/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 69 by nigel, Sat Feb 24 21:40:18 2007 UTC revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC
# Line 5  synopses of each function in the library Line 5  synopses of each function in the library
5  separate text files for the pcregrep and pcretest commands.  separate text files for the pcregrep and pcretest commands.
6  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
7    
8  NAME  PCRE(3)                                                                PCRE(3)
9       PCRE - Perl-compatible regular expressions  
10    
11    
12    NAME
13           PCRE - Perl-compatible regular expressions
14    
15  DESCRIPTION  DESCRIPTION
16    
17       The PCRE library is a set of functions that implement  regu-         The  PCRE  library is a set of functions that implement regular expres-
18       lar  expression  pattern  matching using the same syntax and         sion pattern matching using the same syntax and semantics as Perl, with
19       semantics as Perl, with just a few differences. The  current         just  a  few  differences.  The current implementation of PCRE (release
20       implementation  of  PCRE  (release 4.x) corresponds approxi-         4.x) corresponds approximately with Perl  5.8,  including  support  for
21       mately with Perl 5.8, including support  for  UTF-8  encoded         UTF-8  encoded  strings.   However,  this  support has to be explicitly
22       strings.    However,  this  support  has  to  be  explicitly         enabled; it is not the default.
23       enabled; it is not the default.  
24           PCRE is written in C and released as a C library. However, a number  of
25       PCRE is written in C and released as a C library. However, a         people  have  written  wrappers  and interfaces of various kinds. A C++
26       number  of  people  have  written wrappers and interfaces of         class is included in these contributions, which can  be  found  in  the
27       various kinds. A C++ class is included  in  these  contribu-         Contrib directory at the primary FTP site, which is:
28       tions,  which  can  be found in the Contrib directory at the  
29       primary FTP site, which is:         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30    
31       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         Details  of  exactly which Perl regular expression features are and are
32           not supported by PCRE are given in separate documents. See the pcrepat-
33       Details of exactly which Perl  regular  expression  features         tern and pcrecompat pages.
34       are  and  are  not  supported  by PCRE are given in separate  
35       documents. See the pcrepattern and pcrecompat pages.         Some  features  of  PCRE can be included, excluded, or changed when the
36           library is built. The pcre_config() function makes it  possible  for  a
37       Some features of PCRE can be included, excluded, or  changed         client  to  discover  which features are available. Documentation about
38       when  the library is built. The pcre_config() function makes         building PCRE for various operating systems can be found in the  README
39       it possible for a client  to  discover  which  features  are         file in the source distribution.
      available.  Documentation  about  building  PCRE for various  
      operating systems can be found in the  README  file  in  the  
      source distribution.  
40    
41    
42  USER DOCUMENTATION  USER DOCUMENTATION
43    
44       The user documentation for PCRE has been  split  up  into  a         The user documentation for PCRE has been split up into a number of dif-
45       number  of  different sections. In the "man" format, each of         ferent sections. In the "man" format, each of these is a separate  "man
46       these is a separate "man page". In the HTML format, each  is         page".  In  the  HTML  format, each is a separate page, linked from the
47       a  separate  page,  linked from the index page. In the plain         index page. In the plain text format, all  the  sections  are  concate-
48       text format, all the sections are concatenated, for ease  of         nated, for ease of searching. The sections are as follows:
49       searching. The sections are as follows:  
50             pcre              this document
51         pcre              this document           pcreapi           details of PCRE's native API
52         pcreapi           details of PCRE's native API           pcrebuild         options for building PCRE
53         pcrebuild         options for building PCRE           pcrecallout       details of the callout feature
54         pcrecallout       details of the callout feature           pcrecompat        discussion of Perl compatibility
55         pcrecompat        discussion of Perl compatibility           pcregrep          description of the pcregrep command
56         pcregrep          description of the pcregrep command           pcrepattern       syntax and semantics of supported
57         pcrepattern       syntax and semantics of supported                               regular expressions
58                             regular expressions           pcreperform       discussion of performance issues
59         pcreperform       discussion of performance issues           pcreposix         the POSIX-compatible API
60         pcreposix         the POSIX-compatible API           pcresample        discussion of the sample program
61         pcresample        discussion of the sample program           pcretest          the pcretest testing command
62         pcretest          the pcretest testing command  
63           In  addition,  in the "man" and HTML formats, there is a short page for
64       In addition, in the "man" and HTML formats, there is a short         each library function, listing its arguments and results.
      page  for  each  library function, listing its arguments and  
      results.  
65    
66    
67  LIMITATIONS  LIMITATIONS
68    
69       There are some size limitations in PCRE but it is hoped that         There are some size limitations in PCRE but it is hoped that they  will
70       they will never in practice be relevant.         never in practice be relevant.
71    
72       The maximum length of a  compiled  pattern  is  65539  (sic)         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE
73       bytes  if PCRE is compiled with the default internal linkage         is compiled with the default internal linkage size of 2. If you want to
74       size of 2. If you want to process regular  expressions  that         process  regular  expressions  that are truly enormous, you can compile
75       are  truly  enormous,  you can compile PCRE with an internal         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
76       linkage size of 3 or 4 (see the README file  in  the  source         the  source  distribution and the pcrebuild documentation for details).
77       distribution  and  the pcrebuild documentation for details).         If these cases the limit is substantially larger.  However,  the  speed
78       If these cases the limit is substantially larger.   However,         of execution will be slower.
79       the speed of execution will be slower.  
80           All values in repeating quantifiers must be less than 65536.  The maxi-
81       All values in repeating quantifiers must be less than 65536.         mum number of capturing subpatterns is 65535.
82       The maximum number of capturing subpatterns is 65535.  
83           There is no limit to the number of non-capturing subpatterns,  but  the
84       There is no limit to the  number  of  non-capturing  subpat-         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,
85       terns,  but  the  maximum  depth  of nesting of all kinds of         including capturing subpatterns, assertions, and other types of subpat-
86       parenthesized subpattern, including  capturing  subpatterns,         tern, is 200.
87       assertions, and other types of subpattern, is 200.  
88           The  maximum  length of a subject string is the largest positive number
89       The maximum length of a subject string is the largest  posi-         that an integer variable can hold. However, PCRE uses recursion to han-
90       tive number that an integer variable can hold. However, PCRE         dle  subpatterns  and indefinite repetition. This means that the avail-
91       uses recursion to handle subpatterns and indefinite  repeti-         able stack space may limit the size of a subject  string  that  can  be
92       tion.  This  means  that the available stack space may limit         processed by certain patterns.
      the size of a subject string that can be processed  by  cer-  
      tain patterns.  
93    
94    
95  UTF-8 SUPPORT  UTF-8 SUPPORT
96    
97       Starting at release 3.3, PCRE has had some support for char-         Starting  at  release  3.3,  PCRE  has  had  some support for character
98       acter  strings  encoded in the UTF-8 format. For release 4.0         strings encoded in the UTF-8 format. For  release  4.0  this  has  been
99       this has been greatly extended to cover most common require-         greatly extended to cover most common requirements.
100       ments.  
101           In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
102       In order process UTF-8  strings,  you  must  build  PCRE  to         support in the code, and, in addition,  you  must  call  pcre_compile()
103       include  UTF-8  support  in  the code, and, in addition, you         with  the PCRE_UTF8 option flag. When you do this, both the pattern and
104       must call pcre_compile() with  the  PCRE_UTF8  option  flag.         any subject strings that are matched against it are  treated  as  UTF-8
105       When  you  do this, both the pattern and any subject strings         strings instead of just strings of bytes.
106       that are matched against it are  treated  as  UTF-8  strings  
107       instead of just strings of bytes.         If  you compile PCRE with UTF-8 support, but do not use it at run time,
108           the library will be a bit bigger, but the additional run time  overhead
109       If you compile PCRE with UTF-8 support, but do not use it at         is  limited  to testing the PCRE_UTF8 flag in several places, so should
110       run  time,  the  library will be a bit bigger, but the addi-         not be very large.
111       tional run time overhead is limited to testing the PCRE_UTF8  
112       flag in several places, so should not be very large.         The following comments apply when PCRE is running in UTF-8 mode:
113    
114       The following comments apply when PCRE is running  in  UTF-8         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and
115       mode:         subjects  are  checked for validity on entry to the relevant functions.
116           If an invalid UTF-8 string is passed, an error return is given. In some
117       1. PCRE assumes that the strings it is given  contain  valid         situations,  you  may  already  know  that  your strings are valid, and
118       UTF-8  codes. It does not diagnose invalid UTF-8 strings. If         therefore want to skip these checks in order to improve performance. If
119       you pass invalid UTF-8 strings  to  PCRE,  the  results  are         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,
120       undefined.         PCRE assumes that the pattern or subject  it  is  given  (respectively)
121           contains  only valid UTF-8 codes. In this case, it does not diagnose an
122       2. In a pattern, the escape sequence \x{...}, where the con-         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when
123       tents  of  the  braces is a string of hexadecimal digits, is         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
124       interpreted as a UTF-8 character whose code  number  is  the         crash.
125       given  hexadecimal  number, for example: \x{1234}. If a non-  
126       hexadecimal digit appears between the braces,  the  item  is         2. In a pattern, the escape sequence \x{...}, where the contents of the
127       not  recognized.  This escape sequence can be used either as         braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8
128       a literal, or within a character class.         character whose code number is the given hexadecimal number, for  exam-
129           ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,
130       3. The original hexadecimal escape sequence, \xhh, matches a         the item is not recognized.  This escape sequence can be used either as
131       two-byte UTF-8 character if the value is greater than 127.         a literal, or within a character class.
132    
133       4. Repeat quantifiers apply to  complete  UTF-8  characters,         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte
134       not to individual bytes, for example: \x{100}{3}.         UTF-8 character if the value is greater than 127.
135    
136       5. The dot metacharacter matches one UTF-8 character instead         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
137       of a single byte.         vidual bytes, for example: \x{100}{3}.
138    
139       6. The escape sequence \C can be used to match a single byte         5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a
140       in UTF-8 mode, but its use can lead to some strange effects.         single byte.
141    
142       7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W         6. The escape sequence \C can be used to match a single byte  in  UTF-8
143       correctly test characters of any code value, but the charac-         mode, but its use can lead to some strange effects.
144       ters that PCRE recognizes as digits, spaces, or word charac-  
145       ters  remain  the  same  set as before, all with values less         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
146       than 256.         test characters of any code value, but the characters that PCRE  recog-
147           nizes  as  digits,  spaces,  or  word characters remain the same set as
148       8. Case-insensitive  matching  applies  only  to  characters         before, all with values less than 256.
149       whose  values  are  less than 256. PCRE does not support the  
150       notion of "case" for higher-valued characters.         8. Case-insensitive matching applies only to  characters  whose  values
151           are  less  than  256.  PCRE  does  not support the notion of "case" for
152           higher-valued characters.
153    
154       9. PCRE does not support the use of Unicode tables and  pro-         9. PCRE does not support the use of Unicode tables  and  properties  or
155       perties or the Perl escapes \p, \P, and \X.         the Perl escapes \p, \P, and \X.
156    
157    
158  AUTHOR  AUTHOR
159    
160       Philip Hazel <ph10@cam.ac.uk>         Philip Hazel <ph10@cam.ac.uk>
161       University Computing Service,         University Computing Service,
162       Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
163       Phone: +44 1223 334714         Phone: +44 1223 334714
164    
165  Last updated: 04 February 2003  Last updated: 20 August 2003
166  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
167  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
168    
169  NAME  PCRE(3)                                                                PCRE(3)
      PCRE - Perl-compatible regular expressions  
170    
171    
172    
173    NAME
174           PCRE - Perl-compatible regular expressions
175    
176  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
177    
178       This document describes the optional features of  PCRE  that         This  document  describes  the  optional  features  of PCRE that can be
179       can  be  selected when the library is compiled. They are all         selected when the library is compiled. They are all selected, or  dese-
180       selected, or deselected, by providing options to the config-         lected,  by  providing  options  to  the  configure script which is run
181       ure  script  which  is run before the make command. The com-         before the make command. The complete list  of  options  for  configure
182       plete list of options  for  configure  (which  includes  the         (which  includes the standard ones such as the selection of the instal-
183       standard  ones  such  as  the  selection of the installation         lation directory) can be obtained by running
184       directory) can be obtained by running  
185             ./configure --help
186         ./configure --help  
187           The following sections describe certain options whose names begin  with
188       The following sections describe certain options whose  names         --enable  or  --disable. These settings specify changes to the defaults
189       begin  with  --enable  or  --disable. These settings specify         for the configure command. Because of the  way  that  configure  works,
190       changes to the defaults for the configure  command.  Because         --enable  and  --disable  always  come  in  pairs, so the complementary
191       of  the  way  that  configure  works, --enable and --disable         option always exists as well, but as it specifies the  default,  it  is
192       always come in pairs, so  the  complementary  option  always         not described.
      exists  as  well, but as it specifies the default, it is not  
      described.  
193    
194    
195  UTF-8 SUPPORT  UTF-8 SUPPORT
196    
197       To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 character strings, add
198    
199         --enable-utf8           --enable-utf8
200    
201       to the configure command. Of itself, this does not make PCRE         to  the  configure  command.  Of  itself, this does not make PCRE treat
202       treat  strings as UTF-8. As well as compiling PCRE with this         strings as UTF-8. As well as compiling PCRE with this option, you  also
203       option, you also have have to set the PCRE_UTF8 option  when         have  have to set the PCRE_UTF8 option when you call the pcre_compile()
204       you call the pcre_compile() function.         function.
205    
206    
207  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
208    
209       By default, PCRE treats character 10 (linefeed) as the  new-         By default, PCRE treats character 10 (linefeed) as the newline  charac-
210       line  character.  This  is  the  normal newline character on         ter. This is the normal newline character on Unix-like systems. You can
211       Unix-like systems. You can compile PCRE to use character  13         compile PCRE to use character 13 (carriage return) instead by adding
212       (carriage return) instead by adding  
213             --enable-newline-is-cr
214         --enable-newline-is-cr  
215           to the configure command. For completeness there is  also  a  --enable-
216       to the configure command. For completeness there is  also  a         newline-is-lf  option,  which explicitly specifies linefeed as the new-
217       --enable-newline-is-lf  option,  which  explicitly specifies         line character.
      linefeed as the newline character.  
218    
219    
220  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
221    
222       The PCRE building process uses libtool to build both  shared         The PCRE building process uses libtool to build both shared and  static
223       and  static  Unix libraries by default. You can suppress one         Unix  libraries by default. You can suppress one of these by adding one
224       of these by adding one of         of
225    
226         --disable-shared           --disable-shared
227         --disable-static           --disable-static
228    
229       to the configure command, as required.         to the configure command, as required.
230    
231    
232  POSIX MALLOC USAGE  POSIX MALLOC USAGE
233    
234       When PCRE is called through the  POSIX  interface  (see  the         When PCRE is called through the  POSIX  interface  (see  the  pcreposix
235       pcreposix  documentation),  additional  working  storage  is         documentation),  additional working storage is required for holding the
236       required for holding the pointers  to  capturing  substrings         pointers to capturing substrings because PCRE requires  three  integers
237       because  PCRE requires three integers per substring, whereas         per  substring,  whereas  the POSIX interface provides only two. If the
238       the POSIX interface provides only  two.  If  the  number  of         number of expected substrings is small, the wrapper function uses space
239       expected  substrings  is  small,  the  wrapper function uses         on the stack, because this is faster than using malloc() for each call.
240       space on the stack, because this is faster than  using  mal-         The default threshold above which the stack is no longer used is 10; it
241       loc()  for  each call. The default threshold above which the         can be changed by adding a setting such as
      stack is no longer used is 10; it can be changed by adding a  
      setting such as  
242    
243         --with-posix-malloc-threshold=20           --with-posix-malloc-threshold=20
244    
245       to the configure command.         to the configure command.
246    
247    
248  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
249    
250       Internally, PCRE has a  function  called  match()  which  it         Internally,  PCRE  has a function called match() which it calls repeat-
251       calls  repeatedly  (possibly  recursively) when performing a         edly (possibly recursively) when performing a  matching  operation.  By
252       matching operation. By limiting the  number  of  times  this         limiting  the  number of times this function may be called, a limit can
253       function  may  be  called,  a  limit  can  be  placed on the         be placed on the resources used by a single call  to  pcre_exec().  The
254       resources used by a single call to  pcre_exec().  The  limit         limit  can be changed at run time, as described in the pcreapi documen-
255       can  be  changed  at  run  time, as described in the pcreapi         tation. The default is 10 million, but this can be changed by adding  a
256       documentation. The default is 10 million, but  this  can  be         setting such as
      changed by adding a setting such as  
257    
258         --with-match-limit=500000           --with-match-limit=500000
259    
260       to the configure command.         to the configure command.
261    
262    
263  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
264    
265       Within a compiled pattern, offset values are used  to  point         Within  a  compiled  pattern,  offset values are used to point from one
266       from  one  part  to  another  (for  example, from an opening         part to another (for example, from an opening parenthesis to an  alter-
267       parenthesis to an  alternation  metacharacter).  By  default         nation  metacharacter).  By  default two-byte values are used for these
268       two-byte  values  are  used  for these offsets, leading to a         offsets, leading to a maximum size for a  compiled  pattern  of  around
269       maximum size for a compiled pattern of around 64K.  This  is         64K.  This  is sufficient to handle all but the most gigantic patterns.
270       sufficient  to  handle  all  but the most gigantic patterns.         Nevertheless, some people do want to process enormous patterns,  so  it
271       Nevertheless, some people do want to process  enormous  pat-         is  possible  to compile PCRE to use three-byte or four-byte offsets by
272       terns,  so  it is possible to compile PCRE to use three-byte         adding a setting such as
      or four-byte offsets by adding a setting such as  
   
        --with-link-size=3  
   
      to the configure command. The value given must be 2,  3,  or  
      4.  Using  longer  offsets  slows down the operation of PCRE  
      because it has to load additional bytes when handling them.  
   
      If you build PCRE with an increased link size, test  2  (and  
      test 5 if you are using UTF-8) will fail. Part of the output  
      of these tests is a representation of the compiled  pattern,  
      and this changes with the link size.  
273    
274  Last updated: 21 January 2003           --with-link-size=3
275    
276           to the configure command. The value given must be 2,  3,  or  4.  Using
277           longer  offsets slows down the operation of PCRE because it has to load
278           additional bytes when handling them.
279    
280           If you build PCRE with an increased link size, test 2 (and  test  5  if
281           you  are using UTF-8) will fail. Part of the output of these tests is a
282           representation of the compiled pattern, and this changes with the  link
283           size.
284    
285    
286    AVOIDING EXCESSIVE STACK USAGE
287    
288           PCRE  implements  backtracking while matching by making recursive calls
289           to an internal function called match(). In environments where the  size
290           of the stack is limited, this can severely limit PCRE's operation. (The
291           Unix environment does not usually suffer from this problem.) An  alter-
292           native  approach  that  uses  memory  from  the  heap to remember data,
293           instead of using recursive function calls, has been implemented to work
294           round  this  problem. If you want to build a version of PCRE that works
295           this way, add
296    
297             --disable-stack-for-recursion
298    
299           to the configure command. With this configuration, PCRE  will  use  the
300           pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory
301           management functions. Separate functions are provided because the usage
302           is very predictable: the block sizes requested are always the same, and
303           the blocks are always freed in reverse order. A calling  program  might
304           be  able  to implement optimized functions that perform better than the
305           standard malloc() and  free()  functions.  PCRE  runs  noticeably  more
306           slowly when built in this way.
307    
308    
309    USING EBCDIC CODE
310    
311           PCRE  assumes  by  default that it will run in an environment where the
312           character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE
313           can, however, be compiled to run in an EBCDIC environment by adding
314    
315             --enable-ebcdic
316    
317           to the configure command.
318    
319    Last updated: 09 December 2003
320  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
321  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
322    
323  NAME  PCRE(3)                                                                PCRE(3)
324       PCRE - Perl-compatible regular expressions  
325    
326    
327    NAME
328           PCRE - Perl-compatible regular expressions
329    
330  SYNOPSIS OF PCRE API  SYNOPSIS OF PCRE API
331    
332       #include <pcre.h>         #include <pcre.h>
333    
334       pcre *pcre_compile(const char *pattern, int options,         pcre *pcre_compile(const char *pattern, int options,
335            const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
336            const unsigned char *tableptr);              const unsigned char *tableptr);
337    
338       pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
339            const char **errptr);              const char **errptr);
340    
341       int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
342            const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
343            int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
344    
345       int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
346            const char *subject, int *ovector,              const char *subject, int *ovector,
347            int stringcount, const char *stringname,              int stringcount, const char *stringname,
348            char *buffer, int buffersize);              char *buffer, int buffersize);
349    
350       int pcre_copy_substring(const char *subject, int *ovector,         int pcre_copy_substring(const char *subject, int *ovector,
351            int stringcount, int stringnumber, char *buffer,              int stringcount, int stringnumber, char *buffer,
352            int buffersize);              int buffersize);
353    
354       int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
355            const char *subject, int *ovector,              const char *subject, int *ovector,
356            int stringcount, const char *stringname,              int stringcount, const char *stringname,
357            const char **stringptr);              const char **stringptr);
358    
359       int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
360            const char *name);              const char *name);
361    
362       int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
363            int stringcount, int stringnumber,              int stringcount, int stringnumber,
364            const char **stringptr);              const char **stringptr);
365    
366       int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
367            int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
368    
369       void pcre_free_substring(const char *stringptr);         void pcre_free_substring(const char *stringptr);
370    
371       void pcre_free_substring_list(const char **stringptr);         void pcre_free_substring_list(const char **stringptr);
372    
373       const unsigned char *pcre_maketables(void);         const unsigned char *pcre_maketables(void);
374    
375       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
376            int what, void *where);              int what, void *where);
377    
378           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
379    
380       int pcre_info(const pcre *code, int *optptr, *firstcharptr);         int pcre_config(int what, void *where);
381    
382       int pcre_config(int what, void *where);         char *pcre_version(void);
383    
384       char *pcre_version(void);         void *(*pcre_malloc)(size_t);
385    
386       void *(*pcre_malloc)(size_t);         void (*pcre_free)(void *);
387    
388       void (*pcre_free)(void *);         void *(*pcre_stack_malloc)(size_t);
389    
390       int (*pcre_callout)(pcre_callout_block *);         void (*pcre_stack_free)(void *);
391    
392           int (*pcre_callout)(pcre_callout_block *);
393    
394    
395  PCRE API  PCRE API
396    
397       PCRE has its own native API,  which  is  described  in  this         PCRE has its own native API, which is described in this document. There
398       document.  There  is  also  a  set of wrapper functions that         is also a set of wrapper functions that correspond to the POSIX regular
399       correspond to the POSIX regular expression API.   These  are         expression API.  These are described in the pcreposix documentation.
400       described in the pcreposix documentation.  
401           The  native  API  function  prototypes  are  defined in the header file
402       The native API function prototypes are defined in the header         pcre.h, and on Unix systems the library itself is called libpcre.a,  so
403       file  pcre.h,  and  on  Unix  systems  the library itself is         can be accessed by adding -lpcre to the command for linking an applica-
404       called libpcre.a, so can be accessed by adding -lpcre to the         tion which calls it. The header file defines the macros PCRE_MAJOR  and
405       command  for  linking  an  application  which  calls it. The         PCRE_MINOR  to  contain  the  major  and  minor release numbers for the
406       header file defines the macros PCRE_MAJOR and PCRE_MINOR  to         library. Applications can use these to include  support  for  different
407       contain the major and minor release numbers for the library.         releases.
408       Applications can use these to include support for  different  
409       releases.         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used
410           for compiling and matching regular expressions. A sample  program  that
411       The functions pcre_compile(), pcre_study(), and  pcre_exec()         demonstrates  the simplest way of using them is given in the file pcre-
412       are  used  for compiling and matching regular expressions. A         demo.c. The pcresample documentation describes how to run it.
413       sample program that demonstrates the simplest way  of  using  
414       them  is  given in the file pcredemo.c. The pcresample docu-         There are convenience functions for extracting captured substrings from
415       mentation describes how to run it.         a matched subject string. They are:
416    
417       There are convenience functions for extracting captured sub-           pcre_copy_substring()
418       strings from a matched subject string. They are:           pcre_copy_named_substring()
419             pcre_get_substring()
420         pcre_copy_substring()           pcre_get_named_substring()
421         pcre_copy_named_substring()           pcre_get_substring_list()
422         pcre_get_substring()  
423         pcre_get_named_substring()         pcre_free_substring() and pcre_free_substring_list() are also provided,
424         pcre_get_substring_list()         to free the memory used for extracted strings.
425    
426       pcre_free_substring()  and  pcre_free_substring_list()   are         The function pcre_maketables() is used (optionally) to build a  set  of
427       also  provided,  to  free  the  memory  used  for  extracted         character tables in the current locale for passing to pcre_compile().
428       strings.  
429           The  function  pcre_fullinfo()  is used to find out information about a
430       The function pcre_maketables() is used (optionally) to build         compiled pattern; pcre_info() is an obsolete version which returns only
431       a  set of character tables in the current locale for passing         some  of  the available information, but is retained for backwards com-
432       to pcre_compile().         patibility.  The function pcre_version() returns a pointer to a  string
433           containing the version of PCRE and its date of release.
434       The function pcre_fullinfo() is used to find out information  
435       about a compiled pattern; pcre_info() is an obsolete version         The  global  variables  pcre_malloc and pcre_free initially contain the
436       which returns only some of the available information, but is         entry points of the standard  malloc()  and  free()  functions  respec-
437       retained   for   backwards   compatibility.    The  function         tively. PCRE calls the memory management functions via these variables,
438       pcre_version() returns a pointer to a string containing  the         so a calling program can replace them if it  wishes  to  intercept  the
439       version of PCRE and its date of release.         calls. This should be done before calling any PCRE functions.
440    
441       The global variables  pcre_malloc  and  pcre_free  initially         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
442       contain the entry points of the standard malloc() and free()         indirections to memory management functions.  These  special  functions
443       functions respectively. PCRE  calls  the  memory  management         are  used  only  when  PCRE is compiled to use the heap for remembering
444       functions  via  these  variables,  so  a calling program can         data, instead of recursive function calls. This is a  non-standard  way
445       replace them if it  wishes  to  intercept  the  calls.  This         of  building  PCRE,  for  use in environments that have limited stacks.
446       should be done before calling any PCRE functions.         Because of the greater use of memory management, it runs  more  slowly.
447           Separate  functions  are provided so that special-purpose external code
448       The global variable pcre_callout initially contains NULL. It         can be used for this case. When used, these functions are always called
449       can be set by the caller to a "callout" function, which PCRE         in  a  stack-like  manner  (last obtained, first freed), and always for
450       will then call at specified points during a matching  opera-         memory blocks of the same size.
451       tion. Details are given in the pcrecallout documentation.  
452           The global variable pcre_callout initially contains NULL. It can be set
453           by  the  caller  to  a "callout" function, which PCRE will then call at
454           specified points during a matching operation. Details are given in  the
455           pcrecallout documentation.
456    
457    
458  MULTITHREADING  MULTITHREADING
459    
460       The PCRE functions can be used in  multi-threading  applica-         The  PCRE  functions  can be used in multi-threading applications, with
461       tions, with the proviso that the memory management functions         the  proviso  that  the  memory  management  functions  pointed  to  by
462       pointed to by pcre_malloc and  pcre_free,  and  the  callout         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
463       function  pointed  to  by  pcre_callout,  are  shared by all         callout function pointed to by pcre_callout, are shared by all threads.
464       threads.  
465           The  compiled form of a regular expression is not altered during match-
466       The compiled form of a regular  expression  is  not  altered         ing, so the same compiled pattern can safely be used by several threads
467       during  matching, so the same compiled pattern can safely be         at once.
      used by several threads at once.  
468    
469    
470  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
471    
472       int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
473    
474           The  function pcre_config() makes it possible for a PCRE client to dis-
475           cover which optional features have been compiled into the PCRE library.
476           The  pcrebuild documentation has more details about these optional fea-
477           tures.
478    
479           The first argument for pcre_config() is an  integer,  specifying  which
480           information is required; the second argument is a pointer to a variable
481           into which the information is  placed.  The  following  information  is
482           available:
483    
484       The function pcre_config() makes  it  possible  for  a  PCRE           PCRE_CONFIG_UTF8
      client  to  discover  which optional features have been com-  
      piled into the PCRE library. The pcrebuild documentation has  
      more details about these optional features.  
485    
486       The first argument for pcre_config() is an integer, specify-         The  output is an integer that is set to one if UTF-8 support is avail-
487       ing  which information is required; the second argument is a         able; otherwise it is set to zero.
      pointer to a variable into which the information is  placed.  
      The following information is available:  
488    
489         PCRE_CONFIG_UTF8           PCRE_CONFIG_NEWLINE
490    
491       The output is an integer that is set to one if UTF-8 support         The output is an integer that is set to the value of the code  that  is
492       is available; otherwise it is set to zero.         used  for the newline character. It is either linefeed (10) or carriage
493           return (13), and should normally be the  standard  character  for  your
494           operating system.
495    
496         PCRE_CONFIG_NEWLINE           PCRE_CONFIG_LINK_SIZE
497    
498       The output is an integer that is set to  the  value  of  the         The  output  is  an  integer that contains the number of bytes used for
499       code  that  is  used for the newline character. It is either         internal linkage in compiled regular expressions. The value is 2, 3, or
500       linefeed (10) or carriage return (13), and  should  normally         4.  Larger  values  allow larger regular expressions to be compiled, at
501       be the standard character for your operating system.         the expense of slower matching. The default value of  2  is  sufficient
502           for  all  but  the  most massive patterns, since it allows the compiled
503           pattern to be up to 64K in size.
504    
505         PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
506    
507       The output is an integer that contains the number  of  bytes         The output is an integer that contains the threshold  above  which  the
508       used  for  internal linkage in compiled regular expressions.         POSIX  interface  uses malloc() for output vectors. Further details are
509       The value is 2, 3, or 4. Larger values allow larger  regular         given in the pcreposix documentation.
      expressions  to be compiled, at the expense of slower match-  
      ing. The default value of 2 is sufficient for  all  but  the  
      most  massive patterns, since it allows the compiled pattern  
      to be up to 64K in size.  
510    
511         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_MATCH_LIMIT
512    
513       The output is an integer that contains the  threshold  above         The output is an integer that gives the default limit for the number of
514       which  the POSIX interface uses malloc() for output vectors.         internal  matching  function  calls in a pcre_exec() execution. Further
515       Further details are given in the pcreposix documentation.         details are given with pcre_exec() below.
516    
517         PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_STACKRECURSE
518    
519       The output is an integer that gives the  default  limit  for         The output is an integer that is set to one if  internal  recursion  is
520       the   number  of  internal  matching  function  calls  in  a         implemented  by recursive function calls that use the stack to remember
521       pcre_exec()  execution.  Further  details  are  given   with         their state. This is the usual way that PCRE is compiled. The output is
522       pcre_exec() below.         zero  if PCRE was compiled to use blocks of data on the heap instead of
523           recursive  function  calls.  In  this   case,   pcre_stack_malloc   and
524           pcre_stack_free  are  called  to manage memory blocks on the heap, thus
525           avoiding the use of the stack.
526    
527    
528  COMPILING A PATTERN  COMPILING A PATTERN
529    
530       pcre *pcre_compile(const char *pattern, int options,         pcre *pcre_compile(const char *pattern, int options,
531            const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
532            const unsigned char *tableptr);              const unsigned char *tableptr);
533    
534       The function pcre_compile() is called to compile  a  pattern  
535       into  an internal form. The pattern is a C string terminated         The function pcre_compile() is called to  compile  a  pattern  into  an
536       by a binary zero, and is passed in the argument  pattern.  A         internal  form.  The pattern is a C string terminated by a binary zero,
537       pointer  to  a  single  block of memory that is obtained via         and is passed in the argument pattern. A pointer to a single  block  of
538       pcre_malloc is returned. This contains the compiled code and         memory  that is obtained via pcre_malloc is returned. This contains the
539       related  data.  The  pcre  type  is defined for the returned         compiled code and related data.  The  pcre  type  is  defined  for  the
540       block; this is a typedef for a structure whose contents  are         returned  block;  this  is a typedef for a structure whose contents are
541       not  externally  defined. It is up to the caller to free the         not externally defined. It is up to the caller to free the memory  when
542       memory when it is no longer required.         it is no longer required.
543    
544       Although the compiled code of a PCRE regex  is  relocatable,         Although  the compiled code of a PCRE regex is relocatable, that is, it
545       that is, it does not depend on memory location, the complete         does not depend on memory location, the complete pcre data block is not
546       pcre data block is not fully relocatable,  because  it  con-         fully relocatable, because it contains a copy of the tableptr argument,
547       tains  a  copy of the tableptr argument, which is an address         which is an address (see below).
548       (see below).  
549       The options argument contains independent bits  that  affect         The options argument contains independent bits that affect the compila-
550       the  compilation.  It  should  be  zero  if  no  options are         tion.  It  should  be  zero  if  no  options  are required. Some of the
551       required. Some of the options, in particular, those that are         options, in particular, those that are compatible with Perl,  can  also
552       compatible  with Perl, can also be set and unset from within         be  set and unset from within the pattern (see the detailed description
553       the pattern (see the detailed description of regular expres-         of regular expressions in the  pcrepattern  documentation).  For  these
554       sions  in the pcrepattern documentation). For these options,         options,  the  contents of the options argument specifies their initial
555       the contents of the options argument specifies their initial         settings at the start of compilation and execution.  The  PCRE_ANCHORED
556       settings  at  the  start  of  compilation and execution. The         option can be set at the time of matching as well as at compile time.
557       PCRE_ANCHORED option can be set at the time of  matching  as  
558       well as at compile time.         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
559           if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
560       If errptr is NULL, pcre_compile() returns NULL  immediately.         sets the variable pointed to by errptr to point to a textual error mes-
561       Otherwise, if compilation of a pattern fails, pcre_compile()         sage. The offset from the start of the pattern to the  character  where
562       returns NULL, and sets the variable pointed to by errptr  to         the  error  was  discovered  is  placed  in  the variable pointed to by
563       point  to a textual error message. The offset from the start         erroffset, which must not be NULL. If it  is,  an  immediate  error  is
564       of  the  pattern  to  the  character  where  the  error  was         given.
565       discovered   is   placed  in  the  variable  pointed  to  by  
566       erroffset, which must not be NULL. If it  is,  an  immediate         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
567       error is given.         character tables which are built when it is compiled, using the default
568           C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to
569       If the final  argument,  tableptr,  is  NULL,  PCRE  uses  a         pcre_maketables(). See the section on locale support below.
570       default  set  of character tables which are built when it is  
571       compiled, using the default C  locale.  Otherwise,  tableptr         This code fragment shows a typical straightforward  call  to  pcre_com-
572       must  be  the result of a call to pcre_maketables(). See the         pile():
573       section on locale support below.  
574             pcre *re;
575       This code fragment shows a typical straightforward  call  to           const char *error;
576       pcre_compile():           int erroffset;
577             re = pcre_compile(
578         pcre *re;             "^A.*Z",          /* the pattern */
579         const char *error;             0,                /* default options */
580         int erroffset;             &error,           /* for error message */
581         re = pcre_compile(             &erroffset,       /* for error offset */
582           "^A.*Z",          /* the pattern */             NULL);            /* use default character tables */
583           0,                /* default options */  
584           &error,           /* for error message */         The following option bits are defined:
585           &erroffset,       /* for error offset */  
586           NULL);            /* use default character tables */           PCRE_ANCHORED
587    
588       The following option bits are defined:         If this bit is set, the pattern is forced to be "anchored", that is, it
589           is constrained to match only at the first matching point in the  string
590         PCRE_ANCHORED         which is being searched (the "subject string"). This effect can also be
591           achieved by appropriate constructs in the pattern itself, which is  the
592       If this bit is set, the pattern is forced to be  "anchored",         only way to do it in Perl.
593       that is, it is constrained to match only at the first match-  
594       ing point in the string which is being searched  (the  "sub-           PCRE_CASELESS
595       ject string"). This effect can also be achieved by appropri-  
596       ate constructs in the pattern itself, which is the only  way         If  this  bit is set, letters in the pattern match both upper and lower
597       to do it in Perl.         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
598           changed within a pattern by a (?i) option setting.
599         PCRE_CASELESS  
600             PCRE_DOLLAR_ENDONLY
601       If this bit is set, letters in the pattern match both  upper  
602       and  lower  case  letters.  It  is  equivalent  to Perl's /i         If  this bit is set, a dollar metacharacter in the pattern matches only
603       option, and it can be changed within a  pattern  by  a  (?i)         at the end of the subject string. Without this option,  a  dollar  also
604       option setting.         matches  immediately before the final character if it is a newline (but
605           not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is
606         PCRE_DOLLAR_ENDONLY         ignored if PCRE_MULTILINE is set. There is no equivalent to this option
607           in Perl, and no way to set it within a pattern.
608       If this bit is set, a dollar metacharacter  in  the  pattern  
609       matches  only at the end of the subject string. Without this           PCRE_DOTALL
610       option, a dollar also matches immediately before  the  final  
611       character  if it is a newline (but not before any other new-         If this bit is set, a dot metacharater in the pattern matches all char-
612       lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if         acters,  including  newlines.  Without  it, newlines are excluded. This
613       PCRE_MULTILINE is set. There is no equivalent to this option         option is equivalent to Perl's /s option, and it can be changed  within
614       in Perl, and no way to set it within a pattern.         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]
615           always matches a newline character, independent of the setting of  this
616         PCRE_DOTALL         option.
617    
618       If this bit is  set,  a  dot  metacharater  in  the  pattern           PCRE_EXTENDED
619       matches all characters, including newlines. Without it, new-  
620       lines are excluded. This option is equivalent to  Perl's  /s         If  this  bit  is  set,  whitespace  data characters in the pattern are
621       option,  and  it  can  be changed within a pattern by a (?s)         totally ignored except  when  escaped  or  inside  a  character  class.
622       option setting. A negative class such as [^a] always matches         Whitespace  does  not  include the VT character (code 11). In addition,
623       a  newline  character,  independent  of  the setting of this         characters between an unescaped # outside a  character  class  and  the
624       option.         next newline character, inclusive, are also ignored. This is equivalent
625           to Perl's /x option, and it can be changed within a pattern by  a  (?x)
626         PCRE_EXTENDED         option setting.
627    
628       If this bit is set, whitespace data characters in  the  pat-         This  option  makes  it possible to include comments inside complicated
629       tern  are  totally  ignored  except when escaped or inside a         patterns.  Note, however, that this applies only  to  data  characters.
630       character class. Whitespace does not include the VT  charac-         Whitespace   characters  may  never  appear  within  special  character
631       ter  (code 11). In addition, characters between an unescaped         sequences in a pattern, for  example  within  the  sequence  (?(  which
632       # outside a character class and the next newline  character,         introduces a conditional subpattern.
633       inclusive, are also ignored. This is equivalent to Perl's /x  
634       option, and it can be changed within a  pattern  by  a  (?x)           PCRE_EXTRA
635       option setting.  
636           This  option  was invented in order to turn on additional functionality
637       This option makes it possible  to  include  comments  inside         of PCRE that is incompatible with Perl, but it  is  currently  of  very
638       complicated patterns.  Note, however, that this applies only         little  use. When set, any backslash in a pattern that is followed by a
639       to data characters. Whitespace characters may  never  appear         letter that has no special meaning  causes  an  error,  thus  reserving
640       within special character sequences in a pattern, for example         these  combinations  for  future  expansion.  By default, as in Perl, a
641       within the sequence (?( which introduces a conditional  sub-         backslash followed by a letter with no special meaning is treated as  a
642       pattern.         literal.  There  are  at  present  no other features controlled by this
643           option. It can also be set by a (?X) option setting within a pattern.
644         PCRE_EXTRA  
645             PCRE_MULTILINE
646       This option was invented in  order  to  turn  on  additional  
647       functionality of PCRE that is incompatible with Perl, but it         By default, PCRE treats the subject string as consisting  of  a  single
648       is currently of very little use. When set, any backslash  in         "line"  of  characters (even if it actually contains several newlines).
649       a  pattern  that is followed by a letter that has no special         The "start of line" metacharacter (^) matches only at the start of  the
650       meaning causes an error, thus reserving  these  combinations         string,  while  the "end of line" metacharacter ($) matches only at the
651       for  future  expansion.  By default, as in Perl, a backslash         end of the string, or before a terminating  newline  (unless  PCRE_DOL-
652       followed by a letter with no special meaning is treated as a         LAR_ENDONLY is set). This is the same as Perl.
653       literal.  There  are at present no other features controlled  
654       by this option. It can also be set by a (?X) option  setting         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
655       within a pattern.         constructs match immediately following or immediately before  any  new-
656           line  in the subject string, respectively, as well as at the very start
657         PCRE_MULTILINE         and end. This is equivalent to Perl's /m option, and it can be  changed
658           within a pattern by a (?m) option setting. If there are no "\n" charac-
659       By default, PCRE treats the subject string as consisting  of         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,
660       a  single "line" of characters (even if it actually contains         setting PCRE_MULTILINE has no effect.
661       several newlines). The "start  of  line"  metacharacter  (^)  
662       matches  only  at the start of the string, while the "end of           PCRE_NO_AUTO_CAPTURE
663       line" metacharacter ($) matches  only  at  the  end  of  the  
664       string,    or   before   a   terminating   newline   (unless         If this option is set, it disables the use of numbered capturing paren-
665       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.         theses in the pattern. Any opening parenthesis that is not followed  by
666           ?  behaves as if it were followed by ?: but named parentheses can still
667       When PCRE_MULTILINE it is set, the "start of line" and  "end         be used for capturing (and they acquire  numbers  in  the  usual  way).
668       of  line"  constructs match immediately following or immedi-         There is no equivalent of this option in Perl.
669       ately before any newline  in  the  subject  string,  respec-  
670       tively,  as  well  as  at  the  very  start and end. This is           PCRE_UNGREEDY
671       equivalent to Perl's /m option, and it can be changed within  
672       a  pattern  by  a  (?m) option setting. If there are no "\n"         This  option  inverts  the "greediness" of the quantifiers so that they
673       characters in a subject string, or no occurrences of ^ or  $         are not greedy by default, but become greedy if followed by "?". It  is
674       in a pattern, setting PCRE_MULTILINE has no effect.         not  compatible  with Perl. It can also be set by a (?U) option setting
675           within the pattern.
676         PCRE_NO_AUTO_CAPTURE  
677             PCRE_UTF8
678       If this option is set, it disables the use of numbered  cap-  
679       turing  parentheses  in the pattern. Any opening parenthesis         This option causes PCRE to regard both the pattern and the  subject  as
680       that is not followed by ? behaves as if it were followed  by         strings  of  UTF-8 characters instead of single-byte character strings.
681       ?:  but  named  parentheses  can still be used for capturing         However, it is available only if PCRE has been built to  include  UTF-8
682       (and they acquire numbers in the usual  way).  There  is  no         support.  If  not, the use of this option provokes an error. Details of
683       equivalent of this option in Perl.         how this option changes the behaviour of PCRE are given in the  section
684           on UTF-8 support in the main pcre page.
685         PCRE_UNGREEDY  
686             PCRE_NO_UTF8_CHECK
687       This option inverts the "greediness" of the  quantifiers  so  
688       that  they  are  not greedy by default, but become greedy if         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
689       followed by "?". It is not compatible with Perl. It can also         automatically checked. If an invalid UTF-8 sequence of bytes is  found,
690       be set by a (?U) option setting within the pattern.         pcre_compile()  returns an error. If you already know that your pattern
691           is valid, and you want to skip this check for performance reasons,  you
692         PCRE_UTF8         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of
693           passing an invalid UTF-8 string as a pattern is undefined. It may cause
694       This option causes PCRE to regard both the pattern  and  the         your  program  to  crash.  Note that there is a similar option for sup-
695       subject  as  strings  of UTF-8 characters instead of single-         pressing the checking of subject strings passed to pcre_exec().
696       byte character strings. However, it  is  available  only  if  
      PCRE  has  been  built to include UTF-8 support. If not, the  
      use of this option provokes an error. Details  of  how  this  
      option  changes  the behaviour of PCRE are given in the sec-  
      tion on UTF-8 support in the main pcre page.  
697    
698    
699  STUDYING A PATTERN  STUDYING A PATTERN
700    
701       pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
702            const char **errptr);              const char **errptr);
703    
704       When a pattern is going to be  used  several  times,  it  is         When a pattern is going to be used several times, it is worth  spending
705       worth  spending  more time analyzing it in order to speed up         more  time  analyzing it in order to speed up the time taken for match-
706       the time taken for matching. The function pcre_study() takes         ing. The function pcre_study() takes a pointer to a compiled pattern as
707       a  pointer  to  a compiled pattern as its first argument. If         its first argument. If studing the pattern produces additional informa-
708       studing the pattern  produces  additional  information  that         tion that will help speed up matching, pcre_study() returns  a  pointer
709       will  help speed up matching, pcre_study() returns a pointer         to  a  pcre_extra  block,  in  which the study_data field points to the
710       to a pcre_extra block, in which the study_data field  points         results of the study.
711       to the results of the study.  
712           The returned value from  a  pcre_study()  can  be  passed  directly  to
713       The  returned  value  from  a  pcre_study()  can  be  passed         pcre_exec().  However,  the pcre_extra block also contains other fields
714       directly  to pcre_exec(). However, the pcre_extra block also         that can be set by the caller before the block  is  passed;  these  are
715       contains other fields that can be set by the  caller  before         described  below.  If  studying  the pattern does not produce any addi-
716       the  block is passed; these are described below. If studying         tional information, pcre_study() returns NULL. In that circumstance, if
717       the pattern does not  produce  any  additional  information,         the  calling  program  wants  to  pass  some  of  the  other  fields to
718       pcre_study() returns NULL. In that circumstance, if the cal-         pcre_exec(), it must set up its own pcre_extra block.
719       ling program wants to pass  some  of  the  other  fields  to  
720       pcre_exec(), it must set up its own pcre_extra block.         The second argument contains option bits. At present,  no  options  are
721           defined for pcre_study(), and this argument should always be zero.
722       The second argument contains option  bits.  At  present,  no  
723       options  are  defined  for  pcre_study(),  and this argument         The  third argument for pcre_study() is a pointer for an error message.
724       should always be zero.         If studying succeeds (even if no data is  returned),  the  variable  it
725           points  to  is set to NULL. Otherwise it points to a textual error mes-
726       The third argument for pcre_study()  is  a  pointer  for  an         sage. You should therefore test the error pointer for NULL after  call-
727       error  message.  If  studying  succeeds  (even if no data is         ing pcre_study(), to be sure that it has run successfully.
728       returned), the variable it points to is set to NULL.  Other-  
729       wise it points to a textual error message. You should there-         This is a typical call to pcre_study():
730       fore  test  the  error  pointer  for  NULL   after   calling  
731       pcre_study(), to be sure that it has run successfully.           pcre_extra *pe;
732             pe = pcre_study(
733       This is a typical call to pcre_study():             re,             /* result of pcre_compile() */
734               0,              /* no options exist */
735         pcre_extra *pe;             &error);        /* set to NULL or points to a message */
736         pe = pcre_study(  
737           re,             /* result of pcre_compile() */         At present, studying a pattern is useful only for non-anchored patterns
738           0,              /* no options exist */         that do not have a single fixed starting character. A bitmap of  possi-
739           &error);        /* set to NULL or points to a message */         ble starting characters is created.
   
      At present, studying a  pattern  is  useful  only  for  non-  
      anchored  patterns  that do not have a single fixed starting  
      character. A  bitmap  of  possible  starting  characters  is  
      created.  
740    
741    
742  LOCALE SUPPORT  LOCALE SUPPORT
743    
744       PCRE handles caseless matching, and determines whether char-         PCRE  handles  caseless matching, and determines whether characters are
745       acters  are  letters, digits, or whatever, by reference to a         letters, digits, or whatever, by reference to a  set  of  tables.  When
746       set of tables. When running in UTF-8 mode, this applies only         running  in UTF-8 mode, this applies only to characters with codes less
747       to characters with codes less than 256. The library contains         than 256. The library contains a default set of tables that is  created
748       a default set of tables that is created  in  the  default  C         in  the  default  C locale when PCRE is compiled. This is used when the
749       locale  when  PCRE  is compiled. This is used when the final         final argument of pcre_compile() is NULL, and is  sufficient  for  many
750       argument of pcre_compile() is NULL, and  is  sufficient  for         applications.
751       many applications.  
752           An alternative set of tables can, however, be supplied. Such tables are
753       An alternative set of tables can, however, be supplied. Such         built by calling the pcre_maketables() function,  which  has  no  argu-
754       tables  are built by calling the pcre_maketables() function,         ments,  in  the  relevant  locale.  The  result  can  then be passed to
755       which has no arguments, in the relevant locale.  The  result         pcre_compile() as often as necessary. For example,  to  build  and  use
756       can  then be passed to pcre_compile() as often as necessary.         tables that are appropriate for the French locale (where accented char-
757       For example, to build and use tables  that  are  appropriate         acters with codes greater than 128 are treated as letters), the follow-
758       for  the French locale (where accented characters with codes         ing code could be used:
759       greater than 128 are treated as letters), the following code  
760       could be used:           setlocale(LC_CTYPE, "fr");
761             tables = pcre_maketables();
762         setlocale(LC_CTYPE, "fr");           re = pcre_compile(..., tables);
763         tables = pcre_maketables();  
764         re = pcre_compile(..., tables);         The  tables  are  built in memory that is obtained via pcre_malloc. The
765           pointer that is passed to pcre_compile is saved with the compiled  pat-
766       The  tables  are  built  in  memory  that  is  obtained  via         tern, and the same tables are used via this pointer by pcre_study() and
767       pcre_malloc.  The  pointer that is passed to pcre_compile is         pcre_exec(). Thus, for any single pattern,  compilation,  studying  and
768       saved with the compiled pattern, and  the  same  tables  are         matching  all  happen in the same locale, but different patterns can be
769       used via this pointer by pcre_study() and pcre_exec(). Thus,         compiled in different locales. It is  the  caller's  responsibility  to
770       for any single pattern, compilation, studying  and  matching         ensure  that  the memory containing the tables remains available for as
771       all happen in the same locale, but different patterns can be         long as it is needed.
      compiled in different locales. It is the caller's  responsi-  
      bility  to  ensure  that  the  memory  containing the tables  
      remains available for as long as it is needed.  
772    
773    
774  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
775    
776       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
777            int what, void *where);              int what, void *where);
778    
779       The pcre_fullinfo() function  returns  information  about  a         The pcre_fullinfo() function returns information about a compiled  pat-
780       compiled pattern. It replaces the obsolete pcre_info() func-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
781       tion, which is nevertheless retained for backwards compabil-         less retained for backwards compability (and is documented below).
782       ity (and is documented below).  
783           The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
784       The first argument for pcre_fullinfo() is a pointer  to  the         pattern.  The second argument is the result of pcre_study(), or NULL if
785       compiled  pattern.  The  second  argument  is  the result of         the pattern was not studied. The third argument specifies  which  piece
786       pcre_study(), or NULL if the pattern was  not  studied.  The         of  information  is required, and the fourth argument is a pointer to a
787       third  argument  specifies  which  piece  of  information is         variable to receive the data. The yield of the  function  is  zero  for
788       required, and the fourth argument is a pointer to a variable         success, or one of the following negative numbers:
789       to  receive  the data. The yield of the function is zero for  
790       success, or one of the following negative numbers:           PCRE_ERROR_NULL       the argument code was NULL
791                                   the argument where was NULL
792         PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_BADMAGIC   the "magic number" was not found
793                               the argument where was NULL           PCRE_ERROR_BADOPTION  the value of what was invalid
794         PCRE_ERROR_BADMAGIC   the "magic number" was not found  
795         PCRE_ERROR_BADOPTION  the value of what was invalid         Here  is a typical call of pcre_fullinfo(), to obtain the length of the
796           compiled pattern:
797       Here is a typical call of  pcre_fullinfo(),  to  obtain  the  
798       length of the compiled pattern:           int rc;
799             unsigned long int length;
800         int rc;           rc = pcre_fullinfo(
801         unsigned long int length;             re,               /* result of pcre_compile() */
802         rc = pcre_fullinfo(             pe,               /* result of pcre_study(), or NULL */
803           re,               /* result of pcre_compile() */             PCRE_INFO_SIZE,   /* what is required */
804           pe,               /* result of pcre_study(), or NULL */             &length);         /* where to put the data */
805           PCRE_INFO_SIZE,   /* what is required */  
806           &length);         /* where to put the data */         The possible values for the third argument are defined in  pcre.h,  and
807           are as follows:
808       The possible values for the third argument  are  defined  in  
809       pcre.h, and are as follows:           PCRE_INFO_BACKREFMAX
810    
811         PCRE_INFO_BACKREFMAX         Return  the  number  of  the highest back reference in the pattern. The
812           fourth argument should point to an int variable. Zero  is  returned  if
813       Return the number of the highest back reference in the  pat-         there are no back references.
814       tern.  The  fourth argument should point to an int variable.  
815       Zero is returned if there are no back references.           PCRE_INFO_CAPTURECOUNT
816    
817         PCRE_INFO_CAPTURECOUNT         Return  the  number of capturing subpatterns in the pattern. The fourth
818           argument should point to an int variable.
819       Return the number of capturing subpatterns in  the  pattern.  
820       The fourth argument should point to an int variable.           PCRE_INFO_FIRSTBYTE
821    
822         PCRE_INFO_FIRSTBYTE         Return information about the first byte of any matched  string,  for  a
823           non-anchored    pattern.    (This    option    used    to   be   called
824       Return information about  the  first  byte  of  any  matched         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards
825       string,  for a non-anchored pattern. (This option used to be         compatibility.)
826       called PCRE_INFO_FIRSTCHAR; the old name is still recognized  
827       for backwards compatibility.)         If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as
828           (cat|cow|coyote), it is returned in the integer pointed  to  by  where.
829       If there is a fixed first byte, e.g. from a pattern such  as         Otherwise, if either
830       (cat|cow|coyote),  it  is returned in the integer pointed to  
831       by where. Otherwise, if either         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
832           branch starts with "^", or
833       (a) the pattern was compiled with the PCRE_MULTILINE option,  
834       and every branch starts with "^", or         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
835           set (if it were set, the pattern would be anchored),
836       (b) every  branch  of  the  pattern  starts  with  ".*"  and  
837       PCRE_DOTALL is not set (if it were set, the pattern would be         -1  is  returned, indicating that the pattern matches only at the start
838       anchored),         of a subject string or after any newline within the  string.  Otherwise
839           -2 is returned. For anchored patterns, -2 is returned.
840       -1 is returned, indicating that the pattern matches only  at  
841       the  start  of  a subject string or after any newline within           PCRE_INFO_FIRSTTABLE
842       the string. Otherwise -2 is returned. For anchored patterns,  
843       -2 is returned.         If  the pattern was studied, and this resulted in the construction of a
844           256-bit table indicating a fixed set of bytes for the first byte in any
845         PCRE_INFO_FIRSTTABLE         matching  string, a pointer to the table is returned. Otherwise NULL is
846           returned. The fourth argument should point to an unsigned char *  vari-
847       If the pattern was studied, and this resulted  in  the  con-         able.
848       struction of a 256-bit table indicating a fixed set of bytes  
849       for the first byte in any matching string, a pointer to  the           PCRE_INFO_LASTLITERAL
850       table  is  returned.  Otherwise NULL is returned. The fourth  
851       argument should point to an unsigned char * variable.         Return  the  value of the rightmost literal byte that must exist in any
852           matched string, other than at its  start,  if  such  a  byte  has  been
853         PCRE_INFO_LASTLITERAL         recorded. The fourth argument should point to an int variable. If there
854           is no such byte, -1 is returned. For anchored patterns, a last  literal
855       Return the value of the rightmost  literal  byte  that  must         byte  is  recorded only if it follows something of variable length. For
856       exist  in  any  matched  string, other than at its start, if         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
857       such a byte has been recorded. The  fourth  argument  should         /^a\dz\d/ the returned value is -1.
858       point  to  an  int variable. If there is no such byte, -1 is  
859       returned. For anchored patterns,  a  last  literal  byte  is           PCRE_INFO_NAMECOUNT
860       recorded  only  if  it follows something of variable length.           PCRE_INFO_NAMEENTRYSIZE
861       For example, for the pattern /^a\d+z\d+/ the returned  value           PCRE_INFO_NAMETABLE
862       is "z", but for /^a\dz\d/ the returned value is -1.  
863           PCRE  supports the use of named as well as numbered capturing parenthe-
864         PCRE_INFO_NAMECOUNT         ses. The names are just an additional way of identifying the  parenthe-
865         PCRE_INFO_NAMEENTRYSIZE         ses,  which still acquire a number. A caller that wants to extract data
866         PCRE_INFO_NAMETABLE         from a named subpattern must convert the name to a number in  order  to
867           access  the  correct  pointers  in  the  output  vector (described with
868       PCRE supports the use of named as well as numbered capturing         pcre_exec() below). In order to do this, it must first use these  three
869       parentheses. The names are just an additional way of identi-         values to obtain the name-to-number mapping table for the pattern.
870       fying the parentheses,  which  still  acquire  a  number.  A  
871       caller  that  wants  to extract data from a named subpattern         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
872       must convert the name to a number in  order  to  access  the         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
873       correct  pointers  in  the  output  vector  (described  with         of  each  entry;  both  of  these  return  an int value. The entry size
874       pcre_exec() below). In order to do this, it must  first  use         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
875       these  three  values  to  obtain  the name-to-number mapping         a  pointer  to  the  first  entry of the table (a pointer to char). The
876       table for the pattern.         first two bytes of each entry are the number of the capturing parenthe-
877           sis,  most  significant byte first. The rest of the entry is the corre-
878       The  map  consists  of  a  number  of  fixed-size   entries.         sponding name, zero terminated. The names are  in  alphabetical  order.
879       PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and         For  example,  consider  the following pattern (assume PCRE_EXTENDED is
880       PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both         set, so white space - including newlines - is ignored):
881       of  these return an int value. The entry size depends on the  
882       length of the longest name.  PCRE_INFO_NAMETABLE  returns  a           (?P<date> (?P<year>(\d\d)?\d\d) -
883       pointer to the first entry of the table (a pointer to char).           (?P<month>\d\d) - (?P<day>\d\d) )
884       The first two bytes of each entry are the number of the cap-  
885       turing parenthesis, most significant byte first. The rest of         There are four named subpatterns, so the table has  four  entries,  and
886       the entry is the corresponding name,  zero  terminated.  The         each  entry  in the table is eight bytes long. The table is as follows,
887       names  are  in alphabetical order. For example, consider the         with non-printing bytes shows in hex, and undefined bytes shown as ??:
888       following pattern (assume PCRE_EXTENDED  is  set,  so  white  
889       space - including newlines - is ignored):           00 01 d  a  t  e  00 ??
890             00 05 d  a  y  00 ?? ??
891         (?P<date> (?P<year>(\d\d)?\d\d) -           00 04 m  o  n  t  h  00
892         (?P<month>\d\d) - (?P<day>\d\d) )           00 02 y  e  a  r  00 ??
893    
894       There are four named subpatterns,  so  the  table  has  four         When writing code to extract data from named subpatterns, remember that
895       entries,  and  each  entry in the table is eight bytes long.         the length of each entry may be different for each compiled pattern.
896       The table is as follows, with non-printing  bytes  shows  in  
897       hex, and undefined bytes shown as ??:           PCRE_INFO_OPTIONS
898    
899         00 01 d  a  t  e  00 ??         Return  a  copy of the options with which the pattern was compiled. The
900         00 05 d  a  y  00 ?? ??         fourth argument should point to an unsigned long  int  variable.  These
901         00 04 m  o  n  t  h  00         option bits are those specified in the call to pcre_compile(), modified
902         00 02 y  e  a  r  00 ??         by any top-level option settings within the pattern itself.
903    
904       When writing code to extract data  from  named  subpatterns,         A pattern is automatically anchored by PCRE if  all  of  its  top-level
905       remember  that the length of each entry may be different for         alternatives begin with one of the following:
906       each compiled pattern.  
907             ^     unless PCRE_MULTILINE is set
908         PCRE_INFO_OPTIONS           \A    always
909             \G    always
910       Return a copy of the options with which the pattern was com-           .*    if PCRE_DOTALL is set and there are no back
911       piled.  The fourth argument should point to an unsigned long                   references to the subpattern in which .* appears
912       int variable. These option bits are those specified  in  the  
913       call  to  pcre_compile(),  modified  by any top-level option         For such patterns, the PCRE_ANCHORED bit is set in the options returned
914       settings within the pattern itself.         by pcre_fullinfo().
915    
916       A pattern is automatically anchored by PCRE if  all  of  its           PCRE_INFO_SIZE
917       top-level alternatives begin with one of the following:  
918           Return the size of the compiled pattern, that is, the  value  that  was
919         ^     unless PCRE_MULTILINE is set         passed as the argument to pcre_malloc() when PCRE was getting memory in
920         \A    always         which to place the compiled data. The fourth argument should point to a
921         \G    always         size_t variable.
922         .*    if PCRE_DOTALL is set and there are no back  
923                 references to the subpattern in which .* appears           PCRE_INFO_STUDYSIZE
924    
925       For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the         Returns  the  size of the data block pointed to by the study_data field
926       options returned by pcre_fullinfo().         in a pcre_extra block. That is, it is the  value  that  was  passed  to
927           pcre_malloc() when PCRE was getting memory into which to place the data
928         PCRE_INFO_SIZE         created by pcre_study(). The fourth argument should point to  a  size_t
929           variable.
      Return the size of the compiled pattern, that is, the  value  
      that  was  passed as the argument to pcre_malloc() when PCRE  
      was getting memory in which to place the compiled data.  The  
      fourth argument should point to a size_t variable.  
   
        PCRE_INFO_STUDYSIZE  
   
      Returns the size  of  the  data  block  pointed  to  by  the  
      study_data  field  in a pcre_extra block. That is, it is the  
      value that was passed to pcre_malloc() when PCRE was getting  
      memory into which to place the data created by pcre_study().  
      The fourth argument should point to a size_t variable.  
930    
931    
932  OBSOLETE INFO FUNCTION  OBSOLETE INFO FUNCTION
933    
934       int pcre_info(const pcre *code, int *optptr, *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
935    
936       The pcre_info() function is now obsolete because its  inter-         The  pcre_info()  function is now obsolete because its interface is too
937       face  is  too  restrictive  to return all the available data         restrictive to return all the available data about a compiled  pattern.
938       about  a  compiled  pattern.   New   programs   should   use         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
939       pcre_fullinfo()  instead.  The  yield  of pcre_info() is the         pcre_info() is the number of capturing subpatterns, or one of the  fol-
940       number of capturing subpatterns, or  one  of  the  following         lowing negative numbers:
941       negative numbers:  
942             PCRE_ERROR_NULL       the argument code was NULL
943         PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_BADMAGIC   the "magic number" was not found
944         PCRE_ERROR_BADMAGIC   the "magic number" was not found  
945           If  the  optptr  argument is not NULL, a copy of the options with which
946       If the optptr argument is not NULL, a copy  of  the  options         the pattern was compiled is placed in the integer  it  points  to  (see
947       with which the pattern was compiled is placed in the integer         PCRE_INFO_OPTIONS above).
948       it points to (see PCRE_INFO_OPTIONS above).  
949           If  the  pattern  is  not anchored and the firstcharptr argument is not
950       If the pattern is not anchored and the firstcharptr argument         NULL, it is used to pass back information about the first character  of
951       is  not  NULL, it is used to pass back information about the         any matched string (see PCRE_INFO_FIRSTBYTE above).
      first    character    of    any    matched    string    (see  
      PCRE_INFO_FIRSTBYTE above).  
952    
953    
954  MATCHING A PATTERN  MATCHING A PATTERN
955    
956       int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
957            const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
958            int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
959    
960       The function pcre_exec() is called to match a subject string         The  function pcre_exec() is called to match a subject string against a
961       against  a pre-compiled pattern, which is passed in the code         pre-compiled pattern, which is passed in the code argument. If the pat-
962       argument. If the pattern has been studied, the result of the         tern  has been studied, the result of the study should be passed in the
963       study should be passed in the extra argument.         extra argument.
964    
965       Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
966    
967         int rc;           int rc;
968         int ovector[30];           int ovector[30];
969         rc = pcre_exec(           rc = pcre_exec(
970           re,             /* result of pcre_compile() */             re,             /* result of pcre_compile() */
971           NULL,           /* we didn't study the pattern */             NULL,           /* we didn't study the pattern */
972           "some string",  /* the subject string */             "some string",  /* the subject string */
973           11,             /* the length of the subject string */             11,             /* the length of the subject string */
974           0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
975           0,              /* default options */             0,              /* default options */
976           ovector,        /* vector for substring information */             ovector,        /* vector for substring information */
977           30);            /* number of elements in the vector */             30);            /* number of elements in the vector */
978    
979       If the extra argument is  not  NULL,  it  must  point  to  a         If the extra argument is not NULL, it must point to a  pcre_extra  data
980       pcre_extra  data  block.  The  pcre_study() function returns         block.  The pcre_study() function returns such a block (when it doesn't
981       such a block (when it doesn't return NULL), but you can also         return NULL), but you can also create one for yourself, and pass  addi-
982       create  one for yourself, and pass additional information in         tional information in it. The fields in the block are as follows:
983       it. The fields in the block are as follows:  
984             unsigned long int flags;
985         unsigned long int flags;           void *study_data;
986         void *study_data;           unsigned long int match_limit;
987         unsigned long int match_limit;           void *callout_data;
988         void *callout_data;  
989           The  flags  field  is a bitmap that specifies which of the other fields
990       The flags field is a bitmap  that  specifies  which  of  the         are set. The flag bits are:
991       other fields are set. The flag bits are:  
992             PCRE_EXTRA_STUDY_DATA
993         PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_MATCH_LIMIT
994         PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_CALLOUT_DATA
995         PCRE_EXTRA_CALLOUT_DATA  
996           Other flag bits should be set to zero. The study_data field is  set  in
997       Other flag bits should be set to zero. The study_data  field         the  pcre_extra  block  that is returned by pcre_study(), together with
998       is   set  in  the  pcre_extra  block  that  is  returned  by         the appropriate flag bit. You should not set this yourself, but you can
999       pcre_study(), together with the appropriate  flag  bit.  You         add to the block by setting the other fields.
1000       should  not  set this yourself, but you can add to the block  
1001       by setting the other fields.         The match_limit field provides a means of preventing PCRE from using up
1002           a vast amount of resources when running patterns that are not going  to
1003       The match_limit field provides a means  of  preventing  PCRE         match,  but  which  have  a very large number of possibilities in their
1004       from  using  up a vast amount of resources when running pat-         search trees. The classic  example  is  the  use  of  nested  unlimited
1005       terns that are not going to match, but  which  have  a  very         repeats. Internally, PCRE uses a function called match() which it calls
1006       large  number  of  possibilities  in their search trees. The         repeatedly (sometimes recursively). The limit is imposed on the  number
1007       classic example is the  use  of  nested  unlimited  repeats.         of  times  this function is called during a match, which has the effect
1008       Internally,  PCRE  uses  a  function called match() which it         of limiting the amount of recursion  and  backtracking  that  can  take
1009       calls  repeatedly  (sometimes  recursively).  The  limit  is         place.  For  patterns that are not anchored, the count starts from zero
1010       imposed  on the number of times this function is called dur-         for each position in the subject string.
1011       ing a match, which has the effect of limiting the amount  of  
1012       recursion and backtracking that can take place. For patterns         The default limit for the library can be set when PCRE  is  built;  the
1013       that are not anchored, the count starts from zero  for  each         default  default  is 10 million, which handles all but the most extreme
1014       position in the subject string.         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a
1015           pcre_extra  block  in  which match_limit is set to a smaller value, and
1016       The default limit for the library can be set  when  PCRE  is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1017       built;  the default default is 10 million, which handles all         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1018       but the most extreme cases. You can reduce  the  default  by  
1019       suppling  pcre_exec()  with  a  pcre_extra  block  in  which         The  pcre_callout  field is used in conjunction with the "callout" fea-
1020       match_limit   is   set   to    a    smaller    value,    and         ture, which is described in the pcrecallout documentation.
1021       PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the  
1022       limit      is      exceeded,       pcre_exec()       returns         The PCRE_ANCHORED option can be passed in the options  argument,  whose
1023       PCRE_ERROR_MATCHLIMIT.         unused  bits  must  be zero. This limits pcre_exec() to matching at the
1024           first matching position.  However,  if  a  pattern  was  compiled  with
1025       The pcre_callout field is used in conjunction with the "cal-         PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,
1026       lout"  feature,  which is described in the pcrecallout docu-         it cannot be made unachored at matching time.
1027       mentation.  
1028           When PCRE_UTF8 was set at compile time, the validity of the subject  as
1029       The PCRE_ANCHORED option can be passed in the options  argu-         a  UTF-8  string is automatically checked, and the value of startoffset
1030       ment,   whose   unused   bits  must  be  zero.  This  limits         is also checked to ensure that it points to the start of a UTF-8  char-
1031       pcre_exec() to matching at the first matching position. How-         acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()
1032       ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or         returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an
1033       turned out to be anchored by virtue of its contents, it can-         invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1034       not be made unachored at matching time.  
1035           If  you  already  know that your subject is valid, and you want to skip
1036       There are also three further options that can be set only at         these   checks   for   performance   reasons,   you   can    set    the
1037       matching time:         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1038           do this for the second and subsequent calls to pcre_exec() if  you  are
1039         PCRE_NOTBOL         making  repeated  calls  to  find  all  the matches in a single subject
1040           string. However, you should be  sure  that  the  value  of  startoffset
1041       The first character of the string is not the beginning of  a         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1042       line,  so  the  circumflex  metacharacter  should  not match         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1043       before it. Setting this without PCRE_MULTILINE  (at  compile         value  of startoffset that does not point to the start of a UTF-8 char-
1044       time) causes circumflex never to match.         acter, is undefined. Your program may crash.
1045    
1046         PCRE_NOTEOL         There are also three further options that can be set only  at  matching
1047           time:
1048       The end of the string is not the end of a line, so the  dol-  
1049       lar  metacharacter should not match it nor (except in multi-           PCRE_NOTBOL
1050       line mode) a newline immediately  before  it.  Setting  this  
1051       without PCRE_MULTILINE (at compile time) causes dollar never         The  first  character  of the string is not the beginning of a line, so
1052       to match.         the circumflex metacharacter should not match before it.  Setting  this
1053           without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to
1054         PCRE_NOTEMPTY         match.
1055    
1056       An empty string is not considered to be  a  valid  match  if           PCRE_NOTEOL
1057       this  option  is  set. If there are alternatives in the pat-  
1058       tern, they are tried. If  all  the  alternatives  match  the         The end of the string is not the end of a line, so the dollar metachar-
1059       empty  string,  the  entire match fails. For example, if the         acter  should  not  match  it  nor (except in multiline mode) a newline
1060       pattern         immediately before it. Setting this without PCRE_MULTILINE (at  compile
1061           time) causes dollar never to match.
1062         a?b?  
1063             PCRE_NOTEMPTY
1064       is applied to a string not beginning with  "a"  or  "b",  it  
1065       matches  the  empty string at the start of the subject. With         An empty string is not considered to be a valid match if this option is
1066       PCRE_NOTEMPTY set, this match is not valid, so PCRE searches         set. If there are alternatives in the pattern, they are tried.  If  all
1067       further into the string for occurrences of "a" or "b".         the  alternatives  match  the empty string, the entire match fails. For
1068           example, if the pattern
1069       Perl has no direct equivalent of PCRE_NOTEMPTY, but it  does  
1070       make  a  special case of a pattern match of the empty string           a?b?
1071       within its split() function, and when using the /g modifier.  
1072       It  is possible to emulate Perl's behaviour after matching a         is applied to a string not beginning with "a" or "b",  it  matches  the
1073       null string by first trying the  match  again  at  the  same         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
1074       offset  with  PCRE_NOTEMPTY  set,  and then if that fails by         match is not valid, so PCRE searches further into the string for occur-
1075       advancing the starting offset  (see  below)  and  trying  an         rences of "a" or "b".
1076       ordinary match again.  
1077           Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1078       The subject string is passed to pcre_exec() as a pointer  in         cial case of a pattern match of the empty  string  within  its  split()
1079       subject,  a length in length, and a starting offset in star-         function,  and  when  using  the /g modifier. It is possible to emulate
1080       toffset. Unlike the pattern string, the subject may  contain         Perl's behaviour after matching a null string by first trying the match
1081       binary  zero  bytes.  When  the starting offset is zero, the         again at the same offset with PCRE_NOTEMPTY set, and then if that fails
1082       search for a match starts at the beginning of  the  subject,         by advancing the starting offset (see below)  and  trying  an  ordinary
1083       and this is by far the most common case.         match again.
1084    
1085       If the pattern was compiled with the PCRE_UTF8  option,  the         The  subject string is passed to pcre_exec() as a pointer in subject, a
1086       subject  must  be  a sequence of bytes that is a valid UTF-8         length in length, and a starting byte offset in startoffset. Unlike the
1087       string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's         pattern  string,  the  subject  may contain binary zero bytes. When the
1088       behaviour is not defined.         starting offset is zero, the search for a match starts at the beginning
1089           of the subject, and this is by far the most common case.
1090       A non-zero starting offset  is  useful  when  searching  for  
1091       another  match  in  the  same subject by calling pcre_exec()         If the pattern was compiled with the PCRE_UTF8 option, the subject must
1092       again after a previous success.  Setting startoffset differs         be a sequence of bytes that is a valid UTF-8 string, and  the  starting
1093       from  just  passing  over  a  shortened  string  and setting         offset  must point to the beginning of a UTF-8 character. If an invalid
1094       PCRE_NOTBOL in the case of a pattern that  begins  with  any         UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8
1095       kind of lookbehind. For example, consider the pattern         or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option
1096           PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not
1097         \Biss\B         defined.
1098    
1099       which finds occurrences of "iss" in the middle of words. (\B         A  non-zero  starting offset is useful when searching for another match
1100       matches only if the current position in the subject is not a         in the same subject by calling pcre_exec() again after a previous  suc-
1101       word boundary.) When applied to the string "Mississipi"  the         cess.   Setting  startoffset differs from just passing over a shortened
1102       first  call  to  pcre_exec()  finds the first occurrence. If         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
1103       pcre_exec() is called again with just the remainder  of  the         with any kind of lookbehind. For example, consider the pattern
1104       subject,  namely  "issipi", it does not match, because \B is  
1105       always false at the start of the subject, which is deemed to           \Biss\B
1106       be  a  word  boundary. However, if pcre_exec() is passed the  
1107       entire string again, but with startoffset set to 4, it finds         which  finds  occurrences  of "iss" in the middle of words. (\B matches
1108       the  second  occurrence  of "iss" because it is able to look         only if the current position in the subject is not  a  word  boundary.)
1109       behind the starting point to discover that it is preceded by         When  applied  to the string "Mississipi" the first call to pcre_exec()
1110       a letter.         finds the first occurrence. If pcre_exec() is called  again  with  just
1111           the  remainder  of  the  subject,  namely  "issipi", it does not match,
1112       If a non-zero starting offset is passed when the pattern  is         because \B is always false at the start of the subject, which is deemed
1113       anchored, one attempt to match at the given offset is tried.         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1114       This can only succeed if the pattern does  not  require  the         string again, but with startoffset  set  to  4,  it  finds  the  second
1115       match to be at the start of the subject.         occurrence  of  "iss"  because  it  is able to look behind the starting
1116           point to discover that it is preceded by a letter.
1117       In general, a pattern matches a certain portion of the  sub-  
1118       ject,  and  in addition, further substrings from the subject         If a non-zero starting offset is passed when the pattern  is  anchored,
1119       may be picked out by parts of  the  pattern.  Following  the         one  attempt  to match at the given offset is tried. This can only suc-
1120       usage  in  Jeffrey Friedl's book, this is called "capturing"         ceed if the pattern does not require the match to be at  the  start  of
1121       in what follows, and the phrase  "capturing  subpattern"  is         the subject.
1122       used for a fragment of a pattern that picks out a substring.  
1123       PCRE supports several other kinds of  parenthesized  subpat-         In  general, a pattern matches a certain portion of the subject, and in
1124       tern that do not cause substrings to be captured.         addition, further substrings from the subject  may  be  picked  out  by
1125           parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
1126       Captured substrings are returned to the caller via a  vector         this is called "capturing" in what follows, and the  phrase  "capturing
1127       of  integer  offsets whose address is passed in ovector. The         subpattern"  is  used for a fragment of a pattern that picks out a sub-
1128       number of elements in the vector is passed in ovecsize.  The         string. PCRE supports several other kinds of  parenthesized  subpattern
1129       first two-thirds of the vector is used to pass back captured         that do not cause substrings to be captured.
1130       substrings, each substring using a  pair  of  integers.  The  
1131       remaining  third  of  the  vector  is  used  as workspace by         Captured  substrings are returned to the caller via a vector of integer
1132       pcre_exec() while matching capturing subpatterns, and is not         offsets whose address is passed in ovector. The number of  elements  in
1133       available for passing back information. The length passed in         the vector is passed in ovecsize. The first two-thirds of the vector is
1134       ovecsize should always be a multiple of three. If it is not,         used to pass back captured substrings, each substring using a  pair  of
1135       it is rounded down.         integers.  The  remaining  third  of the vector is used as workspace by
1136           pcre_exec() while matching capturing subpatterns, and is not  available
1137       When a match has been successful, information about captured         for  passing  back  information.  The  length passed in ovecsize should
1138       substrings is returned in pairs of integers, starting at the         always be a multiple of three. If it is not, it is rounded down.
1139       beginning of ovector, and continuing up to two-thirds of its  
1140       length  at  the  most. The first element of a pair is set to         When a match has been successful, information about captured substrings
1141       the offset of the first character in a  substring,  and  the         is returned in pairs of integers, starting at the beginning of ovector,
1142       second is set to the offset of the first character after the         and continuing up to two-thirds of its length at the  most.  The  first
1143       end of a substring. The first  pair,  ovector[0]  and  ovec-         element of a pair is set to the offset of the first character in a sub-
1144       tor[1],  identify  the portion of the subject string matched         string, and the second is set to the  offset  of  the  first  character
1145       by the entire pattern. The next pair is used for  the  first         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1146       capturing  subpattern,  and  so  on.  The  value returned by         tor[1], identify the portion of  the  subject  string  matched  by  the
1147       pcre_exec() is the number of pairs that have  been  set.  If         entire  pattern.  The next pair is used for the first capturing subpat-
1148       there  are no capturing subpatterns, the return value from a         tern, and so on. The value returned by pcre_exec()  is  the  number  of
1149       successful match is 1, indicating that just the  first  pair         pairs  that  have  been set. If there are no capturing subpatterns, the
1150       of offsets has been set.         return value from a successful match is 1,  indicating  that  just  the
1151           first pair of offsets has been set.
1152       Some convenience functions are provided for  extracting  the  
1153       captured substrings as separate strings. These are described         Some  convenience  functions  are  provided for extracting the captured
1154       in the following section.         substrings as separate strings. These are described  in  the  following
1155           section.
1156       It is possible for an capturing  subpattern  number  n+1  to  
1157       match  some  part  of  the subject when subpattern n has not         It  is  possible  for  an capturing subpattern number n+1 to match some
1158       been used at all.  For  example,  if  the  string  "abc"  is         part of the subject when subpattern n has not been  used  at  all.  For
1159       matched  against the pattern (a|(z))(bc) subpatterns 1 and 3         example, if the string "abc" is matched against the pattern (a|(z))(bc)
1160       are matched, but 2 is not. When this  happens,  both  offset         subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both
1161       values corresponding to the unused subpattern are set to -1.         offset values corresponding to the unused subpattern are set to -1.
1162    
1163       If a capturing subpattern is matched repeatedly, it  is  the         If a capturing subpattern is matched repeatedly, it is the last portion
1164       last  portion  of  the  string  that  it  matched  that gets         of the string that it matched that gets returned.
1165       returned.  
1166           If the vector is too small to hold all the captured substrings,  it  is
1167       If the vector is too small to hold  all  the  captured  sub-         used as far as possible (up to two-thirds of its length), and the func-
1168       strings,  it is used as far as possible (up to two-thirds of         tion returns a value of zero. In particular, if the  substring  offsets
1169       its length), and the function returns a value  of  zero.  In         are  not  of interest, pcre_exec() may be called with ovector passed as
1170       particular,  if  the  substring offsets are not of interest,         NULL and ovecsize as zero. However, if the pattern contains back refer-
1171       pcre_exec() may be called with ovector passed  as  NULL  and         ences  and  the  ovector  isn't big enough to remember the related sub-
1172       ovecsize  as  zero.  However,  if  the pattern contains back         strings, PCRE has to get additional memory  for  use  during  matching.
1173       references and the ovector isn't big enough to remember  the         Thus it is usually advisable to supply an ovector.
1174       related  substrings,  PCRE  has to get additional memory for  
1175       use during matching. Thus it is usually advisable to  supply         Note  that  pcre_info() can be used to find out how many capturing sub-
1176       an ovector.         patterns there are in a compiled pattern. The smallest size for ovector
1177           that  will  allow for n captured substrings, in addition to the offsets
1178       Note that pcre_info() can be used to find out how many  cap-         of the substring matched by the whole pattern, is (n+1)*3.
1179       turing  subpatterns  there  are  in  a compiled pattern. The  
1180       smallest size for ovector that will  allow  for  n  captured         If pcre_exec() fails, it returns a negative number. The  following  are
1181       substrings,  in  addition  to  the  offsets of the substring         defined in the header file:
1182       matched by the whole pattern, is (n+1)*3.  
1183             PCRE_ERROR_NOMATCH        (-1)
1184       If pcre_exec() fails, it returns a negative number. The fol-  
1185       lowing are defined in the header file:         The subject string did not match the pattern.
1186    
1187         PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NULL           (-2)
1188    
1189       The subject string did not match the pattern.         Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1190           ovecsize was not zero.
1191         PCRE_ERROR_NULL           (-2)  
1192             PCRE_ERROR_BADOPTION      (-3)
1193       Either code or subject was passed as NULL,  or  ovector  was  
1194       NULL and ovecsize was not zero.         An unrecognized bit was set in the options argument.
1195    
1196         PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADMAGIC       (-4)
1197    
1198       An unrecognized bit was set in the options argument.         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1199           to  catch  the case when it is passed a junk pointer. This is the error
1200         PCRE_ERROR_BADMAGIC       (-4)         it gives when the magic number isn't present.
1201    
1202       PCRE stores a 4-byte "magic number" at the start of the com-           PCRE_ERROR_UNKNOWN_NODE   (-5)
1203       piled  code,  to  catch  the  case  when it is passed a junk  
1204       pointer. This is the error it gives when  the  magic  number         While running the pattern match, an unknown item was encountered in the
1205       isn't present.         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1206           overwriting of the compiled pattern.
1207         PCRE_ERROR_UNKNOWN_NODE   (-5)  
1208             PCRE_ERROR_NOMEMORY       (-6)
1209       While running the pattern match, an unknown item was encoun-  
1210       tered in the compiled pattern. This error could be caused by         If a pattern contains back references, but the ovector that  is  passed
1211       a bug in PCRE or by overwriting of the compiled pattern.         to pcre_exec() is not big enough to remember the referenced substrings,
1212           PCRE gets a block of memory at the start of matching to  use  for  this
1213         PCRE_ERROR_NOMEMORY       (-6)         purpose.  If the call via pcre_malloc() fails, this error is given. The
1214           memory is freed at the end of matching.
1215       If a pattern contains back references, but the ovector  that  
1216       is  passed  to pcre_exec() is not big enough to remember the           PCRE_ERROR_NOSUBSTRING    (-7)
1217       referenced substrings, PCRE gets a block of  memory  at  the  
1218       start  of  matching to use for this purpose. If the call via         This error is used by the pcre_copy_substring(),  pcre_get_substring(),
1219       pcre_malloc() fails, this error  is  given.  The  memory  is         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1220       freed at the end of matching.         returned by pcre_exec().
1221    
1222         PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_MATCHLIMIT     (-8)
1223    
1224       This   error   is   used   by   the   pcre_copy_substring(),         The recursion and backtracking limit, as specified by  the  match_limit
1225       pcre_get_substring(),  and  pcre_get_substring_list()  func-         field  in  a  pcre_extra  structure (or defaulted) was reached. See the
1226       tions (see below). It is never returned by pcre_exec().         description above.
1227    
1228         PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_CALLOUT        (-9)
1229    
1230       The recursion and backtracking limit, as  specified  by  the         This error is never generated by pcre_exec() itself. It is provided for
1231       match_limit  field  in a pcre_extra structure (or defaulted)         use  by  callout functions that want to yield a distinctive error code.
1232       was reached. See the description above.         See the pcrecallout documentation for details.
1233    
1234         PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_BADUTF8        (-10)
1235    
1236       This error is never generated by pcre_exec() itself.  It  is         A string that contains an invalid UTF-8 byte sequence was passed  as  a
1237       provided  for  use by callout functions that want to yield a         subject.
1238       distinctive error code. See  the  pcrecallout  documentation  
1239       for details.           PCRE_ERROR_BADUTF8_OFFSET (-11)
1240    
1241           The UTF-8 byte sequence that was passed as a subject was valid, but the
1242           value of startoffset did not point to the beginning of a UTF-8  charac-
1243           ter.
1244    
1245    
1246  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1247    
1248       int pcre_copy_substring(const char *subject, int *ovector,         int pcre_copy_substring(const char *subject, int *ovector,
1249            int stringcount, int stringnumber, char *buffer,              int stringcount, int stringnumber, char *buffer,
1250            int buffersize);              int buffersize);
1251    
1252       int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
1253            int stringcount, int stringnumber,              int stringcount, int stringnumber,
1254            const char **stringptr);              const char **stringptr);
1255    
1256       int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
1257            int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
1258    
1259       Captured substrings can be accessed directly  by  using  the         Captured  substrings  can  be  accessed  directly  by using the offsets
1260       offsets returned by pcre_exec() in ovector. For convenience,         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
1261       the functions  pcre_copy_substring(),  pcre_get_substring(),         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1262       and  pcre_get_substring_list()  are  provided for extracting         string_list() are provided for extracting captured substrings  as  new,
1263       captured  substrings  as  new,   separate,   zero-terminated         separate,  zero-terminated strings. These functions identify substrings
1264       strings.  These functions identify substrings by number. The         by number. The next section describes functions  for  extracting  named
1265       next section describes functions for extracting  named  sub-         substrings.  A  substring  that  contains  a  binary  zero is correctly
1266       strings.   A  substring  that  contains  a  binary  zero  is         extracted and has a further zero added on the end, but  the  result  is
1267       correctly extracted and has a further zero added on the end,         not, of course, a C string.
1268       but the result is not, of course, a C string.  
1269           The  first  three  arguments  are the same for all three of these func-
1270       The first three arguments are the  same  for  all  three  of         tions: subject is the subject string which has just  been  successfully
1271       these  functions:   subject  is the subject string which has         matched, ovector is a pointer to the vector of integer offsets that was
1272       just been successfully matched, ovector is a pointer to  the         passed to pcre_exec(), and stringcount is the number of substrings that
1273       vector  of  integer  offsets that was passed to pcre_exec(),         were  captured  by  the match, including the substring that matched the
1274       and stringcount is the number of substrings that  were  cap-         entire regular expression. This is the value returned by  pcre_exec  if
1275       tured by the match, including the substring that matched the         it  is greater than zero. If pcre_exec() returned zero, indicating that
1276       entire regular expression. This is  the  value  returned  by         it ran out of space in ovector, the value passed as stringcount  should
1277       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()         be the size of the vector divided by three.
1278       returned zero, indicating that it ran out of space in  ovec-  
1279       tor,  the  value passed as stringcount should be the size of         The  functions pcre_copy_substring() and pcre_get_substring() extract a
1280       the vector divided by three.         single substring, whose number is given as  stringnumber.  A  value  of
1281           zero  extracts  the  substring  that  matched the entire pattern, while
1282       The functions pcre_copy_substring() and pcre_get_substring()         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
1283       extract a single substring, whose number is given as string-         string(),  the  string  is  placed  in buffer, whose length is given by
1284       number. A value of zero extracts the substring that  matched         buffersize, while for pcre_get_substring() a new  block  of  memory  is
1285       the entire pattern, while higher values extract the captured         obtained  via  pcre_malloc,  and its address is returned via stringptr.
1286       substrings. For pcre_copy_substring(), the string is  placed         The yield of the function is the length of the  string,  not  including
1287       in  buffer,  whose  length is given by buffersize, while for         the terminating zero, or one of
1288       pcre_get_substring() a new block of memory is  obtained  via  
1289       pcre_malloc,  and its address is returned via stringptr. The           PCRE_ERROR_NOMEMORY       (-6)
1290       yield of the function is  the  length  of  the  string,  not  
1291       including the terminating zero, or one of         The  buffer  was too small for pcre_copy_substring(), or the attempt to
1292           get memory failed for pcre_get_substring().
1293         PCRE_ERROR_NOMEMORY       (-6)  
1294             PCRE_ERROR_NOSUBSTRING    (-7)
1295       The buffer was too small for pcre_copy_substring(),  or  the  
1296       attempt to get memory failed for pcre_get_substring().         There is no substring whose number is stringnumber.
1297    
1298         PCRE_ERROR_NOSUBSTRING    (-7)         The pcre_get_substring_list()  function  extracts  all  available  sub-
1299           strings  and  builds  a list of pointers to them. All this is done in a
1300       There is no substring whose number is stringnumber.         single block of memory which is obtained via pcre_malloc.  The  address
1301           of the memory block is returned via listptr, which is also the start of
1302       The pcre_get_substring_list() function extracts  all  avail-         the list of string pointers. The end of the list is marked  by  a  NULL
1303       able  substrings  and builds a list of pointers to them. All         pointer. The yield of the function is zero if all went well, or
1304       this is done in a single block of memory which  is  obtained  
1305       via pcre_malloc. The address of the memory block is returned           PCRE_ERROR_NOMEMORY       (-6)
1306       via listptr, which is also the start of the list  of  string  
1307       pointers.  The  end of the list is marked by a NULL pointer.         if the attempt to get the memory block failed.
1308       The yield of the function is zero if all went well, or  
1309           When  any of these functions encounter a substring that is unset, which
1310         PCRE_ERROR_NOMEMORY       (-6)         can happen when capturing subpattern number n+1 matches  some  part  of
1311           the  subject, but subpattern n has not been used at all, they return an
1312       if the attempt to get the memory block failed.         empty string. This can be distinguished from a genuine zero-length sub-
1313           string  by inspecting the appropriate offset in ovector, which is nega-
1314       When any of these functions encounter a  substring  that  is         tive for unset substrings.
1315       unset, which can happen when capturing subpattern number n+1  
1316       matches some part of the subject, but subpattern n  has  not         The    two    convenience    functions    pcre_free_substring()     and
1317       been  used  at all, they return an empty string. This can be         pcre_free_substring_list() can be used to free the memory returned by a
1318       distinguished  from  a  genuine  zero-length  substring   by         previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),
1319       inspecting the appropriate offset in ovector, which is nega-         respectively. They do nothing more than call the function pointed to by
1320       tive for unset substrings.         pcre_free, which of course could be called directly from a  C  program.
1321           However,  PCRE is used in some situations where it is linked via a spe-
1322       The  two  convenience  functions  pcre_free_substring()  and         cial  interface  to  another  programming  language  which  cannot  use
1323       pcre_free_substring_list()  can  be  used to free the memory         pcre_free  directly;  it is for these cases that the functions are pro-
1324       returned by  a  previous  call  of  pcre_get_substring()  or         vided.
      pcre_get_substring_list(),  respectively.  They  do  nothing  
      more than call the function pointed to by  pcre_free,  which  
      of  course  could  be called directly from a C program. How-  
      ever, PCRE is used in some situations where it is linked via  
      a  special  interface  to another programming language which  
      cannot use pcre_free directly; it is for  these  cases  that  
      the functions are provided.  
1325    
1326    
1327  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1328    
1329       int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
1330            const char *subject, int *ovector,              const char *subject, int *ovector,
1331            int stringcount, const char *stringname,              int stringcount, const char *stringname,
1332            char *buffer, int buffersize);              char *buffer, int buffersize);
1333    
1334       int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
1335            const char *name);              const char *name);
1336    
1337       int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
1338            const char *subject, int *ovector,              const char *subject, int *ovector,
1339            int stringcount, const char *stringname,              int stringcount, const char *stringname,
1340            const char **stringptr);              const char **stringptr);
1341    
1342       To extract a substring by name, you first have to find asso-         To extract a substring by name, you first have to find associated  num-
1343       ciated    number.    This    can    be   done   by   calling         ber.  This  can  be  done by calling pcre_get_stringnumber(). The first
1344       pcre_get_stringnumber(). The first argument is the  compiled         argument is the compiled pattern, and the second is the name. For exam-
1345       pattern,  and  the second is the name. For example, for this         ple, for this pattern
1346       pattern  
1347             ab(?<xxx>\d+)...
1348         ab(?<xxx>\d+)...  
1349           the  number  of the subpattern called "xxx" is 1. Given the number, you
1350       the number of the subpattern called "xxx" is  1.  Given  the         can then extract the substring directly, or use one  of  the  functions
1351       number,  you can then extract the substring directly, or use         described  in the previous section. For convenience, there are also two
1352       one of the functions described in the previous section.  For         functions that do the whole job.
1353       convenience,  there are also two functions that do the whole  
1354       job.         Most   of   the   arguments    of    pcre_copy_named_substring()    and
1355           pcre_get_named_substring() are the same as those for the functions that
1356       Most of the  arguments  of  pcre_copy_named_substring()  and         extract by number, and so are not re-described here. There are just two
1357       pcre_get_named_substring()  are  the  same  as those for the         differences.
1358       functions that  extract  by  number,  and  so  are  not  re-  
1359       described here. There are just two differences.         First,  instead  of a substring number, a substring name is given. Sec-
1360           ond, there is an extra argument, given at the start, which is a pointer
1361       First, instead of a substring number, a  substring  name  is         to  the compiled pattern. This is needed in order to gain access to the
1362       given.  Second,  there  is  an  extra argument, given at the         name-to-number translation table.
1363       start, which is a pointer to the compiled pattern.  This  is  
1364       needed  in order to gain access to the name-to-number trans-         These functions call pcre_get_stringnumber(), and if it succeeds,  they
1365       lation table.         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-
1366           ate.
      These functions  call  pcre_get_stringnumber(),  and  if  it  
      succeeds,    they   then   call   pcre_copy_substring()   or  
      pcre_get_substring(), as appropriate.  
1367    
1368  Last updated: 03 February 2003  Last updated: 09 December 2003
1369  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
1370  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1371    
1372  NAME  PCRE(3)                                                                PCRE(3)
1373       PCRE - Perl-compatible regular expressions  
1374    
1375    
1376    NAME
1377           PCRE - Perl-compatible regular expressions
1378    
1379  PCRE CALLOUTS  PCRE CALLOUTS
1380    
1381       int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
1382    
1383       PCRE provides a feature called "callout", which is  a  means         PCRE provides a feature called "callout", which is a means of temporar-
1384       of  temporarily passing control to the caller of PCRE in the         ily passing control to the caller of PCRE  in  the  middle  of  pattern
1385       middle of pattern matching. The caller of PCRE  provides  an         matching.  The  caller of PCRE provides an external function by putting
1386       external  function  by putting its entry point in the global         its entry point in the global variable pcre_callout. By  default,  this
1387       variable pcre_callout. By default,  this  variable  contains         variable contains NULL, which disables all calling out.
1388       NULL, which disables all calling out.  
1389           Within  a  regular  expression,  (?C) indicates the points at which the
1390       Within a regular expression, (?C) indicates  the  points  at         external function is to be called.  Different  callout  points  can  be
1391       which  the external function is to be called. Different cal-         identified  by  putting  a number less than 256 after the letter C. The
1392       lout points can be identified by putting a number less  than         default value is zero.  For  example,  this  pattern  has  two  callout
1393       256  after  the  letter  C.  The default value is zero.  For         points:
1394       example, this pattern has two callout points:  
1395             (?C1)abc(?C2)def
1396         (?C1)9abc(?C2)def  
1397           During matching, when PCRE reaches a callout point (and pcre_callout is
1398       During matching, when PCRE  reaches  a  callout  point  (and         set), the external function is called. Its only argument is  a  pointer
1399       pcre_callout  is  set), the external function is called. Its         to a pcre_callout block. This contains the following variables:
1400       only argument is a pointer to  a  pcre_callout  block.  This  
1401       contains the following variables:           int          version;
1402             int          callout_number;
1403         int          version;           int         *offset_vector;
1404         int          callout_number;           const char  *subject;
1405         int         *offset_vector;           int          subject_length;
1406         const char  *subject;           int          start_match;
1407         int          subject_length;           int          current_position;
1408         int          start_match;           int          capture_top;
1409         int          current_position;           int          capture_last;
1410         int          capture_top;           void        *callout_data;
1411         int          capture_last;  
1412         void        *callout_data;         The  version  field  is an integer containing the version number of the
1413           block format. The current version  is  zero.  The  version  number  may
1414       The version field  is  an  integer  containing  the  version         change  in  future if additional fields are added, but the intention is
1415       number of the block format. The current version is zero. The         never to remove any of the existing fields.
1416       version number may change in future if additional fields are  
1417       added,  but  the  intention  is  never  to remove any of the         The callout_number field contains the number of the  callout,  as  com-
1418       existing fields.         piled into the pattern (that is, the number after ?C).
1419    
1420       The callout_number field contains the number of the callout,         The  offset_vector field is a pointer to the vector of offsets that was
1421       as compiled into the pattern (that is, the number after ?C).         passed by the caller to pcre_exec(). The contents can be  inspected  in
1422           order  to extract substrings that have been matched so far, in the same
1423       The offset_vector field  is  a  pointer  to  the  vector  of         way as for extracting substrings after a match has completed.
1424       offsets  that  was  passed by the caller to pcre_exec(). The  
1425       contents can be inspected in  order  to  extract  substrings         The subject and subject_length fields contain copies  the  values  that
1426       that  have  been  matched  so  far,  in  the same way as for         were passed to pcre_exec().
1427       extracting substrings after a match has completed.  
1428       The subject and subject_length  fields  contain  copies  the         The  start_match  field contains the offset within the subject at which
1429       values that were passed to pcre_exec().         the current match attempt started. If the pattern is not anchored,  the
1430           callout  function  may  be  called several times for different starting
1431       The start_match field contains the offset within the subject         points.
1432       at  which  the current match attempt started. If the pattern  
1433       is not anchored, the callout function may be called  several         The current_position field contains the offset within  the  subject  of
1434       times for different starting points.         the current match pointer.
1435    
1436       The current_position field contains the  offset  within  the         The  capture_top field contains one more than the number of the highest
1437       subject of the current match pointer.         numbered  captured  substring  so  far.  If  no  substrings  have  been
1438           captured, the value of capture_top is one.
1439       The capture_top field contains the  number  of  the  highest  
1440       captured substring so far.         The  capture_last  field  contains the number of the most recently cap-
1441           tured substring.
1442       The capture_last field  contains  the  number  of  the  most  
1443       recently captured substring.         The callout_data field contains a value that is passed  to  pcre_exec()
1444           by  the  caller specifically so that it can be passed back in callouts.
1445       The callout_data field contains a value that  is  passed  to         It is passed in the pcre_callout field of the  pcre_extra  data  struc-
1446       pcre_exec()  by  the  caller  specifically so that it can be         ture.  If  no  such  data  was  passed,  the value of callout_data in a
1447       passed back in callouts. It is passed  in  the  pcre_callout         pcre_callout block is NULL. There is a description  of  the  pcre_extra
1448       field  of the pcre_extra data structure. If no such data was         structure in the pcreapi documentation.
      passed, the value of callout_data in a pcre_callout block is  
      NULL.  There is a description of the pcre_extra structure in  
      the pcreapi documentation.  
1449    
1450    
1451    
1452  RETURN VALUES  RETURN VALUES
1453    
1454       The callout function returns an integer.  If  the  value  is         The callout function returns an integer. If the value is zero, matching
1455       zero,  matching  proceeds as normal. If the value is greater         proceeds as normal. If the value is greater than zero,  matching  fails
1456       than zero, matching fails at the current  point,  but  back-         at the current point, but backtracking to test other possibilities goes
1457       tracking  to test other possibilities goes ahead, just as if         ahead, just as if a lookahead assertion had failed.  If  the  value  is
1458       a lookahead assertion had failed. If the value is less  than         less  than  zero,  the  match is abandoned, and pcre_exec() returns the
1459       zero,  the  match  is abandoned, and pcre_exec() returns the         value.
1460       value.  
1461           Negative  values  should  normally  be   chosen   from   the   set   of
1462       Negative values should normally be chosen from  the  set  of         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
1463       PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
1464       forces a standard "no  match"  failure.   The  error  number         reserved  for  use  by callout functions; it will never be used by PCRE
1465       PCRE_ERROR_CALLOUT is reserved for use by callout functions;         itself.
      it will never be used by PCRE itself.  
1466    
1467  Last updated: 21 January 2003  Last updated: 21 January 2003
1468  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
1469  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1470    
1471  NAME  PCRE(3)                                                                PCRE(3)
1472       PCRE - Perl-compatible regular expressions  
1473    
1474    
1475    NAME
1476           PCRE - Perl-compatible regular expressions
1477    
1478  DIFFERENCES FROM PERL  DIFFERENCES FROM PERL
1479    
1480       This document describes the differences  in  the  ways  that         This  document describes the differences in the ways that PCRE and Perl
1481       PCRE  and  Perl  handle regular expressions. The differences         handle regular expressions. The differences  described  here  are  with
1482       described here are with respect to Perl 5.8.         respect to Perl 5.8.
   
      1. PCRE does  not  allow  repeat  quantifiers  on  lookahead  
      assertions. Perl permits them, but they do not mean what you  
      might think. For example, (?!a){3} does not assert that  the  
      next  three characters are not "a". It just asserts that the  
      next character is not "a" three times.  
   
      2. Capturing subpatterns that occur inside  negative  looka-  
      head  assertions  are  counted,  but  their  entries  in the  
      offsets vector are never set. Perl sets its numerical  vari-  
      ables  from  any  such  patterns that are matched before the  
      assertion fails to match something (thereby succeeding), but  
      only  if  the negative lookahead assertion contains just one  
      branch.  
   
      3. Though binary zero characters are supported in  the  sub-  
      ject  string,  they  are  not  allowed  in  a pattern string  
      because it is passed as a normal  C  string,  terminated  by  
      zero. The escape sequence "\0" can be used in the pattern to  
      represent a binary zero.  
   
      4. The following Perl escape sequences  are  not  supported:  
      \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-  
      mented by Perl's general string-handling and are not part of  
      its pattern matching engine. If any of these are encountered  
      by PCRE, an error is generated.  
   
      5. PCRE does support the \Q...\E  escape  for  quoting  sub-  
      strings. Characters in between are treated as literals. This  
      is slightly different from Perl in that $  and  @  are  also  
      handled  as  literals inside the quotes. In Perl, they cause  
      variable interpolation (but of course  PCRE  does  not  have  
      variables). Note the following examples:  
   
          Pattern            PCRE matches      Perl matches  
   
          \Qabc$xyz\E        abc$xyz           abc followed by the  
                                                 contents of $xyz  
          \Qabc\$xyz\E       abc\$xyz          abc\$xyz  
          \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz  
   
      In PCRE, the \Q...\E mechanism is not  recognized  inside  a  
      character class.  
   
      8. Fairly obviously, PCRE does not support the (?{code}) and  
      (?p{code})  constructions. However, there is some experimen-  
      tal support for recursive patterns using the non-Perl  items  
      (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"  
      feature allows an external function to be called during pat-  
      tern matching.  
   
      9. There are some differences that are  concerned  with  the  
      settings  of  captured  strings  when  part  of a pattern is  
      repeated. For example, matching "aba"  against  the  pattern  
      /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set  
      to "b".  
   
      10. PCRE  provides  some  extensions  to  the  Perl  regular  
      expression facilities:  
   
      (a) Although lookbehind assertions must match  fixed  length  
      strings,  each  alternative branch of a lookbehind assertion  
      can match a different length of string. Perl  requires  them  
      all to have the same length.  
   
      (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not  
      set,  the  $  meta-character matches only at the very end of  
      the string.  
   
      (c) If PCRE_EXTRA is set, a backslash followed by  a  letter  
      with no special meaning is faulted.  
   
      (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-  
      tion  quantifiers  is inverted, that is, by default they are  
      not greedy, but if followed by a question mark they are.  
   
      (e) PCRE_ANCHORED can be used to force a pattern to be tried  
      only at the first matching position in the subject string.  
   
      (f)  The  PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   and  
      PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl  
      equivalents.  
   
      (g) The (?R), (?number), and (?P>name) constructs allows for  
      recursive  pattern  matching  (Perl  can  do  this using the  
      (?p{code}) construct, which PCRE cannot support.)  
   
      (h) PCRE supports  named  capturing  substrings,  using  the  
      Python syntax.  
1483    
1484       (i) PCRE supports the  possessive  quantifier  "++"  syntax,         1.  PCRE does not have full UTF-8 support. Details of what it does have
1485       taken from Sun's Java package.         are given in the section on UTF-8 support in the main pcre page.
1486    
1487       (j) The (R) condition, for  testing  recursion,  is  a  PCRE         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
1488       extension.         permits  them,  but they do not mean what you might think. For example,
1489           (?!a){3} does not assert that the next three characters are not "a". It
1490           just asserts that the next character is not "a" three times.
1491    
1492       (k) The callout facility is PCRE-specific.         3.  Capturing  subpatterns  that occur inside negative lookahead asser-
1493           tions are counted, but their entries in the offsets  vector  are  never
1494           set.  Perl sets its numerical variables from any such patterns that are
1495           matched before the assertion fails to match something (thereby succeed-
1496           ing),  but  only  if the negative lookahead assertion contains just one
1497           branch.
1498    
1499  Last updated: 03 February 2003         4. Though binary zero characters are supported in the  subject  string,
1500           they are not allowed in a pattern string because it is passed as a nor-
1501           mal C string, terminated by zero. The escape sequence "\0" can be  used
1502           in the pattern to represent a binary zero.
1503    
1504           5.  The  following Perl escape sequences are not supported: \l, \u, \L,
1505           \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
1506           string-handling and are not part of its pattern matching engine. If any
1507           of these are encountered by PCRE, an error is generated.
1508    
1509           6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
1510           ters  in  between  are  treated as literals. This is slightly different
1511           from Perl in that $ and @ are  also  handled  as  literals  inside  the
1512           quotes.  In Perl, they cause variable interpolation (but of course PCRE
1513           does not have variables). Note the following examples:
1514    
1515               Pattern            PCRE matches      Perl matches
1516    
1517               \Qabc$xyz\E        abc$xyz           abc followed by the
1518                                                      contents of $xyz
1519               \Qabc\$xyz\E       abc\$xyz          abc\$xyz
1520               \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
1521    
1522           The \Q...\E sequence is recognized both inside  and  outside  character
1523           classes.
1524    
1525           7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
1526           constructions. However, there is some experimental support  for  recur-
1527           sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).
1528           Also, the PCRE "callout" feature allows  an  external  function  to  be
1529           called during pattern matching.
1530    
1531           8.  There  are some differences that are concerned with the settings of
1532           captured strings when part of  a  pattern  is  repeated.  For  example,
1533           matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
1534           unset, but in PCRE it is set to "b".
1535    
1536           9. PCRE  provides  some  extensions  to  the  Perl  regular  expression
1537           facilities:
1538    
1539           (a)  Although  lookbehind  assertions  must match fixed length strings,
1540           each alternative branch of a lookbehind assertion can match a different
1541           length of string. Perl requires them all to have the same length.
1542    
1543           (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
1544           meta-character matches only at the very end of the string.
1545    
1546           (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
1547           cial meaning is faulted.
1548    
1549           (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
1550           fiers is inverted, that is, by default they are not greedy, but if fol-
1551           lowed by a question mark they are.
1552    
1553           (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at
1554           the first matching position in the subject string.
1555    
1556           (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
1557           TURE options for pcre_exec() have no Perl equivalents.
1558    
1559           (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
1560           pattern matching (Perl can do  this  using  the  (?p{code})  construct,
1561           which PCRE cannot support.)
1562    
1563           (h)  PCRE supports named capturing substrings, using the Python syntax.
1564    
1565           (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
1566           Sun's Java package.
1567    
1568           (j) The (R) condition, for testing recursion, is a PCRE extension.
1569    
1570           (k) The callout facility is PCRE-specific.
1571    
1572    Last updated: 09 December 2003
1573  Copyright (c) 1997-2003 University of Cambridge.  Copyright (c) 1997-2003 University of Cambridge.
1574  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
1575    
1576  NAME  PCRE(3)                                                                PCRE(3)
1577       PCRE - Perl-compatible regular expressions  
1578    
1579    
1580    NAME
1581           PCRE - Perl-compatible regular expressions
1582    
1583  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
1584    
1585       The syntax and semantics of  the  regular  expressions  sup-         The  syntax  and semantics of the regular expressions supported by PCRE
1586       ported  by PCRE are described below. Regular expressions are         are described below. Regular expressions are also described in the Perl
1587       also described in the Perl documentation and in a number  of         documentation  and in a number of other books, some of which have copi-
1588       other  books,  some  of which have copious examples. Jeffrey         ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-
1589       Friedl's  "Mastering  Regular  Expressions",  published   by         lished  by  O'Reilly, covers them in great detail. The description here
1590       O'Reilly,  covers them in great detail. The description here         is intended as reference documentation.
1591       is intended as reference documentation.  
1592           The basic operation of PCRE is on strings of bytes. However,  there  is
1593       The basic operation of PCRE is on strings of bytes. However,         also  support for UTF-8 character strings. To use this support you must
1594       there  is  also  support for UTF-8 character strings. To use         build PCRE to include UTF-8 support, and then call pcre_compile()  with
1595       this support you must build PCRE to include  UTF-8  support,         the  PCRE_UTF8  option.  How  this affects the pattern matching is men-
1596       and  then call pcre_compile() with the PCRE_UTF8 option. How         tioned in several places below. There is also a summary of  UTF-8  fea-
1597       this affects the pattern matching is  mentioned  in  several         tures in the section on UTF-8 support in the main pcre page.
1598       places  below.  There is also a summary of UTF-8 features in  
1599       the section on UTF-8 support in the main pcre page.         A  regular  expression  is  a pattern that is matched against a subject
1600           string from left to right. Most characters stand for  themselves  in  a
1601       A regular expression is a pattern that is matched against  a         pattern,  and  match  the corresponding characters in the subject. As a
1602       subject string from left to right. Most characters stand for         trivial example, the pattern
1603       themselves in a pattern, and match the corresponding charac-  
1604       ters in the subject. As a trivial example, the pattern           The quick brown fox
1605    
1606         The quick brown fox         matches a portion of a subject string that is identical to itself.  The
1607           power of regular expressions comes from the ability to include alterna-
1608       matches a portion of a subject string that is  identical  to         tives and repetitions in the pattern. These are encoded in the  pattern
1609       itself.  The  power  of  regular  expressions comes from the         by  the  use  of meta-characters, which do not stand for themselves but
1610       ability to include alternatives and repetitions in the  pat-         instead are interpreted in some special way.
1611       tern.  These  are encoded in the pattern by the use of meta-  
1612       characters, which do not stand for  themselves  but  instead         There are two different sets of meta-characters: those that are  recog-
1613       are interpreted in some special way.         nized  anywhere in the pattern except within square brackets, and those
1614           that are recognized in square brackets. Outside  square  brackets,  the
1615       There are two different sets of meta-characters: those  that         meta-characters are as follows:
1616       are  recognized anywhere in the pattern except within square  
1617       brackets, and those that are recognized in square  brackets.           \      general escape character with several uses
1618       Outside square brackets, the meta-characters are as follows:           ^      assert start of string (or line, in multiline mode)
1619             $      assert end of string (or line, in multiline mode)
1620         \      general escape character with several uses           .      match any character except newline (by default)
1621         ^      assert start of string (or line, in multiline mode)           [      start character class definition
1622         $      assert end of string (or line, in multiline mode)           |      start of alternative branch
1623         .      match any character except newline (by default)           (      start subpattern
1624         [      start character class definition           )      end subpattern
1625         |      start of alternative branch           ?      extends the meaning of (
1626         (      start subpattern                  also 0 or 1 quantifier
1627         )      end subpattern                  also quantifier minimizer
1628         ?      extends the meaning of (           *      0 or more quantifier
1629                also 0 or 1 quantifier           +      1 or more quantifier
1630                also quantifier minimizer                  also "possessive quantifier"
1631         *      0 or more quantifier           {      start min/max quantifier
1632         +      1 or more quantifier  
1633                also "possessive quantifier"         Part  of  a  pattern  that is in square brackets is called a "character
1634         {      start min/max quantifier         class". In a character class the only meta-characters are:
1635    
1636       Part of a pattern that is in square  brackets  is  called  a           \      general escape character
1637       "character  class".  In  a  character  class  the only meta-           ^      negate the class, but only if the first character
1638       characters are:           -      indicates character range
1639             [      POSIX character class (only if followed by POSIX
1640         \      general escape character                    syntax)
1641         ^      negate the class, but only if the first character           ]      terminates the character class
        -      indicates character range  
        [      POSIX character class (only if followed by POSIX  
                 syntax)  
        ]      terminates the character class  
1642    
1643       The following sections describe  the  use  of  each  of  the         The following sections describe the use of each of the meta-characters.
      meta-characters.  
1644    
1645    
1646  BACKSLASH  BACKSLASH
1647    
1648       The backslash character has several uses. Firstly, if it  is         The backslash character has several uses. Firstly, if it is followed by
1649       followed  by  a  non-alphameric character, it takes away any         a non-alphameric character, it takes  away  any  special  meaning  that
1650       special  meaning  that  character  may  have.  This  use  of         character  may  have.  This  use  of  backslash  as an escape character
1651       backslash  as  an  escape  character applies both inside and         applies both inside and outside character classes.
1652       outside character classes.  
1653           For example, if you want to match a * character, you write  \*  in  the
1654       For example, if you want to match a * character,  you  write         pattern.   This  escaping  action  applies whether or not the following
1655       \*  in the pattern.  This escaping action applies whether or         character would otherwise be interpreted as a meta-character, so it  is
1656       not the following character would otherwise  be  interpreted         always  safe to precede a non-alphameric with backslash to specify that
1657       as  a meta-character, so it is always safe to precede a non-         it stands for itself. In particular, if you want to match a  backslash,
1658       alphameric with backslash to  specify  that  it  stands  for         you write \\.
1659       itself. In particular, if you want to match a backslash, you  
1660       write \\.         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
1661           the pattern (other than in a character class) and characters between  a
1662       If a pattern is compiled with the PCRE_EXTENDED option, whi-         # outside a character class and the next newline character are ignored.
1663       tespace in the pattern (other than in a character class) and         An escaping backslash can be used to include a whitespace or #  charac-
1664       characters between a # outside a  character  class  and  the         ter as part of the pattern.
1665       next  newline  character  are ignored. An escaping backslash  
1666       can be used to include a whitespace or # character  as  part         If  you  want  to remove the special meaning from a sequence of charac-
1667       of the pattern.         ters, you can do so by putting them between \Q and \E. This is  differ-
1668           ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E
1669       If you want to remove the special meaning from a sequence of         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-
1670       characters, you can do so by putting them between \Q and \E.         tion. Note the following examples:
1671       This is different from Perl in that $ and @ are  handled  as  
1672       literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $           Pattern            PCRE matches   Perl matches
1673       and @ cause variable interpolation. Note the following exam-  
1674       ples:           \Qabc$xyz\E        abc$xyz        abc followed by the
1675                                                 contents of $xyz
1676         Pattern            PCRE matches   Perl matches           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
1677             \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
1678         \Qabc$xyz\E        abc$xyz        abc followed by the  
1679           The  \Q...\E  sequence  is recognized both inside and outside character
1680                                             contents of $xyz         classes.
1681         \Qabc\$xyz\E       abc\$xyz       abc\$xyz  
1682         \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz         A second use of backslash provides a way of encoding non-printing char-
1683           acters  in patterns in a visible manner. There is no restriction on the
1684       The \Q...\E sequence is recognized both inside  and  outside         appearance of non-printing characters, apart from the binary zero  that
1685       character classes.         terminates  a  pattern,  but  when  a pattern is being prepared by text
1686           editing, it is usually easier  to  use  one  of  the  following  escape
1687       A second use of backslash provides a way  of  encoding  non-         sequences than the binary character it represents:
1688       printing  characters  in patterns in a visible manner. There  
1689       is no restriction on the appearance of non-printing  charac-           \a        alarm, that is, the BEL character (hex 07)
1690       ters,  apart from the binary zero that terminates a pattern,           \cx       "control-x", where x is any character
1691       but when a pattern is being prepared by text editing, it  is           \e        escape (hex 1B)
1692       usually  easier to use one of the following escape sequences           \f        formfeed (hex 0C)
1693       than the binary character it represents:           \n        newline (hex 0A)
1694             \r        carriage return (hex 0D)
1695         \a        alarm, that is, the BEL character (hex 07)           \t        tab (hex 09)
1696         \cx       "control-x", where x is any character           \ddd      character with octal code ddd, or backreference
1697         \e        escape (hex 1B)           \xhh      character with hex code hh
1698         \f        formfeed (hex 0C)           \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1699         \n        newline (hex 0A)  
1700         \r        carriage return (hex 0D)         The  precise  effect of \cx is as follows: if x is a lower case letter,
1701         \t        tab (hex 09)         it is converted to upper case. Then bit 6 of the character (hex 40)  is
1702         \ddd      character with octal code ddd, or backreference         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
1703         \xhh      character with hex code hh         becomes hex 7B.
1704         \x{hhh..} character with hex code hhh... (UTF-8 mode only)  
1705           After \x, from zero to two hexadecimal digits are read (letters can  be
1706       The precise effect of \cx is as follows: if  x  is  a  lower         in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
1707       case  letter,  it  is converted to upper case. Then bit 6 of         its may appear between \x{ and }, but the value of the  character  code
1708       the character (hex 40) is inverted.  Thus  \cz  becomes  hex         must  be  less  than  2**31  (that is, the maximum hexadecimal value is
1709       1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.         7FFFFFFF). If characters other than hexadecimal digits  appear  between
1710           \x{  and }, or if there is no terminating }, this form of escape is not
1711       After \x, from zero  to  two  hexadecimal  digits  are  read         recognized. Instead, the initial \x will be interpreted as a basic hex-
1712       (letters  can be in upper or lower case). In UTF-8 mode, any         adecimal escape, with no following digits, giving a byte whose value is
1713       number of hexadecimal digits may appear between \x{  and  },         zero.
1714       but  the value of the character code must be less than 2**31  
1715       (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If         Characters whose value is less than 256 can be defined by either of the
1716       characters  other than hexadecimal digits appear between \x{         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
1717       and }, or if there is no terminating }, this form of  escape         in the way they are handled. For example, \xdc is exactly the  same  as
1718       is  not  recognized.  Instead, the initial \x will be inter-         \x{dc}.
1719       preted as a basic  hexadecimal  escape,  with  no  following  
1720       digits, giving a byte whose value is zero.         After  \0  up  to  two further octal digits are read. In both cases, if
1721           there are fewer than two digits, just those that are present are  used.
1722       Characters whose value is less than 256 can  be  defined  by         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL
1723       either  of  the  two  syntaxes  for \x when PCRE is in UTF-8         character (code value 7). Make sure you supply  two  digits  after  the
1724       mode. There is no difference in the way  they  are  handled.         initial zero if the character that follows is itself an octal digit.
1725       For example, \xdc is exactly the same as \x{dc}.  
1726           The handling of a backslash followed by a digit other than 0 is compli-
1727       After \0 up to two further octal digits are  read.  In  both         cated.  Outside a character class, PCRE reads it and any following dig-
1728       cases,  if  there are fewer than two digits, just those that         its  as  a  decimal  number. If the number is less than 10, or if there
1729       are present are used. Thus the  sequence  \0\x\07  specifies         have been at least that many previous capturing left parentheses in the
1730       two binary zeros followed by a BEL character (code value 7).         expression,  the  entire  sequence  is  taken  as  a  back reference. A
1731       Make sure you supply two digits after the  initial  zero  if         description of how this works is given later, following the  discussion
1732       the character that follows is itself an octal digit.         of parenthesized subpatterns.
1733    
1734       The handling of a backslash followed by a digit other than 0         Inside  a  character  class, or if the decimal number is greater than 9
1735       is  complicated.   Outside  a character class, PCRE reads it         and there have not been that many capturing subpatterns, PCRE  re-reads
1736       and any following digits as a decimal number. If the  number         up  to three octal digits following the backslash, and generates a sin-
1737       is  less  than  10, or if there have been at least that many         gle byte from the least significant 8 bits of the value. Any subsequent
1738       previous capturing left parentheses in the  expression,  the         digits stand for themselves.  For example:
1739       entire  sequence is taken as a back reference. A description  
1740       of how this works is given later, following  the  discussion           \040   is another way of writing a space
1741       of parenthesized subpatterns.           \40    is the same, provided there are fewer than 40
1742                       previous capturing subpatterns
1743       Inside a character  class,  or  if  the  decimal  number  is           \7     is always a back reference
1744       greater  than  9 and there have not been that many capturing           \11    might be a back reference, or another way of
1745       subpatterns, PCRE re-reads up to three octal digits  follow-                     writing a tab
1746       ing  the  backslash,  and  generates  a single byte from the           \011   is always a tab
1747       least significant 8 bits of the value. Any subsequent digits           \0113  is a tab followed by the character "3"
1748       stand for themselves.  For example:           \113   might be a back reference, otherwise the
1749                       character with octal code 113
1750         \040   is another way of writing a space           \377   might be a back reference, otherwise
1751         \40    is the same, provided there are fewer than 40                     the byte consisting entirely of 1 bits
1752                   previous capturing subpatterns           \81    is either a back reference, or a binary zero
1753         \7     is always a back reference                     followed by the two characters "8" and "1"
1754         \11    might be a back reference, or another way of  
1755                   writing a tab         Note  that  octal  values of 100 or greater must not be introduced by a
1756         \011   is always a tab         leading zero, because no more than three octal digits are ever read.
1757         \0113  is a tab followed by the character "3"  
1758         \113   might be a back reference, otherwise the         All the sequences that define a single byte value  or  a  single  UTF-8
1759                   character with octal code 113         character (in UTF-8 mode) can be used both inside and outside character
1760         \377   might be a back reference, otherwise         classes. In addition, inside a character  class,  the  sequence  \b  is
1761                   the byte consisting entirely of 1 bits         interpreted  as  the  backspace character (hex 08). Outside a character
1762         \81    is either a back reference, or a binary zero         class it has a different meaning (see below).
1763                   followed by the two characters "8" and "1"  
1764           The third use of backslash is for specifying generic character types:
1765       Note that octal values of 100 or greater must not be  intro-  
1766       duced  by  a  leading zero, because no more than three octal           \d     any decimal digit
1767       digits are ever read.           \D     any character that is not a decimal digit
1768             \s     any whitespace character
1769       All the sequences that define a single byte value or a  sin-           \S     any character that is not a whitespace character
1770       gle  UTF-8 character (in UTF-8 mode) can be used both inside           \w     any "word" character
1771       and outside character classes. In addition, inside a charac-           \W     any "non-word" character
1772       ter  class,  the sequence \b is interpreted as the backspace  
1773       character (hex 08). Outside a character class it has a  dif-         Each pair of escape sequences partitions the complete set of characters
1774       ferent meaning (see below).         into  two disjoint sets. Any given character matches one, and only one,
1775           of each pair.
1776       The third use of backslash is for specifying generic charac-  
1777       ter types:         In UTF-8 mode, characters with values greater than 255 never match  \d,
1778           \s, or \w, and always match \D, \S, and \W.
1779         \d     any decimal digit  
1780         \D     any character that is not a decimal digit         For  compatibility  with Perl, \s does not match the VT character (code
1781         \s     any whitespace character         11).  This makes it different from the the POSIX "space" class. The  \s
1782         \S     any character that is not a whitespace character         characters are HT (9), LF (10), FF (12), CR (13), and space (32).
1783         \w     any "word" character  
1784         W     any "non-word" character         A  "word" character is any letter or digit or the underscore character,
1785           that is, any character which can be part of a Perl "word". The  defini-
1786       Each pair of escape sequences partitions the complete set of         tion  of  letters  and digits is controlled by PCRE's character tables,
1787       characters  into  two  disjoint  sets.  Any  given character         and may vary if locale- specific matching is taking place (see  "Locale
1788       matches one, and only one, of each pair.         support"  in  the  pcreapi  page).  For  example,  in the "fr" (French)
1789           locale, some character codes greater than 128  are  used  for  accented
1790       In UTF-8 mode, characters with values greater than 255 never         letters, and these are matched by \w.
1791       match \d, \s, or \w, and always match \D, \S, and \W.  
1792           These character type sequences can appear both inside and outside char-
1793       For compatibility with Perl, \s does not match the VT  char-         acter classes. They each match one character of the  appropriate  type.
1794       acter (code 11).  This makes it different from the the POSIX         If  the current matching point is at the end of the subject string, all
1795       "space" class. The \s characters are HT  (9),  LF  (10),  FF         of them fail, since there is no character to match.
1796       (12), CR (13), and space (32).  
1797           The fourth use of backslash is for certain simple assertions. An asser-
1798       A "word" character is any letter or digit or the  underscore         tion  specifies a condition that has to be met at a particular point in
1799       character,  that  is,  any  character which can be part of a         a match, without consuming any characters from the subject string.  The
1800       Perl "word". The definition of letters and  digits  is  con-         use  of subpatterns for more complicated assertions is described below.
1801       trolled  by PCRE's character tables, and may vary if locale-         The backslashed assertions are
1802       specific matching is taking place (see "Locale  support"  in  
1803       the pcreapi page). For example, in the "fr" (French) locale,           \b     matches at a word boundary
1804       some character codes greater than 128 are used for  accented           \B     matches when not at a word boundary
1805       letters, and these are matched by \w.           \A     matches at start of subject
1806             \Z     matches at end of subject or before newline at end
1807       These character type sequences can appear  both  inside  and           \z     matches at end of subject
1808       outside  character classes. They each match one character of           \G     matches at first matching position in subject
1809       the appropriate type. If the current matching  point  is  at  
1810       the end of the subject string, all of them fail, since there         These assertions may not appear in character classes (but note that  \b
1811       is no character to match.         has a different meaning, namely the backspace character, inside a char-
1812           acter class).
1813       The fourth use of backslash is  for  certain  simple  asser-  
1814       tions. An assertion specifies a condition that has to be met         A word boundary is a position in the subject string where  the  current
1815       at a particular point in  a  match,  without  consuming  any         character  and  the previous character do not both match \w or \W (i.e.
1816       characters  from  the subject string. The use of subpatterns         one matches \w and the other matches \W), or the start or  end  of  the
1817       for more complicated  assertions  is  described  below.  The         string if the first or last character matches \w, respectively.
1818       backslashed assertions are  
1819           The  \A,  \Z,  and \z assertions differ from the traditional circumflex
1820         \b     matches at a word boundary         and dollar (described below) in that they only ever match at  the  very
1821         \B     matches when not at a word boundary         start  and  end  of the subject string, whatever options are set. Thus,
1822         \A     matches at start of subject         they are independent of multiline mode.
1823         \Z     matches at end of subject or before newline at end  
1824         \z     matches at end of subject         They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
1825         \G     matches at first matching position in subject         startoffset argument of pcre_exec() is non-zero, indicating that match-
1826           ing is to start at a point other than the beginning of the subject,  \A
1827       These assertions may not appear in  character  classes  (but         can  never  match.  The difference between \Z and \z is that \Z matches
1828       note  that  \b has a different meaning, namely the backspace         before a newline that is the last character of the string as well as at
1829       character, inside a character class).         the end of the string, whereas \z matches only at the end.
1830    
1831       A word boundary is a position in the  subject  string  where         The  \G assertion is true only when the current matching position is at
1832       the current character and the previous character do not both         the start point of the match, as specified by the startoffset  argument
1833       match \w or \W (i.e. one matches \w and  the  other  matches         of  pcre_exec().  It  differs  from \A when the value of startoffset is
1834       \W),  or the start or end of the string if the first or last         non-zero. By calling pcre_exec() multiple times with appropriate  argu-
1835       character matches \w, respectively.         ments, you can mimic Perl's /g option, and it is in this kind of imple-
1836       The \A, \Z, and \z assertions differ  from  the  traditional         mentation where \G can be useful.
1837       circumflex  and  dollar  (described below) in that they only  
1838       ever match at the very start and end of the subject  string,         Note, however, that PCRE's interpretation of \G, as the  start  of  the
1839       whatever options are set. Thus, they are independent of mul-         current match, is subtly different from Perl's, which defines it as the
1840       tiline mode.         end of the previous match. In Perl, these can  be  different  when  the
1841           previously  matched  string was empty. Because PCRE does just one match
1842       They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL         at a time, it cannot reproduce this behaviour.
1843       options.  If the startoffset argument of pcre_exec() is non-  
1844       zero, indicating that matching is to start at a point  other         If all the alternatives of a pattern begin with \G, the  expression  is
1845       than  the  beginning of the subject, \A can never match. The         anchored to the starting match position, and the "anchored" flag is set
1846       difference between \Z and \z is that  \Z  matches  before  a         in the compiled regular expression.
      newline  that is the last character of the string as well as  
      at the end of the string, whereas \z  matches  only  at  the  
      end.  
   
      The \G assertion is true  only  when  the  current  matching  
      position is at the start point of the match, as specified by  
      the startoffset argument of pcre_exec(). It differs from  \A  
      when  the  value  of  startoffset  is  non-zero.  By calling  
      pcre_exec() multiple times with appropriate  arguments,  you  
      can mimic Perl's /g option, and it is in this kind of imple-  
      mentation where \G can be useful.  
   
      Note, however, that PCRE's  interpretation  of  \G,  as  the  
      start of the current match, is subtly different from Perl's,  
      which defines it as the end of the previous match. In  Perl,  
      these  can  be  different when the previously matched string  
      was empty. Because PCRE does just one match at  a  time,  it  
      cannot reproduce this behaviour.  
   
      If all the alternatives of a  pattern  begin  with  \G,  the  
      expression  is  anchored to the starting match position, and  
      the "anchored" flag is set in the compiled  regular  expres-  
      sion.  
1847    
1848    
1849  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
1850    
1851       Outside a character class, in the default matching mode, the         Outside a character class, in the default matching mode, the circumflex
1852       circumflex  character  is an assertion which is true only if         character  is  an  assertion which is true only if the current matching
1853       the current matching point is at the start  of  the  subject         point is at the start of the subject string. If the  startoffset  argu-
1854       string.  If  the startoffset argument of pcre_exec() is non-         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
1855       zero, circumflex  can  never  match  if  the  PCRE_MULTILINE         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
1856       option is unset. Inside a character class, circumflex has an         has an entirely different meaning (see below).
1857       entirely different meaning (see below).  
1858           Circumflex  need  not be the first character of the pattern if a number
1859       Circumflex need not be the first character of the pattern if         of alternatives are involved, but it should be the first thing in  each
1860       a  number of alternatives are involved, but it should be the         alternative  in  which  it appears if the pattern is ever to match that
1861       first thing in each alternative in which it appears  if  the         branch. If all possible alternatives start with a circumflex, that  is,
1862       pattern is ever to match that branch. If all possible alter-         if  the  pattern  is constrained to match only at the start of the sub-
1863       natives start with a circumflex, that is, if the pattern  is         ject, it is said to be an "anchored" pattern.  (There  are  also  other
1864       constrained to match only at the start of the subject, it is         constructs that can cause a pattern to be anchored.)
1865       said to be an "anchored" pattern. (There are also other con-  
1866       structs that can cause a pattern to be anchored.)         A  dollar  character  is an assertion which is true only if the current
1867           matching point is at the end of  the  subject  string,  or  immediately
1868       A dollar character is an assertion which is true only if the         before a newline character that is the last character in the string (by
1869       current  matching point is at the end of the subject string,         default). Dollar need not be the last character of  the  pattern  if  a
1870       or immediately before a newline character that is  the  last         number  of alternatives are involved, but it should be the last item in
1871       character in the string (by default). Dollar need not be the         any branch in which it appears.  Dollar has no  special  meaning  in  a
1872       last character of the pattern if a  number  of  alternatives         character class.
1873       are  involved,  but it should be the last item in any branch  
1874       in which it appears.  Dollar has no  special  meaning  in  a         The  meaning  of  dollar  can be changed so that it matches only at the
1875       character class.         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
1876           compile time. This does not affect the \Z assertion.
1877       The meaning of dollar can be changed so that it matches only  
1878       at   the   very   end   of   the   string,  by  setting  the         The meanings of the circumflex and dollar characters are changed if the
1879       PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not         PCRE_MULTILINE option is set. When this is the case, they match immedi-
1880       affect the \Z assertion.         ately  after  and  immediately  before  an  internal newline character,
1881           respectively, in addition to matching at the start and end of the  sub-
1882       The meanings of the circumflex  and  dollar  characters  are         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject
1883       changed  if  the  PCRE_MULTILINE option is set. When this is         string "def\nabc" in multiline mode, but not  otherwise.  Consequently,
1884       the case,  they  match  immediately  after  and  immediately         patterns  that  are  anchored  in single line mode because all branches
1885       before an internal newline character, respectively, in addi-         start with ^ are not anchored in multiline mode, and a match  for  cir-
1886       tion to matching at the start and end of the subject string.         cumflex  is  possible  when  the startoffset argument of pcre_exec() is
1887       For  example, the pattern /^abc$/ matches the subject string         non-zero. The PCRE_DOLLAR_ENDONLY option is ignored  if  PCRE_MULTILINE
1888       "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-         is set.
1889       quently,  patterns  that  are  anchored  in single line mode  
1890       because all branches start with ^ are not anchored in multi-         Note  that  the sequences \A, \Z, and \z can be used to match the start
1891       line  mode,  and a match for circumflex is possible when the         and end of the subject in both modes, and if all branches of a  pattern
1892       startoffset  argument  of  pcre_exec()  is   non-zero.   The         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or
1893       PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is         not.
      set.  
   
      Note that the sequences \A, \Z, and \z can be used to  match  
      the  start  and end of the subject in both modes, and if all  
      branches of a pattern start with \A it is  always  anchored,  
      whether PCRE_MULTILINE is set or not.  
1894    
1895    
1896  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
1897    
1898       Outside a character class, a dot in the pattern matches  any         Outside a character class, a dot in the pattern matches any one charac-
1899       one character in the subject, including a non-printing char-         ter  in  the  subject,  including a non-printing character, but not (by
1900       acter, but not (by default) newline.  In UTF-8 mode,  a  dot         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,
1901       matches  any  UTF-8  character, which might be more than one         which  might  be  more than one byte long, except (by default) for new-
1902       byte  long,  except  (by  default)  for  newline.   If   the         line. If the PCRE_DOTALL option is set, dots match  newlines  as  well.
1903       PCRE_DOTALL  option is set, dots match newlines as well. The         The  handling of dot is entirely independent of the handling of circum-
1904       handling of dot is entirely independent of the  handling  of         flex and dollar, the only relationship being  that  they  both  involve
1905       circumflex and dollar, the only relationship being that they         newline characters. Dot has no special meaning in a character class.
      both involve newline characters. Dot has no special  meaning  
      in a character class.  
   
1906    
1907    
1908  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
1909    
1910       Outside a character class, the escape  sequence  \C  matches         Outside a character class, the escape sequence \C matches any one byte,
1911       any  one  byte, both in and out of UTF-8 mode. Unlike a dot,         both in and out of UTF-8 mode. Unlike a dot, it always matches  a  new-
1912       it always matches a newline. The feature is provided in Perl         line.  The  feature  is  provided  in Perl in order to match individual
1913       in  order  to match individual bytes in UTF-8 mode.  Because         bytes in UTF-8 mode.  Because it breaks up UTF-8 characters into  indi-
1914       it breaks up UTF-8 characters into  individual  bytes,  what         vidual  bytes,  what  remains  in  the  string may be a malformed UTF-8
1915       remains  in  the string may be a malformed UTF-8 string. For         string. For this reason it is best avoided.
1916       this reason it is best avoided.  
1917           PCRE does not allow \C to appear in lookbehind assertions (see  below),
1918       PCRE does not allow \C to appear  in  lookbehind  assertions         because in UTF-8 mode it makes it impossible to calculate the length of
1919       (see below), because in UTF-8 mode it makes it impossible to         the lookbehind.
      calculate the length of the lookbehind.  
1920    
1921    
1922  SQUARE BRACKETS  SQUARE BRACKETS
1923    
1924       An opening square bracket introduces a character class, ter-         An opening square bracket introduces a character class, terminated by a
1925       minated  by  a  closing  square  bracket.  A  closing square         closing square bracket. A closing square bracket on its own is not spe-
1926       bracket on its own is  not  special.  If  a  closing  square         cial. If a closing square bracket is required as a member of the class,
1927       bracket  is  required as a member of the class, it should be         it  should  be  the first data character in the class (after an initial
1928       the first data character in the class (after an initial cir-         circumflex, if present) or escaped with a backslash.
1929       cumflex, if present) or escaped with a backslash.  
1930           A character class matches a single character in the subject.  In  UTF-8
1931       A character class matches a single character in the subject.         mode,  the character may occupy more than one byte. A matched character
1932       In  UTF-8 mode, the character may occupy more than one byte.         must be in the set of characters defined by the class, unless the first
1933       A matched character must be in the set of characters defined         character  in  the  class definition is a circumflex, in which case the
1934       by the class, unless the first character in the class defin-         subject character must not be in the set defined by  the  class.  If  a
1935       ition is a circumflex, in which case the  subject  character         circumflex  is actually required as a member of the class, ensure it is
1936       must not be in the set defined by the class. If a circumflex         not the first character, or escape it with a backslash.
1937       is actually required as a member of the class, ensure it  is  
1938       not the first character, or escape it with a backslash.         For example, the character class [aeiou] matches any lower case  vowel,
1939           while  [^aeiou]  matches  any character that is not a lower case vowel.
1940       For example, the character class [aeiou] matches  any  lower         Note that a circumflex is just a convenient notation for specifying the
1941       case vowel, while [^aeiou] matches any character that is not         characters which are in the class by enumerating those that are not. It
1942       a lower case vowel. Note that a circumflex is  just  a  con-         is not an assertion: it still consumes a  character  from  the  subject
1943       venient  notation for specifying the characters which are in         string, and fails if the current pointer is at the end of the string.
1944       the class by enumerating those that are not. It  is  not  an  
1945       assertion:  it  still  consumes a character from the subject         In  UTF-8 mode, characters with values greater than 255 can be included
1946       string, and fails if the current pointer is at  the  end  of         in a class as a literal string of bytes, or by using the  \x{  escaping
1947       the string.         mechanism.
1948    
1949       In UTF-8 mode, characters with values greater than  255  can         When  caseless  matching  is set, any letters in a class represent both
1950       be  included  in a class as a literal string of bytes, or by         their upper case and lower case versions, so for  example,  a  caseless
1951       using the \x{ escaping mechanism.         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
1952           match "A", whereas a caseful version would. PCRE does not  support  the
1953       When caseless matching  is  set,  any  letters  in  a  class         concept of case for characters with values greater than 255.
1954       represent  both their upper case and lower case versions, so  
1955       for example, a caseless [aeiou] matches "A" as well as  "a",         The  newline character is never treated in any special way in character
1956       and  a caseless [^aeiou] does not match "A", whereas a case-         classes, whatever the setting  of  the  PCRE_DOTALL  or  PCRE_MULTILINE
1957       ful version would. PCRE does not support the concept of case         options is. A class such as [^a] will always match a newline.
1958       for characters with values greater than 255.  
1959       The newline character is never treated in any special way in         The  minus (hyphen) character can be used to specify a range of charac-
1960       character  classes,  whatever the setting of the PCRE_DOTALL         ters in a character  class.  For  example,  [d-m]  matches  any  letter
1961       or PCRE_MULTILINE options is. A  class  such  as  [^a]  will         between  d  and  m,  inclusive.  If  a minus character is required in a
1962       always match a newline.         class, it must be escaped with a backslash  or  appear  in  a  position
1963           where  it cannot be interpreted as indicating a range, typically as the
1964       The minus (hyphen) character can be used to specify a  range         first or last character in the class.
1965       of  characters  in  a  character  class.  For example, [d-m]  
1966       matches any letter between d and m, inclusive.  If  a  minus         It is not possible to have the literal character "]" as the end charac-
1967       character  is required in a class, it must be escaped with a         ter  of a range. A pattern such as [W-]46] is interpreted as a class of
1968       backslash or appear in a position where it cannot be  inter-         two characters ("W" and "-") followed by a literal string "46]", so  it
1969       preted as indicating a range, typically as the first or last         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
1970       character in the class.         backslash it is interpreted as the end of range, so [W-\]46] is  inter-
1971           preted  as  a  single class containing a range followed by two separate
1972       It is not possible to have the literal character "]" as  the         characters. The octal or hexadecimal representation of "]" can also  be
1973       end  character  of  a  range.  A  pattern such as [W-]46] is         used to end a range.
1974       interpreted as a class of two characters ("W" and "-")  fol-  
1975       lowed by a literal string "46]", so it would match "W46]" or         Ranges  operate in the collating sequence of character values. They can
1976       "-46]". However, if the "]" is escaped with a  backslash  it         also  be  used  for  characters  specified  numerically,  for   example
1977       is  interpreted  as  the end of range, so [W-\]46] is inter-         [\000-\037].  In UTF-8 mode, ranges can include characters whose values
1978       preted as a single class containing a range followed by  two         are greater than 255, for example [\x{100}-\x{2ff}].
1979       separate characters. The octal or hexadecimal representation  
1980       of "]" can also be used to end a range.         If a range that includes letters is used when caseless matching is set,
1981           it matches the letters in either case. For example, [W-c] is equivalent
1982       Ranges  operate  in  the  collating  sequence  of  character         to [][\^_`wxyzabc], matched caselessly, and if character tables for the
1983       values.  They  can  also  be  used  for characters specified         "fr"  locale  are  in use, [\xc8-\xcb] matches accented E characters in
1984       numerically, for example [\000-\037]. In UTF-8 mode,  ranges         both cases.
1985       can  include  characters  whose values are greater than 255,  
1986       for example [\x{100}-\x{2ff}].         The character types \d, \D, \s, \S, \w, and \W may  also  appear  in  a
1987           character  class,  and add the characters that they match to the class.
1988       If a range that  includes  letters  is  used  when  caseless         For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
1989       matching  is set, it matches the letters in either case. For         conveniently  be  used with the upper case character types to specify a
1990       example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched         more restricted set of characters than the matching  lower  case  type.
1991       caselessly,  and if character tables for the "fr" locale are         For  example,  the  class  [^\W_]  matches any letter or digit, but not
1992       in use, [\xc8-\xcb] matches accented E  characters  in  both         underscore.
1993       cases.  
1994           All non-alphameric characters other than \, -, ^ (at the start) and the
1995       The character types \d, \D, \s, \S,  \w,  and  \W  may  also         terminating ] are non-special in character classes, but it does no harm
1996       appear  in  a  character  class, and add the characters that         if they are escaped.
      they match to the class. For example, [\dABCDEF] matches any  
      hexadecimal  digit.  A  circumflex  can conveniently be used  
      with the upper case character types to specify a  more  res-  
      tricted set of characters than the matching lower case type.  
      For example, the class [^\W_] matches any letter  or  digit,  
      but not underscore.  
   
      All non-alphameric characters other than \,  -,  ^  (at  the  
      start)  and  the  terminating ] are non-special in character  
      classes, but it does no harm if they are escaped.  
1997    
1998    
1999  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
2000    
2001       Perl supports the  POSIX  notation  for  character  classes,         Perl supports the POSIX notation  for  character  classes,  which  uses
2002       which  uses names enclosed by [: and :] within the enclosing         names  enclosed by [: and :] within the enclosing square brackets. PCRE
2003       square brackets. PCRE also supports this notation. For exam-         also supports this notation. For example,
2004       ple,  
2005             [01[:alpha:]%]
2006         [01[:alpha:]%]  
2007           matches "0", "1", any alphabetic character, or "%". The supported class
2008       matches "0", "1", any alphabetic character, or "%". The sup-         names are
2009       ported class names are  
2010             alnum    letters and digits
2011         alnum    letters and digits           alpha    letters
2012         alpha    letters           ascii    character codes 0 - 127
2013         ascii    character codes 0 - 127           blank    space or tab only
2014         blank    space or tab only           cntrl    control characters
2015         cntrl    control characters           digit    decimal digits (same as \d)
2016         digit    decimal digits (same as \d)           graph    printing characters, excluding space
2017         graph    printing characters, excluding space           lower    lower case letters
2018         lower    lower case letters           print    printing characters, including space
2019         print    printing characters, including space           punct    printing characters, excluding letters and digits
2020         punct    printing characters, excluding letters and digits           space    white space (not quite the same as \s)
2021         space    white space (not quite the same as \s)           upper    upper case letters
2022         upper    upper case letters           word     "word" characters (same as \w)
2023         word     "word" characters (same as \w)           xdigit   hexadecimal digits
2024         xdigit   hexadecimal digits  
2025           The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
2026       The "space" characters are HT (9),  LF  (10),  VT  (11),  FF         and space (32). Notice that this list includes the VT  character  (code
2027       (12),  CR  (13),  and  space  (32).  Notice  that  this list         11). This makes "space" different to \s, which does not include VT (for
2028       includes the VT character (code 11). This makes "space" dif-         Perl compatibility).
2029       ferent  to  \s, which does not include VT (for Perl compati-  
2030       bility).         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
2031           from  Perl  5.8. Another Perl extension is negation, which is indicated
2032       The name "word" is a Perl extension, and "blank"  is  a  GNU         by a ^ character after the colon. For example,
2033       extension from Perl 5.8. Another Perl extension is negation,  
2034       which is indicated by a ^ character  after  the  colon.  For           [12[:^digit:]]
2035       example,  
2036           matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the
2037         [12[:^digit:]]         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2038           these are not supported, and an error is given if they are encountered.
      matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also  
      recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a  
      "collating element", but these are  not  supported,  and  an  
      error is given if they are encountered.  
2039    
2040       In UTF-8 mode, characters with values greater  than  255  do         In UTF-8 mode, characters with values greater than 255 do not match any
2041       not match any of the POSIX character classes.         of the POSIX character classes.
2042    
2043    
2044  VERTICAL BAR  VERTICAL BAR
2045    
2046       Vertical bar characters are  used  to  separate  alternative         Vertical bar characters are used to separate alternative patterns.  For
2047       patterns. For example, the pattern         example, the pattern
2048    
2049         gilbert|sullivan           gilbert|sullivan
2050    
2051       matches either "gilbert" or "sullivan". Any number of alter-         matches  either "gilbert" or "sullivan". Any number of alternatives may
2052       natives  may  appear,  and an empty alternative is permitted         appear, and an empty  alternative  is  permitted  (matching  the  empty
2053       (matching the empty string).   The  matching  process  tries         string).   The  matching  process  tries each alternative in turn, from
2054       each  alternative in turn, from left to right, and the first         left to right, and the first one that succeeds is used. If the alterna-
2055       one that succeeds is used. If the alternatives are within  a         tives  are within a subpattern (defined below), "succeeds" means match-
2056       subpattern  (defined  below),  "succeeds" means matching the         ing the rest of the main pattern as well as the alternative in the sub-
2057       rest of the main pattern as well as the alternative  in  the         pattern.
      subpattern.  
2058    
2059    
2060  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
2061    
2062       The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
2063       PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from         PCRE_EXTENDED options can be changed  from  within  the  pattern  by  a
2064       within the pattern by a  sequence  of  Perl  option  letters         sequence  of  Perl  option  letters  enclosed between "(?" and ")". The
2065       enclosed between "(?" and ")". The option letters are         option letters are
2066    
2067         i  for PCRE_CASELESS           i  for PCRE_CASELESS
2068         m  for PCRE_MULTILINE           m  for PCRE_MULTILINE
2069         s  for PCRE_DOTALL           s  for PCRE_DOTALL
2070         x  for PCRE_EXTENDED           x  for PCRE_EXTENDED
2071    
2072       For example, (?im) sets caseless, multiline matching. It  is         For example, (?im) sets caseless, multiline matching. It is also possi-
2073       also possible to unset these options by preceding the letter         ble to unset these options by preceding the letter with a hyphen, and a
2074       with a hyphen, and a combined setting and unsetting such  as         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
2075       (?im-sx),  which sets PCRE_CASELESS and PCRE_MULTILINE while         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
2076       unsetting PCRE_DOTALL and PCRE_EXTENDED, is also  permitted.         is also permitted. If a  letter  appears  both  before  and  after  the
2077       If  a  letter  appears both before and after the hyphen, the         hyphen, the option is unset.
2078       option is unset.  
2079           When  an option change occurs at top level (that is, not inside subpat-
2080       When an option change occurs at  top  level  (that  is,  not         tern parentheses), the change applies to the remainder of  the  pattern
2081       inside  subpattern  parentheses),  the change applies to the         that follows.  If the change is placed right at the start of a pattern,
2082       remainder of the pattern that follows.   If  the  change  is         PCRE extracts it into the global options (and it will therefore show up
2083       placed  right  at  the  start of a pattern, PCRE extracts it         in data extracted by the pcre_fullinfo() function).
2084       into the global options (and it will therefore  show  up  in  
2085       data extracted by the pcre_fullinfo() function).         An option change within a subpattern affects only that part of the cur-
2086           rent pattern that follows it, so
2087       An option change within a subpattern affects only that  part  
2088       of the current pattern that follows it, so           (a(?i)b)c
2089    
2090         (a(?i)b)c         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
2091           used).   By  this means, options can be made to have different settings
2092       matches  abc  and  aBc  and  no  other   strings   (assuming         in different parts of the pattern. Any changes made in one  alternative
2093       PCRE_CASELESS  is  not used).  By this means, options can be         do  carry  on  into subsequent branches within the same subpattern. For
2094       made to have different settings in different  parts  of  the         example,
2095       pattern.  Any  changes  made  in one alternative do carry on  
2096       into subsequent branches within  the  same  subpattern.  For           (a(?i)b|c)
2097       example,  
2098           matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the
2099         (a(?i)b|c)         first  branch  is  abandoned before the option setting. This is because
2100           the effects of option settings happen at compile time. There  would  be
2101       matches "ab", "aB", "c", and "C", even though when  matching         some very weird behaviour otherwise.
2102       "C" the first branch is abandoned before the option setting.  
2103       This is because the effects of  option  settings  happen  at         The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed
2104       compile  time. There would be some very weird behaviour oth-         in the same way as the Perl-compatible options by using the  characters
2105       erwise.         U  and X respectively. The (?X) flag setting is special in that it must
2106           always occur earlier in the pattern than any of the additional features
2107       The PCRE-specific options PCRE_UNGREEDY and  PCRE_EXTRA  can         it turns on, even when it is at top level. It is best put at the start.
      be changed in the same way as the Perl-compatible options by  
      using the characters U and X  respectively.  The  (?X)  flag  
      setting  is  special in that it must always occur earlier in  
      the pattern than any of the additional features it turns on,  
      even when it is at top level. It is best put at the start.  
2108    
2109    
2110  SUBPATTERNS  SUBPATTERNS
2111    
2112       Subpatterns are delimited by parentheses  (round  brackets),         Subpatterns are delimited by parentheses (round brackets), which can be
2113       which can be nested.  Marking part of a pattern as a subpat-         nested.  Marking part of a pattern as a subpattern does two things:
2114       tern does two things:  
2115           1. It localizes a set of alternatives. For example, the pattern
2116       1. It localizes a set of alternatives. For example, the pat-  
2117       tern           cat(aract|erpillar|)
2118    
2119         cat(aract|erpillar|)         matches  one  of the words "cat", "cataract", or "caterpillar". Without
2120           the parentheses, it would match "cataract",  "erpillar"  or  the  empty
2121       matches one of the words "cat",  "cataract",  or  "caterpil-         string.
2122       lar".  Without  the  parentheses, it would match "cataract",  
2123       "erpillar" or the empty string.         2.  It  sets  up  the  subpattern as a capturing subpattern (as defined
2124           above).  When the whole pattern matches, that portion  of  the  subject
2125       2. It sets up the subpattern as a capturing  subpattern  (as         string that matched the subpattern is passed back to the caller via the
2126       defined  above).   When the whole pattern matches, that por-         ovector argument of pcre_exec(). Opening parentheses are  counted  from
2127       tion of the subject string that matched  the  subpattern  is         left  to right (starting from 1) to obtain the numbers of the capturing
2128       passed  back  to  the  caller  via  the  ovector argument of         subpatterns.
2129       pcre_exec(). Opening parentheses are counted  from  left  to  
2130       right (starting from 1) to obtain the numbers of the captur-         For example, if the string "the red king" is matched against  the  pat-
2131       ing subpatterns.         tern
2132    
2133       For example, if the string "the red king" is matched against           the ((red|white) (king|queen))
2134       the pattern  
2135           the captured substrings are "red king", "red", and "king", and are num-
2136         the ((red|white) (king|queen))         bered 1, 2, and 3, respectively.
2137    
2138       the captured substrings are "red king", "red",  and  "king",         The fact that plain parentheses fulfil  two  functions  is  not  always
2139       and are numbered 1, 2, and 3, respectively.         helpful.   There are often times when a grouping subpattern is required
2140           without a capturing requirement. If an opening parenthesis is  followed
2141       The fact that plain parentheses fulfil two functions is  not         by  a question mark and a colon, the subpattern does not do any captur-
2142       always  helpful.  There are often times when a grouping sub-         ing, and is not counted when computing the  number  of  any  subsequent
2143       pattern is required without a capturing requirement.  If  an         capturing  subpatterns. For example, if the string "the white queen" is
2144       opening  parenthesis  is  followed  by a question mark and a         matched against the pattern
2145       colon, the subpattern does not do any capturing, and is  not  
2146       counted  when computing the number of any subsequent captur-           the ((?:red|white) (king|queen))
2147       ing subpatterns. For  example,  if  the  string  "the  white  
2148       queen" is matched against the pattern         the captured substrings are "white queen" and "queen", and are numbered
2149           1  and 2. The maximum number of capturing subpatterns is 65535, and the
2150         the ((?:red|white) (king|queen))         maximum depth of nesting of all subpatterns, both  capturing  and  non-
2151           capturing, is 200.
2152       the captured substrings are "white queen" and  "queen",  and  
2153       are  numbered  1 and 2. The maximum number of capturing sub-         As  a  convenient shorthand, if any option settings are required at the
2154       patterns is 65535, and the maximum depth of nesting  of  all         start of a non-capturing subpattern,  the  option  letters  may  appear
2155       subpatterns, both capturing and non-capturing, is 200.         between the "?" and the ":". Thus the two patterns
2156    
2157       As a  convenient  shorthand,  if  any  option  settings  are           (?i:saturday|sunday)
2158       required  at  the  start  of a non-capturing subpattern, the           (?:(?i)saturday|sunday)
2159       option letters may appear between the "?" and the ":".  Thus  
2160       the two patterns         match exactly the same set of strings. Because alternative branches are
2161           tried from left to right, and options are not reset until  the  end  of
2162         (?i:saturday|sunday)         the  subpattern is reached, an option setting in one branch does affect
2163         (?:(?i)saturday|sunday)         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
2164           "Saturday".
      match exactly the same set of strings.  Because  alternative  
      branches  are  tried from left to right, and options are not  
      reset until the end of the subpattern is reached, an  option  
      setting  in  one  branch does affect subsequent branches, so  
      the above patterns match "SUNDAY" as well as "Saturday".  
2165    
2166    
2167  NAMED SUBPATTERNS  NAMED SUBPATTERNS
2168    
2169       Identifying capturing parentheses by number is  simple,  but         Identifying  capturing  parentheses  by number is simple, but it can be
2170       it  can be very hard to keep track of the numbers in compli-         very hard to keep track of the numbers in complicated  regular  expres-
2171       cated regular expressions. Furthermore, if an expression  is         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
2172       modified,  the  numbers  may change. To help with the diffi-         change. To help with the difficulty, PCRE supports the naming  of  sub-
2173       culty, PCRE supports the naming  of  subpatterns,  something         patterns,  something  that  Perl  does  not  provide. The Python syntax
2174       that  Perl does not provide. The Python syntax (?P<name>...)         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and
2175       is used. Names consist of alphanumeric characters and under-         underscores, and must be unique within a pattern.
2176       scores, and must be unique within a pattern.  
2177           Named  capturing  parentheses  are  still  allocated numbers as well as
2178       Named capturing parentheses are still allocated  numbers  as         names. The PCRE API provides function calls for extracting the name-to-
2179       well  as  names.  The  PCRE  API provides function calls for         number  translation  table from a compiled pattern. For further details
2180       extracting the name-to-number translation table from a  com-         see the pcreapi documentation.
      piled  pattern. For further details see the pcreapi documen-  
      tation.  
2181    
2182    
2183  REPETITION  REPETITION
2184    
2185       Repetition is specified by quantifiers, which can follow any         Repetition is specified by quantifiers, which can  follow  any  of  the
2186       of the following items:         following items:
2187    
2188             a literal data character
2189             the . metacharacter
2190             the \C escape sequence
2191             escapes such as \d that match single characters
2192             a character class
2193             a back reference (see next section)
2194             a parenthesized subpattern (unless it is an assertion)
2195    
2196           The  general repetition quantifier specifies a minimum and maximum num-
2197           ber of permitted matches, by giving the two numbers in  curly  brackets
2198           (braces),  separated  by  a comma. The numbers must be less than 65536,
2199           and the first must be less than or equal to the second. For example:
2200    
2201             z{2,4}
2202    
2203           matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
2204           special  character.  If  the second number is omitted, but the comma is
2205           present, there is no upper limit; if the second number  and  the  comma
2206           are  both omitted, the quantifier specifies an exact number of required
2207           matches. Thus
2208    
2209             [aeiou]{3,}
2210    
2211           matches at least 3 successive vowels, but may match many more, while
2212    
2213             \d{8}
2214    
2215           matches exactly 8 digits. An opening curly bracket that  appears  in  a
2216           position  where a quantifier is not allowed, or one that does not match
2217           the syntax of a quantifier, is taken as a literal character. For  exam-
2218           ple, {,6} is not a quantifier, but a literal string of four characters.
2219    
2220           In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
2221           individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
2222           acters, each of which is represented by a two-byte sequence.
2223    
2224           The quantifier {0} is permitted, causing the expression to behave as if
2225           the previous item and the quantifier were not present.
2226    
2227           For  convenience  (and  historical compatibility) the three most common
2228           quantifiers have single-character abbreviations:
2229    
2230             *    is equivalent to {0,}
2231             +    is equivalent to {1,}
2232             ?    is equivalent to {0,1}
2233    
2234           It is possible to construct infinite loops by  following  a  subpattern
2235           that can match no characters with a quantifier that has no upper limit,
2236           for example:
2237    
2238             (a?)*
2239    
2240           Earlier versions of Perl and PCRE used to give an error at compile time
2241           for  such  patterns. However, because there are cases where this can be
2242           useful, such patterns are now accepted, but if any  repetition  of  the
2243           subpattern  does in fact match no characters, the loop is forcibly bro-
2244           ken.
2245    
2246           By default, the quantifiers are "greedy", that is, they match  as  much
2247           as  possible  (up  to  the  maximum number of permitted times), without
2248           causing the rest of the pattern to fail. The classic example  of  where
2249           this gives problems is in trying to match comments in C programs. These
2250           appear between the sequences /* and */ and within the  sequence,  indi-
2251           vidual * and / characters may appear. An attempt to match C comments by
2252           applying the pattern
2253    
2254             /\*.*\*/
2255    
2256           to the string
2257    
2258             /* first command */  not comment  /* second comment */
2259    
2260           fails, because it matches the entire string owing to the greediness  of
2261           the .*  item.
2262    
2263           However,  if  a quantifier is followed by a question mark, it ceases to
2264           be greedy, and instead matches the minimum number of times possible, so
2265           the pattern
2266    
2267         a literal data character           /\*.*?\*/
        the . metacharacter  
        the \C escape sequence  
        escapes such as \d that match single characters  
        a character class  
        a back reference (see next section)  
        a parenthesized subpattern (unless it is an assertion)  
   
      The general repetition quantifier specifies  a  minimum  and  
      maximum  number  of  permitted  matches,  by  giving the two  
      numbers in curly brackets (braces), separated  by  a  comma.  
      The  numbers  must be less than 65536, and the first must be  
      less than or equal to the second. For example:  
   
        z{2,4}  
   
      matches "zz", "zzz", or "zzzz". A closing brace on  its  own  
      is not a special character. If the second number is omitted,  
      but the comma is present, there is no upper  limit;  if  the  
      second number and the comma are both omitted, the quantifier  
      specifies an exact number of required matches. Thus  
   
        [aeiou]{3,}  
   
      matches at least 3 successive vowels,  but  may  match  many  
      more, while  
   
        \d{8}  
   
      matches exactly 8 digits.  An  opening  curly  bracket  that  
      appears  in a position where a quantifier is not allowed, or  
      one that does not match the syntax of a quantifier, is taken  
      as  a literal character. For example, {,6} is not a quantif-  
      ier, but a literal string of four characters.  
   
      In UTF-8 mode, quantifiers apply to UTF-8 characters  rather  
      than  to  individual  bytes.  Thus,  for example, \x{100}{2}  
      matches two UTF-8 characters, each of which  is  represented  
      by a two-byte sequence.  
   
      The quantifier {0} is permitted, causing the  expression  to  
      behave  as  if the previous item and the quantifier were not  
      present.  
   
      For convenience (and  historical  compatibility)  the  three  
      most common quantifiers have single-character abbreviations:  
   
        *    is equivalent to {0,}  
        +    is equivalent to {1,}  
        ?    is equivalent to {0,1}  
   
      It is possible to construct infinite loops  by  following  a  
      subpattern  that  can  match no characters with a quantifier  
      that has no upper limit, for example:  
   
        (a?)*  
   
      Earlier versions of Perl and PCRE used to give an  error  at  
      compile  time  for such patterns. However, because there are  
      cases where this  can  be  useful,  such  patterns  are  now  
      accepted,  but  if  any repetition of the subpattern does in  
      fact match no characters, the loop is forcibly broken.  
   
      By default, the quantifiers  are  "greedy",  that  is,  they  
      match  as much as possible (up to the maximum number of per-  
      mitted times), without causing the rest of  the  pattern  to  
      fail. The classic example of where this gives problems is in  
      trying to match comments in C programs. These appear between  
      the  sequences /* and */ and within the sequence, individual  
      * and / characters may appear. An attempt to  match  C  com-  
      ments by applying the pattern  
   
        /\*.*\*/  
   
      to the string  
   
        /* first command */  not comment  /* second comment */  
   
      fails, because it matches the entire  string  owing  to  the  
      greediness of the .*  item.  
   
      However, if a quantifier is followed by a question mark,  it  
      ceases  to be greedy, and instead matches the minimum number  
      of times possible, so the pattern  
   
        /\*.*?\*/  
   
      does the right thing with the C comments. The meaning of the  
      various  quantifiers is not otherwise changed, just the pre-  
      ferred number of matches.  Do not confuse this use of  ques-  
      tion  mark  with  its  use as a quantifier in its own right.  
      Because it has two uses, it can sometimes appear doubled, as  
      in  
   
        \d??\d  
   
      which matches one digit by preference, but can match two  if  
      that is the only way the rest of the pattern matches.  
   
      If the PCRE_UNGREEDY option is set (an option which  is  not  
      available  in  Perl),  the  quantifiers  are  not  greedy by  
      default, but individual ones can be made greedy by following  
      them  with  a  question mark. In other words, it inverts the  
      default behaviour.  
   
      When a parenthesized subpattern is quantified with a minimum  
      repeat  count  that is greater than 1 or with a limited max-  
      imum, more store is required for the  compiled  pattern,  in  
      proportion to the size of the minimum or maximum.  
      If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL  
      option (equivalent to Perl's /s) is set, thus allowing the .  
      to match  newlines,  the  pattern  is  implicitly  anchored,  
      because whatever follows will be tried against every charac-  
      ter position in the subject string, so there is no point  in  
      retrying  the overall match at any position after the first.  
      PCRE normally treats such a pattern as though it  were  pre-  
      ceded by \A.  
   
      In cases where it is known that the subject string  contains  
      no  newlines,  it  is  worth setting PCRE_DOTALL in order to  
      obtain this optimization, or alternatively using ^ to  indi-  
      cate anchoring explicitly.  
   
      However, there is one situation where the optimization  can-  
      not  be  used. When .*  is inside capturing parentheses that  
      are the subject of a backreference elsewhere in the pattern,  
      a match at the start may fail, and a later one succeed. Con-  
      sider, for example:  
   
        (.*)abc\1  
   
      If the subject is "xyz123abc123"  the  match  point  is  the  
      fourth  character.  For  this  reason, such a pattern is not  
      implicitly anchored.  
   
      When a capturing subpattern is repeated, the value  captured  
      is the substring that matched the final iteration. For exam-  
      ple, after  
   
        (tweedle[dume]{3}\s*)+  
   
      has matched "tweedledum tweedledee" the value  of  the  cap-  
      tured  substring  is  "tweedledee".  However,  if  there are  
      nested capturing  subpatterns,  the  corresponding  captured  
      values  may  have been set in previous iterations. For exam-  
      ple, after  
2268    
2269         /(a|(b))+/         does  the  right  thing with the C comments. The meaning of the various
2270           quantifiers is not otherwise changed,  just  the  preferred  number  of
2271           matches.   Do  not  confuse this use of question mark with its use as a
2272           quantifier in its own right. Because it has two uses, it can  sometimes
2273           appear doubled, as in
2274    
2275       matches "aba" the value of the second captured substring  is           \d??\d
2276       "b".  
2277           which matches one digit by preference, but can match two if that is the
2278           only way the rest of the pattern matches.
2279    
2280           If the PCRE_UNGREEDY option is set (an option which is not available in
2281           Perl),  the  quantifiers are not greedy by default, but individual ones
2282           can be made greedy by following them with a  question  mark.  In  other
2283           words, it inverts the default behaviour.
2284    
2285           When  a  parenthesized  subpattern  is quantified with a minimum repeat
2286           count that is greater than 1 or with a limited maximum, more  store  is
2287           required  for  the  compiled  pattern, in proportion to the size of the
2288           minimum or maximum.
2289    
2290           If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
2291           alent  to Perl's /s) is set, thus allowing the . to match newlines, the
2292           pattern is implicitly anchored, because whatever follows will be  tried
2293           against  every character position in the subject string, so there is no
2294           point in retrying the overall match at any position  after  the  first.
2295           PCRE normally treats such a pattern as though it were preceded by \A.
2296    
2297           In  cases  where  it  is known that the subject string contains no new-
2298           lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
2299           mization, or alternatively using ^ to indicate anchoring explicitly.
2300    
2301           However,  there is one situation where the optimization cannot be used.
2302           When .*  is inside capturing parentheses that  are  the  subject  of  a
2303           backreference  elsewhere in the pattern, a match at the start may fail,
2304           and a later one succeed. Consider, for example:
2305    
2306             (.*)abc\1
2307    
2308           If the subject is "xyz123abc123" the match point is the fourth  charac-
2309           ter. For this reason, such a pattern is not implicitly anchored.
2310    
2311           When a capturing subpattern is repeated, the value captured is the sub-
2312           string that matched the final iteration. For example, after
2313    
2314             (tweedle[dume]{3}\s*)+
2315    
2316           has matched "tweedledum tweedledee" the value of the captured substring
2317           is  "tweedledee".  However,  if there are nested capturing subpatterns,
2318           the corresponding captured values may have been set in previous  itera-
2319           tions. For example, after
2320    
2321             /(a|(b))+/
2322    
2323           matches "aba" the value of the second captured substring is "b".
2324    
2325    
2326  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
2327    
2328       With both maximizing and minimizing repetition,  failure  of         With both maximizing and minimizing repetition, failure of what follows
2329       what  follows  normally  causes  the repeated item to be re-         normally causes the repeated item to be re-evaluated to see if  a  dif-
2330       evaluated to see if a different number of repeats allows the         ferent number of repeats allows the rest of the pattern to match. Some-
2331       rest  of  the  pattern  to  match. Sometimes it is useful to         times it is useful to prevent this, either to change the nature of  the
2332       prevent this, either to change the nature of the  match,  or         match,  or  to  cause it fail earlier than it otherwise might, when the
2333       to  cause  it fail earlier than it otherwise might, when the         author of the pattern knows there is no point in carrying on.
2334       author of the pattern knows there is no  point  in  carrying  
2335       on.         Consider, for example, the pattern \d+foo when applied to  the  subject
2336           line
2337       Consider, for example, the pattern \d+foo  when  applied  to  
2338       the subject line           123456bar
2339    
2340         123456bar         After matching all 6 digits and then failing to match "foo", the normal
2341           action of the matcher is to try again with only 5 digits  matching  the
2342       After matching all 6 digits and then failing to match "foo",         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
2343       the normal action of the matcher is to try again with only 5         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
2344       digits matching the \d+ item, and then with 4,  and  so  on,         the  means for specifying that once a subpattern has matched, it is not
2345       before  ultimately  failing. "Atomic grouping" (a term taken         to be re-evaluated in this way.
2346       from Jeffrey Friedl's book) provides the means for  specify-  
2347       ing  that once a subpattern has matched, it is not to be re-         If we use atomic grouping for the previous example, the  matcher  would
2348       evaluated in this way.         give up immediately on failing to match "foo" the first time. The nota-
2349           tion is a kind of special parenthesis, starting with  (?>  as  in  this
2350       If we use atomic grouping  for  the  previous  example,  the         example:
2351       matcher  would give up immediately on failing to match "foo"  
2352       the  first  time.  The  notation  is  a  kind   of   special           (?>\d+)foo
2353       parenthesis, starting with (?> as in this example:  
2354           This  kind  of  parenthesis "locks up" the  part of the pattern it con-
2355         (?>\d+)bar         tains once it has matched, and a failure further into  the  pattern  is
2356           prevented  from  backtracking into it. Backtracking past it to previous
2357       This kind of parenthesis "locks up" the  part of the pattern         items, however, works as normal.
2358       it  contains once it has matched, and a failure further into  
2359       the pattern is prevented from backtracking  into  it.  Back-         An alternative description is that a subpattern of  this  type  matches
2360       tracking  past  it to previous items, however, works as nor-         the  string  of  characters  that an identical standalone pattern would
2361       mal.         match, if anchored at the current point in the subject string.
2362    
2363       An alternative description is that a subpattern of this type         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
2364       matches  the  string  of  characters that an identical stan-         such as the above example can be thought of as a maximizing repeat that
2365       dalone pattern would match, if anchored at the current point         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
2366       in the subject string.         pared  to  adjust  the number of digits they match in order to make the
2367           rest of the pattern match, (?>\d+) can only match an entire sequence of
2368       Atomic grouping subpatterns are not  capturing  subpatterns.         digits.
2369       Simple  cases such as the above example can be thought of as  
2370       a maximizing repeat that must swallow everything it can. So,         Atomic  groups in general can of course contain arbitrarily complicated
2371       while both \d+ and \d+? are prepared to adjust the number of         subpatterns, and can be nested. However, when  the  subpattern  for  an
2372       digits they match in order to make the rest of  the  pattern         atomic group is just a single repeated item, as in the example above, a
2373       match, (?>\d+) can only match an entire sequence of digits.         simpler notation, called a "possessive quantifier" can  be  used.  This
2374           consists  of  an  additional  + character following a quantifier. Using
2375       Atomic groups in general can of course  contain  arbitrarily         this notation, the previous example can be rewritten as
2376       complicated  subpatterns,  and  can be nested. However, when  
2377       the subpattern for an atomic group is just a single repeated           \d++bar
2378       item,  as in the example above, a simpler notation, called a  
2379       "possessive quantifier" can be used.  This  consists  of  an         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
2380       additional  +  character  following a quantifier. Using this         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
2381       notation, the previous example can be rewritten as         simpler forms of atomic group. However, there is no difference  in  the
2382           meaning  or  processing  of  a possessive quantifier and the equivalent
2383         \d++bar         atomic group.
2384    
2385       Possessive quantifiers are always greedy; the setting of the         The possessive quantifier syntax is an extension to the Perl syntax. It
2386       PCRE_UNGREEDY option is ignored. They are a convenient nota-         originates in Sun's Java package.
2387       tion for the simpler forms of atomic group.  However,  there  
2388       is  no  difference in the meaning or processing of a posses-         When  a  pattern  contains an unlimited repeat inside a subpattern that
2389       sive quantifier and the equivalent atomic group.         can itself be repeated an unlimited number of  times,  the  use  of  an
2390           atomic  group  is  the  only way to avoid some failing matches taking a
2391       The possessive quantifier syntax is an extension to the Perl         very long time indeed. The pattern
2392       syntax. It originates in Sun's Java package.  
2393             (\D+|<\d+>)*[!?]
2394       When a pattern contains an unlimited repeat inside a subpat-  
2395       tern  that  can  itself  be  repeated an unlimited number of         matches an unlimited number of substrings that either consist  of  non-
2396       times, the use of an atomic group is the only way  to  avoid         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
2397       some  failing  matches  taking  a very long time indeed. The         matches, it runs quickly. However, if it is applied to
2398       pattern  
2399             aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2400         (\D+|<\d+>)*[!?]  
2401           it takes a long time before reporting  failure.  This  is  because  the
2402       matches an unlimited number of substrings that  either  con-         string  can  be  divided  between  the two repeats in a large number of
2403       sist  of  non-digits,  or digits enclosed in <>, followed by         ways, and all have to be tried. (The example used [!?]  rather  than  a
2404       either ! or ?. When it matches, it runs quickly. However, if         single  character  at the end, because both PCRE and Perl have an opti-
2405       it is applied to         mization that allows for fast failure when a single character is  used.
2406           They  remember  the last single character that is required for a match,
2407         aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa         and fail early if it is not present in the string.)  If the pattern  is
2408           changed to
      it takes a long  time  before  reporting  failure.  This  is  
      because the string can be divided between the two repeats in  
      a large number of ways, and all have to be tried. (The exam-  
      ple  used  [!?]  rather  than a single character at the end,  
      because both PCRE and Perl have an optimization that  allows  
      for  fast  failure  when  a  single  character is used. They  
      remember the last single character that is  required  for  a  
      match,  and  fail early if it is not present in the string.)  
      If the pattern is changed to  
2409    
2410         ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
2411    
2412       sequences of non-digits cannot be broken, and  failure  hap-         sequences  of non-digits cannot be broken, and failure happens quickly.
      pens quickly.  
2413    
2414    
2415  BACK REFERENCES  BACK REFERENCES
2416    
2417       Outside a character class, a backslash followed by  a  digit         Outside a character class, a backslash followed by a digit greater than
2418       greater  than  0  (and  possibly  further  digits) is a back         0 (and possibly further digits) is a back reference to a capturing sub-
2419       reference to a capturing subpattern earlier (that is, to its         pattern earlier (that is, to its left) in the pattern,  provided  there
2420       left)  in  the  pattern,  provided there have been that many         have been that many previous capturing left parentheses.
2421       previous capturing left parentheses.  
2422           However, if the decimal number following the backslash is less than 10,
2423       However, if the decimal number following  the  backslash  is         it is always taken as a back reference, and causes  an  error  only  if
2424       less  than  10,  it is always taken as a back reference, and         there  are  not that many capturing left parentheses in the entire pat-
2425       causes an error only if there are not  that  many  capturing         tern. In other words, the parentheses that are referenced need  not  be
2426       left  parentheses in the entire pattern. In other words, the         to  the left of the reference for numbers less than 10. See the section
2427       parentheses that are referenced need not be to the  left  of         entitled "Backslash" above for further details of the handling of  dig-
2428       the  reference  for  numbers  less  than 10. See the section         its following a backslash.
2429       entitled "Backslash" above for further details of  the  han-  
2430       dling of digits following a backslash.         A  back  reference matches whatever actually matched the capturing sub-
2431           pattern in the current subject string, rather  than  anything  matching
2432       A back reference matches whatever actually matched the  cap-         the subpattern itself (see "Subpatterns as subroutines" below for a way
2433       turing subpattern in the current subject string, rather than         of doing that). So the pattern
2434       anything matching the subpattern itself (see "Subpatterns as  
2435       subroutines" below for a way of doing that). So the pattern           (sens|respons)e and \1ibility
2436    
2437         (sens|respons)e and \1ibility         matches "sense and sensibility" and "response and responsibility",  but
2438           not  "sense and responsibility". If caseful matching is in force at the
2439       matches "sense and sensibility" and "response and  responsi-         time of the back reference, the case of letters is relevant. For  exam-
2440       bility",  but  not  "sense  and  responsibility". If caseful         ple,
2441       matching is in force at the time of the back reference,  the  
2442       case of letters is relevant. For example,           ((?i)rah)\s+\1
2443    
2444         ((?i)rah)\s+\1         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
2445           original capturing subpattern is matched caselessly.
2446       matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even  
2447       though  the  original  capturing subpattern is matched case-         Back references to named subpatterns use the Python  syntax  (?P=name).
2448       lessly.         We could rewrite the above example as follows:
2449    
2450       Back references to named subpatterns use the  Python  syntax           (?<p1>(?i)rah)\s+(?P=p1)
2451       (?P=name). We could rewrite the above example as follows:  
2452           There  may be more than one back reference to the same subpattern. If a
2453         (?<p1>(?i)rah)\s+(?P=p1)         subpattern has not actually been used in a particular match,  any  back
2454           references to it always fail. For example, the pattern
2455       There may be more than one back reference to the  same  sub-  
2456       pattern.  If  a  subpattern  has not actually been used in a           (a|(bc))\2
2457       particular match, any back references to it always fail. For  
2458       example, the pattern         always  fails if it starts to match "a" rather than "bc". Because there
2459           may be many capturing parentheses in a pattern,  all  digits  following
2460         (a|(bc))\2         the  backslash  are taken as part of a potential back reference number.
2461           If the pattern continues with a digit character, some delimiter must be
2462       always fails if it starts to match  "a"  rather  than  "bc".         used  to  terminate  the back reference. If the PCRE_EXTENDED option is
2463       Because  there  may  be many capturing parentheses in a pat-         set, this can be whitespace.  Otherwise an empty comment can be used.
2464       tern, all digits following the backslash are taken  as  part  
2465       of a potential back reference number. If the pattern contin-         A back reference that occurs inside the parentheses to which it  refers
2466       ues with a digit character, some delimiter must be  used  to         fails  when  the subpattern is first used, so, for example, (a\1) never
2467       terminate the back reference. If the PCRE_EXTENDED option is         matches.  However, such references can be useful inside  repeated  sub-
2468       set, this can be whitespace.  Otherwise an empty comment can         patterns. For example, the pattern
2469       be used.  
2470             (a|b\1)+
2471       A back reference that occurs inside the parentheses to which  
2472       it  refers  fails when the subpattern is first used, so, for         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
2473       example, (a\1) never matches.  However, such references  can         ation of the subpattern,  the  back  reference  matches  the  character
2474       be useful inside repeated subpatterns. For example, the pat-         string  corresponding  to  the previous iteration. In order for this to
2475       tern         work, the pattern must be such that the first iteration does  not  need
2476           to  match the back reference. This can be done using alternation, as in
2477         (a|b\1)+         the example above, or by a quantifier with a minimum of zero.
   
      matches any number of "a"s and also "aba", "ababbaa" etc. At  
      each iteration of the subpattern, the back reference matches  
      the character string corresponding to  the  previous  itera-  
      tion.  In  order  for this to work, the pattern must be such  
      that the first iteration does not need  to  match  the  back  
      reference.  This  can  be  done using alternation, as in the  
      example above, or by a quantifier with a minimum of zero.  
2478    
2479    
2480  ASSERTIONS  ASSERTIONS
2481    
2482       An assertion is  a  test  on  the  characters  following  or         An assertion is a test on the characters  following  or  preceding  the
2483       preceding  the current matching point that does not actually         current  matching  point that does not actually consume any characters.
2484       consume any characters. The simple assertions coded  as  \b,         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
2485       \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-         described above.  More complicated assertions are coded as subpatterns.
2486       plicated assertions are coded as subpatterns. There are  two         There are two kinds: those that look ahead of the current  position  in
2487       kinds:  those that look ahead of the current position in the         the subject string, and those that look behind it.
      subject string, and those that look behind it.  
2488    
2489       An assertion subpattern is matched in the normal way, except         An  assertion  subpattern  is matched in the normal way, except that it
2490       that  it  does not cause the current matching position to be         does not cause the current matching position to be  changed.  Lookahead
2491       changed. Lookahead assertions start with  (?=  for  positive         assertions  start with (?= for positive assertions and (?! for negative
2492       assertions and (?! for negative assertions. For example,         assertions. For example,
2493    
2494         \w+(?=;)           \w+(?=;)
2495    
2496       matches a word followed by a semicolon, but does not include         matches a word followed by a semicolon, but does not include the  semi-
2497       the semicolon in the match, and         colon in the match, and
2498    
2499         foo(?!bar)           foo(?!bar)
2500    
2501       matches any occurrence of "foo"  that  is  not  followed  by         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
2502       "bar". Note that the apparently similar pattern         that the apparently similar pattern
2503    
2504         (?!foo)bar           (?!foo)bar
2505    
2506       does not find an occurrence of "bar"  that  is  preceded  by         does not find an occurrence of "bar"  that  is  preceded  by  something
2507       something other than "foo"; it finds any occurrence of "bar"         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
2508       whatsoever, because the assertion  (?!foo)  is  always  true         the assertion (?!foo) is always true when the next three characters are
2509       when  the  next  three  characters  are  "bar". A lookbehind         "bar". A lookbehind assertion is needed to achieve this effect.
      assertion is needed to achieve this effect.  
2510    
2511       If you want to force a matching failure at some point  in  a         If you want to force a matching failure at some point in a pattern, the
2512       pattern,  the  most  convenient  way  to  do it is with (?!)         most convenient way to do it is  with  (?!)  because  an  empty  string
2513       because an empty string always matches, so an assertion that         always  matches, so an assertion that requires there not to be an empty
2514       requires there not to be an empty string must always fail.         string must always fail.
2515    
2516       Lookbehind assertions start with (?<=  for  positive  asser-         Lookbehind assertions start with (?<= for positive assertions and  (?<!
2517       tions and (?<! for negative assertions. For example,         for negative assertions. For example,
2518    
2519         (?<!foo)bar           (?<!foo)bar
2520    
2521       does find an occurrence of "bar" that  is  not  preceded  by         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
2522       "foo". The contents of a lookbehind assertion are restricted         contents of a lookbehind assertion are restricted  such  that  all  the
2523       such that all the strings  it  matches  must  have  a  fixed         strings it matches must have a fixed length. However, if there are sev-
2524       length.  However, if there are several alternatives, they do         eral alternatives, they do not all have to have the same fixed  length.
2525       not all have to have the same fixed length. Thus         Thus
2526    
2527         (?<=bullock|donkey)           (?<=bullock|donkey)
2528    
2529       is permitted, but         is permitted, but
2530    
2531         (?<!dogs?|cats?)           (?<!dogs?|cats?)
2532    
2533       causes an error at compile time. Branches  that  match  dif-         causes  an  error at compile time. Branches that match different length
2534       ferent length strings are permitted only at the top level of         strings are permitted only at the top level of a lookbehind  assertion.
2535       a lookbehind assertion. This is an extension  compared  with         This  is  an  extension  compared  with  Perl (at least for 5.8), which
2536       Perl  (at  least  for  5.8),  which requires all branches to         requires all branches to match the same length of string. An  assertion
2537       match the same length of string. An assertion such as         such as
2538    
2539         (?<=ab(c|de))           (?<=ab(c|de))
2540    
2541       is not permitted, because its single  top-level  branch  can         is  not  permitted,  because  its single top-level branch can match two
2542       match two different lengths, but it is acceptable if rewrit-         different lengths, but it is acceptable if rewritten to  use  two  top-
2543       ten to use two top-level branches:         level branches:
2544    
2545         (?<=abc|abde)           (?<=abc|abde)
2546    
2547       The implementation of lookbehind  assertions  is,  for  each         The  implementation  of lookbehind assertions is, for each alternative,
2548       alternative,  to  temporarily move the current position back         to temporarily move the current position back by the  fixed  width  and
2549       by the fixed width and then  try  to  match.  If  there  are         then try to match. If there are insufficient characters before the cur-
2550       insufficient  characters  before  the  current position, the         rent position, the match is deemed to fail.
      match is deemed to fail.  
2551    
2552       PCRE does not allow the \C escape (which  matches  a  single         PCRE does not allow the \C escape (which matches a single byte in UTF-8
2553       byte  in  UTF-8  mode)  to  appear in lookbehind assertions,         mode)  to appear in lookbehind assertions, because it makes it impossi-
2554       because it makes it impossible to calculate  the  length  of         ble to calculate the length of the lookbehind.
      the lookbehind.  
2555    
2556       Atomic groups can be used  in  conjunction  with  lookbehind         Atomic groups can be used in conjunction with lookbehind assertions  to
2557       assertions  to  specify efficient matching at the end of the         specify efficient matching at the end of the subject string. Consider a
2558       subject string. Consider a simple pattern such as         simple pattern such as
2559    
2560         abcd$           abcd$
2561    
2562       when applied to a long string that does not  match.  Because         when applied to a long string that does  not  match.  Because  matching
2563       matching  proceeds  from  left  to right, PCRE will look for         proceeds from left to right, PCRE will look for each "a" in the subject
2564       each "a" in the subject and then see if what follows matches         and then see if what follows matches the rest of the  pattern.  If  the
2565       the rest of the pattern. If the pattern is specified as         pattern is specified as
2566    
2567         ^.*abcd$           ^.*abcd$
2568    
2569       the initial .* matches the entire string at first, but  when         the  initial .* matches the entire string at first, but when this fails
2570       this  fails  (because  there  is no following "a"), it back-         (because there is no following "a"), it backtracks to match all but the
2571       tracks to match all but the last character, then all but the         last  character,  then all but the last two characters, and so on. Once
2572       last  two  characters,  and so on. Once again the search for         again the search for "a" covers the entire string, from right to  left,
2573       "a" covers the entire string, from right to left, so we  are         so we are no better off. However, if the pattern is written as
      no better off. However, if the pattern is written as  
2574    
2575         ^(?>.*)(?<=abcd)           ^(?>.*)(?<=abcd)
2576    
2577       or, equivalently,         or, equivalently,
2578    
2579         ^.*+(?<=abcd)           ^.*+(?<=abcd)
2580    
2581       there can be no backtracking for the .* item; it  can  match         there  can  be  no  backtracking for the .* item; it can match only the
2582       only  the entire string. The subsequent lookbehind assertion         entire string. The subsequent lookbehind assertion does a  single  test
2583       does a single test on the last four characters. If it fails,         on  the last four characters. If it fails, the match fails immediately.
2584       the match fails immediately. For long strings, this approach         For long strings, this approach makes a significant difference  to  the
2585       makes a significant difference to the processing time.         processing time.
2586    
2587       Several assertions (of any sort) may  occur  in  succession.         Several assertions (of any sort) may occur in succession. For example,
      For example,  
2588    
2589         (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
2590    
2591       matches "foo" preceded by three digits that are  not  "999".         matches  "foo" preceded by three digits that are not "999". Notice that
2592       Notice  that each of the assertions is applied independently         each of the assertions is applied independently at the  same  point  in
2593       at the same point in the subject string. First  there  is  a         the  subject  string.  First  there  is a check that the previous three
2594       check that the previous three characters are all digits, and         characters are all digits, and then there is  a  check  that  the  same
2595       then there is a check that the same three characters are not         three characters are not "999".  This pattern does not match "foo" pre-
2596       "999".   This  pattern  does not match "foo" preceded by six         ceded by six characters, the first of which are  digits  and  the  last
2597       characters, the first of which are digits and the last three         three  of  which  are not "999". For example, it doesn't match "123abc-
2598       of  which  are  not  "999".  For  example,  it doesn't match         foo". A pattern to do that is
      "123abcfoo". A pattern to do that is  
2599    
2600         (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
2601    
2602       This time the first assertion looks  at  the  preceding  six         This time the first assertion looks at the  preceding  six  characters,
2603       characters,  checking  that  the first three are digits, and         checking that the first three are digits, and then the second assertion
2604       then the second assertion checks that  the  preceding  three         checks that the preceding three characters are not "999".
      characters are not "999".  
2605    
2606       Assertions can be nested in any combination. For example,         Assertions can be nested in any combination. For example,
2607    
2608         (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
2609    
2610       matches an occurrence of "baz" that  is  preceded  by  "bar"         matches an occurrence of "baz" that is preceded by "bar" which in  turn
2611       which in turn is not preceded by "foo", while         is not preceded by "foo", while
2612    
2613         (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
2614    
2615       is another pattern which matches  "foo"  preceded  by  three         is another pattern which matches "foo" preceded by three digits and any
2616       digits and any three characters that are not "999".         three characters that are not "999".
2617    
2618       Assertion subpatterns are not capturing subpatterns, and may         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
2619       not  be  repeated,  because  it makes no sense to assert the         repeated,  because  it  makes no sense to assert the same thing several
2620       same thing several times. If any kind of assertion  contains         times. If any kind of assertion contains capturing  subpatterns  within
2621       capturing  subpatterns  within it, these are counted for the         it,  these are counted for the purposes of numbering the capturing sub-
2622       purposes of numbering the capturing subpatterns in the whole         patterns in the whole pattern.  However, substring capturing is carried
2623       pattern.   However,  substring capturing is carried out only         out  only  for  positive assertions, because it does not make sense for
2624       for positive assertions, because it does not make sense  for         negative assertions.
      negative assertions.  
2625    
2626    
2627  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
2628    
2629       It is possible to cause the matching process to obey a  sub-         It is possible to cause the matching process to obey a subpattern  con-
2630       pattern  conditionally  or to choose between two alternative         ditionally  or to choose between two alternative subpatterns, depending
2631       subpatterns, depending on the result  of  an  assertion,  or         on the  result  of  an  assertion,  or  whether  a  previous  capturing
2632       whether  a previous capturing subpattern matched or not. The         subpattern  matched  or not. The two possible forms of conditional sub-
2633       two possible forms of conditional subpattern are         pattern are
2634