/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 83 by nigel, Sat Feb 24 21:41:06 2007 UTC revision 197 by ph10, Tue Jul 31 10:50:18 2007 UTC
# Line 18  INTRODUCTION Line 18  INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few differences. (Certain features that appeared in Python and
22         6.x) corresponds approximately with Perl  5.8,  including  support  for         PCRE before they appeared in Perl are also available using  the  Python
23         UTF-8 encoded strings and Unicode general category properties. However,         syntax.)
24         this support has to be explicitly enabled; it is not the default.  
25           The  current  implementation of PCRE (release 7.x) corresponds approxi-
26         In addition to the Perl-compatible matching function,  PCRE  also  con-         mately with Perl 5.10, including support for UTF-8 encoded strings  and
27         tains  an  alternative matching function that matches the same compiled         Unicode general category properties. However, UTF-8 and Unicode support
28         patterns in a different way. In certain circumstances, the  alternative         has to be explicitly enabled; it is not the default. The Unicode tables
29         function  has  some  advantages.  For  a discussion of the two matching         correspond to Unicode release 5.0.0.
30         algorithms, see the pcrematching page.  
31           In  addition to the Perl-compatible matching function, PCRE contains an
32         PCRE is written in C and released as a C library. A  number  of  people         alternative matching function that matches the same  compiled  patterns
33         have  written  wrappers and interfaces of various kinds. In particular,         in  a different way. In certain circumstances, the alternative function
34         Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now         has some advantages. For a discussion of the two  matching  algorithms,
35           see the pcrematching page.
36    
37           PCRE  is  written  in C and released as a C library. A number of people
38           have written wrappers and interfaces of various kinds.  In  particular,
39           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
40         included as part of the PCRE distribution. The pcrecpp page has details         included as part of the PCRE distribution. The pcrecpp page has details
41         of this interface. Other people's contributions can  be  found  in  the         of  this  interface.  Other  people's contributions can be found in the
42         Contrib directory at the primary FTP site, which is:         Contrib directory at the primary FTP site, which is:
43    
44         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
45    
46         Details  of  exactly which Perl regular expression features are and are         Details of exactly which Perl regular expression features are  and  are
47         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
48         tern and pcrecompat pages.         tern and pcrecompat pages.
49    
50         Some  features  of  PCRE can be included, excluded, or changed when the         Some features of PCRE can be included, excluded, or  changed  when  the
51         library is built. The pcre_config() function makes it  possible  for  a         library  is  built.  The pcre_config() function makes it possible for a
52         client  to  discover  which  features are available. The features them-         client to discover which features are  available.  The  features  them-
53         selves are described in the pcrebuild page. Documentation about  build-         selves  are described in the pcrebuild page. Documentation about build-
54         ing  PCRE for various operating systems can be found in the README file         ing PCRE for various operating systems can be found in the README  file
55         in the source distribution.         in the source distribution.
56    
57         The library contains a number of undocumented  internal  functions  and         The  library  contains  a number of undocumented internal functions and
58         data  tables  that  are  used by more than one of the exported external         data tables that are used by more than one  of  the  exported  external
59         functions, but which are not intended  for  use  by  external  callers.         functions,  but  which  are  not  intended for use by external callers.
60         Their  names  all begin with "_pcre_", which hopefully will not provoke         Their names all begin with "_pcre_", which hopefully will  not  provoke
61         any name clashes. In some environments, it is possible to control which         any name clashes. In some environments, it is possible to control which
62         external  symbols  are  exported when a shared library is built, and in         external symbols are exported when a shared library is  built,  and  in
63         these cases the undocumented symbols are not exported.         these cases the undocumented symbols are not exported.
64    
65    
66  USER DOCUMENTATION  USER DOCUMENTATION
67    
68         The user documentation for PCRE comprises a number  of  different  sec-         The  user  documentation  for PCRE comprises a number of different sec-
69         tions.  In the "man" format, each of these is a separate "man page". In         tions. In the "man" format, each of these is a separate "man page".  In
70         the HTML format, each is a separate page, linked from the  index  page.         the  HTML  format, each is a separate page, linked from the index page.
71         In  the  plain text format, all the sections are concatenated, for ease         In the plain text format, all the sections are concatenated,  for  ease
72         of searching. The sections are as follows:         of searching. The sections are as follows:
73    
74           pcre              this document           pcre              this document
75             pcre-config       show PCRE installation configuration information
76           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
77           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
78           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
# Line 81  USER DOCUMENTATION Line 87  USER DOCUMENTATION
87           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
88           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
89           pcresample        discussion of the sample program           pcresample        discussion of the sample program
90             pcrestack         discussion of stack usage
91           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
92    
93         In  addition,  in the "man" and HTML formats, there is a short page for         In addition, in the "man" and HTML formats, there is a short  page  for
94         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
95    
96    
97  LIMITATIONS  LIMITATIONS
98    
99         There are some size limitations in PCRE but it is hoped that they  will         There  are some size limitations in PCRE but it is hoped that they will
100         never in practice be relevant.         never in practice be relevant.
101    
102         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE         The maximum length of a compiled pattern is 65539 (sic) bytes  if  PCRE
103         is compiled with the default internal linkage size of 2. If you want to         is compiled with the default internal linkage size of 2. If you want to
104         process  regular  expressions  that are truly enormous, you can compile         process regular expressions that are truly enormous,  you  can  compile
105         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE  with  an  internal linkage size of 3 or 4 (see the README file in
106         the  source  distribution and the pcrebuild documentation for details).         the source distribution and the pcrebuild documentation  for  details).
107         In these cases the limit is substantially larger.  However,  the  speed         In  these  cases the limit is substantially larger.  However, the speed
108         of execution will be slower.         of execution is slower.
109    
110         All values in repeating quantifiers must be less than 65536.  The maxi-         All values in repeating quantifiers must be less than 65536. The  maxi-
111         mum number of capturing subpatterns is 65535.         mum  compiled  length  of  subpattern  with an explicit repeat count is
112           30000 bytes. The maximum number of capturing subpatterns is 65535.
113         There is no limit to the number of non-capturing subpatterns,  but  the  
114         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         There is no limit to the number of parenthesized subpatterns, but there
115         including capturing subpatterns, assertions, and other types of subpat-         can be no more than 65535 capturing subpatterns.
116         tern, is 200.  
117           If  a  non-capturing subpattern with an unlimited repetition quantifier
118           can match an empty string, there is a limit of 1000 on  the  number  of
119           times  it  can  be  repeated while not matching an empty string - if it
120           does match an empty string, the loop is immediately broken.
121    
122           The maximum length of name for a named subpattern is 32 characters, and
123           the maximum number of named subpatterns is 10000.
124    
125         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
126         that an integer variable can hold. However, when using the  traditional         that an integer variable can hold. However, when using the  traditional
127         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
128         inite repetition.  This means that the available stack space may  limit         inite repetition.  This means that the available stack space may  limit
129         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
130           For a discussion of stack issues, see the pcrestack documentation.
131    
132    
133  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
# Line 130  UTF-8 AND UNICODE PROPERTY SUPPORT Line 145  UTF-8 AND UNICODE PROPERTY SUPPORT
145    
146         If  you compile PCRE with UTF-8 support, but do not use it at run time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
147         the library will be a bit bigger, but the additional run time  overhead         the library will be a bit bigger, but the additional run time  overhead
148         is  limited  to testing the PCRE_UTF8 flag in several places, so should         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
149         not be very large.         very big.
150    
151         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
152         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
153         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
154         general  category  properties such as Lu for an upper case letter or Nd         general  category  properties such as Lu for an upper case letter or Nd
155         for a decimal number. A full list is given in the pcrepattern  documen-         for a decimal number, the Unicode script names such as Arabic  or  Han,
156         tation. The PCRE library is increased in size by about 90K when Unicode         and  the  derived  properties  Any  and L&. A full list is given in the
157         property support is included.         pcrepattern documentation. Only the short names for properties are sup-
158           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
159           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
160           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
161           does not support this.
162    
163         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
164    
# Line 155  UTF-8 AND UNICODE PROPERTY SUPPORT Line 174  UTF-8 AND UNICODE PROPERTY SUPPORT
174         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
175         crash.         crash.
176    
177         2. In a pattern, the escape sequence \x{...}, where the contents of the         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
178         braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8         two-byte UTF-8 character if the value is greater than 127.
        character whose code number is the given hexadecimal number, for  exam-  
        ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,  
        the item is not recognized.  This escape sequence can be used either as  
        a literal, or within a character class.  
179    
180         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
181         UTF-8 character if the value is greater than 127.         characters for values greater than \177.
182    
183         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
184         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
# Line 187  UTF-8 AND UNICODE PROPERTY SUPPORT Line 202  UTF-8 AND UNICODE PROPERTY SUPPORT
202         8.  Similarly,  characters that match the POSIX named character classes         8.  Similarly,  characters that match the POSIX named character classes
203         are all low-valued characters.         are all low-valued characters.
204    
205         9. Case-insensitive matching applies only to  characters  whose  values         9. However, the Perl 5.10 horizontal and vertical  whitespace  matching
206           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
207           acters.
208    
209           10. Case-insensitive matching applies only to characters  whose  values
210         are  less than 128, unless PCRE is built with Unicode property support.         are  less than 128, unless PCRE is built with Unicode property support.
211         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
212         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
213         so as not to degrade performance.  The Unicode property information  is         so as not to degrade performance.  The Unicode property information  is
214         used only for characters with higher values.         used only for characters with higher values. Even when Unicode property
215           support is available, PCRE supports case-insensitive matching only when
216           there  is  a  one-to-one  mapping between a letter's cases. There are a
217           small number of many-to-one mappings in Unicode;  these  are  not  sup-
218           ported by PCRE.
219    
220    
221  AUTHOR  AUTHOR
222    
223         Philip Hazel         Philip Hazel
224         University Computing Service,         University Computing Service
225         Cambridge CB2 3QG, England.         Cambridge CB2 3QH, England.
226    
227         Putting  an actual email address here seems to have been a spam magnet,         Putting  an actual email address here seems to have been a spam magnet,
228         so I've taken it away. If you want to email me, use my initial and sur-         so I've taken it away. If you want to email me, use  my  two  initials,
229         name, separated by a dot, at the domain ucs.cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
230    
231    
232    REVISION
233    
234  Last updated: 07 March 2005         Last updated: 30 July 2007
235  Copyright (c) 1997-2005 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
236  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
237    
238    
# Line 228  PCRE BUILD-TIME OPTIONS Line 254  PCRE BUILD-TIME OPTIONS
254    
255           ./configure --help           ./configure --help
256    
257         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
258         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
259         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
260         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
261         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
262         not described.         is not described.
263    
264    
265  C++ SUPPORT  C++ SUPPORT
# Line 272  UNICODE CHARACTER PROPERTY SUPPORT Line 298  UNICODE CHARACTER PROPERTY SUPPORT
298         to the configure command. This implies UTF-8 support, even if you  have         to the configure command. This implies UTF-8 support, even if you  have
299         not explicitly requested it.         not explicitly requested it.
300    
301         Including  Unicode  property  support  adds around 90K of tables to the         Including  Unicode  property  support  adds around 30K of tables to the
302         PCRE library, approximately doubling its size. Only the  general  cate-         PCRE library. Only the general category properties such as  Lu  and  Nd
303         gory  properties  such as Lu and Nd are supported. Details are given in         are supported. Details are given in the pcrepattern documentation.
        the pcrepattern documentation.  
304    
305    
306  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
307    
308         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating
309         ter. This is the normal newline character on Unix-like systems. You can         the end of a line. This is the normal newline  character  on  Unix-like
310         compile PCRE to use character 13 (carriage return) instead by adding         systems. You can compile PCRE to use character 13 (carriage return, CR)
311           instead, by adding
312    
313           --enable-newline-is-cr           --enable-newline-is-cr
314    
315         to the configure command. For completeness there is  also  a  --enable-         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
316         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
317         line character.  
318           Alternatively, you can specify that line endings are to be indicated by
319           the two character sequence CRLF. If you want this, add
320    
321             --enable-newline-is-crlf
322    
323           to the configure command. There is a fourth option, specified by
324    
325             --enable-newline-is-anycrlf
326    
327           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
328           CRLF as indicating a line ending. Finally, a fifth option, specified by
329    
330             --enable-newline-is-any
331    
332           causes PCRE to recognize any Unicode newline sequence.
333    
334           Whatever line ending convention is selected when PCRE is built  can  be
335           overridden  when  the library functions are called. At build time it is
336           conventional to use the standard for your operating system.
337    
338    
339  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
# Line 319  POSIX MALLOC USAGE Line 364  POSIX MALLOC USAGE
364         to the configure command.         to the configure command.
365    
366    
 LIMITING PCRE RESOURCE USAGE  
   
        Internally,  PCRE has a function called match(), which it calls repeat-  
        edly  (possibly  recursively)  when  matching  a   pattern   with   the  
        pcre_exec()  function.  By controlling the maximum number of times this  
        function may be called during a single matching operation, a limit  can  
        be  placed  on  the resources used by a single call to pcre_exec(). The  
        limit can be changed at run time, as described in the pcreapi  documen-  
        tation.  The default is 10 million, but this can be changed by adding a  
        setting such as  
   
          --with-match-limit=500000  
   
        to  the  configure  command.  This  setting  has  no  effect   on   the  
        pcre_dfa_exec() matching function.  
   
   
367  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
368    
369         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
# Line 353  HANDLING VERY LARGE PATTERNS Line 381  HANDLING VERY LARGE PATTERNS
381         longer  offsets slows down the operation of PCRE because it has to load         longer  offsets slows down the operation of PCRE because it has to load
382         additional bytes when handling them.         additional bytes when handling them.
383    
        If you build PCRE with an increased link size, test 2 (and  test  5  if  
        you  are using UTF-8) will fail. Part of the output of these tests is a  
        representation of the compiled pattern, and this changes with the  link  
        size.  
   
384    
385  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
386    
387         When matching with the pcre_exec() function, PCRE implements backtrack-         When matching with the pcre_exec() function, PCRE implements backtrack-
388         ing by making recursive calls to an internal function  called  match().         ing  by  making recursive calls to an internal function called match().
389         In  environments  where  the size of the stack is limited, this can se-         In environments where the size of the stack is limited,  this  can  se-
390         verely limit PCRE's operation. (The Unix environment does  not  usually         verely  limit  PCRE's operation. (The Unix environment does not usually
391         suffer  from  this  problem.)  An alternative approach that uses memory         suffer from this problem, but it may sometimes be necessary to increase
392         from the heap to remember data, instead  of  using  recursive  function         the  maximum  stack size.  There is a discussion in the pcrestack docu-
393         calls,  has been implemented to work round this problem. If you want to         mentation.) An alternative approach to recursion that uses memory  from
394         build a version of PCRE that works this way, add         the  heap  to remember data, instead of using recursive function calls,
395           has been implemented to work round the problem of limited  stack  size.
396           If you want to build a version of PCRE that works this way, add
397    
398           --disable-stack-for-recursion           --disable-stack-for-recursion
399    
400         to the configure command. With this configuration, PCRE  will  use  the         to  the  configure  command. With this configuration, PCRE will use the
401         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
402         ment functions. Separate functions are provided because  the  usage  is         ment  functions. By default these point to malloc() and free(), but you
403         very  predictable:  the  block sizes requested are always the same, and         can replace the pointers so that your own functions are used.
404         the blocks are always freed in reverse order. A calling  program  might  
405         be  able  to implement optimized functions that perform better than the         Separate functions are  provided  rather  than  using  pcre_malloc  and
406         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         pcre_free  because  the  usage  is  very  predictable:  the block sizes
407         slowly when built in this way. This option affects only the pcre_exec()         requested are always the same, and  the  blocks  are  always  freed  in
408         function; it is not relevant for the the pcre_dfa_exec() function.         reverse  order.  A calling program might be able to implement optimized
409           functions that perform better  than  malloc()  and  free().  PCRE  runs
410           noticeably more slowly when built in this way. This option affects only
411           the  pcre_exec()  function;  it   is   not   relevant   for   the   the
412           pcre_dfa_exec() function.
413    
414    
415    LIMITING PCRE RESOURCE USAGE
416    
417           Internally,  PCRE has a function called match(), which it calls repeat-
418           edly  (sometimes  recursively)  when  matching  a  pattern   with   the
419           pcre_exec()  function.  By controlling the maximum number of times this
420           function may be called during a single matching operation, a limit  can
421           be  placed  on  the resources used by a single call to pcre_exec(). The
422           limit can be changed at run time, as described in the pcreapi  documen-
423           tation.  The default is 10 million, but this can be changed by adding a
424           setting such as
425    
426             --with-match-limit=500000
427    
428           to  the  configure  command.  This  setting  has  no  effect   on   the
429           pcre_dfa_exec() matching function.
430    
431           In  some  environments  it is desirable to limit the depth of recursive
432           calls of match() more strictly than the total number of calls, in order
433           to  restrict  the maximum amount of stack (or heap, if --disable-stack-
434           for-recursion is specified) that is used. A second limit controls this;
435           it  defaults  to  the  value  that is set for --with-match-limit, which
436           imposes no additional constraints. However, you can set a  lower  limit
437           by adding, for example,
438    
439             --with-match-limit-recursion=10000
440    
441           to  the  configure  command.  This  value can also be overridden at run
442           time.
443    
444    
445    CREATING CHARACTER TABLES AT BUILD TIME
446    
447           PCRE uses fixed tables for processing characters whose code values  are
448           less  than 256. By default, PCRE is built with a set of tables that are
449           distributed in the file pcre_chartables.c.dist. These  tables  are  for
450           ASCII codes only. If you add
451    
452             --enable-rebuild-chartables
453    
454           to  the  configure  command, the distributed tables are no longer used.
455           Instead, a program called dftables is compiled and  run.  This  outputs
456           the source for new set of tables, created in the default locale of your
457           C runtime system. (This method of replacing the tables does not work if
458           you  are cross compiling, because dftables is run on the local host. If
459           you need to create alternative tables when cross  compiling,  you  will
460           have to do so "by hand".)
461    
462    
463  USING EBCDIC CODE  USING EBCDIC CODE
464    
465         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
466         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
467         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         This  is  the  case for most computer operating systems. PCRE can, how-
468         adding         ever, be compiled to run in an EBCDIC environment by adding
469    
470           --enable-ebcdic           --enable-ebcdic
471    
472         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
473           bles.  You  should  only  use  it if you know that you are in an EBCDIC
474           environment (for example, an IBM mainframe operating system).
475    
476    
477    SEE ALSO
478    
479           pcreapi(3), pcre_config(3).
480    
481  Last updated: 15 August 2005  
482  Copyright (c) 1997-2005 University of Cambridge.  AUTHOR
483    
484           Philip Hazel
485           University Computing Service
486           Cambridge CB2 3QH, England.
487    
488    
489    REVISION
490    
491           Last updated: 30 July 2007
492           Copyright (c) 1997-2007 University of Cambridge.
493  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
494    
495    
# Line 431  PCRE MATCHING ALGORITHMS Line 525  PCRE MATCHING ALGORITHMS
525           <something> <something else> <something further>           <something> <something else> <something further>
526    
527         there are three possible answers. The standard algorithm finds only one         there are three possible answers. The standard algorithm finds only one
528         of them, whereas the DFA algorithm finds all three.         of them, whereas the alternative algorithm finds all three.
529    
530    
531  REGULAR EXPRESSIONS AS TREES  REGULAR EXPRESSIONS AS TREES
# Line 440  REGULAR EXPRESSIONS AS TREES Line 534  REGULAR EXPRESSIONS AS TREES
534         resented  as  a  tree structure. An unlimited repetition in the pattern         resented  as  a  tree structure. An unlimited repetition in the pattern
535         makes the tree of infinite size, but it is still a tree.  Matching  the         makes the tree of infinite size, but it is still a tree.  Matching  the
536         pattern  to a given subject string (from a given starting point) can be         pattern  to a given subject string (from a given starting point) can be
537         thought of as a search of the tree.  There are  two  standard  ways  to         thought of as a search of the tree.  There are two  ways  to  search  a
538         search  a  tree: depth-first and breadth-first, and these correspond to         tree:  depth-first  and  breadth-first, and these correspond to the two
539         the two matching algorithms provided by PCRE.         matching algorithms provided by PCRE.
540    
541    
542  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
543    
544         In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
545         sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
546         depth-first search of the pattern tree. That is, it  proceeds  along  a         depth-first search of the pattern tree. That is, it  proceeds  along  a
547         single path through the tree, checking that the subject matches what is         single path through the tree, checking that the subject matches what is
548         required. When there is a mismatch, the algorithm  tries  any  alterna-         required. When there is a mismatch, the algorithm  tries  any  alterna-
# Line 472  THE STANDARD MATCHING ALGORITHM Line 566  THE STANDARD MATCHING ALGORITHM
566         This provides support for capturing parentheses and back references.         This provides support for capturing parentheses and back references.
567    
568    
569  THE DFA MATCHING ALGORITHM  THE ALTERNATIVE MATCHING ALGORITHM
570    
571         DFA stands for "deterministic finite automaton", but you do not need to         This algorithm conducts a breadth-first search of  the  tree.  Starting
572         understand the origins of that name. This algorithm conducts a breadth-         from  the  first  matching  point  in the subject, it scans the subject
573         first search of the tree. Starting from the first matching point in the         string from left to right, once, character by character, and as it does
574         subject,  it scans the subject string from left to right, once, charac-         this,  it remembers all the paths through the tree that represent valid
575         ter by character, and as it does  this,  it  remembers  all  the  paths         matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
576         through the tree that represent valid matches.         though  it is not implemented as a traditional finite state machine (it
577           keeps multiple states active simultaneously).
578         The  scan  continues until either the end of the subject is reached, or  
579         there are no more unterminated paths. At this point,  terminated  paths         The scan continues until either the end of the subject is  reached,  or
580         represent  the different matching possibilities (if there are none, the         there  are  no more unterminated paths. At this point, terminated paths
581         match has failed).  Thus, if there is more  than  one  possible  match,         represent the different matching possibilities (if there are none,  the
582           match  has  failed).   Thus,  if there is more than one possible match,
583         this algorithm finds all of them, and in particular, it finds the long-         this algorithm finds all of them, and in particular, it finds the long-
584         est. In PCRE, there is an option to stop the algorithm after the  first         est.  In PCRE, there is an option to stop the algorithm after the first
585         match (which is necessarily the shortest) has been found.         match (which is necessarily the shortest) has been found.
586    
587         Note that all the matches that are found start at the same point in the         Note that all the matches that are found start at the same point in the
# Line 494  THE DFA MATCHING ALGORITHM Line 589  THE DFA MATCHING ALGORITHM
589    
590           cat(er(pillar)?)           cat(er(pillar)?)
591    
592         is matched against the string "the caterpillar catchment",  the  result         is  matched  against the string "the caterpillar catchment", the result
593         will  be the three strings "cat", "cater", and "caterpillar" that start         will be the three strings "cat", "cater", and "caterpillar" that  start
594         at the fourth character of the subject. The algorithm does not automat-         at the fourth character of the subject. The algorithm does not automat-
595         ically move on to find matches that start at later positions.         ically move on to find matches that start at later positions.
596    
597         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
598         supported by the DFA matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
599    
600         1. Because the algorithm finds all  possible  matches,  the  greedy  or         1.  Because  the  algorithm  finds  all possible matches, the greedy or
601         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
602         ungreedy quantifiers are treated in exactly the same way.         ungreedy quantifiers are treated in exactly the same way. However, pos-
603           sessive quantifiers can make a difference when what follows could  also
604           match what is quantified, for example in a pattern like this:
605    
606             ^a++\w!
607    
608           This  pattern matches "aaab!" but not "aaa!", which would be matched by
609           a non-possessive quantifier. Similarly, if an atomic group is  present,
610           it  is matched as if it were a standalone pattern at the current point,
611           and the longest match is then "locked in" for the rest of  the  overall
612           pattern.
613    
614         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
615         is  not  straightforward  to  keep track of captured substrings for the         is not straightforward to keep track of  captured  substrings  for  the
616         different matching possibilities, and  PCRE's  implementation  of  this         different  matching  possibilities,  and  PCRE's implementation of this
617         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
618         strings are available.         strings are available.
619    
620         3. Because no substrings are captured, back references within the  pat-         3.  Because no substrings are captured, back references within the pat-
621         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
622    
623         4.  For  the same reason, conditional expressions that use a backrefer-         4. For the same reason, conditional expressions that use  a  backrefer-
624         ence as the condition are not supported.         ence  as  the  condition or test for a specific group recursion are not
625           supported.
626    
627         5. Callouts are supported, but the value of the  capture_top  field  is         5. Because many paths through the tree may be  active,  the  \K  escape
628           sequence, which resets the start of the match when encountered (but may
629           be on some paths and not on others), is not  supported.  It  causes  an
630           error if encountered.
631    
632           6.  Callouts  are  supported, but the value of the capture_top field is
633         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
634    
635         6.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
636         single byte, even in UTF-8 mode, is not supported because the DFA algo-         single  byte, even in UTF-8 mode, is not supported because the alterna-
637         rithm moves through the subject string one character at a time, for all         tive algorithm moves through the subject  string  one  character  at  a
638         active paths through the tree.         time, for all active paths through the tree.
639    
640    
641  ADVANTAGES OF THE DFA ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
642    
643         Using the DFA matching algorithm provides the following advantages:         Using  the alternative matching algorithm provides the following advan-
644           tages:
645    
646         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
647         ically  found,  and  in particular, the longest match is found. To find         ically  found,  and  in particular, the longest match is found. To find
# Line 538  ADVANTAGES OF THE DFA ALGORITHM Line 650  ADVANTAGES OF THE DFA ALGORITHM
650    
651         2.  There is much better support for partial matching. The restrictions         2.  There is much better support for partial matching. The restrictions
652         on the content of the pattern that apply when using the standard  algo-         on the content of the pattern that apply when using the standard  algo-
653         rithm  for partial matching do not apply to the DFA algorithm. For non-         rithm  for  partial matching do not apply to the alternative algorithm.
654         anchored patterns, the starting position of a partial match  is  avail-         For non-anchored patterns, the starting position of a partial match  is
655         able.         available.
656    
657         3.  Because  the  DFA algorithm scans the subject string just once, and         3.  Because  the  alternative  algorithm  scans the subject string just
658         never needs to backtrack, it is possible  to  pass  very  long  subject         once, and never needs to backtrack, it is possible to  pass  very  long
659         strings  to  the matching function in several pieces, checking for par-         subject  strings  to  the matching function in several pieces, checking
660         tial matching each time.         for partial matching each time.
661    
662    
663  DISADVANTAGES OF THE DFA ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
664    
665         The DFA algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
666    
667         1. It is substantially slower than  the  standard  algorithm.  This  is         1. It is substantially slower than  the  standard  algorithm.  This  is
668         partly  because  it has to search for all possible matches, but is also         partly  because  it has to search for all possible matches, but is also
# Line 558  DISADVANTAGES OF THE DFA ALGORITHM Line 670  DISADVANTAGES OF THE DFA ALGORITHM
670    
671         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
672    
673         3. The "atomic group" feature of PCRE regular expressions is supported,         3. Although atomic groups are supported, their use does not provide the
674         but  does not provide the advantage that it does for the standard algo-         performance advantage that it does for the standard algorithm.
675         rithm.  
676    
677    AUTHOR
678    
679           Philip Hazel
680           University Computing Service
681           Cambridge CB2 3QH, England.
682    
683    
684  Last updated: 28 February 2005  REVISION
685  Copyright (c) 1997-2005 University of Cambridge.  
686           Last updated: 29 May 2007
687           Copyright (c) 1997-2007 University of Cambridge.
688  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
689    
690    
# Line 616  PCRE NATIVE API Line 737  PCRE NATIVE API
737         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
738              const char *name);              const char *name);
739    
740           int pcre_get_stringtable_entries(const pcre *code,
741                const char *name, char **first, char **last);
742    
743         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
744              int stringcount, int stringnumber,              int stringcount, int stringnumber,
745              const char **stringptr);              const char **stringptr);
# Line 654  PCRE NATIVE API Line 778  PCRE NATIVE API
778  PCRE API OVERVIEW  PCRE API OVERVIEW
779    
780         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
781         is also a set of wrapper functions that correspond to the POSIX regular         are also some wrapper functions that correspond to  the  POSIX  regular
782         expression  API.  These  are  described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
783         Both of these APIs define a set of C function calls. A C++  wrapper  is         Both of these APIs define a set of C function calls. A C++  wrapper  is
784         distributed with PCRE. It is documented in the pcrecpp page.         distributed with PCRE. It is documented in the pcrecpp page.
# Line 676  PCRE API OVERVIEW Line 800  PCRE API OVERVIEW
800    
801         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
802         ble,  is  also provided. This uses a different algorithm for the match-         ble,  is  also provided. This uses a different algorithm for the match-
803         ing. This allows it to find all possible matches (at a given  point  in         ing. The alternative algorithm finds all possible matches (at  a  given
804         the  subject),  not  just  one. However, this algorithm does not return         point  in  the subject), and scans the subject just once. However, this
805         captured substrings. A description of the two matching  algorithms  and         algorithm does not return captured substrings. A description of the two
806         their  advantages  and disadvantages is given in the pcrematching docu-         matching  algorithms and their advantages and disadvantages is given in
807         mentation.         the pcrematching documentation.
808    
809         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
810         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 692  PCRE API OVERVIEW Line 816  PCRE API OVERVIEW
816           pcre_get_named_substring()           pcre_get_named_substring()
817           pcre_get_substring_list()           pcre_get_substring_list()
818           pcre_get_stringnumber()           pcre_get_stringnumber()
819             pcre_get_stringtable_entries()
820    
821         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
822         to free the memory used for extracted strings.         to free the memory used for extracted strings.
# Line 723  PCRE API OVERVIEW Line 848  PCRE API OVERVIEW
848         indirections  to  memory  management functions. These special functions         indirections  to  memory  management functions. These special functions
849         are used only when PCRE is compiled to use  the  heap  for  remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
850         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
851         function. This is a non-standard way of building PCRE, for use in envi-         function. See the pcrebuild documentation for  details  of  how  to  do
852         ronments that have limited stacks. Because of the greater use of memory         this.  It  is  a non-standard way of building PCRE, for use in environ-
853         management, it runs more slowly.  Separate functions  are  provided  so         ments that have limited stacks. Because of the greater  use  of  memory
854         that  special-purpose  external  code  can  be used for this case. When         management,  it  runs  more  slowly. Separate functions are provided so
855         used, these functions are always called in a  stack-like  manner  (last         that special-purpose external code can be  used  for  this  case.  When
856         obtained,  first freed), and always for memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
857           obtained, first freed), and always for memory blocks of the same  size.
858           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
859           mentation.
860    
861         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
862         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 736  PCRE API OVERVIEW Line 864  PCRE API OVERVIEW
864         pcrecallout documentation.         pcrecallout documentation.
865    
866    
867    NEWLINES
868    
869           PCRE  supports five different conventions for indicating line breaks in
870           strings: a single CR (carriage return) character, a  single  LF  (line-
871           feed) character, the two-character sequence CRLF, any of the three pre-
872           ceding, or any Unicode newline sequence. The Unicode newline  sequences
873           are  the  three just mentioned, plus the single characters VT (vertical
874           tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
875           separator, U+2028), and PS (paragraph separator, U+2029).
876    
877           Each  of  the first three conventions is used by at least one operating
878           system as its standard newline sequence. When PCRE is built, a  default
879           can  be  specified.  The default default is LF, which is the Unix stan-
880           dard. When PCRE is run, the default can be overridden,  either  when  a
881           pattern is compiled, or when it is matched.
882    
883           In the PCRE documentation the word "newline" is used to mean "the char-
884           acter or pair of characters that indicate a line break". The choice  of
885           newline  convention  affects  the  handling of the dot, circumflex, and
886           dollar metacharacters, the handling of #-comments in /x mode, and, when
887           CRLF  is a recognized line ending sequence, the match position advance-
888           ment for a non-anchored pattern. The choice of newline convention  does
889           not affect the interpretation of the \n or \r escape sequences.
890    
891    
892  MULTITHREADING  MULTITHREADING
893    
894         The  PCRE  functions  can be used in multi-threading applications, with         The  PCRE  functions  can be used in multi-threading applications, with
# Line 753  SAVING PRECOMPILED PATTERNS FOR LATER US Line 906  SAVING PRECOMPILED PATTERNS FOR LATER US
906         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
907         later time, possibly by a different program, and even on a  host  other         later time, possibly by a different program, and even on a  host  other
908         than  the  one  on  which  it  was  compiled.  Details are given in the         than  the  one  on  which  it  was  compiled.  Details are given in the
909         pcreprecompile documentation.         pcreprecompile documentation. However, compiling a  regular  expression
910           with  one version of PCRE for use with a different version is not guar-
911           anteed to work and may cause crashes.
912    
913    
914  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
# Line 782  CHECKING BUILD-TIME OPTIONS Line 937  CHECKING BUILD-TIME OPTIONS
937    
938           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
939    
940         The output is an integer that is set to the value of the code  that  is         The output is an integer whose value specifies  the  default  character
941         used  for the newline character. It is either linefeed (10) or carriage         sequence  that is recognized as meaning "newline". The four values that
942         return (13), and should normally be the  standard  character  for  your         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
943         operating system.         and  -1  for  ANY. The default should normally be the standard sequence
944           for your operating system.
945    
946           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
947    
948         The  output  is  an  integer that contains the number of bytes used for         The output is an integer that contains the number  of  bytes  used  for
949         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
950         4.  Larger  values  allow larger regular expressions to be compiled, at         4. Larger values allow larger regular expressions to  be  compiled,  at
951         the expense of slower matching. The default value of  2  is  sufficient         the  expense  of  slower matching. The default value of 2 is sufficient
952         for  all  but  the  most massive patterns, since it allows the compiled         for all but the most massive patterns, since  it  allows  the  compiled
953         pattern to be up to 64K in size.         pattern to be up to 64K in size.
954    
955           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
956    
957         The output is an integer that contains the threshold  above  which  the         The  output  is  an integer that contains the threshold above which the
958         POSIX  interface  uses malloc() for output vectors. Further details are         POSIX interface uses malloc() for output vectors. Further  details  are
959         given in the pcreposix documentation.         given in the pcreposix documentation.
960    
961           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
962    
963         The output is an integer that gives the default limit for the number of         The output is an integer that gives the default limit for the number of
964         internal  matching  function  calls in a pcre_exec() execution. Further         internal matching function calls in a  pcre_exec()  execution.  Further
965         details are given with pcre_exec() below.         details are given with pcre_exec() below.
966    
967             PCRE_CONFIG_MATCH_LIMIT_RECURSION
968    
969           The  output is an integer that gives the default limit for the depth of
970           recursion when calling the internal matching function in a  pcre_exec()
971           execution. Further details are given with pcre_exec() below.
972    
973           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
974    
975         The output is an integer that is set to one if internal recursion  when         The  output is an integer that is set to one if internal recursion when
976         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
977         the stack to remember their state. This is the usual way that  PCRE  is         the  stack  to remember their state. This is the usual way that PCRE is
978         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
979         on the  heap  instead  of  recursive  function  calls.  In  this  case,         on  the  heap  instead  of  recursive  function  calls.  In  this case,
980         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
981         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
982    
983    
# Line 832  COMPILING A PATTERN Line 994  COMPILING A PATTERN
994    
995         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
996         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
997         the two interfaces is that pcre_compile2() has an additional  argument,         the  two interfaces is that pcre_compile2() has an additional argument,
998         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error code can be returned.
999    
1000         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
1001         the pattern argument. A pointer to a single block  of  memory  that  is         the  pattern  argument.  A  pointer to a single block of memory that is
1002         obtained  via  pcre_malloc is returned. This contains the compiled code         obtained via pcre_malloc is returned. This contains the  compiled  code
1003         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
1004         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
1005         It is up to the caller  to  free  the  memory  when  it  is  no  longer         It is up to the caller to free the memory (via pcre_free) when it is no
1006         required.         longer required.
1007    
1008         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
1009         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1010         fully  relocatable, because it may contain a copy of the tableptr argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
1011         ment, which is an address (see below).         ment, which is an address (see below).
1012    
1013         The options argument contains independent bits that affect the compila-         The options argument contains various bit settings that affect the com-
1014         tion.  It  should  be  zero  if  no options are required. The available         pilation. It should be zero if no options are required.  The  available
1015         options are described below. Some of them, in  particular,  those  that         options  are  described  below. Some of them, in particular, those that
1016         are  compatible  with  Perl,  can also be set and unset from within the         are compatible with Perl, can also be set and  unset  from  within  the
1017         pattern (see the detailed description  in  the  pcrepattern  documenta-         pattern  (see  the  detailed  description in the pcrepattern documenta-
1018         tion).  For  these options, the contents of the options argument speci-         tion). For these options, the contents of the options  argument  speci-
1019         fies their initial settings at the start of compilation and  execution.         fies  their initial settings at the start of compilation and execution.
1020         The  PCRE_ANCHORED option can be set at the time of matching as well as         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time
1021         at compile time.         of matching as well as at compile time.
1022    
1023         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1024         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
1025         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1026         sage.  The  offset from the start of the pattern to the character where         sage. This is a static string that is part of the library. You must not
1027         the error was discovered is  placed  in  the  variable  pointed  to  by         try to free it. The offset from the start of the pattern to the charac-
1028         erroffset,  which  must  not  be  NULL. If it is, an immediate error is         ter where the error was discovered is placed in the variable pointed to
1029           by  erroffset,  which must not be NULL. If it is, an immediate error is
1030         given.         given.
1031    
1032         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
# Line 926  COMPILING A PATTERN Line 1089  COMPILING A PATTERN
1089    
1090         If this bit is set, a dollar metacharacter in the pattern matches  only         If this bit is set, a dollar metacharacter in the pattern matches  only
1091         at  the  end  of the subject string. Without this option, a dollar also         at  the  end  of the subject string. Without this option, a dollar also
1092         matches immediately before the final character if it is a newline  (but         matches immediately before a newline at the end of the string (but  not
1093         not  before  any  other  newlines).  The  PCRE_DOLLAR_ENDONLY option is         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
1094         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
1095         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1096    
1097           PCRE_DOTALL           PCRE_DOTALL
1098    
1099         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
1100         acters, including newlines. Without it,  newlines  are  excluded.  This         acters, including those that indicate newline. Without it, a  dot  does
1101         option  is equivalent to Perl's /s option, and it can be changed within         not  match  when  the  current position is at a newline. This option is
1102         a pattern by a (?s) option setting.  A  negative  class  such  as  [^a]         equivalent to Perl's /s option, and it can be changed within a  pattern
1103         always  matches a newline character, independent of the setting of this         by  a (?s) option setting. A negative class such as [^a] always matches
1104         option.         newline characters, independent of the setting of this option.
1105    
1106             PCRE_DUPNAMES
1107    
1108           If this bit is set, names used to identify capturing  subpatterns  need
1109           not be unique. This can be helpful for certain types of pattern when it
1110           is known that only one instance of the named  subpattern  can  ever  be
1111           matched.  There  are  more details of named subpatterns below; see also
1112           the pcrepattern documentation.
1113    
1114           PCRE_EXTENDED           PCRE_EXTENDED
1115    
# Line 946  COMPILING A PATTERN Line 1117  COMPILING A PATTERN
1117         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1118         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1119         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1120         line character, inclusive, are also  ignored.  This  is  equivalent  to         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
1121         Perl's  /x  option,  and  it  can be changed within a pattern by a (?x)         option,  and  it  can be changed within a pattern by a (?x) option set-
1122         option setting.         ting.
1123    
1124         This option makes it possible to include  comments  inside  complicated         This option makes it possible to include  comments  inside  complicated
1125         patterns.   Note,  however,  that this applies only to data characters.         patterns.   Note,  however,  that this applies only to data characters.
# Line 964  COMPILING A PATTERN Line 1135  COMPILING A PATTERN
1135         letter  that  has  no  special  meaning causes an error, thus reserving         letter  that  has  no  special  meaning causes an error, thus reserving
1136         these combinations for future expansion. By  default,  as  in  Perl,  a         these combinations for future expansion. By  default,  as  in  Perl,  a
1137         backslash  followed by a letter with no special meaning is treated as a         backslash  followed by a letter with no special meaning is treated as a
1138         literal. There are at present no  other  features  controlled  by  this         literal. (Perl can, however, be persuaded to give a warning for  this.)
1139         option. It can also be set by a (?X) option setting within a pattern.         There  are  at  present no other features controlled by this option. It
1140           can also be set by a (?X) option setting within a pattern.
1141    
1142           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1143    
1144         If  this  option  is  set,  an  unanchored pattern is required to match         If this option is set, an  unanchored  pattern  is  required  to  match
1145         before or at the first newline character in the subject string,  though         before  or  at  the  first  newline  in  the subject string, though the
1146         the matched text may continue over the newline.         matched text may continue over the newline.
1147    
1148           PCRE_MULTILINE           PCRE_MULTILINE
1149    
1150         By  default,  PCRE  treats the subject string as consisting of a single         By default, PCRE treats the subject string as consisting  of  a  single
1151         line of characters (even if it actually contains newlines). The  "start         line  of characters (even if it actually contains newlines). The "start
1152         of  line"  metacharacter  (^)  matches only at the start of the string,         of line" metacharacter (^) matches only at the  start  of  the  string,
1153         while the "end of line" metacharacter ($) matches only at  the  end  of         while  the  "end  of line" metacharacter ($) matches only at the end of
1154         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1155         is set). This is the same as Perl.         is set). This is the same as Perl.
1156    
1157         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1158         constructs  match  immediately following or immediately before any new-         constructs match immediately following or immediately  before  internal
1159         line in the subject string, respectively, as well as at the very  start         newlines  in  the  subject string, respectively, as well as at the very
1160         and  end. This is equivalent to Perl's /m option, and it can be changed         start and end. This is equivalent to Perl's /m option, and  it  can  be
1161         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1162         ters  in  a  subject  string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1163         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1164    
1165             PCRE_NEWLINE_CR
1166             PCRE_NEWLINE_LF
1167             PCRE_NEWLINE_CRLF
1168             PCRE_NEWLINE_ANYCRLF
1169             PCRE_NEWLINE_ANY
1170    
1171           These  options  override the default newline definition that was chosen
1172           when PCRE was built. Setting the first or the second specifies  that  a
1173           newline  is  indicated  by a single character (CR or LF, respectively).
1174           Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1175           two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1176           that any of the three preceding sequences should be recognized. Setting
1177           PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1178           recognized. The Unicode newline sequences are the three just mentioned,
1179           plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1180           U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1181           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1182           UTF-8 mode.
1183    
1184           The newline setting in the  options  word  uses  three  bits  that  are
1185           treated as a number, giving eight possibilities. Currently only six are
1186           used (default plus the five values above). This means that if  you  set
1187           more  than one newline option, the combination may or may not be sensi-
1188           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1189           PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1190           cause an error.
1191    
1192           The only time that a line break is specially recognized when  compiling
1193           a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1194           character class is encountered. This indicates  a  comment  that  lasts
1195           until  after the next line break sequence. In other circumstances, line
1196           break  sequences  are  treated  as  literal  data,   except   that   in
1197           PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1198           and are therefore ignored.
1199    
1200           The newline option that is set at compile time becomes the default that
1201           is  used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1202    
1203           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1204    
1205         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
# Line 1031  COMPILATION ERROR CODES Line 1241  COMPILATION ERROR CODES
1241    
1242         The following table lists the error  codes  than  may  be  returned  by         The following table lists the error  codes  than  may  be  returned  by
1243         pcre_compile2(),  along with the error messages that may be returned by         pcre_compile2(),  along with the error messages that may be returned by
1244         both compiling functions.         both compiling functions. As PCRE has developed, some error codes  have
1245           fallen out of use. To avoid confusion, they have not been re-used.
1246    
1247            0  no error            0  no error
1248            1  \ at end of pattern            1  \ at end of pattern
# Line 1043  COMPILATION ERROR CODES Line 1254  COMPILATION ERROR CODES
1254            7  invalid escape sequence in character class            7  invalid escape sequence in character class
1255            8  range out of order in character class            8  range out of order in character class
1256            9  nothing to repeat            9  nothing to repeat
1257           10  operand of unlimited repeat could match the empty string           10  [this code is not in use]
1258           11  internal error: unexpected repeat           11  internal error: unexpected repeat
1259           12  unrecognized character after (?           12  unrecognized character after (?
1260           13  POSIX named classes are supported only within a class           13  POSIX named classes are supported only within a class
# Line 1052  COMPILATION ERROR CODES Line 1263  COMPILATION ERROR CODES
1263           16  erroffset passed as NULL           16  erroffset passed as NULL
1264           17  unknown option bit(s) set           17  unknown option bit(s) set
1265           18  missing ) after comment           18  missing ) after comment
1266           19  parentheses nested too deeply           19  [this code is not in use]
1267           20  regular expression too large           20  regular expression too large
1268           21  failed to get memory           21  failed to get memory
1269           22  unmatched parentheses           22  unmatched parentheses
1270           23  internal error: code overflow           23  internal error: code overflow
1271           24  unrecognized character after (?<           24  unrecognized character after (?<
1272           25  lookbehind assertion is not fixed length           25  lookbehind assertion is not fixed length
1273           26  malformed number after (?(           26  malformed number or name after (?(
1274           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1275           28  assertion expected after (?(           28  assertion expected after (?(
1276           29  (?R or (?digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
1277           30  unknown POSIX class name           30  unknown POSIX class name
1278           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
1279           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is not compiled with PCRE_UTF8 support
1280           33  spare error           33  [this code is not in use]
1281           34  character value in \x{...} sequence is too large           34  character value in \x{...} sequence is too large
1282           35  invalid condition (?(0)           35  invalid condition (?(0)
1283           36  \C not allowed in lookbehind assertion           36  \C not allowed in lookbehind assertion
# Line 1075  COMPILATION ERROR CODES Line 1286  COMPILATION ERROR CODES
1286           39  closing ) for (?C expected           39  closing ) for (?C expected
1287           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
1288           41  unrecognized character after (?P           41  unrecognized character after (?P
1289           42  syntax error after (?P           42  syntax error in subpattern name (missing terminator)
1290           43  two named groups have the same name           43  two named subpatterns have the same name
1291           44  invalid UTF-8 string           44  invalid UTF-8 string
1292           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
1293           46  malformed \P or \p sequence           46  malformed \P or \p sequence
1294           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1295             48  subpattern name is too long (maximum 32 characters)
1296             49  too many named subpatterns (maximum 10,000)
1297             50  repeated subpattern is too long
1298             51  octal value is greater than \377 (not in UTF-8 mode)
1299             52  internal error: overran compiling workspace
1300             53   internal  error:  previously-checked  referenced  subpattern not
1301           found
1302             54  DEFINE group contains more than one branch
1303             55  repeating a DEFINE group is not allowed
1304             56  inconsistent NEWLINE options"
1305             57  \g is not followed by a braced name or an optionally braced
1306                   non-zero number
1307             58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1308    
1309    
1310  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1111  STUDYING A PATTERN Line 1335  STUDYING A PATTERN
1335    
1336         The  third argument for pcre_study() is a pointer for an error message.         The  third argument for pcre_study() is a pointer for an error message.
1337         If studying succeeds (even if no data is  returned),  the  variable  it         If studying succeeds (even if no data is  returned),  the  variable  it
1338         points  to  is set to NULL. Otherwise it points to a textual error mes-         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1339         sage. You should therefore test the error pointer for NULL after  call-         error message. This is a static string that is part of the library. You
1340         ing pcre_study(), to be sure that it has run successfully.         must  not  try  to  free it. You should test the error pointer for NULL
1341           after calling pcre_study(), to be sure that it has run successfully.
1342    
1343         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1344    
# Line 1124  STUDYING A PATTERN Line 1349  STUDYING A PATTERN
1349             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1350    
1351         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1352         that do not have a single fixed starting character. A bitmap of  possi-         that  do not have a single fixed starting character. A bitmap of possi-
1353         ble starting bytes is created.         ble starting bytes is created.
1354    
1355    
1356  LOCALE SUPPORT  LOCALE SUPPORT
1357    
1358         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1359         letters digits, or whatever, by reference to a set of  tables,  indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1360         by  character  value.  When running in UTF-8 mode, this applies only to         by character value. When running in UTF-8 mode, this  applies  only  to
1361         characters with codes less than 128. Higher-valued  codes  never  match         characters  with  codes  less than 128. Higher-valued codes never match
1362         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1363         with Unicode character property support.         with  Unicode  character property support. The use of locales with Uni-
1364           code is discouraged. If you are handling characters with codes  greater
1365         An internal set of tables is created in the default C locale when  PCRE         than  128, you should either use UTF-8 and Unicode, or use locales, but
1366         is  built.  This  is  used when the final argument of pcre_compile() is         not try to mix the two.
1367         NULL, and is sufficient for many applications. An  alternative  set  of  
1368         tables  can,  however, be supplied. These may be created in a different         PCRE contains an internal set of tables that are used  when  the  final
1369         locale from the default. As more and more applications change to  using         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1370         Unicode, the need for this locale support is expected to die away.         applications.  Normally, the internal tables recognize only ASCII char-
1371           acters. However, when PCRE is built, it is possible to cause the inter-
1372         External  tables  are  built by calling the pcre_maketables() function,         nal tables to be rebuilt in the default "C" locale of the local system,
1373         which has no arguments, in the relevant locale. The result can then  be         which may cause them to be different.
1374         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For  
1375         example, to build and use tables that are appropriate  for  the  French         The  internal tables can always be overridden by tables supplied by the
1376         locale  (where  accented  characters  with  values greater than 128 are         application that calls PCRE. These may be created in a different locale
1377           from  the  default.  As more and more applications change to using Uni-
1378           code, the need for this locale support is expected to die away.
1379    
1380           External tables are built by calling  the  pcre_maketables()  function,
1381           which  has no arguments, in the relevant locale. The result can then be
1382           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1383           example,  to  build  and use tables that are appropriate for the French
1384           locale (where accented characters with  values  greater  than  128  are
1385         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1386    
1387           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1388           tables = pcre_maketables();           tables = pcre_maketables();
1389           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1390    
1391           The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1392           if you are using Windows, the name for the French locale is "french".
1393    
1394         When pcre_maketables() runs, the tables are built  in  memory  that  is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1395         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1396         that the memory containing the tables remains available for as long  as         that the memory containing the tables remains available for as long  as
# Line 1200  INFORMATION ABOUT A PATTERN Line 1436  INFORMATION ABOUT A PATTERN
1436         pattern:         pattern:
1437    
1438           int rc;           int rc;
1439           unsigned long int length;           size_t length;
1440           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1441             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1442             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
# Line 1232  INFORMATION ABOUT A PATTERN Line 1468  INFORMATION ABOUT A PATTERN
1468           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1469    
1470         Return  information  about  the first byte of any matched string, for a         Return  information  about  the first byte of any matched string, for a
1471         non-anchored   pattern.   (This    option    used    to    be    called         non-anchored pattern. The fourth argument should point to an int  vari-
1472         PCRE_INFO_FIRSTCHAR;  the  old  name  is still recognized for backwards         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1473         compatibility.)         is still recognized for backwards compatibility.)
1474    
1475         If there is a fixed first byte, for example, from  a  pattern  such  as         If there is a fixed first byte, for example, from  a  pattern  such  as
1476         (cat|cow|coyote),  it  is  returned in the integer pointed to by where.         (cat|cow|coyote), its value is returned. Otherwise, if either
        Otherwise, if either  
1477    
1478         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1479         branch starts with "^", or         branch starts with "^", or
1480    
1481         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1482         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1483    
1484         -1 is returned, indicating that the pattern matches only at  the  start         -1  is  returned, indicating that the pattern matches only at the start
1485         of  a  subject string or after any newline within the string. Otherwise         of a subject string or after any newline within the  string.  Otherwise
1486         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1487    
1488           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1489    
1490         If the pattern was studied, and this resulted in the construction of  a         If  the pattern was studied, and this resulted in the construction of a
1491         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1492         matching string, a pointer to the table is returned. Otherwise NULL  is         matching  string, a pointer to the table is returned. Otherwise NULL is
1493         returned.  The fourth argument should point to an unsigned char * vari-         returned. The fourth argument should point to an unsigned char *  vari-
1494         able.         able.
1495    
1496             PCRE_INFO_JCHANGED
1497    
1498           Return  1  if the (?J) option setting is used in the pattern, otherwise
1499           0. The fourth argument should point to an int variable. The (?J) inter-
1500           nal option setting changes the local PCRE_DUPNAMES option.
1501    
1502           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1503    
1504         Return the value of the rightmost literal byte that must exist  in  any         Return  the  value of the rightmost literal byte that must exist in any
1505         matched  string,  other  than  at  its  start,  if such a byte has been         matched string, other than at its  start,  if  such  a  byte  has  been
1506         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1507         is  no such byte, -1 is returned. For anchored patterns, a last literal         is no such byte, -1 is returned. For anchored patterns, a last  literal
1508         byte is recorded only if it follows something of variable  length.  For         byte  is  recorded only if it follows something of variable length. For
1509         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1510         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1511    
# Line 1272  INFORMATION ABOUT A PATTERN Line 1513  INFORMATION ABOUT A PATTERN
1513           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1514           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1515    
1516         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1517         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1518         ses,  which  still  acquire  numbers.  A  convenience  function  called         ses, which still acquire numbers. Several convenience functions such as
1519         pcre_get_named_substring()  is  provided  for  extracting an individual         pcre_get_named_substring() are provided for  extracting  captured  sub-
1520         captured substring by name. It is also possible  to  extract  the  data         strings  by  name. It is also possible to extract the data directly, by
1521         directly,  by  first converting the name to a number in order to access         first converting the name to a number in order to  access  the  correct
1522         the correct pointers in the output vector (described  with  pcre_exec()         pointers in the output vector (described with pcre_exec() below). To do
1523         below).  To  do the conversion, you need to use the name-to-number map,         the conversion, you need  to  use  the  name-to-number  map,  which  is
1524         which is described by these three values.         described by these three values.
1525    
1526         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1527         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1528         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
1529         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1530         a pointer to the first entry of the table  (a  pointer  to  char).  The         a  pointer  to  the  first  entry of the table (a pointer to char). The
1531         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1532         sis, most significant byte first. The rest of the entry is  the  corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1533         sponding  name,  zero  terminated. The names are in alphabetical order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1534         For example, consider the following pattern  (assume  PCRE_EXTENDED  is         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1535         set, so white space - including newlines - is ignored):         theses numbers. For example, consider  the  following  pattern  (assume
1536           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1537           ignored):
1538    
1539           (?P<date> (?P<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1540           (?P<month>\d\d) - (?P<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1541    
1542         There  are  four  named subpatterns, so the table has four entries, and         There are four named subpatterns, so the table has  four  entries,  and
1543         each entry in the table is eight bytes long. The table is  as  follows,         each  entry  in the table is eight bytes long. The table is as follows,
1544         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1545         as ??:         as ??:
1546    
# Line 1306  INFORMATION ABOUT A PATTERN Line 1549  INFORMATION ABOUT A PATTERN
1549           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1550           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1551    
1552         When writing code to extract data  from  named  subpatterns  using  the         When  writing  code  to  extract  data from named subpatterns using the
1553         name-to-number map, remember that the length of each entry is likely to         name-to-number map, remember that the length of the entries  is  likely
1554         be different for each compiled pattern.         to be different for each compiled pattern.
1555    
1556             PCRE_INFO_OKPARTIAL
1557    
1558           Return  1 if the pattern can be used for partial matching, otherwise 0.
1559           The fourth argument should point to an int  variable.  The  pcrepartial
1560           documentation  lists  the restrictions that apply to patterns when par-
1561           tial matching is used.
1562    
1563           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1564    
1565         Return a copy of the options with which the pattern was  compiled.  The         Return a copy of the options with which the pattern was  compiled.  The
1566         fourth  argument  should  point to an unsigned long int variable. These         fourth  argument  should  point to an unsigned long int variable. These
1567         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1568         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1569           other words, they are the options that will be in force  when  matching
1570           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1571           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1572           and PCRE_EXTENDED.
1573    
1574         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1575         alternatives begin with one of the following:         alternatives begin with one of the following:
# Line 1428  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1682  MATCHING A PATTERN: THE TRADITIONAL FUNC
1682         If the extra argument is not NULL, it must point to a  pcre_extra  data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1683         block.  The pcre_study() function returns such a block (when it doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1684         return NULL), but you can also create one for yourself, and pass  addi-         return NULL), but you can also create one for yourself, and pass  addi-
1685         tional  information in it. The fields in a pcre_extra block are as fol-         tional  information  in it. The pcre_extra block contains the following
1686         lows:         fields (not necessarily in this order):
1687    
1688           unsigned long int flags;           unsigned long int flags;
1689           void *study_data;           void *study_data;
1690           unsigned long int match_limit;           unsigned long int match_limit;
1691             unsigned long int match_limit_recursion;
1692           void *callout_data;           void *callout_data;
1693           const unsigned char *tables;           const unsigned char *tables;
1694    
# Line 1442  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1697  MATCHING A PATTERN: THE TRADITIONAL FUNC
1697    
1698           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1699           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1700             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1701           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1702           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1703    
# Line 1458  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1714  MATCHING A PATTERN: THE TRADITIONAL FUNC
1714         repeats.         repeats.
1715    
1716         Internally,  PCRE uses a function called match() which it calls repeat-         Internally,  PCRE uses a function called match() which it calls repeat-
1717         edly (sometimes recursively). The limit is imposed  on  the  number  of         edly (sometimes recursively). The limit set by match_limit  is  imposed
1718         times  this  function is called during a match, which has the effect of         on  the  number  of times this function is called during a match, which
1719         limiting the amount of recursion and backtracking that can take  place.         has the effect of limiting the amount of  backtracking  that  can  take
1720         For patterns that are not anchored, the count starts from zero for each         place. For patterns that are not anchored, the count restarts from zero
1721         position in the subject string.         for each position in the subject string.
1722    
1723         The default limit for the library can be set when PCRE  is  built;  the         The default value for the limit can be set  when  PCRE  is  built;  the
1724         default  default  is 10 million, which handles all but the most extreme         default  default  is 10 million, which handles all but the most extreme
1725         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1726         pcre_extra  block  in  which match_limit is set to a smaller value, and         pcre_extra     block    in    which    match_limit    is    set,    and
1727         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1728         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1729    
1730           The  match_limit_recursion field is similar to match_limit, but instead
1731           of limiting the total number of times that match() is called, it limits
1732           the  depth  of  recursion. The recursion depth is a smaller number than
1733           the total number of calls, because not all calls to match() are  recur-
1734           sive.  This limit is of use only if it is set smaller than match_limit.
1735    
1736           Limiting the recursion depth limits the amount of  stack  that  can  be
1737           used, or, when PCRE has been compiled to use memory on the heap instead
1738           of the stack, the amount of heap memory that can be used.
1739    
1740           The default value for match_limit_recursion can be  set  when  PCRE  is
1741           built;  the  default  default  is  the  same  value  as the default for
1742           match_limit. You can override the default by suppling pcre_exec()  with
1743           a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1744           PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1745           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1746    
1747         The  pcre_callout  field is used in conjunction with the "callout" fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1748         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1749    
# Line 1488  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1761  MATCHING A PATTERN: THE TRADITIONAL FUNC
1761     Option bits for pcre_exec()     Option bits for pcre_exec()
1762    
1763         The  unused  bits of the options argument for pcre_exec() must be zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1764         The  only  bits  that  may  be  set  are  PCRE_ANCHORED,   PCRE_NOTBOL,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1765         PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1766           PCRE_PARTIAL.
1767    
1768           PCRE_ANCHORED           PCRE_ANCHORED
1769    
1770         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1771         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1772         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1773         unachored at matching time.         unachored at matching time.
1774    
1775             PCRE_NEWLINE_CR
1776             PCRE_NEWLINE_LF
1777             PCRE_NEWLINE_CRLF
1778             PCRE_NEWLINE_ANYCRLF
1779             PCRE_NEWLINE_ANY
1780    
1781           These  options  override  the  newline  definition  that  was chosen or
1782           defaulted when the pattern was compiled. For details, see the  descrip-
1783           tion  of  pcre_compile()  above.  During  matching,  the newline choice
1784           affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1785           ters.  It may also alter the way the match position is advanced after a
1786           match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1787           PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt
1788           fails when the current position is at a CRLF sequence, the match  posi-
1789           tion  is  advanced by two characters instead of one, in other words, to
1790           after the CRLF.
1791    
1792           PCRE_NOTBOL           PCRE_NOTBOL
1793    
1794         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
# Line 1633  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1924  MATCHING A PATTERN: THE TRADITIONAL FUNC
1924         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-
1925         tor[1],  identify  the  portion  of  the  subject string matched by the         tor[1],  identify  the  portion  of  the  subject string matched by the
1926         entire pattern. The next pair is used for the first  capturing  subpat-         entire pattern. The next pair is used for the first  capturing  subpat-
1927         tern,  and  so  on.  The value returned by pcre_exec() is the number of         tern, and so on. The value returned by pcre_exec() is one more than the
1928         pairs that have been set. If there are no  capturing  subpatterns,  the         highest numbered pair that has been set. For example, if two substrings
1929         return  value  from  a  successful match is 1, indicating that just the         have  been captured, the returned value is 3. If there are no capturing
1930         first pair of offsets has been set.         subpatterns, the return value from a successful match is 1,  indicating
1931           that just the first pair of offsets has been set.
        Some convenience functions are provided  for  extracting  the  captured  
        substrings  as  separate  strings. These are described in the following  
        section.  
   
        It is possible for an capturing subpattern number  n+1  to  match  some  
        part  of  the  subject  when subpattern n has not been used at all. For  
        example, if the string "abc" is matched against the pattern (a|(z))(bc)  
        subpatterns  1 and 3 are matched, but 2 is not. When this happens, both  
        offset values corresponding to the unused subpattern are set to -1.  
1932    
1933         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1934         of the string that it matched that is returned.         of the string that it matched that is returned.
1935    
1936         If  the vector is too small to hold all the captured substring offsets,         If the vector is too small to hold all the captured substring  offsets,
1937         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1938         function  returns a value of zero. In particular, if the substring off-         function returns a value of zero. In particular, if the substring  off-
1939         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1940         as  NULL  and  ovecsize  as zero. However, if the pattern contains back         as NULL and ovecsize as zero. However, if  the  pattern  contains  back
1941         references and the ovector is not big enough to  remember  the  related         references  and  the  ovector is not big enough to remember the related
1942         substrings,  PCRE has to get additional memory for use during matching.         substrings, PCRE has to get additional memory for use during  matching.
1943         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1944    
1945         Note that pcre_info() can be used to find out how many  capturing  sub-         The  pcre_info()  function  can  be used to find out how many capturing
1946         patterns there are in a compiled pattern. The smallest size for ovector         subpatterns there are in a compiled  pattern.  The  smallest  size  for
1947         that will allow for n captured substrings, in addition to  the  offsets         ovector  that  will allow for n captured substrings, in addition to the
1948         of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1949    
1950           It is possible for capturing subpattern number n+1 to match  some  part
1951           of the subject when subpattern n has not been used at all. For example,
1952           if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
1953           return from the function is 4, and subpatterns 1 and 3 are matched, but
1954           2 is not. When this happens, both values in  the  offset  pairs  corre-
1955           sponding to unused subpatterns are set to -1.
1956    
1957           Offset  values  that correspond to unused subpatterns at the end of the
1958           expression are also set to -1. For example,  if  the  string  "abc"  is
1959           matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1960           matched. The return from the function is 2, because  the  highest  used
1961           capturing subpattern number is 1. However, you can refer to the offsets
1962           for the second and third capturing subpatterns if  you  wish  (assuming
1963           the vector is large enough, of course).
1964    
1965           Some  convenience  functions  are  provided for extracting the captured
1966           substrings as separate strings. These are described below.
1967    
1968     Return values from pcre_exec()     Error return values from pcre_exec()
1969    
1970         If  pcre_exec()  fails, it returns a negative number. The following are         If pcre_exec() fails, it returns a negative number. The  following  are
1971         defined in the header file:         defined in the header file:
1972    
1973           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1676  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1976  MATCHING A PATTERN: THE TRADITIONAL FUNC
1976    
1977           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1978    
1979         Either code or subject was passed as NULL,  or  ovector  was  NULL  and         Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1980         ovecsize was not zero.         ovecsize was not zero.
1981    
1982           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1685  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1985  MATCHING A PATTERN: THE TRADITIONAL FUNC
1985    
1986           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1987    
1988         PCRE  stores a 4-byte "magic number" at the start of the compiled code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1989         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1990         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1991         an environment with the other endianness. This is the error  that  PCRE         an  environment  with the other endianness. This is the error that PCRE
1992         gives when the magic number is not present.         gives when the magic number is not present.
1993    
1994           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
1995    
1996         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1997         compiled pattern. This error could be caused by a bug  in  PCRE  or  by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1998         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1999    
2000           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2001    
2002         If  a  pattern contains back references, but the ovector that is passed         If a pattern contains back references, but the ovector that  is  passed
2003         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2004         PCRE  gets  a  block of memory at the start of matching to use for this         PCRE gets a block of memory at the start of matching to  use  for  this
2005         purpose. If the call via pcre_malloc() fails, this error is given.  The         purpose.  If the call via pcre_malloc() fails, this error is given. The
2006         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
2007    
2008           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2009    
2010         This  error is used by the pcre_copy_substring(), pcre_get_substring(),         This error is used by the pcre_copy_substring(),  pcre_get_substring(),
2011         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2012         returned by pcre_exec().         returned by pcre_exec().
2013    
2014           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2015    
2016         The  recursion  and backtracking limit, as specified by the match_limit         The backtracking limit, as specified by  the  match_limit  field  in  a
2017         field in a pcre_extra structure (or defaulted)  was  reached.  See  the         pcre_extra  structure  (or  defaulted) was reached. See the description
2018         description above.         above.
2019    
2020           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2021    
2022         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2023         use by callout functions that want to yield a distinctive  error  code.         use  by  callout functions that want to yield a distinctive error code.
2024         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2025    
2026           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2027    
2028         A  string  that contains an invalid UTF-8 byte sequence was passed as a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2029         subject.         subject.
2030    
2031           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2032    
2033         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
2034         value  of startoffset did not point to the beginning of a UTF-8 charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2035         ter.         ter.
2036    
2037           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2038    
2039         The subject string did not match, but it did match partially.  See  the         The  subject  string did not match, but it did match partially. See the
2040         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2041    
2042           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2043    
2044         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
2045         items that are not supported for partial matching. See the  pcrepartial         items  that are not supported for partial matching. See the pcrepartial
2046         documentation for details of partial matching.         documentation for details of partial matching.
2047    
2048           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2049    
2050         An  unexpected  internal error has occurred. This error could be caused         An unexpected internal error has occurred. This error could  be  caused
2051         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2052    
2053           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2054    
2055         This error is given if the value of the ovecsize argument is  negative.         This  error is given if the value of the ovecsize argument is negative.
2056    
2057             PCRE_ERROR_RECURSIONLIMIT (-21)
2058    
2059           The internal recursion limit, as specified by the match_limit_recursion
2060           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2061           description above.
2062    
2063             PCRE_ERROR_BADNEWLINE     (-23)
2064    
2065           An invalid combination of PCRE_NEWLINE_xxx options was given.
2066    
2067           Error numbers -16 to -20 and -22 are not used by pcre_exec().
2068    
2069    
2070  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 1768  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2080  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2080         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2081              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2082    
2083         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2084         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2085         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2086         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2087         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2088         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2089         substrings.  A  substring  that  contains  a  binary  zero is correctly         substrings.
2090         extracted and has a further zero added on the end, but  the  result  is  
2091         not, of course, a C string.         A substring that contains a binary zero is correctly extracted and  has
2092           a  further zero added on the end, but the result is not, of course, a C
2093           string.  However, you can process such a string  by  referring  to  the
2094           length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2095           string().  Unfortunately, the interface to pcre_get_substring_list() is
2096           not  adequate for handling strings containing binary zeros, because the
2097           end of the final string is not independently indicated.
2098    
2099         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2100         tions: subject is the subject string that has  just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2101         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2102         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2103         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2104         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2105         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2106         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2107         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2108    
2109         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2110         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2111         zero  extracts  the  substring that matched the entire pattern, whereas         zero extracts the substring that matched the  entire  pattern,  whereas
2112         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2113         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2114         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2115         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2116         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2117         the terminating zero, or one of         the terminating zero, or one of these error codes:
2118    
2119           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2120    
2121         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2122         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2123    
2124           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2125    
2126         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2127    
2128         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2129         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2130         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2131         the  memory  block  is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2132         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2133         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all  went  well,  or  the
2134           error code
2135    
2136           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2137    
# Line 1831  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2150  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2150         tively.  They  do  nothing  more  than  call the function pointed to by         tively.  They  do  nothing  more  than  call the function pointed to by
2151         pcre_free, which of course could be called directly from a  C  program.         pcre_free, which of course could be called directly from a  C  program.
2152         However,  PCRE is used in some situations where it is linked via a spe-         However,  PCRE is used in some situations where it is linked via a spe-
2153         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
2154         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free  directly;  it is for these cases that the functions are pro-
2155         vided.         vided.
2156    
# Line 1854  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2173  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2173         To extract a substring by name, you first have to find associated  num-         To extract a substring by name, you first have to find associated  num-
2174         ber.  For example, for this pattern         ber.  For example, for this pattern
2175    
2176           (a+)b(?P<xxx>\d+)...           (a+)b(?<xxx>\d+)...
2177    
2178         the number of the subpattern called "xxx" is 2. You can find the number         the number of the subpattern called "xxx" is 2. If the name is known to
2179         from the name by calling pcre_get_stringnumber(). The first argument is         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2180         the  compiled  pattern,  and  the  second is the name. The yield of the         name by calling pcre_get_stringnumber(). The first argument is the com-
2181         function is the subpattern number, or  PCRE_ERROR_NOSUBSTRING  (-7)  if         piled pattern, and the second is the name. The yield of the function is
2182         there is no subpattern of that name.         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2183           subpattern of that name.
2184    
2185         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2186         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2187         are also two functions that do the whole job.         are also two functions that do the whole job.
2188    
2189         Most    of    the    arguments   of   pcre_copy_named_substring()   and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2190         pcre_get_named_substring() are the same  as  those  for  the  similarly         pcre_get_named_substring()  are  the  same  as  those for the similarly
2191         named  functions  that extract by number. As these are described in the         named functions that extract by number. As these are described  in  the
2192         previous section, they are not re-described here. There  are  just  two         previous  section,  they  are not re-described here. There are just two
2193         differences:         differences:
2194    
2195         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2196         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2197         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2198         name-to-number translation table.         name-to-number translation table.
2199    
2200         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2201         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2202         ate.         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2203           behaviour may not be what you want (see the next section).
2204    
2205    
2206    DUPLICATE SUBPATTERN NAMES
2207    
2208           int pcre_get_stringtable_entries(const pcre *code,
2209                const char *name, char **first, char **last);
2210    
2211           When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2212           subpatterns  are  not  required  to  be unique. Normally, patterns with
2213           duplicate names are such that in any one match, only one of  the  named
2214           subpatterns  participates. An example is shown in the pcrepattern docu-
2215           mentation. When duplicates are present, pcre_copy_named_substring() and
2216           pcre_get_named_substring()  return the first substring corresponding to
2217           the given name that is set.  If  none  are  set,  an  empty  string  is
2218           returned.  The pcre_get_stringnumber() function returns one of the num-
2219           bers that are associated with the name, but it is not defined which  it
2220           is.
2221    
2222           If  you want to get full details of all captured substrings for a given
2223           name, you must use  the  pcre_get_stringtable_entries()  function.  The
2224           first argument is the compiled pattern, and the second is the name. The
2225           third and fourth are pointers to variables which  are  updated  by  the
2226           function. After it has run, they point to the first and last entries in
2227           the name-to-number table  for  the  given  name.  The  function  itself
2228           returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2229           there are none. The format of the table is described above in the  sec-
2230           tion  entitled  Information  about  a  pattern.  Given all the relevant
2231           entries for the name, you can extract each of their numbers, and  hence
2232           the captured data, if any.
2233    
2234    
2235  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2236    
2237         The traditional matching function uses a  similar  algorithm  to  Perl,         The  traditional  matching  function  uses a similar algorithm to Perl,
2238         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2239         the subject. If you want to find all possible matches, or  the  longest         the  subject.  If you want to find all possible matches, or the longest
2240         possible  match,  consider using the alternative matching function (see         possible match, consider using the alternative matching  function  (see
2241         below) instead. If you cannot use the alternative function,  but  still         below)  instead.  If you cannot use the alternative function, but still
2242         need  to  find all possible matches, you can kludge it up by making use         need to find all possible matches, you can kludge it up by  making  use
2243         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2244         tation.         tation.
2245    
2246         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2247         tern.  When your callout function is called, extract and save the  cur-         tern.   When your callout function is called, extract and save the cur-
2248         rent  matched  substring.  Then  return  1, which forces pcre_exec() to         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2249         backtrack and try other alternatives. Ultimately, when it runs  out  of         backtrack  and  try other alternatives. Ultimately, when it runs out of
2250         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2251    
2252    
# Line 1907  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2257  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2257              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2258              int *workspace, int wscount);              int *workspace, int wscount);
2259    
2260         The  function  pcre_dfa_exec()  is  called  to  match  a subject string         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2261         against a compiled pattern, using a "DFA" matching algorithm. This  has         against  a  compiled pattern, using a matching algorithm that scans the
2262         different  characteristics to the normal algorithm, and is not compati-         subject string just once, and does not backtrack.  This  has  different
2263         ble with Perl. Some of the features of PCRE patterns are not supported.         characteristics  to  the  normal  algorithm, and is not compatible with
2264         Nevertheless, there are times when this kind of matching can be useful.         Perl. Some of the features of PCRE patterns are not  supported.  Never-
2265         For a discussion of the two matching algorithms, see  the  pcrematching         theless,  there are times when this kind of matching can be useful. For
2266         documentation.         a discussion of the two matching algorithms, see the pcrematching docu-
2267           mentation.
2268    
2269         The  arguments  for  the  pcre_dfa_exec()  function are the same as for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2270         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
# Line 1925  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2276  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2276         workspace vector should contain at least 20 elements. It  is  used  for         workspace vector should contain at least 20 elements. It  is  used  for
2277         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2278         workspace will be needed for patterns and subjects where  there  are  a         workspace will be needed for patterns and subjects where  there  are  a
2279         lot of possible matches.         lot of potential matches.
2280    
2281         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_dfa_exec():
2282    
2283           int rc;           int rc;
2284           int ovector[10];           int ovector[10];
2285           int wspace[20];           int wspace[20];
2286           rc = pcre_exec(           rc = pcre_dfa_exec(
2287             re,             /* result of pcre_compile() */             re,             /* result of pcre_compile() */
2288             NULL,           /* we didn't study the pattern */             NULL,           /* we didn't study the pattern */
2289             "some string",  /* the subject string */             "some string",  /* the subject string */
# Line 1947  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2298  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2298     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2299    
2300         The  unused  bits  of  the options argument for pcre_dfa_exec() must be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2301         zero. The only bits that may be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2302         PCRE_NOTEOL,     PCRE_NOTEMPTY,    PCRE_NO_UTF8_CHECK,    PCRE_PARTIAL,         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2303         PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All  but  the  last  three  of         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2304         these  are  the  same  as  for pcre_exec(), so their description is not         three of these are the same as for pcre_exec(), so their description is
2305         repeated here.         not repeated here.
2306    
2307           PCRE_PARTIAL           PCRE_PARTIAL
2308    
# Line 1966  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2317  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2317           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2318    
2319         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2320         stop  as  soon  as  it  has found one match. Because of the way the DFA         stop as soon as it has found one match. Because of the way the alterna-
2321         algorithm works, this is necessarily the shortest possible match at the         tive algorithm works, this is necessarily the shortest  possible  match
2322         first possible matching point in the subject string.         at the first possible matching point in the subject string.
2323    
2324           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2325    
# Line 2004  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2355  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2355         On success, the yield of the function is a number  greater  than  zero,         On success, the yield of the function is a number  greater  than  zero,
2356         which  is  the  number of matched substrings. The substrings themselves         which  is  the  number of matched substrings. The substrings themselves
2357         are returned in ovector. Each string uses two elements;  the  first  is         are returned in ovector. Each string uses two elements;  the  first  is
2358         the  offset  to the start, and the second is the offset to the end. All         the  offset  to  the start, and the second is the offset to the end. In
2359         the strings have the same start offset. (Space could have been saved by         fact, all the strings have the same start  offset.  (Space  could  have
2360         giving  this only once, but it was decided to retain some compatibility         been  saved by giving this only once, but it was decided to retain some
2361         with the way pcre_exec() returns data, even though the meaning  of  the         compatibility with the way pcre_exec() returns data,  even  though  the
2362         strings is different.)         meaning of the strings is different.)
2363    
2364         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2365         est matching string is given first. If there were too many  matches  to         est matching string is given first. If there were too many  matches  to
# Line 2030  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2381  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2381    
2382           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2383    
2384         This  return is given if pcre_dfa_exec() encounters a condition item in         This  return  is  given  if pcre_dfa_exec() encounters a condition item
2385         a pattern that uses a back reference for the  condition.  This  is  not         that uses a back reference for the condition, or a test  for  recursion
2386         supported.         in a specific group. These are not supported.
2387    
2388           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2389    
# Line 2052  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2403  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2403         This error is given if the output vector  is  not  large  enough.  This         This error is given if the output vector  is  not  large  enough.  This
2404         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2405    
2406  Last updated: 16 May 2005  
2407  Copyright (c) 1997-2005 University of Cambridge.  SEE ALSO
2408    
2409           pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2410           tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).
2411    
2412    
2413    AUTHOR
2414    
2415           Philip Hazel
2416           University Computing Service
2417           Cambridge CB2 3QH, England.
2418    
2419    
2420    REVISION
2421    
2422           Last updated: 30 July 2007
2423           Copyright (c) 1997-2007 University of Cambridge.
2424  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2425    
2426    
# Line 2080  PCRE CALLOUTS Line 2447  PCRE CALLOUTS
2447         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
2448         points:         points:
2449    
2450           (?C1)eabc(?C2)def           (?C1)abc(?C2)def
2451    
2452         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2453         called, PCRE automatically  inserts  callouts,  all  with  number  255,         called, PCRE automatically  inserts  callouts,  all  with  number  255,
# Line 2155  THE CALLOUT INTERFACE Line 2522  THE CALLOUT INTERFACE
2522         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2523         were passed to pcre_exec().         were passed to pcre_exec().
2524    
2525         The start_match field contains the offset within the subject  at  which         The start_match field normally contains the offset within  the  subject
2526         the  current match attempt started. If the pattern is not anchored, the         at  which  the  current  match  attempt started. However, if the escape
2527         callout function may be called several times from the same point in the         sequence \K has been encountered, this value is changed to reflect  the
2528         pattern for different starting points in the subject.         modified  starting  point.  If the pattern is not anchored, the callout
2529           function may be called several times from the same point in the pattern
2530           for different starting points in the subject.
2531    
2532         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
2533         the current match pointer.         the current match pointer.
# Line 2211  RETURN VALUES Line 2580  RETURN VALUES
2580         reserved for use by callout functions; it will never be  used  by  PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2581         itself.         itself.
2582    
2583  Last updated: 28 February 2005  
2584  Copyright (c) 1997-2005 University of Cambridge.  AUTHOR
2585    
2586           Philip Hazel
2587           University Computing Service
2588           Cambridge CB2 3QH, England.
2589    
2590    
2591    REVISION
2592    
2593           Last updated: 29 May 2007
2594           Copyright (c) 1997-2007 University of Cambridge.
2595  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2596    
2597    
# Line 2226  NAME Line 2605  NAME
2605  DIFFERENCES BETWEEN PCRE AND PERL  DIFFERENCES BETWEEN PCRE AND PERL
2606    
2607         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2608         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences described here  are  mainly
2609         respect to Perl 5.8.         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2610           some features that are expected to be in the forthcoming Perl 5.10.
2611         1.  PCRE does not have full UTF-8 support. Details of what it does have  
2612         are given in the section on UTF-8 support in the main pcre page.         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2613           of  what  it does have are given in the section on UTF-8 support in the
2614           main pcre page.
2615    
2616         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2617         permits  them,  but they do not mean what you might think. For example,         permits  them,  but they do not mean what you might think. For example,
# Line 2257  DIFFERENCES BETWEEN PCRE AND PERL Line 2638  DIFFERENCES BETWEEN PCRE AND PERL
2638         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
2639         is  built  with Unicode character property support. The properties that         is  built  with Unicode character property support. The properties that
2640         can be tested with \p and \P are limited to the general category  prop-         can be tested with \p and \P are limited to the general category  prop-
2641         erties such as Lu and Nd.         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2642           derived properties Any and L&.
2643    
2644         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2645         ters in between are treated as literals.  This  is  slightly  different         ters  in  between  are  treated as literals. This is slightly different
2646         from  Perl  in  that  $  and  @ are also handled as literals inside the         from Perl in that $ and @ are  also  handled  as  literals  inside  the
2647         quotes. In Perl, they cause variable interpolation (but of course  PCRE         quotes.  In Perl, they cause variable interpolation (but of course PCRE
2648         does not have variables). Note the following examples:         does not have variables). Note the following examples:
2649    
2650             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 2272  DIFFERENCES BETWEEN PCRE AND PERL Line 2654  DIFFERENCES BETWEEN PCRE AND PERL
2654             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
2655             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
2656    
2657         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2658         classes.         classes.
2659    
2660         8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2661         constructions.  However,  there is support for recursive patterns using         constructions. However, there is support for recursive  patterns.  This
2662         the non-Perl items (?R),  (?number),  and  (?P>name).  Also,  the  PCRE         is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
2663         "callout"  feature allows an external function to be called during pat-         "callout" feature allows an external function to be called during  pat-
2664         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
2665    
2666         9. There are some differences that are concerned with the  settings  of         9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2667         captured  strings  when  part  of  a  pattern is repeated. For example,         always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2668         matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2         unlike Perl.
2669    
2670           10.  There are some differences that are concerned with the settings of
2671           captured strings when part of  a  pattern  is  repeated.  For  example,
2672           matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
2673         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
2674    
2675         10. PCRE provides some extensions to the Perl regular expression facil-         11. PCRE provides some extensions to the Perl regular expression facil-
2676         ities:         ities.   Perl  5.10  will  include new features that are not in earlier
2677           versions, some of which (such as named parentheses) have been  in  PCRE
2678           for some time. This list is with respect to Perl 5.10:
2679    
2680         (a) Although lookbehind assertions must  match  fixed  length  strings,         (a)  Although  lookbehind  assertions  must match fixed length strings,
2681         each alternative branch of a lookbehind assertion can match a different         each alternative branch of a lookbehind assertion can match a different
2682         length of string. Perl requires them all to have the same length.         length of string. Perl requires them all to have the same length.
2683    
2684         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
2685         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2686    
2687         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2688         cial meaning is faulted.         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2689           ignored.  (Perl can be made to issue a warning.)
2690    
2691         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2692         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
# Line 2309  DIFFERENCES BETWEEN PCRE AND PERL Line 2698  DIFFERENCES BETWEEN PCRE AND PERL
2698         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2699         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2700    
2701         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive         (g) The callout facility is PCRE-specific.
        pattern  matching  (Perl  can  do  this using the (?p{code}) construct,  
        which PCRE cannot support.)  
2702    
2703         (h) PCRE supports named capturing substrings, using the Python  syntax.         (h) The partial matching facility is PCRE-specific.
2704    
2705         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from         (i) Patterns compiled by PCRE can be saved and re-used at a later time,
2706         Sun's Java package.         even on different hosts that have the other endianness.
2707    
2708         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2709           different way and is not Perl-compatible.
2710    
        (k) The callout facility is PCRE-specific.  
2711    
2712         (l) The partial matching facility is PCRE-specific.  AUTHOR
2713    
2714         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         Philip Hazel
2715         even on different hosts that have the other endianness.         University Computing Service
2716           Cambridge CB2 3QH, England.
2717    
        (n)  The  alternative  matching function (pcre_dfa_exec()) matches in a  
        different way and is not Perl-compatible.  
2718    
2719  Last updated: 28 February 2005  REVISION
2720  Copyright (c) 1997-2005 University of Cambridge.  
2721           Last updated: 13 June 2007
2722           Copyright (c) 1997-2007 University of Cambridge.
2723  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2724    
2725    
# Line 2363  PCRE REGULAR EXPRESSION DETAILS Line 2751  PCRE REGULAR EXPRESSION DETAILS
2751         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2752         From  release  6.0,   PCRE   offers   a   second   matching   function,         From  release  6.0,   PCRE   offers   a   second   matching   function,
2753         pcre_dfa_exec(),  which matches using a different algorithm that is not         pcre_dfa_exec(),  which matches using a different algorithm that is not
2754         Perl-compatible. The advantages and disadvantages  of  the  alternative         Perl-compatible. Some of the features discussed below are not available
2755         function, and how it differs from the normal function, are discussed in         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2756         the pcrematching page.         alternative function, and how it differs from the normal function,  are
2757           discussed in the pcrematching page.
2758         A regular expression is a pattern that is  matched  against  a  subject  
2759         string  from  left  to right. Most characters stand for themselves in a  
2760         pattern, and match the corresponding characters in the  subject.  As  a  CHARACTERS AND METACHARACTERS
2761    
2762           A  regular  expression  is  a pattern that is matched against a subject
2763           string from left to right. Most characters stand for  themselves  in  a
2764           pattern,  and  match  the corresponding characters in the subject. As a
2765         trivial example, the pattern         trivial example, the pattern
2766    
2767           The quick brown fox           The quick brown fox
2768    
2769         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2770         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2771         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2772         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2773         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2774         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2775         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2776         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2777         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2778    
2779         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2780         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2781         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2782         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2783    
2784         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2785         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2786         that are recognized in square brackets. Outside  square  brackets,  the         that  are  recognized  within square brackets. Outside square brackets,
2787         metacharacters are as follows:         the metacharacters are as follows:
2788    
2789           \      general escape character with several uses           \      general escape character with several uses
2790           ^      assert start of string (or line, in multiline mode)           ^      assert start of string (or line, in multiline mode)
# Line 2410  PCRE REGULAR EXPRESSION DETAILS Line 2802  PCRE REGULAR EXPRESSION DETAILS
2802                  also "possessive quantifier"                  also "possessive quantifier"
2803           {      start min/max quantifier           {      start min/max quantifier
2804    
2805         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2806         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2807    
2808           \      general escape character           \      general escape character
# Line 2420  PCRE REGULAR EXPRESSION DETAILS Line 2812  PCRE REGULAR EXPRESSION DETAILS
2812                    syntax)                    syntax)
2813           ]      terminates the character class           ]      terminates the character class
2814    
2815         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2816    
2817    
2818  BACKSLASH  BACKSLASH
2819    
2820         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2821         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2822         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2823         applies both inside and outside character classes.         applies both inside and outside character classes.
2824    
2825         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2826         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2827         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2828         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2829         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2830         slash, you write \\.         slash, you write \\.
2831    
2832         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2833         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2834         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline are ignored. An escap-
2835         An escaping backslash can be used to include a whitespace or #  charac-         ing  backslash  can  be  used to include a whitespace or # character as
2836         ter as part of the pattern.         part of the pattern.
2837    
2838         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2839         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2840         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2841         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
2842         tion. Note the following examples:         tion. Note the following examples:
2843    
2844           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2456  BACKSLASH Line 2848  BACKSLASH
2848           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2849           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2850    
2851         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2852         classes.         classes.
2853    
2854     Non-printing characters     Non-printing characters
2855    
2856         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2857         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
2858         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
2859         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
2860         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
2861         sequences than the binary character it represents:         sequences than the binary character it represents:
2862    
2863           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 2477  BACKSLASH Line 2869  BACKSLASH
2869           \t        tab (hex 09)           \t        tab (hex 09)
2870           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
2871           \xhh      character with hex code hh           \xhh      character with hex code hh
2872           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh..
2873    
2874         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
2875         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
2876         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
2877         becomes hex 7B.         becomes hex 7B.
2878    
2879         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
2880         in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-         in upper or lower case). Any number of hexadecimal  digits  may  appear
2881         its may appear between \x{ and }, but the value of the  character  code         between  \x{  and  },  but the value of the character code must be less
2882         must  be  less  than  2**31  (that is, the maximum hexadecimal value is         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2883         7FFFFFFF). If characters other than hexadecimal digits  appear  between         the  maximum  hexadecimal  value is 7FFFFFFF). If characters other than
2884         \x{  and }, or if there is no terminating }, this form of escape is not         hexadecimal digits appear between \x{ and }, or if there is  no  termi-
2885         recognized. Instead, the initial \x will  be  interpreted  as  a  basic         nating  }, this form of escape is not recognized.  Instead, the initial
2886         hexadecimal  escape, with no following digits, giving a character whose         \x will be interpreted as a basic hexadecimal escape, with no following
2887         value is zero.         digits, giving a character whose value is zero.
2888    
2889         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2890         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference         two syntaxes for \x. There is no difference in the way  they  are  han-
2891         in the way they are handled. For example, \xdc is exactly the  same  as         dled. For example, \xdc is exactly the same as \x{dc}.
2892         \x{dc}.  
2893           After  \0  up  to two further octal digits are read. If there are fewer
2894         After  \0  up  to  two further octal digits are read. In both cases, if         than two digits, just  those  that  are  present  are  used.  Thus  the
2895         there are fewer than two digits, just those that are present are  used.         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2896         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL         (code value 7). Make sure you supply two digits after the initial  zero
2897         character (code value 7). Make sure you supply  two  digits  after  the         if the pattern character that follows is itself an octal digit.
        initial  zero  if the pattern character that follows is itself an octal  
        digit.  
2898    
2899         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2900         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2901         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
2902         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2903         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
2904         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
2905         of parenthesized subpatterns.         of parenthesized subpatterns.
2906    
2907         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
2908         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
2909         up to three octal digits following the backslash, and generates a  sin-         up to three octal digits following the backslash, and uses them to gen-
2910         gle byte from the least significant 8 bits of the value. Any subsequent         erate a data character. Any subsequent digits stand for themselves.  In
2911         digits stand for themselves.  For example:         non-UTF-8  mode,  the  value  of a character specified in octal must be
2912           less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
2913           example:
2914    
2915           \040   is another way of writing a space           \040   is another way of writing a space
2916           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 2535  BACKSLASH Line 2927  BACKSLASH
2927           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2928                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2929    
2930         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
2931         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2932    
2933         All  the  sequences  that  define a single byte value or a single UTF-8         All the sequences that define a single character value can be used both
2934         character (in UTF-8 mode) can be used both inside and outside character         inside  and  outside character classes. In addition, inside a character
2935         classes.  In  addition,  inside  a  character class, the sequence \b is         class, the sequence \b is interpreted as the backspace  character  (hex
2936         interpreted as the backspace character (hex 08), and the sequence \X is         08),  and the sequences \R and \X are interpreted as the characters "R"
2937         interpreted  as  the  character  "X".  Outside a character class, these         and "X", respectively. Outside a character class, these sequences  have
2938         sequences have different meanings (see below).         different meanings (see below).
2939    
2940       Absolute and relative back references
2941    
2942           The  sequence  \g followed by a positive or negative number, optionally
2943           enclosed in braces, is an absolute or relative back reference. A  named
2944           back  reference can be coded as \g{name}. Back references are discussed
2945           later, following the discussion of parenthesized subpatterns.
2946    
2947     Generic character types     Generic character types
2948    
2949         The third use of backslash is for specifying generic  character  types.         Another use of backslash is for specifying generic character types. The
2950         The following are always recognized:         following are always recognized:
2951    
2952           \d     any decimal digit           \d     any decimal digit
2953           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
2954             \h     any horizontal whitespace character
2955             \H     any character that is not a horizontal whitespace character
2956           \s     any whitespace character           \s     any whitespace character
2957           \S     any character that is not a whitespace character           \S     any character that is not a whitespace character
2958             \v     any vertical whitespace character
2959             \V     any character that is not a vertical whitespace character
2960           \w     any "word" character           \w     any "word" character
2961           \W     any "non-word" character           \W     any "non-word" character
2962    
# Line 2568  BACKSLASH Line 2971  BACKSLASH
2971    
2972         For compatibility with Perl, \s does not match the VT  character  (code         For compatibility with Perl, \s does not match the VT  character  (code
2973         11).   This makes it different from the the POSIX "space" class. The \s         11).   This makes it different from the the POSIX "space" class. The \s
2974         characters are HT (9), LF (10), FF (12), CR (13), and space (32).         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
2975           "use locale;" is included in a Perl script, \s may match the VT charac-
2976           ter. In PCRE, it never does.
2977    
2978           In UTF-8 mode, characters with values greater than 128 never match  \d,
2979           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2980           code character property support is available.  These  sequences  retain
2981           their original meanings from before UTF-8 support was available, mainly
2982           for efficiency reasons.
2983    
2984           The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
2985           the  other  sequences, these do match certain high-valued codepoints in
2986           UTF-8 mode.  The horizontal space characters are:
2987    
2988             U+0009     Horizontal tab
2989             U+0020     Space
2990             U+00A0     Non-break space
2991             U+1680     Ogham space mark
2992             U+180E     Mongolian vowel separator
2993             U+2000     En quad
2994             U+2001     Em quad
2995             U+2002     En space
2996             U+2003     Em space
2997             U+2004     Three-per-em space
2998             U+2005     Four-per-em space
2999             U+2006     Six-per-em space
3000             U+2007     Figure space
3001             U+2008     Punctuation space
3002             U+2009     Thin space
3003             U+200A     Hair space
3004             U+202F     Narrow no-break space
3005             U+205F     Medium mathematical space
3006             U+3000     Ideographic space
3007    
3008           The vertical space characters are:
3009    
3010             U+000A     Linefeed
3011             U+000B     Vertical tab
3012             U+000C     Formfeed
3013             U+000D     Carriage return
3014             U+0085     Next line
3015             U+2028     Line separator
3016             U+2029     Paragraph separator
3017    
3018         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
3019         is  a  letter  or  digit.  The definition of letters and digits is con-         is  a  letter  or  digit.  The definition of letters and digits is con-
3020         trolled by PCRE's low-valued character tables, and may vary if  locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
3021         specific  matching is taking place (see "Locale support" in the pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3022         page). For example, in the  "fr_FR"  (French)  locale,  some  character         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3023         codes  greater  than  128  are used for accented letters, and these are         systems,  or "french" in Windows, some character codes greater than 128
3024         matched by \w.         are used for accented letters, and these are matched by \w. The use  of
3025           locales with Unicode is discouraged.
3026    
3027       Newline sequences
3028    
3029           Outside  a  character class, the escape sequence \R matches any Unicode
3030           newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R  is
3031           equivalent to the following:
3032    
3033             (?>\r\n|\n|\x0b|\f|\r|\x85)
3034    
3035           This  is  an  example  of an "atomic group", details of which are given
3036           below.  This particular group matches either the two-character sequence
3037           CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3038           U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3039           return, U+000D), or NEL (next line, U+0085). The two-character sequence
3040           is treated as a single unit that cannot be split.
3041    
3042           In UTF-8 mode, two additional characters whose codepoints  are  greater
3043           than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3044           rator, U+2029).  Unicode character property support is not  needed  for
3045           these characters to be recognized.
3046    
3047         In UTF-8 mode, characters with values greater than 128 never match  \d,         Inside a character class, \R matches the letter "R".
        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-  
        code character property support is available.  
3048    
3049     Unicode character properties     Unicode character properties
3050    
3051         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3052         tional  escape sequences to match generic character types are available         tional escape sequences that match characters with specific  properties
3053         when UTF-8 mode is selected. They are:         are  available.   When not in UTF-8 mode, these sequences are of course
3054           limited to testing characters whose codepoints are less than  256,  but
3055          \p{xx}   a character with the xx property         they do work in this mode.  The extra escape sequences are:
3056          \P{xx}   a character without the xx property  
3057          \X       an extended Unicode sequence           \p{xx}   a character with the xx property
3058             \P{xx}   a character without the xx property
3059         The property names represented by xx above are limited to  the  Unicode           \X       an extended Unicode sequence
3060         general  category properties. Each character has exactly one such prop-  
3061         erty, specified by a two-letter abbreviation.  For  compatibility  with         The  property  names represented by xx above are limited to the Unicode
3062         Perl,  negation  can be specified by including a circumflex between the         script names, the general category properties, and "Any", which matches
3063         opening brace and the property name. For example, \p{^Lu} is  the  same         any character (including newline). Other properties such as "InMusical-
3064         as \P{Lu}.         Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
3065           not match any characters, so always causes a match failure.
3066         If  only  one  letter  is  specified with \p or \P, it includes all the  
3067         properties that start with that letter. In this case, in the absence of         Sets of Unicode characters are defined as belonging to certain scripts.
3068         negation, the curly brackets in the escape sequence are optional; these         A character from one of these sets can be matched using a script  name.
3069         two examples have the same effect:         For example:
3070    
3071             \p{Greek}
3072             \P{Han}
3073    
3074           Those  that are not part of an identified script are lumped together as
3075           "Common". The current list of scripts is:
3076    
3077           Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
3078           Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
3079           Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3080           Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3081           gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
3082           Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
3083           Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
3084           Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
3085           Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3086    
3087           Each character has exactly one general category property, specified  by
3088           a two-letter abbreviation. For compatibility with Perl, negation can be
3089           specified by including a circumflex between the opening brace  and  the
3090           property name. For example, \p{^Lu} is the same as \P{Lu}.
3091    
3092           If only one letter is specified with \p or \P, it includes all the gen-
3093           eral category properties that start with that letter. In this case,  in
3094           the  absence of negation, the curly brackets in the escape sequence are
3095           optional; these two examples have the same effect:
3096    
3097           \p{L}           \p{L}
3098           \pL           \pL
3099    
3100         The following property codes are supported:         The following general category property codes are supported:
3101    
3102           C     Other           C     Other
3103           Cc    Control           Cc    Control
# Line 2653  BACKSLASH Line 3143  BACKSLASH
3143           Zp    Paragraph separator           Zp    Paragraph separator
3144           Zs    Space separator           Zs    Space separator
3145    
3146         Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-         The special property L& is also supported: it matches a character  that
3147         ported by PCRE.         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3148           classified as a modifier or "other".
3149    
3150           The long synonyms for these properties  that  Perl  supports  (such  as
3151           \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3152           any of these properties with "Is".
3153    
3154           No character that is in the Unicode table has the Cn (unassigned) prop-
3155           erty.  Instead, this property is assumed for any code point that is not
3156           in the Unicode table.
3157    
3158         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3159         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3160    
3161         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3162         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3163    
3164           (?>\PM\pM*)           (?>\PM\pM*)
3165    
3166         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3167         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3168         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3169         property are typically accents that affect the preceding character.         property  are  typically  accents  that affect the preceding character.
3170           None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3171           matches any one character.
3172    
3173         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3174         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3175         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3176         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
3177    
3178       Resetting the match start
3179    
3180           The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3181           ously  matched  characters  not  to  be  included  in the final matched
3182           sequence. For example, the pattern:
3183    
3184             foo\Kbar
3185    
3186           matches "foobar", but reports that it has matched "bar".  This  feature
3187           is  similar  to  a lookbehind assertion (described below).  However, in
3188           this case, the part of the subject before the real match does not  have
3189           to  be of fixed length, as lookbehind assertions do. The use of \K does
3190           not interfere with the setting of captured  substrings.   For  example,
3191           when the pattern
3192    
3193             (foo)\Kbar
3194    
3195           matches "foobar", the first substring is still set to "foo".
3196    
3197     Simple assertions     Simple assertions
3198    
3199         The fourth use of backslash is for certain simple assertions. An asser-         The  final use of backslash is for certain simple assertions. An asser-
3200         tion specifies a condition that has to be met at a particular point  in         tion specifies a condition that has to be met at a particular point  in
3201         a  match, without consuming any characters from the subject string. The         a  match, without consuming any characters from the subject string. The
3202         use of subpatterns for more complicated assertions is described  below.         use of subpatterns for more complicated assertions is described  below.
# Line 2684  BACKSLASH Line 3204  BACKSLASH
3204    
3205           \b     matches at a word boundary           \b     matches at a word boundary
3206           \B     matches when not at a word boundary           \B     matches when not at a word boundary
3207           \A     matches at start of subject           \A     matches at the start of the subject
3208           \Z     matches at end of subject or before newline at end           \Z     matches at the end of the subject
3209           \z     matches at end of subject                   also matches before a newline at the end of the subject
3210           \G     matches at first matching position in subject           \z     matches only at the end of the subject
3211             \G     matches at the first matching position in the subject
3212    
3213         These  assertions may not appear in character classes (but note that \b         These  assertions may not appear in character classes (but note that \b
3214         has a different meaning, namely the backspace character, inside a char-         has a different meaning, namely the backspace character, inside a char-
# Line 2707  BACKSLASH Line 3228  BACKSLASH
3228         However, if the startoffset argument of pcre_exec() is non-zero,  indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
3229         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
3230         the subject, \A can never match. The difference between \Z  and  \z  is         the subject, \A can never match. The difference between \Z  and  \z  is
3231         that  \Z  matches  before  a  newline that is the last character of the         that \Z matches before a newline at the end of the string as well as at
3232         string as well as at the end of the string, whereas \z matches only  at         the very end, whereas \z matches only at the end.
3233         the end.  
3234           The \G assertion is true only when the current matching position is  at
3235         The  \G assertion is true only when the current matching position is at         the  start point of the match, as specified by the startoffset argument
3236         the start point of the match, as specified by the startoffset  argument         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
3237         of  pcre_exec().  It  differs  from \A when the value of startoffset is         non-zero.  By calling pcre_exec() multiple times with appropriate argu-
        non-zero. By calling pcre_exec() multiple times with appropriate  argu-  
3238         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
3239         mentation where \G can be useful.         mentation where \G can be useful.
3240    
3241         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
3242         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
3243         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
3244         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
3245         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
3246    
3247         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
3248         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
3249         in the compiled regular expression.         in the compiled regular expression.
3250    
# Line 2732  BACKSLASH Line 3252  BACKSLASH
3252  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
3253    
3254         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
3255         character  is  an  assertion  that is true only if the current matching         character is an assertion that is true only  if  the  current  matching
3256         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
3257         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
3258         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
3259         has an entirely different meaning (see below).         has an entirely different meaning (see below).
3260    
3261         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
3262         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
3263         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
3264         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
3265         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
3266         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
3267         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
3268    
3269         A  dollar  character  is  an assertion that is true only if the current         A dollar character is an assertion that is true  only  if  the  current
3270         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
3271         before a newline character that is the last character in the string (by         before a newline at the end of the string (by default). Dollar need not
3272         default). Dollar need not be the last character of  the  pattern  if  a         be  the  last  character of the pattern if a number of alternatives are
3273         number  of alternatives are involved, but it should be the last item in         involved, but it should be the last item in  any  branch  in  which  it
3274         any branch in which it appears.  Dollar has no  special  meaning  in  a         appears. Dollar has no special meaning in a character class.
        character class.  
3275    
3276         The  meaning  of  dollar  can be changed so that it matches only at the         The  meaning  of  dollar  can be changed so that it matches only at the
3277         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
3278         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
3279    
3280         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
3281         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
3282         ately  after  and  immediately  before  an  internal newline character,         matches  immediately after internal newlines as well as at the start of
3283         respectively, in addition to matching at the start and end of the  sub-         the subject string. It does not match after a  newline  that  ends  the
3284         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject         string.  A dollar matches before any newlines in the string, as well as
3285         string "def\nabc" (where \n represents a newline character)  in  multi-         at the very end, when PCRE_MULTILINE is set. When newline is  specified
3286         line mode, but not otherwise.  Consequently, patterns that are anchored         as  the  two-character  sequence CRLF, isolated CR and LF characters do
3287         in single line mode because all branches start with ^ are not  anchored         not indicate newlines.
3288         in  multiline  mode,  and  a  match for circumflex is possible when the  
3289         startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
3290         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.         (where  \n  represents a newline) in multiline mode, but not otherwise.
3291           Consequently, patterns that are anchored in single  line  mode  because
3292         Note  that  the sequences \A, \Z, and \z can be used to match the start         all  branches  start  with  ^ are not anchored in multiline mode, and a
3293         and end of the subject in both modes, and if all branches of a  pattern         match for circumflex is  possible  when  the  startoffset  argument  of
3294         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
3295         not.         PCRE_MULTILINE is set.
3296    
3297           Note that the sequences \A, \Z, and \z can be used to match  the  start
3298           and  end of the subject in both modes, and if all branches of a pattern
3299           start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
3300           set.
3301    
3302    
3303  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
3304    
3305         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
3306         ter  in  the  subject,  including a non-printing character, but not (by         ter in the subject string except (by default) a character  that  signi-
3307         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3308         which might be more than one byte long, except (by default) newline. If         more than one byte long.
3309         the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-  
3310         dling  of dot is entirely independent of the handling of circumflex and         When a line ending is defined as a single character, dot never  matches
3311         dollar, the only relationship being  that  they  both  involve  newline         that  character; when the two-character sequence CRLF is used, dot does
3312         characters. Dot has no special meaning in a character class.         not match CR if it is immediately followed  by  LF,  but  otherwise  it
3313           matches  all characters (including isolated CRs and LFs). When any Uni-
3314           code line endings are being recognized, dot does not match CR or LF  or
3315           any of the other line ending characters.
3316    
3317           The  behaviour  of  dot  with regard to newlines can be changed. If the
3318           PCRE_DOTALL option is set, a dot matches  any  one  character,  without
3319           exception. If the two-character sequence CRLF is present in the subject
3320           string, it takes two dots to match it.
3321    
3322           The handling of dot is entirely independent of the handling of  circum-
3323           flex  and  dollar,  the  only relationship being that they both involve
3324           newlines. Dot has no special meaning in a character class.
3325    
3326    
3327  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
3328    
3329         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
3330         both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.         both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
3331         The  feature  is provided in Perl in order to match individual bytes in         line-ending characters. The feature is provided in  Perl  in  order  to
3332         UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual         match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
3333         bytes,  what remains in the string may be a malformed UTF-8 string. For         acters into individual bytes, what remains in the string may be a  mal-
3334         this reason, the \C escape sequence is best avoided.         formed  UTF-8  string.  For this reason, the \C escape sequence is best
3335           avoided.
3336    
3337         PCRE does not allow \C to appear in  lookbehind  assertions  (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
3338         below),  because  in UTF-8 mode this would make it impossible to calcu-         below),  because  in UTF-8 mode this would make it impossible to calcu-
# Line 2842  SQUARE BRACKETS AND CHARACTER CLASSES Line 3379  SQUARE BRACKETS AND CHARACTER CLASSES
3379         PCRE  is  compiled  with Unicode property support as well as with UTF-8         PCRE  is  compiled  with Unicode property support as well as with UTF-8
3380         support.         support.
3381    
3382         The newline character is never treated in any special way in  character         Characters that might indicate line breaks are  never  treated  in  any
3383         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE         special  way  when  matching  character  classes,  whatever line-ending
3384         options is. A class such as [^a] will always match a newline.         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
3385           PCRE_MULTILINE options is used. A class such as [^a] always matches one
3386           of these characters.
3387    
3388         The minus (hyphen) character can be used to specify a range of  charac-         The minus (hyphen) character can be used to specify a range of  charac-
3389         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
# Line 2870  SQUARE BRACKETS AND CHARACTER CLASSES Line 3409  SQUARE BRACKETS AND CHARACTER CLASSES
3409         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
3410         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
3411         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
3412         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
3413         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
3414         concept of case for characters with values greater than 128  only  when         concept of case for characters with values greater than 128  only  when
3415         it is compiled with Unicode property support.         it is compiled with Unicode property support.
# Line 2945  VERTICAL BAR Line 3484  VERTICAL BAR
3484    
3485         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches  either "gilbert" or "sullivan". Any number of alternatives may
3486         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
3487         string).   The  matching  process  tries each alternative in turn, from         string). The matching process tries each alternative in turn, from left
3488         left to right, and the first one that succeeds is used. If the alterna-         to right, and the first one that succeeds is used. If the  alternatives
3489         tives  are within a subpattern (defined below), "succeeds" means match-         are  within a subpattern (defined below), "succeeds" means matching the
3490         ing the rest of the main pattern as well as the alternative in the sub-         rest of the main pattern as well as the alternative in the  subpattern.
        pattern.  
3491    
3492    
3493  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
# Line 2977  INTERNAL OPTION SETTING Line 3515  INTERNAL OPTION SETTING
3515         PCRE extracts it into the global options (and it will therefore show up         PCRE extracts it into the global options (and it will therefore show up
3516         in data extracted by the pcre_fullinfo() function).         in data extracted by the pcre_fullinfo() function).
3517    
3518         An option change within a subpattern affects only that part of the cur-         An  option  change  within a subpattern (see below for a description of
3519         rent pattern that follows it, so         subpatterns) affects only that part of the current pattern that follows
3520           it, so
3521    
3522           (a(?i)b)c           (a(?i)b)c
3523    
3524         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
3525         used).   By  this means, options can be made to have different settings         used).  By this means, options can be made to have  different  settings
3526         in different parts of the pattern. Any changes made in one  alternative         in  different parts of the pattern. Any changes made in one alternative
3527         do  carry  on  into subsequent branches within the same subpattern. For         do carry on into subsequent branches within the  same  subpattern.  For
3528         example,         example,
3529    
3530           (a(?i)b|c)           (a(?i)b|c)
3531    
3532         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
3533         first  branch  is  abandoned before the option setting. This is because         first branch is abandoned before the option setting.  This  is  because
3534         the effects of option settings happen at compile time. There  would  be         the  effects  of option settings happen at compile time. There would be
3535         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3536    
3537         The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
3538         in the same way as the Perl-compatible options by using the  characters         can  be changed in the same way as the Perl-compatible options by using
3539         U  and X respectively. The (?X) flag setting is special in that it must         the characters J, U and X respectively.
        always occur earlier in the pattern than any of the additional features  
        it  turns on, even when it is at top level. It is best to put it at the  
        start.  
3540    
3541    
3542  SUBPATTERNS  SUBPATTERNS
# Line 3013  SUBPATTERNS Line 3549  SUBPATTERNS
3549           cat(aract|erpillar|)           cat(aract|erpillar|)
3550    
3551         matches  one  of the words "cat", "cataract", or "caterpillar". Without         matches  one  of the words "cat", "cataract", or "caterpillar". Without
3552         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the parentheses, it would match  "cataract",  "erpillar"  or  an  empty
3553         string.         string.
3554    
3555         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
# Line 3042  SUBPATTERNS Line 3578  SUBPATTERNS
3578           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
3579    
3580         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
3581         1  and 2. The maximum number of capturing subpatterns is 65535, and the         1 and 2. The maximum number of capturing subpatterns is 65535.
        maximum depth of nesting of all subpatterns, both  capturing  and  non-  
        capturing, is 200.  
3582    
3583         As  a  convenient shorthand, if any option settings are required at the         As  a  convenient shorthand, if any option settings are required at the
3584         start of a non-capturing subpattern,  the  option  letters  may  appear         start of a non-capturing subpattern,  the  option  letters  may  appear
# Line 3060  SUBPATTERNS Line 3594  SUBPATTERNS
3594         "Saturday".         "Saturday".
3595    
3596    
3597    DUPLICATE SUBPATTERN NUMBERS
3598    
3599           Perl 5.10 introduced a feature whereby each alternative in a subpattern
3600           uses the same numbers for its capturing parentheses. Such a  subpattern
3601           starts  with (?| and is itself a non-capturing subpattern. For example,
3602           consider this pattern:
3603    
3604             (?|(Sat)ur|(Sun))day
3605    
3606           Because the two alternatives are inside a (?| group, both sets of  cap-
3607           turing  parentheses  are  numbered one. Thus, when the pattern matches,
3608           you can look at captured substring number  one,  whichever  alternative
3609           matched.  This  construct  is useful when you want to capture part, but
3610           not all, of one of a number of alternatives. Inside a (?| group, paren-
3611           theses  are  numbered as usual, but the number is reset at the start of
3612           each branch. The numbers of any capturing buffers that follow the  sub-
3613           pattern  start after the highest number used in any branch. The follow-
3614           ing example is taken from the Perl documentation.  The  numbers  under-
3615           neath show in which buffer the captured content will be stored.
3616    
3617             # before  ---------------branch-reset----------- after
3618             / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3619             # 1            2         2  3        2     3     4
3620    
3621           A  backreference  or  a  recursive call to a numbered subpattern always
3622           refers to the first one in the pattern with the given number.
3623    
3624           An alternative approach to using this "branch reset" feature is to  use
3625           duplicate named subpatterns, as described in the next section.
3626    
3627    
3628  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3629    
3630         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying  capturing  parentheses  by number is simple, but it can be
3631         very hard to keep track of the numbers in complicated  regular  expres-         very hard to keep track of the numbers in complicated  regular  expres-
3632         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
3633         change. To help with this difficulty, PCRE supports the naming of  sub-         change. To help with this difficulty, PCRE supports the naming of  sub-
3634         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns. This feature was not added to Perl until release 5.10. Python
3635         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         had the feature earlier, and PCRE introduced it at release  4.0,  using
3636         underscores, and must be unique within a pattern.         the  Python syntax. PCRE now supports both the Perl and the Python syn-
3637           tax.
3638    
3639           In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
3640           or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
3641           to capturing parentheses from other parts of the pattern, such as back-
3642           references,  recursion,  and conditions, can be made by name as well as
3643           by number.
3644    
3645           Names consist of up to  32  alphanumeric  characters  and  underscores.
3646         Named  capturing  parentheses  are  still  allocated numbers as well as         Named  capturing  parentheses  are  still  allocated numbers as well as
3647         names. The PCRE API provides function calls for extracting the name-to-         names, exactly as if the names were not present. The PCRE API  provides
3648         number  translation table from a compiled pattern. There is also a con-         function calls for extracting the name-to-number translation table from
3649         venience function for extracting a captured substring by name. For fur-         a compiled pattern. There is also a convenience function for extracting
3650         ther details see the pcreapi documentation.         a captured substring by name.
3651    
3652           By  default, a name must be unique within a pattern, but it is possible
3653           to relax this constraint by setting the PCRE_DUPNAMES option at compile
3654           time.  This  can  be useful for patterns where only one instance of the
3655           named parentheses can match. Suppose you want to match the  name  of  a
3656           weekday,  either as a 3-letter abbreviation or as the full name, and in
3657           both cases you want to extract the abbreviation. This pattern (ignoring
3658           the line breaks) does the job:
3659    
3660             (?<DN>Mon|Fri|Sun)(?:day)?|
3661             (?<DN>Tue)(?:sday)?|
3662             (?<DN>Wed)(?:nesday)?|
3663             (?<DN>Thu)(?:rsday)?|
3664             (?<DN>Sat)(?:urday)?
3665    
3666           There  are  five capturing substrings, but only one is ever set after a
3667           match.  (An alternative way of solving this problem is to use a "branch
3668           reset" subpattern, as described in the previous section.)
3669    
3670           The  convenience  function  for extracting the data by name returns the
3671           substring for the first (and in this example, the only)  subpattern  of
3672           that  name  that  matched.  This saves searching to find which numbered
3673           subpattern it was. If you make a reference to a non-unique  named  sub-
3674           pattern  from elsewhere in the pattern, the one that corresponds to the
3675           lowest number is used. For further details of the interfaces  for  han-
3676           dling named subpatterns, see the pcreapi documentation.
3677    
3678    
3679  REPETITION  REPETITION
# Line 3083  REPETITION Line 3682  REPETITION
3682         following items:         following items:
3683    
3684           a literal data character           a literal data character
3685           the . metacharacter           the dot metacharacter
3686           the \C escape sequence           the \C escape sequence
3687           the \X escape sequence (in UTF-8 mode with Unicode properties)           the \X escape sequence (in UTF-8 mode with Unicode properties)
3688             the \R escape sequence
3689           an escape such as \d that matches a single character           an escape such as \d that matches a single character
3690           a character class           a character class
3691           a back reference (see next section)           a back reference (see next section)
# Line 3125  REPETITION Line 3725  REPETITION
3725         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
3726         the previous item and the quantifier were not present.         the previous item and the quantifier were not present.
3727    
3728         For  convenience  (and  historical compatibility) the three most common         For  convenience, the three most common quantifiers have single-charac-
3729         quantifiers have single-character abbreviations:         ter abbreviations:
3730    
3731           *    is equivalent to {0,}           *    is equivalent to {0,}
3732           +    is equivalent to {1,}           +    is equivalent to {1,}
# Line 3178  REPETITION Line 3778  REPETITION
3778         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
3779         only way the rest of the pattern matches.         only way the rest of the pattern matches.
3780    
3781         If the PCRE_UNGREEDY option is set (an option which is not available in         If the PCRE_UNGREEDY option is set (an option that is not available  in
3782         Perl),  the  quantifiers are not greedy by default, but individual ones         Perl),  the  quantifiers are not greedy by default, but individual ones
3783         can be made greedy by following them with a  question  mark.  In  other         can be made greedy by following them with a  question  mark.  In  other
3784         words, it inverts the default behaviour.         words, it inverts the default behaviour.
# Line 3189  REPETITION Line 3789  REPETITION
3789         minimum or maximum.         minimum or maximum.
3790    
3791         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
3792         alent  to Perl's /s) is set, thus allowing the . to match newlines, the         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
3793         pattern is implicitly anchored, because whatever follows will be  tried         the pattern is implicitly anchored, because whatever  follows  will  be
3794         against  every character position in the subject string, so there is no         tried  against every character position in the subject string, so there
3795         point in retrying the overall match at any position  after  the  first.         is no point in retrying the overall match at  any  position  after  the
3796         PCRE normally treats such a pattern as though it were preceded by \A.         first.  PCRE  normally treats such a pattern as though it were preceded
3797           by \A.
3798    
3799         In  cases  where  it  is known that the subject string contains no new-         In cases where it is known that the subject  string  contains  no  new-
3800         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
3801         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
3802    
3803         However,  there is one situation where the optimization cannot be used.         However, there is one situation where the optimization cannot be  used.
3804         When .*  is inside capturing parentheses that  are  the  subject  of  a         When  .*   is  inside  capturing  parentheses that are the subject of a
3805         backreference  elsewhere in the pattern, a match at the start may fail,         backreference elsewhere in the pattern, a match at the start  may  fail
3806         and a later one succeed. Consider, for example:         where a later one succeeds. Consider, for example:
3807    
3808           (.*)abc\1           (.*)abc\1
3809    
3810         If the subject is "xyz123abc123" the match point is the fourth  charac-         If  the subject is "xyz123abc123" the match point is the fourth charac-
3811         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
3812    
3813         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 3215  REPETITION Line 3816  REPETITION
3816           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
3817    
3818         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
3819         is  "tweedledee".  However,  if there are nested capturing subpatterns,         is "tweedledee". However, if there are  nested  capturing  subpatterns,
3820         the corresponding captured values may have been set in previous  itera-         the  corresponding captured values may have been set in previous itera-
3821         tions. For example, after         tions. For example, after
3822    
3823           /(a|(b))+/           /(a|(b))+/
# Line 3226  REPETITION Line 3827  REPETITION
3827    
3828  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
3829    
3830         With both maximizing and minimizing repetition, failure of what follows         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
3831         normally causes the repeated item to be re-evaluated to see if  a  dif-         repetition,  failure  of what follows normally causes the repeated item
3832         ferent number of repeats allows the rest of the pattern to match. Some-         to be re-evaluated to see if a different number of repeats  allows  the
3833         times it is useful to prevent this, either to change the nature of  the         rest  of  the pattern to match. Sometimes it is useful to prevent this,
3834         match,  or  to  cause it fail earlier than it otherwise might, when the         either to change the nature of the match, or to cause it  fail  earlier
3835         author of the pattern knows there is no point in carrying on.         than  it otherwise might, when the author of the pattern knows there is
3836           no point in carrying on.
3837    
3838         Consider, for example, the pattern \d+foo when applied to  the  subject         Consider, for example, the pattern \d+foo when applied to  the  subject
3839         line         line
# Line 3245  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3847  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3847         the  means for specifying that once a subpattern has matched, it is not         the  means for specifying that once a subpattern has matched, it is not
3848         to be re-evaluated in this way.         to be re-evaluated in this way.
3849    
3850         If we use atomic grouping for the previous example, the  matcher  would         If we use atomic grouping for the previous example, the  matcher  gives
3851         give up immediately on failing to match "foo" the first time. The nota-         up  immediately  on failing to match "foo" the first time. The notation
3852         tion is a kind of special parenthesis, starting with  (?>  as  in  this         is a kind of special parenthesis, starting with (?> as in this example:
        example:  
3853    
3854           (?>\d+)foo           (?>\d+)foo
3855    
# Line 3280  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3881  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3881         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
3882         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
3883         simpler forms of atomic group. However, there is no difference  in  the         simpler forms of atomic group. However, there is no difference  in  the
3884         meaning  or  processing  of  a possessive quantifier and the equivalent         meaning  of  a  possessive  quantifier and the equivalent atomic group,
3885         atomic group.         though there may be a performance  difference;  possessive  quantifiers
3886           should be slightly faster.
3887         The possessive quantifier syntax is an extension to the Perl syntax. It  
3888         originates in Sun's Java package.         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
3889           tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
3890         When  a  pattern  contains an unlimited repeat inside a subpattern that         edition of his book. Mike McCloskey liked it, so implemented it when he
3891         can itself be repeated an unlimited number of  times,  the  use  of  an         built Sun's Java package, and PCRE copied it from there. It  ultimately
3892         atomic  group  is  the  only way to avoid some failing matches taking a         found its way into Perl at release 5.10.
3893    
3894           PCRE has an optimization that automatically "possessifies" certain sim-
3895           ple pattern constructs. For example, the sequence  A+B  is  treated  as
3896           A++B  because  there is no point in backtracking into a sequence of A's
3897           when B must follow.
3898    
3899           When a pattern contains an unlimited repeat inside  a  subpattern  that
3900           can  itself  be  repeated  an  unlimited number of times, the use of an
3901           atomic group is the only way to avoid some  failing  matches  taking  a
3902         very long time indeed. The pattern         very long time indeed. The pattern
3903    
3904           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
3905    
3906         matches an unlimited number of substrings that either consist  of  non-         matches  an  unlimited number of substrings that either consist of non-
3907         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
3908         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
3909    
3910           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3911    
3912         it takes a long time before reporting  failure.  This  is  because  the         it  takes  a  long  time  before reporting failure. This is because the
3913         string  can be divided between the internal \D+ repeat and the external         string can be divided between the internal \D+ repeat and the  external
3914         * repeat in a large number of ways, and all  have  to  be  tried.  (The         *  repeat  in  a  large  number of ways, and all have to be tried. (The
3915         example  uses  [!?]  rather than a single character at the end, because         example uses [!?] rather than a single character at  the  end,  because
3916         both PCRE and Perl have an optimization that allows  for  fast  failure         both  PCRE  and  Perl have an optimization that allows for fast failure
3917         when  a single character is used. They remember the last single charac-         when a single character is used. They remember the last single  charac-
3918         ter that is required for a match, and fail early if it is  not  present         ter  that  is required for a match, and fail early if it is not present
3919         in  the  string.)  If  the pattern is changed so that it uses an atomic         in the string.) If the pattern is changed so that  it  uses  an  atomic
3920         group, like this:         group, like this:
3921    
3922           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
3923    
3924         sequences of non-digits cannot be broken, and failure happens  quickly.         sequences  of non-digits cannot be broken, and failure happens quickly.
3925    
3926    
3927  BACK REFERENCES  BACK REFERENCES
3928    
3929         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
3930         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
3931         pattern  earlier  (that is, to its left) in the pattern, provided there         pattern earlier (that is, to its left) in the pattern,  provided  there
3932         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
3933    
3934         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
3935         it  is  always  taken  as a back reference, and causes an error only if         it is always taken as a back reference, and causes  an  error  only  if
3936         there are not that many capturing left parentheses in the  entire  pat-         there  are  not that many capturing left parentheses in the entire pat-
3937         tern.  In  other words, the parentheses that are referenced need not be         tern. In other words, the parentheses that are referenced need  not  be
3938         to the left of the reference for numbers less than 10. See the  subsec-         to  the left of the reference for numbers less than 10. A "forward back
3939         tion  entitled  "Non-printing  characters" above for further details of         reference" of this type can make sense when a  repetition  is  involved
3940         the handling of digits following a backslash.         and  the  subpattern to the right has participated in an earlier itera-
3941           tion.
3942    
3943           It is not possible to have a numerical "forward back  reference"  to  a
3944           subpattern  whose  number  is  10  or  more using this syntax because a
3945           sequence such as \50 is interpreted as a character  defined  in  octal.
3946           See the subsection entitled "Non-printing characters" above for further
3947           details of the handling of digits following a backslash.  There  is  no
3948           such  problem  when named parentheses are used. A back reference to any
3949           subpattern is possible using named parentheses (see below).
3950    
3951           Another way of avoiding the ambiguity inherent in  the  use  of  digits
3952           following a backslash is to use the \g escape sequence, which is a fea-
3953           ture introduced in Perl 5.10. This escape must be followed by  a  posi-
3954           tive  or  a negative number, optionally enclosed in braces. These exam-
3955           ples are all identical:
3956    
3957             (ring), \1
3958             (ring), \g1
3959             (ring), \g{1}
3960    
3961           A positive number specifies an absolute reference without the ambiguity
3962           that  is  present  in  the older syntax. It is also useful when literal
3963           digits follow the reference. A negative number is a relative reference.
3964           Consider this example:
3965    
3966             (abc(def)ghi)\g{-1}
3967    
3968           The sequence \g{-1} is a reference to the most recently started captur-
3969           ing subpattern before \g, that is, is it equivalent to  \2.  Similarly,
3970           \g{-2} would be equivalent to \1. The use of relative references can be
3971           helpful in long patterns, and also in  patterns  that  are  created  by
3972           joining together fragments that contain references within themselves.
3973    
3974         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
3975         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
3976         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3977         of doing that). So the pattern         of doing that). So the pattern
3978    
3979           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3980    
3981         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
3982         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
3983         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
3984         ple,         ple,
3985    
3986           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3987    
3988         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3989         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3990    
3991         Back  references  to named subpatterns use the Python syntax (?P=name).         There are several different ways of writing back  references  to  named
3992         We could rewrite the above example as follows:         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
3993           \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
3994           unified back reference syntax, in which \g can be used for both numeric
3995           and named references, is also supported. We  could  rewrite  the  above
3996           example in any of the following ways:
3997    
3998             (?<p1>(?i)rah)\s+\k<p1>
3999             (?'p1'(?i)rah)\s+\k{p1}
4000             (?P<p1>(?i)rah)\s+(?P=p1)
4001             (?<p1>(?i)rah)\s+\g{p1}
4002    
4003           (?<p1>(?i)rah)\s+(?P=p1)         A  subpattern  that  is  referenced  by  name may appear in the pattern
4004           before or after the reference.
4005    
4006         There may be more than one back reference to the same subpattern. If  a         There may be more than one back reference to the same subpattern. If  a
4007         subpattern  has  not actually been used in a particular match, any back         subpattern  has  not actually been used in a particular match, any back
# Line 3438  ASSERTIONS Line 4090  ASSERTIONS
4090         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4091         contents of a lookbehind assertion are restricted  such  that  all  the         contents of a lookbehind assertion are restricted  such  that  all  the
4092         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4093         eral alternatives, they do not all have to have the same fixed  length.         eral top-level alternatives, they do not all  have  to  have  the  same
4094         Thus         fixed length. Thus
4095    
4096           (?<=bullock|donkey)           (?<=bullock|donkey)
4097    
# Line 3461  ASSERTIONS Line 4113  ASSERTIONS
4113    
4114           (?<=abc|abde)           (?<=abc|abde)
4115    
4116           In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4117           instead of a lookbehind assertion; this is not restricted to  a  fixed-
4118           length.
4119    
4120         The  implementation  of lookbehind assertions is, for each alternative,         The  implementation  of lookbehind assertions is, for each alternative,
4121         to temporarily move the current position back by the  fixed  width  and         to temporarily move the current position back by the fixed  length  and
4122         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4123         rent position, the match is deemed to fail.         rent position, the assertion fails.
4124    
4125         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4126         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode)  to appear in lookbehind assertions, because it makes it impossi-
4127         ble to calculate the length of the lookbehind. The \X escape, which can         ble to calculate the length of the lookbehind. The \X and  \R  escapes,
4128         match different numbers of bytes, is also not permitted.         which can match different numbers of bytes, are also not permitted.
4129    
4130         Atomic  groups can be used in conjunction with lookbehind assertions to         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
4131         specify efficient matching at the end of the subject string. Consider a         assertions to specify efficient matching at  the  end  of  the  subject
4132         simple pattern such as         string. Consider a simple pattern such as
4133    
4134           abcd$           abcd$
4135    
# Line 3490  ASSERTIONS Line 4146  ASSERTIONS
4146         again  the search for "a" covers the entire string, from right to left,         again  the search for "a" covers the entire string, from right to left,
4147         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4148    
          ^(?>.*)(?<=abcd)  
   
        or, equivalently, using the possessive quantifier syntax,  
   
4149           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4150    
4151         there can be no backtracking for the .* item; it  can  match  only  the         there can be no backtracking for the .*+ item; it can  match  only  the
4152         entire  string.  The subsequent lookbehind assertion does a single test         entire  string.  The subsequent lookbehind assertion does a single test
4153         on the last four characters. If it fails, the match fails  immediately.         on the last four characters. If it fails, the match fails  immediately.
4154         For  long  strings, this approach makes a significant difference to the         For  long  strings, this approach makes a significant difference to the
# Line 3551  CONDITIONAL SUBPATTERNS Line 4203  CONDITIONAL SUBPATTERNS
4203         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
4204         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4205    
4206         There are three kinds of condition. If the text between the parentheses         There  are  four  kinds of condition: references to subpatterns, refer-
4207         consists of a sequence of digits, the condition  is  satisfied  if  the         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4208         capturing  subpattern of that number has previously matched. The number  
4209         must be greater than zero. Consider the following pattern,  which  con-     Checking for a used subpattern by number
4210         tains  non-significant white space to make it more readable (assume the  
4211         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of         If the text between the parentheses consists of a sequence  of  digits,
4212         discussion:         the  condition  is  true if the capturing subpattern of that number has
4213           previously matched. An alternative notation is to  precede  the  digits
4214           with a plus or minus sign. In this case, the subpattern number is rela-
4215           tive rather than absolute.  The most recently opened parentheses can be
4216           referenced  by  (?(-1),  the  next most recent by (?(-2), and so on. In
4217           looping constructs it can also make sense to refer to subsequent groups
4218           with constructs such as (?(+2).
4219    
4220           Consider  the  following  pattern, which contains non-significant white
4221           space to make it more readable (assume the PCRE_EXTENDED option) and to
4222           divide it into three parts for ease of discussion:
4223    
4224           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
4225    
# Line 3572  CONDITIONAL SUBPATTERNS Line 4234  CONDITIONAL SUBPATTERNS
4234         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
4235         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4236    
4237         If the condition is the string (R), it is satisfied if a recursive call         If you were embedding this pattern in a larger one,  you  could  use  a
4238         to  the pattern or subpattern has been made. At "top level", the condi-         relative reference:
4239         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are  
4240         described in the next section.           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4241    
4242           This  makes  the  fragment independent of the parentheses in the larger
4243           pattern.
4244    
4245       Checking for a used subpattern by name
4246    
4247           Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
4248           used  subpattern  by  name.  For compatibility with earlier versions of
4249           PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
4250           also  recognized. However, there is a possible ambiguity with this syn-
4251           tax, because subpattern names may  consist  entirely  of  digits.  PCRE
4252           looks  first for a named subpattern; if it cannot find one and the name
4253           consists entirely of digits, PCRE looks for a subpattern of  that  num-
4254           ber,  which must be greater than zero. Using subpattern names that con-
4255           sist entirely of digits is not recommended.
4256    
4257           Rewriting the above example to use a named subpattern gives this:
4258    
4259         If  the  condition  is  not  a sequence of digits or (R), it must be an           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
4260    
4261    
4262       Checking for pattern recursion
4263    
4264           If the condition is the string (R), and there is no subpattern with the
4265           name  R, the condition is true if a recursive call to the whole pattern
4266           or any subpattern has been made. If digits or a name preceded by amper-
4267           sand follow the letter R, for example:
4268    
4269             (?(R3)...) or (?(R&name)...)
4270    
4271           the  condition is true if the most recent recursion is into the subpat-
4272           tern whose number or name is given. This condition does not  check  the
4273           entire recursion stack.
4274    
4275           At  "top  level", all these recursion test conditions are false. Recur-
4276           sive patterns are described below.
4277    
4278       Defining subpatterns for use by reference only
4279    
4280           If the condition is the string (DEFINE), and  there  is  no  subpattern
4281           with  the  name  DEFINE,  the  condition is always false. In this case,
4282           there may be only one alternative  in  the  subpattern.  It  is  always
4283           skipped  if  control  reaches  this  point  in the pattern; the idea of
4284           DEFINE is that it can be used to define "subroutines" that can be  ref-
4285           erenced  from elsewhere. (The use of "subroutines" is described below.)
4286           For example, a pattern to match an IPv4 address could be  written  like
4287           this (ignore whitespace and line breaks):
4288    
4289             (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
4290             \b (?&byte) (\.(?&byte)){3} \b
4291    
4292           The  first part of the pattern is a DEFINE group inside which a another
4293           group named "byte" is defined. This matches an individual component  of
4294           an  IPv4  address  (a number less than 256). When matching takes place,
4295           this part of the pattern is skipped because DEFINE acts  like  a  false
4296           condition.
4297    
4298           The rest of the pattern uses references to the named group to match the
4299           four dot-separated components of an IPv4 address, insisting on  a  word
4300           boundary at each end.
4301    
4302       Assertion conditions
4303    
4304           If  the  condition  is  not  in any of the above formats, it must be an
4305         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
4306         assertion.  Consider  this  pattern,  again  containing non-significant         assertion.  Consider  this  pattern,  again  containing non-significant
4307         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
# Line 3602  COMMENTS Line 4326  COMMENTS
4326         at all.         at all.
4327    
4328         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
4329         character class introduces a comment that continues up to the next new-         character class introduces a  comment  that  continues  to  immediately
4330         line character in the pattern.         after the next newline in the pattern.
4331    
4332    
4333  RECURSIVE PATTERNS  RECURSIVE PATTERNS
# Line 3612  RECURSIVE PATTERNS Line 4336  RECURSIVE PATTERNS
4336         unlimited nested parentheses. Without the use of  recursion,  the  best         unlimited nested parentheses. Without the use of  recursion,  the  best
4337         that  can  be  done  is  to use a pattern that matches up to some fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
4338         depth of nesting. It is not possible to  handle  an  arbitrary  nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
4339         depth.  Perl  provides  a  facility  that allows regular expressions to         depth.
4340         recurse (amongst other things). It does this by interpolating Perl code  
4341         in the expression at run time, and the code can refer to the expression         For some time, Perl has provided a facility that allows regular expres-
4342         itself. A Perl pattern to solve the parentheses problem can be  created         sions to recurse (amongst other things). It does this by  interpolating
4343         like this:         Perl  code in the expression at run time, and the code can refer to the
4344           expression itself. A Perl pattern using code interpolation to solve the
4345           parentheses problem can be created like this:
4346    
4347           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
4348    
4349         The (?p{...}) item interpolates Perl code at run time, and in this case         The (?p{...}) item interpolates Perl code at run time, and in this case
4350         refers recursively to the pattern in which it appears. Obviously,  PCRE         refers recursively to the pattern in which it appears.
4351         cannot  support  the  interpolation  of Perl code. Instead, it supports  
4352         some special syntax for recursion of the entire pattern, and  also  for         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4353         individual subpattern recursion.         it  supports  special  syntax  for recursion of the entire pattern, and
4354           also for individual subpattern recursion.  After  its  introduction  in
4355           PCRE  and  Python,  this  kind of recursion was introduced into Perl at
4356           release 5.10.
4357    
4358         The  special item that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
4359         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
4360         the  given  number, provided that it occurs inside that subpattern. (If         the given number, provided that it occurs inside that  subpattern.  (If
4361         not, it is a "subroutine" call, which is described  in  the  next  sec-         not,  it  is  a  "subroutine" call, which is described in the next sec-
4362         tion.)  The special item (?R) is a recursive call of the entire regular         tion.) The special item (?R) or (?0) is a recursive call of the  entire
4363         expression.         regular expression.
4364    
4365           In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
4366           always treated as an atomic group. That is, once it has matched some of
4367           the subject string, it is never re-entered, even if it contains untried
4368           alternatives and there is a subsequent matching failure.
4369    
4370         For example, this PCRE pattern solves the  nested  parentheses  problem         This PCRE pattern solves the nested  parentheses  problem  (assume  the
4371         (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is         PCRE_EXTENDED option is set so that white space is ignored):
        ignored):  
4372    
4373           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
4374    
4375         First it matches an opening parenthesis. Then it matches any number  of         First  it matches an opening parenthesis. Then it matches any number of
4376         substrings  which  can  either  be  a sequence of non-parentheses, or a         substrings which can either be a  sequence  of  non-parentheses,  or  a
4377         recursive match of the pattern itself (that is  a  correctly  parenthe-         recursive  match  of the pattern itself (that is, a correctly parenthe-
4378         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
4379    
4380         If  this  were  part of a larger pattern, you would not want to recurse         If this were part of a larger pattern, you would not  want  to  recurse
4381         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
4382    
4383           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4384    
4385         We have put the pattern into parentheses, and caused the  recursion  to         We  have  put the pattern into parentheses, and caused the recursion to
4386         refer  to them instead of the whole pattern. In a larger pattern, keep-         refer to them instead of the whole pattern.
4387         ing track of parenthesis numbers can be tricky. It may be  more  conve-  
4388         nient  to use named parentheses instead. For this, PCRE uses (?P>name),         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
4389         which is an extension to the Python syntax that  PCRE  uses  for  named         tricky.  This is made easier by the use of relative references. (A Perl
4390         parentheses (Perl does not provide named parentheses). We could rewrite         5.10 feature.)  Instead of (?1) in the  pattern  above  you  can  write
4391         the above example as follows:         (?-2) to refer to the second most recently opened parentheses preceding
4392           the recursion. In other  words,  a  negative  number  counts  capturing
4393           (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )         parentheses leftwards from the point at which it is encountered.
4394    
4395         This particular example pattern contains nested unlimited repeats,  and         It  is  also  possible  to refer to subsequently opened parentheses, by
4396         so  the  use of atomic grouping for matching strings of non-parentheses         writing references such as (?+2). However, these  cannot  be  recursive
4397         is important when applying the pattern to strings that  do  not  match.         because  the  reference  is  not inside the parentheses that are refer-
4398         For example, when this pattern is applied to         enced. They are always "subroutine" calls, as  described  in  the  next
4399           section.
4400    
4401           An  alternative  approach is to use named parentheses instead. The Perl
4402           syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
4403           supported. We could rewrite the above example as follows:
4404    
4405             (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4406    
4407           If  there  is more than one subpattern with the same name, the earliest
4408           one is used.
4409    
4410           This particular example pattern that we have been looking  at  contains
4411           nested  unlimited repeats, and so the use of atomic grouping for match-
4412           ing strings of non-parentheses is important when applying  the  pattern
4413           to strings that do not match. For example, when this pattern is applied
4414           to
4415    
4416           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4417    
4418         it  yields "no match" quickly. However, if atomic grouping is not used,         it yields "no match" quickly. However, if atomic grouping is not  used,
4419         the match runs for a very long time indeed because there  are  so  many         the  match  runs  for a very long time indeed because there are so many
4420         different  ways  the  + and * repeats can carve up the subject, and all         different ways the + and * repeats can carve up the  subject,  and  all
4421         have to be tested before failure can be reported.         have to be tested before failure can be reported.
4422    
4423         At the end of a match, the values set for any capturing subpatterns are         At the end of a match, the values set for any capturing subpatterns are
4424         those from the outermost level of the recursion at which the subpattern         those from the outermost level of the recursion at which the subpattern
4425         value is set.  If you want to obtain  intermediate  values,  a  callout         value  is  set.   If  you want to obtain intermediate values, a callout
4426         function can be used (see the next section and the pcrecallout documen-         function can be used (see below and the pcrecallout documentation).  If
4427         tation). If the pattern above is matched against         the pattern above is matched against
4428    
4429           (ab(cd)ef)           (ab(cd)ef)
4430    
4431         the value for the capturing parentheses is  "ef",  which  is  the  last         the  value  for  the  capturing  parentheses is "ef", which is the last
4432         value  taken  on at the top level. If additional parentheses are added,         value taken on at the top level. If additional parentheses  are  added,
4433         giving         giving
4434