/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC revision 208 by ph10, Mon Aug 6 15:23:29 2007 UTC
# Line 18  INTRODUCTION Line 18  INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few differences. (Certain features that appeared in Python and
22         6.x) corresponds approximately with Perl  5.8,  including  support  for         PCRE before they appeared in Perl are also available using  the  Python
23         UTF-8 encoded strings and Unicode general category properties. However,         syntax.)
24         this support has to be explicitly enabled; it is not the default.  
25           The  current  implementation of PCRE (release 7.x) corresponds approxi-
26         In addition to the Perl-compatible matching function,  PCRE  also  con-         mately with Perl 5.10, including support for UTF-8 encoded strings  and
27         tains  an  alternative matching function that matches the same compiled         Unicode general category properties. However, UTF-8 and Unicode support
28         patterns in a different way. In certain circumstances, the  alternative         has to be explicitly enabled; it is not the default. The Unicode tables
29         function  has  some  advantages.  For  a discussion of the two matching         correspond to Unicode release 5.0.0.
30         algorithms, see the pcrematching page.  
31           In  addition to the Perl-compatible matching function, PCRE contains an
32         PCRE is written in C and released as a C library. A  number  of  people         alternative matching function that matches the same  compiled  patterns
33         have  written  wrappers and interfaces of various kinds. In particular,         in  a different way. In certain circumstances, the alternative function
34         Google Inc.  have provided a comprehensive C++  wrapper.  This  is  now         has some advantages. For a discussion of the two  matching  algorithms,
35           see the pcrematching page.
36    
37           PCRE  is  written  in C and released as a C library. A number of people
38           have written wrappers and interfaces of various kinds.  In  particular,
39           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
40         included as part of the PCRE distribution. The pcrecpp page has details         included as part of the PCRE distribution. The pcrecpp page has details
41         of this interface. Other people's contributions can  be  found  in  the         of  this  interface.  Other  people's contributions can be found in the
42         Contrib directory at the primary FTP site, which is:         Contrib directory at the primary FTP site, which is:
43    
44         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
45    
46         Details  of  exactly which Perl regular expression features are and are         Details of exactly which Perl regular expression features are  and  are
47         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
48         tern and pcrecompat pages.         tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
49           page.
50    
51         Some  features  of  PCRE can be included, excluded, or changed when the         Some  features  of  PCRE can be included, excluded, or changed when the
52         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
# Line 67  USER DOCUMENTATION Line 73  USER DOCUMENTATION
73         of searching. The sections are as follows:         of searching. The sections are as follows:
74    
75           pcre              this document           pcre              this document
76             pcre-config       show PCRE installation configuration information
77           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
78           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
79           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
# Line 77  USER DOCUMENTATION Line 84  USER DOCUMENTATION
84           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
85           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
86                               regular expressions                               regular expressions
87             pcresyntax        quick syntax reference
88           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
89           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
90           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
# Line 99  LIMITATIONS Line 107  LIMITATIONS
107         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
108         the  source  distribution and the pcrebuild documentation for details).         the  source  distribution and the pcrebuild documentation for details).
109         In these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
110         of execution will be slower.         of execution is slower.
111    
112           All values in repeating quantifiers must be less than 65536.
113    
114         All  values in repeating quantifiers must be less than 65536. The maxi-         There is no limit to the number of parenthesized subpatterns, but there
115         mum compiled length of subpattern with  an  explicit  repeat  count  is         can be no more than 65535 capturing subpatterns.
        30000 bytes. The maximum number of capturing subpatterns is 65535.  
   
        There  is  no limit to the number of non-capturing subpatterns, but the  
        maximum depth of nesting of  all  kinds  of  parenthesized  subpattern,  
        including capturing subpatterns, assertions, and other types of subpat-  
        tern, is 200.  
116    
117         The maximum length of name for a named subpattern is 32, and the  maxi-         The maximum length of name for a named subpattern is 32 characters, and
118         mum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
119    
120         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
121         that an integer variable can hold. However, when using the  traditional         that an integer variable can hold. However, when using the  traditional
# Line 136  UTF-8 AND UNICODE PROPERTY SUPPORT Line 140  UTF-8 AND UNICODE PROPERTY SUPPORT
140    
141         If  you compile PCRE with UTF-8 support, but do not use it at run time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
142         the library will be a bit bigger, but the additional run time  overhead         the library will be a bit bigger, but the additional run time  overhead
143         is  limited  to testing the PCRE_UTF8 flag in several places, so should         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
144         not be very large.         very big.
145    
146         If PCRE is built with Unicode character property support (which implies         If PCRE is built with Unicode character property support (which implies
147         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
# Line 193  UTF-8 AND UNICODE PROPERTY SUPPORT Line 197  UTF-8 AND UNICODE PROPERTY SUPPORT
197         8.  Similarly,  characters that match the POSIX named character classes         8.  Similarly,  characters that match the POSIX named character classes
198         are all low-valued characters.         are all low-valued characters.
199    
200         9. Case-insensitive matching applies only to  characters  whose  values         9. However, the Perl 5.10 horizontal and vertical  whitespace  matching
201           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
202           acters.
203    
204           10. Case-insensitive matching applies only to characters  whose  values
205         are  less than 128, unless PCRE is built with Unicode property support.         are  less than 128, unless PCRE is built with Unicode property support.
206         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
207         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
# Line 208  UTF-8 AND UNICODE PROPERTY SUPPORT Line 216  UTF-8 AND UNICODE PROPERTY SUPPORT
216  AUTHOR  AUTHOR
217    
218         Philip Hazel         Philip Hazel
219         University Computing Service,         University Computing Service
220         Cambridge CB2 3QG, England.         Cambridge CB2 3QH, England.
221    
222         Putting  an actual email address here seems to have been a spam magnet,         Putting  an actual email address here seems to have been a spam magnet,
223         so I've taken it away. If you want to email me, use my initial and sur-         so I've taken it away. If you want to email me, use  my  two  initials,
224         name, separated by a dot, at the domain ucs.cam.ac.uk.         followed by the two digits 10, at the domain cam.ac.uk.
225    
226    
227    REVISION
228    
229  Last updated: 05 June 2006         Last updated: 06 August 2007
230  Copyright (c) 1997-2006 University of Cambridge.         Copyright (c) 1997-2007 University of Cambridge.
231  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
232    
233    
# Line 238  PCRE BUILD-TIME OPTIONS Line 249  PCRE BUILD-TIME OPTIONS
249    
250           ./configure --help           ./configure --help
251    
252         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
253         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
254         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
255         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
256         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
257         not described.         is not described.
258    
259    
260  C++ SUPPORT  C++ SUPPORT
# Line 282  UNICODE CHARACTER PROPERTY SUPPORT Line 293  UNICODE CHARACTER PROPERTY SUPPORT
293         to the configure command. This implies UTF-8 support, even if you  have         to the configure command. This implies UTF-8 support, even if you  have
294         not explicitly requested it.         not explicitly requested it.
295    
296         Including  Unicode  property  support  adds around 90K of tables to the         Including  Unicode  property  support  adds around 30K of tables to the
297         PCRE library, approximately doubling its size. Only the  general  cate-         PCRE library. Only the general category properties such as  Lu  and  Nd
298         gory  properties  such as Lu and Nd are supported. Details are given in         are supported. Details are given in the pcrepattern documentation.
        the pcrepattern documentation.  
299    
300    
301  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
302    
303         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating
304         the  end  of  a line. This is the normal newline character on Unix-like         the end of a line. This is the normal newline  character  on  Unix-like
305         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use character 13 (carriage return, CR)
306         instead, by adding         instead, by adding
307    
308           --enable-newline-is-cr           --enable-newline-is-cr
309    
310         to  the  configure  command.  There  is  also  a --enable-newline-is-lf         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
311         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
312    
313         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 305  CODE VALUE OF NEWLINE Line 315  CODE VALUE OF NEWLINE
315    
316           --enable-newline-is-crlf           --enable-newline-is-crlf
317    
318         to  the  configure command. Whatever line ending convention is selected         to the configure command. There is a fourth option, specified by
319         when PCRE is built can be overridden when  the  library  functions  are  
320         called.  At  build time it is conventional to use the standard for your           --enable-newline-is-anycrlf
321         operating system.  
322           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
323           CRLF as indicating a line ending. Finally, a fifth option, specified by
324    
325             --enable-newline-is-any
326    
327           causes PCRE to recognize any Unicode newline sequence.
328    
329           Whatever line ending convention is selected when PCRE is built  can  be
330           overridden  when  the library functions are called. At build time it is
331           conventional to use the standard for your operating system.
332    
333    
334  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
# Line 356  HANDLING VERY LARGE PATTERNS Line 376  HANDLING VERY LARGE PATTERNS
376         longer  offsets slows down the operation of PCRE because it has to load         longer  offsets slows down the operation of PCRE because it has to load
377         additional bytes when handling them.         additional bytes when handling them.
378    
        If you build PCRE with an increased link size, test 2 (and  test  5  if  
        you  are using UTF-8) will fail. Part of the output of these tests is a  
        representation of the compiled pattern, and this changes with the  link  
        size.  
   
379    
380  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
381    
382         When matching with the pcre_exec() function, PCRE implements backtrack-         When matching with the pcre_exec() function, PCRE implements backtrack-
383         ing by making recursive calls to an internal function  called  match().         ing  by  making recursive calls to an internal function called match().
384         In  environments  where  the size of the stack is limited, this can se-         In environments where the size of the stack is limited,  this  can  se-
385         verely limit PCRE's operation. (The Unix environment does  not  usually         verely  limit  PCRE's operation. (The Unix environment does not usually
386         suffer from this problem, but it may sometimes be necessary to increase         suffer from this problem, but it may sometimes be necessary to increase
387         the maximum stack size.  There is a discussion in the  pcrestack  docu-         the  maximum  stack size.  There is a discussion in the pcrestack docu-
388         mentation.)  An alternative approach to recursion that uses memory from         mentation.) An alternative approach to recursion that uses memory  from
389         the heap to remember data, instead of using recursive  function  calls,         the  heap  to remember data, instead of using recursive function calls,
390         has  been  implemented to work round the problem of limited stack size.         has been implemented to work round the problem of limited  stack  size.
391         If you want to build a version of PCRE that works this way, add         If you want to build a version of PCRE that works this way, add
392    
393           --disable-stack-for-recursion           --disable-stack-for-recursion
394    
395         to the configure command. With this configuration, PCRE  will  use  the         to  the  configure  command. With this configuration, PCRE will use the
396         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-
397         ment functions. Separate functions are provided because  the  usage  is         ment  functions. By default these point to malloc() and free(), but you
398         very  predictable:  the  block sizes requested are always the same, and         can replace the pointers so that your own functions are used.
399         the blocks are always freed in reverse order. A calling  program  might  
400         be  able  to implement optimized functions that perform better than the         Separate functions are  provided  rather  than  using  pcre_malloc  and
401         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         pcre_free  because  the  usage  is  very  predictable:  the block sizes
402         slowly when built in this way. This option affects only the pcre_exec()         requested are always the same, and  the  blocks  are  always  freed  in
403         function; it is not relevant for the the pcre_dfa_exec() function.         reverse  order.  A calling program might be able to implement optimized
404           functions that perform better  than  malloc()  and  free().  PCRE  runs
405           noticeably more slowly when built in this way. This option affects only
406           the  pcre_exec()  function;  it   is   not   relevant   for   the   the
407           pcre_dfa_exec() function.
408    
409    
410  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
411    
412         Internally, PCRE has a function called match(), which it calls  repeat-         Internally,  PCRE has a function called match(), which it calls repeat-
413         edly   (sometimes   recursively)  when  matching  a  pattern  with  the         edly  (sometimes  recursively)  when  matching  a  pattern   with   the
414         pcre_exec() function. By controlling the maximum number of  times  this         pcre_exec()  function.  By controlling the maximum number of times this
415         function  may be called during a single matching operation, a limit can         function may be called during a single matching operation, a limit  can
416         be placed on the resources used by a single call  to  pcre_exec().  The         be  placed  on  the resources used by a single call to pcre_exec(). The
417         limit  can be changed at run time, as described in the pcreapi documen-         limit can be changed at run time, as described in the pcreapi  documen-
418         tation. The default is 10 million, but this can be changed by adding  a         tation.  The default is 10 million, but this can be changed by adding a
419         setting such as         setting such as
420    
421           --with-match-limit=500000           --with-match-limit=500000
422    
423         to   the   configure  command.  This  setting  has  no  effect  on  the         to  the  configure  command.  This  setting  has  no  effect   on   the
424         pcre_dfa_exec() matching function.         pcre_dfa_exec() matching function.
425    
426         In some environments it is desirable to limit the  depth  of  recursive         In  some  environments  it is desirable to limit the depth of recursive
427         calls of match() more strictly than the total number of calls, in order         calls of match() more strictly than the total number of calls, in order
428         to restrict the maximum amount of stack (or heap,  if  --disable-stack-         to  restrict  the maximum amount of stack (or heap, if --disable-stack-
429         for-recursion is specified) that is used. A second limit controls this;         for-recursion is specified) that is used. A second limit controls this;
430         it defaults to the value that  is  set  for  --with-match-limit,  which         it  defaults  to  the  value  that is set for --with-match-limit, which
431         imposes  no  additional constraints. However, you can set a lower limit         imposes no additional constraints. However, you can set a  lower  limit
432         by adding, for example,         by adding, for example,
433    
434           --with-match-limit-recursion=10000           --with-match-limit-recursion=10000
435    
436         to the configure command. This value can  also  be  overridden  at  run         to  the  configure  command.  This  value can also be overridden at run
437         time.         time.
438    
439    
440    CREATING CHARACTER TABLES AT BUILD TIME
441    
442           PCRE uses fixed tables for processing characters whose code values  are
443           less  than 256. By default, PCRE is built with a set of tables that are
444           distributed in the file pcre_chartables.c.dist. These  tables  are  for
445           ASCII codes only. If you add
446    
447             --enable-rebuild-chartables
448    
449           to  the  configure  command, the distributed tables are no longer used.
450           Instead, a program called dftables is compiled and  run.  This  outputs
451           the source for new set of tables, created in the default locale of your
452           C runtime system. (This method of replacing the tables does not work if
453           you  are cross compiling, because dftables is run on the local host. If
454           you need to create alternative tables when cross  compiling,  you  will
455           have to do so "by hand".)
456    
457    
458  USING EBCDIC CODE  USING EBCDIC CODE
459    
460         PCRE  assumes  by  default that it will run in an environment where the         PCRE  assumes  by  default that it will run in an environment where the
461         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
462         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by         This  is  the  case for most computer operating systems. PCRE can, how-
463         adding         ever, be compiled to run in an EBCDIC environment by adding
464    
465           --enable-ebcdic           --enable-ebcdic
466    
467         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
468           bles.  You  should  only  use  it if you know that you are in an EBCDIC
469           environment (for example, an IBM mainframe operating system).
470    
471    
472    SEE ALSO
473    
474           pcreapi(3), pcre_config(3).
475    
476  Last updated: 06 June 2006  
477  Copyright (c) 1997-2006 University of Cambridge.  AUTHOR
478    
479           Philip Hazel
480           University Computing Service
481           Cambridge CB2 3QH, England.
482    
483    
484    REVISION
485    
486           Last updated: 30 July 2007
487           Copyright (c) 1997-2007 University of Cambridge.
488  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
489    
490    
# Line 466  PCRE MATCHING ALGORITHMS Line 520  PCRE MATCHING ALGORITHMS
520           <something> <something else> <something further>           <something> <something else> <something further>
521    
522         there are three possible answers. The standard algorithm finds only one         there are three possible answers. The standard algorithm finds only one
523         of them, whereas the DFA algorithm finds all three.         of them, whereas the alternative algorithm finds all three.
524    
525    
526  REGULAR EXPRESSIONS AS TREES  REGULAR EXPRESSIONS AS TREES
# Line 482  REGULAR EXPRESSIONS AS TREES Line 536  REGULAR EXPRESSIONS AS TREES
536    
537  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
538    
539         In the terminology of Jeffrey Friedl's book Mastering  Regular  Expres-         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
540         sions,  the  standard  algorithm  is  an "NFA algorithm". It conducts a         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
541         depth-first search of the pattern tree. That is, it  proceeds  along  a         depth-first search of the pattern tree. That is, it  proceeds  along  a
542         single path through the tree, checking that the subject matches what is         single path through the tree, checking that the subject matches what is
543         required. When there is a mismatch, the algorithm  tries  any  alterna-         required. When there is a mismatch, the algorithm  tries  any  alterna-
# Line 507  THE STANDARD MATCHING ALGORITHM Line 561  THE STANDARD MATCHING ALGORITHM
561         This provides support for capturing parentheses and back references.         This provides support for capturing parentheses and back references.
562    
563    
564  THE DFA MATCHING ALGORITHM  THE ALTERNATIVE MATCHING ALGORITHM
565    
566         DFA stands for "deterministic finite automaton", but you do not need to         This algorithm conducts a breadth-first search of  the  tree.  Starting
567         understand the origins of that name. This algorithm conducts a breadth-         from  the  first  matching  point  in the subject, it scans the subject
568         first search of the tree. Starting from the first matching point in the         string from left to right, once, character by character, and as it does
569         subject,  it scans the subject string from left to right, once, charac-         this,  it remembers all the paths through the tree that represent valid
570         ter by character, and as it does  this,  it  remembers  all  the  paths         matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
571         through the tree that represent valid matches.         though  it is not implemented as a traditional finite state machine (it
572           keeps multiple states active simultaneously).
573         The  scan  continues until either the end of the subject is reached, or  
574         there are no more unterminated paths. At this point,  terminated  paths         The scan continues until either the end of the subject is  reached,  or
575         represent  the different matching possibilities (if there are none, the         there  are  no more unterminated paths. At this point, terminated paths
576         match has failed).  Thus, if there is more  than  one  possible  match,         represent the different matching possibilities (if there are none,  the
577           match  has  failed).   Thus,  if there is more than one possible match,
578         this algorithm finds all of them, and in particular, it finds the long-         this algorithm finds all of them, and in particular, it finds the long-
579         est. In PCRE, there is an option to stop the algorithm after the  first         est.  In PCRE, there is an option to stop the algorithm after the first
580         match (which is necessarily the shortest) has been found.         match (which is necessarily the shortest) has been found.
581    
582         Note that all the matches that are found start at the same point in the         Note that all the matches that are found start at the same point in the
# Line 529  THE DFA MATCHING ALGORITHM Line 584  THE DFA MATCHING ALGORITHM
584    
585           cat(er(pillar)?)           cat(er(pillar)?)
586    
587         is matched against the string "the caterpillar catchment",  the  result         is  matched  against the string "the caterpillar catchment", the result
588         will  be the three strings "cat", "cater", and "caterpillar" that start         will be the three strings "cat", "cater", and "caterpillar" that  start
589         at the fourth character of the subject. The algorithm does not automat-         at the fourth character of the subject. The algorithm does not automat-
590         ically move on to find matches that start at later positions.         ically move on to find matches that start at later positions.
591    
592         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
593         supported by the DFA matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
594    
595         1. Because the algorithm finds all  possible  matches,  the  greedy  or         1.  Because  the  algorithm  finds  all possible matches, the greedy or
596         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
597         ungreedy quantifiers are treated in exactly the same way.         ungreedy quantifiers are treated in exactly the same way. However, pos-
598           sessive quantifiers can make a difference when what follows could  also
599           match what is quantified, for example in a pattern like this:
600    
601             ^a++\w!
602    
603           This  pattern matches "aaab!" but not "aaa!", which would be matched by
604           a non-possessive quantifier. Similarly, if an atomic group is  present,
605           it  is matched as if it were a standalone pattern at the current point,
606           and the longest match is then "locked in" for the rest of  the  overall
607           pattern.
608    
609         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
610         is  not  straightforward  to  keep track of captured substrings for the         is not straightforward to keep track of  captured  substrings  for  the
611         different matching possibilities, and  PCRE's  implementation  of  this         different  matching  possibilities,  and  PCRE's implementation of this
612         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
613         strings are available.         strings are available.
614    
615         3. Because no substrings are captured, back references within the  pat-         3.  Because no substrings are captured, back references within the pat-
616         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
617    
618         4.  For  the same reason, conditional expressions that use a backrefer-         4. For the same reason, conditional expressions that use  a  backrefer-
619         ence as the condition are not supported.         ence  as  the  condition or test for a specific group recursion are not
620           supported.
621    
622         5. Callouts are supported, but the value of the  capture_top  field  is         5. Because many paths through the tree may be  active,  the  \K  escape
623           sequence, which resets the start of the match when encountered (but may
624           be on some paths and not on others), is not  supported.  It  causes  an
625           error if encountered.
626    
627           6.  Callouts  are  supported, but the value of the capture_top field is
628         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
629    
630         6.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
631         single byte, even in UTF-8 mode, is not supported because the DFA algo-         single  byte, even in UTF-8 mode, is not supported because the alterna-
632         rithm moves through the subject string one character at a time, for all         tive algorithm moves through the subject  string  one  character  at  a
633         active paths through the tree.         time, for all active paths through the tree.
634    
635    
636  ADVANTAGES OF THE DFA ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
637    
638         Using the DFA matching algorithm provides the following advantages:         Using  the alternative matching algorithm provides the following advan-
639           tages:
640    
641         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
642         ically  found,  and  in particular, the longest match is found. To find         ically  found,  and  in particular, the longest match is found. To find
# Line 573  ADVANTAGES OF THE DFA ALGORITHM Line 645  ADVANTAGES OF THE DFA ALGORITHM
645    
646         2.  There is much better support for partial matching. The restrictions         2.  There is much better support for partial matching. The restrictions
647         on the content of the pattern that apply when using the standard  algo-         on the content of the pattern that apply when using the standard  algo-
648         rithm  for partial matching do not apply to the DFA algorithm. For non-         rithm  for  partial matching do not apply to the alternative algorithm.
649         anchored patterns, the starting position of a partial match  is  avail-         For non-anchored patterns, the starting position of a partial match  is
650         able.         available.
651    
652         3.  Because  the  DFA algorithm scans the subject string just once, and         3.  Because  the  alternative  algorithm  scans the subject string just
653         never needs to backtrack, it is possible  to  pass  very  long  subject         once, and never needs to backtrack, it is possible to  pass  very  long
654         strings  to  the matching function in several pieces, checking for par-         subject  strings  to  the matching function in several pieces, checking
655         tial matching each time.         for partial matching each time.
656    
657    
658  DISADVANTAGES OF THE DFA ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
659    
660         The DFA algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
661    
662         1. It is substantially slower than  the  standard  algorithm.  This  is         1. It is substantially slower than  the  standard  algorithm.  This  is
663         partly  because  it has to search for all possible matches, but is also         partly  because  it has to search for all possible matches, but is also
# Line 593  DISADVANTAGES OF THE DFA ALGORITHM Line 665  DISADVANTAGES OF THE DFA ALGORITHM
665    
666         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
667    
668         3. The "atomic group" feature of PCRE regular expressions is supported,         3. Although atomic groups are supported, their use does not provide the
669         but  does not provide the advantage that it does for the standard algo-         performance advantage that it does for the standard algorithm.
        rithm.  
670    
671  Last updated: 06 June 2006  
672  Copyright (c) 1997-2006 University of Cambridge.  AUTHOR
673    
674           Philip Hazel
675           University Computing Service
676           Cambridge CB2 3QH, England.
677    
678    
679    REVISION
680    
681           Last updated: 29 May 2007
682           Copyright (c) 1997-2007 University of Cambridge.
683  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
684    
685    
# Line 692  PCRE NATIVE API Line 773  PCRE NATIVE API
773  PCRE API OVERVIEW  PCRE API OVERVIEW
774    
775         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
776         is also a set of wrapper functions that correspond to the POSIX regular         are also some wrapper functions that correspond to  the  POSIX  regular
777         expression  API.  These  are  described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
778         Both of these APIs define a set of C function calls. A C++  wrapper  is         Both of these APIs define a set of C function calls. A C++  wrapper  is
779         distributed with PCRE. It is documented in the pcrecpp page.         distributed with PCRE. It is documented in the pcrecpp page.
# Line 715  PCRE API OVERVIEW Line 796  PCRE API OVERVIEW
796         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
797         ble,  is  also provided. This uses a different algorithm for the match-         ble,  is  also provided. This uses a different algorithm for the match-
798         ing. The alternative algorithm finds all possible matches (at  a  given         ing. The alternative algorithm finds all possible matches (at  a  given
799         point in the subject). However, this algorithm does not return captured         point  in  the subject), and scans the subject just once. However, this
800         substrings. A description of the  two  matching  algorithms  and  their         algorithm does not return captured substrings. A description of the two
801         advantages  and  disadvantages  is given in the pcrematching documenta-         matching  algorithms and their advantages and disadvantages is given in
802         tion.         the pcrematching documentation.
803    
804         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
805         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 779  PCRE API OVERVIEW Line 860  PCRE API OVERVIEW
860    
861    
862  NEWLINES  NEWLINES
863         PCRE supports three different conventions for indicating line breaks in  
864         strings: a single CR character, a single LF character, or the two-char-         PCRE  supports five different conventions for indicating line breaks in
865         acter  sequence  CRLF.  All  three  are used as "standard" by different         strings: a single CR (carriage return) character, a  single  LF  (line-
866         operating systems.  When PCRE is built, a default can be specified. The         feed) character, the two-character sequence CRLF, any of the three pre-
867         default  default  is  LF, which is the Unix standard. When PCRE is run,         ceding, or any Unicode newline sequence. The Unicode newline  sequences
868         the default can be overridden, either when a pattern  is  compiled,  or         are  the  three just mentioned, plus the single characters VT (vertical
869         when it is matched.         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
870           separator, U+2028), and PS (paragraph separator, U+2029).
871    
872           Each  of  the first three conventions is used by at least one operating
873           system as its standard newline sequence. When PCRE is built, a  default
874           can  be  specified.  The default default is LF, which is the Unix stan-
875           dard. When PCRE is run, the default can be overridden,  either  when  a
876           pattern is compiled, or when it is matched.
877    
878         In the PCRE documentation the word "newline" is used to mean "the char-         In the PCRE documentation the word "newline" is used to mean "the char-
879         acter or pair of characters that indicate a line break".         acter or pair of characters that indicate a line break". The choice  of
880           newline  convention  affects  the  handling of the dot, circumflex, and
881           dollar metacharacters, the handling of #-comments in /x mode, and, when
882           CRLF  is a recognized line ending sequence, the match position advance-
883           ment for a non-anchored pattern. The choice of newline convention  does
884           not affect the interpretation of the \n or \r escape sequences.
885    
886    
887  MULTITHREADING  MULTITHREADING
888    
889         The PCRE functions can be used in  multi-threading  applications,  with         The  PCRE  functions  can be used in multi-threading applications, with
890         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
891         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
892         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
893    
894         The compiled form of a regular expression is not altered during  match-         The  compiled form of a regular expression is not altered during match-
895         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
896         at once.         at once.
897    
# Line 806  MULTITHREADING Line 899  MULTITHREADING
899  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
900    
901         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
902         later  time,  possibly by a different program, and even on a host other         later time, possibly by a different program, and even on a  host  other
903         than the one on which  it  was  compiled.  Details  are  given  in  the         than  the  one  on  which  it  was  compiled.  Details are given in the
904         pcreprecompile documentation.         pcreprecompile documentation. However, compiling a  regular  expression
905           with  one version of PCRE for use with a different version is not guar-
906           anteed to work and may cause crashes.
907    
908    
909  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
910    
911         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
912    
913         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
914         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
915         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
916         tures.         tures.
917    
918         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
919         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
920         into which the information is  placed.  The  following  information  is         into  which  the  information  is  placed. The following information is
921         available:         available:
922    
923           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
924    
925         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
926         able; otherwise it is set to zero.         able; otherwise it is set to zero.
927    
928           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
929    
930         The output is an integer that is set to  one  if  support  for  Unicode         The  output  is  an  integer  that is set to one if support for Unicode
931         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
932    
933           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
934    
935         The  output  is  an integer whose value specifies the default character         The output is an integer whose value specifies  the  default  character
936         sequence that is recognized as meaning "newline". The three values that         sequence  that is recognized as meaning "newline". The four values that
937         are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
938         should normally be the standard sequence for your operating system.         and  -1  for  ANY. The default should normally be the standard sequence
939           for your operating system.
940    
941           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
942    
# Line 909  COMPILING A PATTERN Line 1005  COMPILING A PATTERN
1005         fully relocatable, because it may contain a copy of the tableptr  argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
1006         ment, which is an address (see below).         ment, which is an address (see below).
1007    
1008         The options argument contains independent bits that affect the compila-         The options argument contains various bit settings that affect the com-
1009         tion. It should be zero if  no  options  are  required.  The  available         pilation. It should be zero if no options are required.  The  available
1010         options  are  described  below. Some of them, in particular, those that         options  are  described  below. Some of them, in particular, those that
1011         are compatible with Perl, can also be set and  unset  from  within  the         are compatible with Perl, can also be set and  unset  from  within  the
1012         pattern  (see  the  detailed  description in the pcrepattern documenta-         pattern  (see  the  detailed  description in the pcrepattern documenta-
# Line 1000  COMPILING A PATTERN Line 1096  COMPILING A PATTERN
1096         not  match  when  the  current position is at a newline. This option is         not  match  when  the  current position is at a newline. This option is
1097         equivalent to Perl's /s option, and it can be changed within a  pattern         equivalent to Perl's /s option, and it can be changed within a  pattern
1098         by  a (?s) option setting. A negative class such as [^a] always matches         by  a (?s) option setting. A negative class such as [^a] always matches
1099         newlines, independent of the setting of this option.         newline characters, independent of the setting of this option.
1100    
1101           PCRE_DUPNAMES           PCRE_DUPNAMES
1102    
# Line 1064  COMPILING A PATTERN Line 1160  COMPILING A PATTERN
1160           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1161           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1162           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1163             PCRE_NEWLINE_ANYCRLF
1164             PCRE_NEWLINE_ANY
1165    
1166         These  options  override the default newline definition that was chosen         These  options  override the default newline definition that was chosen
1167         when PCRE was built. Setting the first or the second specifies  that  a         when PCRE was built. Setting the first or the second specifies  that  a
1168         newline  is  indicated  by a single character (CR or LF, respectively).         newline  is  indicated  by a single character (CR or LF, respectively).
1169         Setting both of them specifies that a newline is indicated by the  two-         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1170         character  CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1171         to contain both bits. The only time that a line break is relevant  when         that any of the three preceding sequences should be recognized. Setting
1172         compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out-         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1173         side a character class is encountered. This indicates  a  comment  that         recognized. The Unicode newline sequences are the three just mentioned,
1174         lasts until after the next newline.         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1175           U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1176           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1177           UTF-8 mode.
1178    
1179           The newline setting in the  options  word  uses  three  bits  that  are
1180           treated as a number, giving eight possibilities. Currently only six are
1181           used (default plus the five values above). This means that if  you  set
1182           more  than one newline option, the combination may or may not be sensi-
1183           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1184           PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1185           cause an error.
1186    
1187           The only time that a line break is specially recognized when  compiling
1188           a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1189           character class is encountered. This indicates  a  comment  that  lasts
1190           until  after the next line break sequence. In other circumstances, line
1191           break  sequences  are  treated  as  literal  data,   except   that   in
1192           PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1193           and are therefore ignored.
1194    
1195         The newline option set at compile time becomes the default that is used         The newline option that is set at compile time becomes the default that
1196         for pcre_exec() and pcre_dfa_exec(), but it can be overridden.         is  used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1197    
1198           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1199    
# Line 1119  COMPILATION ERROR CODES Line 1236  COMPILATION ERROR CODES
1236    
1237         The following table lists the error  codes  than  may  be  returned  by         The following table lists the error  codes  than  may  be  returned  by
1238         pcre_compile2(),  along with the error messages that may be returned by         pcre_compile2(),  along with the error messages that may be returned by
1239         both compiling functions.         both compiling functions. As PCRE has developed, some error codes  have
1240           fallen out of use. To avoid confusion, they have not been re-used.
1241    
1242            0  no error            0  no error
1243            1  \ at end of pattern            1  \ at end of pattern
# Line 1131  COMPILATION ERROR CODES Line 1249  COMPILATION ERROR CODES
1249            7  invalid escape sequence in character class            7  invalid escape sequence in character class
1250            8  range out of order in character class            8  range out of order in character class
1251            9  nothing to repeat            9  nothing to repeat
1252           10  operand of unlimited repeat could match the empty string           10  [this code is not in use]
1253           11  internal error: unexpected repeat           11  internal error: unexpected repeat
1254           12  unrecognized character after (?           12  unrecognized character after (?
1255           13  POSIX named classes are supported only within a class           13  POSIX named classes are supported only within a class
# Line 1140  COMPILATION ERROR CODES Line 1258  COMPILATION ERROR CODES
1258           16  erroffset passed as NULL           16  erroffset passed as NULL
1259           17  unknown option bit(s) set           17  unknown option bit(s) set
1260           18  missing ) after comment           18  missing ) after comment
1261           19  parentheses nested too deeply           19  [this code is not in use]
1262           20  regular expression too large           20  regular expression too large
1263           21  failed to get memory           21  failed to get memory
1264           22  unmatched parentheses           22  unmatched parentheses
# Line 1150  COMPILATION ERROR CODES Line 1268  COMPILATION ERROR CODES
1268           26  malformed number or name after (?(           26  malformed number or name after (?(
1269           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1270           28  assertion expected after (?(           28  assertion expected after (?(
1271           29  (?R or (?digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
1272           30  unknown POSIX class name           30  unknown POSIX class name
1273           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
1274           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is not compiled with PCRE_UTF8 support
1275           33  spare error           33  [this code is not in use]
1276           34  character value in \x{...} sequence is too large           34  character value in \x{...} sequence is too large
1277           35  invalid condition (?(0)           35  invalid condition (?(0)
1278           36  \C not allowed in lookbehind assertion           36  \C not allowed in lookbehind assertion
# Line 1163  COMPILATION ERROR CODES Line 1281  COMPILATION ERROR CODES
1281           39  closing ) for (?C expected           39  closing ) for (?C expected
1282           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
1283           41  unrecognized character after (?P           41  unrecognized character after (?P
1284           42  syntax error after (?P           42  syntax error in subpattern name (missing terminator)
1285           43  two named subpatterns have the same name           43  two named subpatterns have the same name
1286           44  invalid UTF-8 string           44  invalid UTF-8 string
1287           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
# Line 1171  COMPILATION ERROR CODES Line 1289  COMPILATION ERROR CODES
1289           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1290           48  subpattern name is too long (maximum 32 characters)           48  subpattern name is too long (maximum 32 characters)
1291           49  too many named subpatterns (maximum 10,000)           49  too many named subpatterns (maximum 10,000)
1292           50  repeated subpattern is too long           50  [this code is not in use]
1293           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 (not in UTF-8 mode)
1294             52  internal error: overran compiling workspace
1295             53   internal  error:  previously-checked  referenced  subpattern not
1296           found
1297             54  DEFINE group contains more than one branch
1298             55  repeating a DEFINE group is not allowed
1299             56  inconsistent NEWLINE options"
1300             57  \g is not followed by a braced name or an optionally braced
1301                   non-zero number
1302             58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1303    
1304    
1305  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1224  STUDYING A PATTERN Line 1351  STUDYING A PATTERN
1351  LOCALE SUPPORT  LOCALE SUPPORT
1352    
1353         PCRE handles caseless matching, and determines whether  characters  are         PCRE handles caseless matching, and determines whether  characters  are
1354         letters  digits,  or whatever, by reference to a set of tables, indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
1355         by character value. When running in UTF-8 mode, this  applies  only  to         by character value. When running in UTF-8 mode, this  applies  only  to
1356         characters  with  codes  less than 128. Higher-valued codes never match         characters  with  codes  less than 128. Higher-valued codes never match
1357         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1358         with  Unicode  character property support. The use of locales with Uni-         with  Unicode  character property support. The use of locales with Uni-
1359         code is discouraged.         code is discouraged. If you are handling characters with codes  greater
1360           than  128, you should either use UTF-8 and Unicode, or use locales, but
1361         An internal set of tables is created in the default C locale when  PCRE         not try to mix the two.
1362         is  built.  This  is  used when the final argument of pcre_compile() is  
1363         NULL, and is sufficient for many applications. An  alternative  set  of         PCRE contains an internal set of tables that are used  when  the  final
1364         tables  can,  however, be supplied. These may be created in a different         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1365         locale from the default. As more and more applications change to  using         applications.  Normally, the internal tables recognize only ASCII char-
1366         Unicode, the need for this locale support is expected to die away.         acters. However, when PCRE is built, it is possible to cause the inter-
1367           nal tables to be rebuilt in the default "C" locale of the local system,
1368         External  tables  are  built by calling the pcre_maketables() function,         which may cause them to be different.
1369         which has no arguments, in the relevant locale. The result can then  be  
1370         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         The  internal tables can always be overridden by tables supplied by the
1371         example, to build and use tables that are appropriate  for  the  French         application that calls PCRE. These may be created in a different locale
1372         locale  (where  accented  characters  with  values greater than 128 are         from  the  default.  As more and more applications change to using Uni-
1373           code, the need for this locale support is expected to die away.
1374    
1375           External tables are built by calling  the  pcre_maketables()  function,
1376           which  has no arguments, in the relevant locale. The result can then be
1377           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1378           example,  to  build  and use tables that are appropriate for the French
1379           locale (where accented characters with  values  greater  than  128  are
1380         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1381    
1382           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1383           tables = pcre_maketables();           tables = pcre_maketables();
1384           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1385    
1386           The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1387           if you are using Windows, the name for the French locale is "french".
1388    
1389         When pcre_maketables() runs, the tables are built  in  memory  that  is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1390         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1391         that the memory containing the tables remains available for as long  as         that the memory containing the tables remains available for as long  as
# Line 1331  INFORMATION ABOUT A PATTERN Line 1468  INFORMATION ABOUT A PATTERN
1468         is still recognized for backwards compatibility.)         is still recognized for backwards compatibility.)
1469    
1470         If there is a fixed first byte, for example, from  a  pattern  such  as         If there is a fixed first byte, for example, from  a  pattern  such  as
1471         (cat|cow|coyote). Otherwise, if either         (cat|cow|coyote), its value is returned. Otherwise, if either
1472    
1473         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1474         branch starts with "^", or         branch starts with "^", or
# Line 1351  INFORMATION ABOUT A PATTERN Line 1488  INFORMATION ABOUT A PATTERN
1488         returned. The fourth argument should point to an unsigned char *  vari-         returned. The fourth argument should point to an unsigned char *  vari-
1489         able.         able.
1490    
1491             PCRE_INFO_JCHANGED
1492    
1493           Return  1  if the (?J) option setting is used in the pattern, otherwise
1494           0. The fourth argument should point to an int variable. The (?J) inter-
1495           nal option setting changes the local PCRE_DUPNAMES option.
1496    
1497           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1498    
1499         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1388  INFORMATION ABOUT A PATTERN Line 1531  INFORMATION ABOUT A PATTERN
1531         PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is         PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1532         ignored):         ignored):
1533    
1534           (?P<date> (?P<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1535           (?P<month>\d\d) - (?P<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1536    
1537         There are four named subpatterns, so the table has  four  entries,  and         There are four named subpatterns, so the table has  four  entries,  and
1538         each  entry  in the table is eight bytes long. The table is as follows,         each  entry  in the table is eight bytes long. The table is as follows,
# Line 1405  INFORMATION ABOUT A PATTERN Line 1548  INFORMATION ABOUT A PATTERN
1548         name-to-number map, remember that the length of the entries  is  likely         name-to-number map, remember that the length of the entries  is  likely
1549         to be different for each compiled pattern.         to be different for each compiled pattern.
1550    
1551             PCRE_INFO_OKPARTIAL
1552    
1553           Return  1 if the pattern can be used for partial matching, otherwise 0.
1554           The fourth argument should point to an int  variable.  The  pcrepartial
1555           documentation  lists  the restrictions that apply to patterns when par-
1556           tial matching is used.
1557    
1558           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1559    
1560         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1561         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1562         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1563         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1564           other words, they are the options that will be in force  when  matching
1565           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1566           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1567           and PCRE_EXTENDED.
1568    
1569         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1570         alternatives begin with one of the following:         alternatives begin with one of the following:
1571    
1572           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1426  INFORMATION ABOUT A PATTERN Line 1580  INFORMATION ABOUT A PATTERN
1580    
1581           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1582    
1583         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1584         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1585         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1586         size_t variable.         size_t variable.
# Line 1434  INFORMATION ABOUT A PATTERN Line 1588  INFORMATION ABOUT A PATTERN
1588           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1589    
1590         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1591         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1592         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1593         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1594         variable.         variable.
1595    
1596    
# Line 1444  OBSOLETE INFO FUNCTION Line 1598  OBSOLETE INFO FUNCTION
1598    
1599         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1600    
1601         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1602         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1603         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1604         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1605         lowing negative numbers:         lowing negative numbers:
1606    
1607           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1608           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1609    
1610         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1611         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1612         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1613    
1614         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1615         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1616         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1617    
1618    
# Line 1466  REFERENCE COUNTS Line 1620  REFERENCE COUNTS
1620    
1621         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1622    
1623         The  pcre_refcount()  function is used to maintain a reference count in         The pcre_refcount() function is used to maintain a reference  count  in
1624         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1625         benefit  of  applications  that  operate  in an object-oriented manner,         benefit of applications that  operate  in  an  object-oriented  manner,
1626         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1627         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1628    
1629         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1630         zero.  It is changed only by calling this function, whose action is  to         zero.   It is changed only by calling this function, whose action is to
1631         add  the  adjust  value  (which may be positive or negative) to it. The         add the adjust value (which may be positive or  negative)  to  it.  The
1632         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1633         is  constrained to lie between 0 and 65535, inclusive. If the new value         is constrained to lie between 0 and 65535, inclusive. If the new  value
1634         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1635    
1636         Except when it is zero, the reference count is not correctly  preserved         Except  when it is zero, the reference count is not correctly preserved
1637         if  a  pattern  is  compiled on one host and then transferred to a host         if a pattern is compiled on one host and then  transferred  to  a  host
1638         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1639    
1640    
# Line 1490  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1644  MATCHING A PATTERN: THE TRADITIONAL FUNC
1644              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1645              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1646    
1647         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1648         compiled pattern, which is passed in the code argument. If the  pattern         compiled  pattern, which is passed in the code argument. If the pattern
1649         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1650         argument. This function is the main matching facility of  the  library,         argument.  This  function is the main matching facility of the library,
1651         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1652         an alternative matching function, which is described below in the  sec-         an  alternative matching function, which is described below in the sec-
1653         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1654    
1655         In  most applications, the pattern will have been compiled (and option-         In most applications, the pattern will have been compiled (and  option-
1656         ally studied) in the same process that calls pcre_exec().  However,  it         ally  studied)  in the same process that calls pcre_exec(). However, it
1657         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1658         later in different processes, possibly even on different hosts.  For  a         later  in  different processes, possibly even on different hosts. For a
1659         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1660    
1661         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1520  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1674  MATCHING A PATTERN: THE TRADITIONAL FUNC
1674    
1675     Extra data for pcre_exec()     Extra data for pcre_exec()
1676    
1677         If  the  extra argument is not NULL, it must point to a pcre_extra data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1678         block. The pcre_study() function returns such a block (when it  doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1679         return  NULL), but you can also create one for yourself, and pass addi-         return NULL), but you can also create one for yourself, and pass  addi-
1680         tional information in it. The pcre_extra block contains  the  following         tional  information  in it. The pcre_extra block contains the following
1681         fields (not necessarily in this order):         fields (not necessarily in this order):
1682    
1683           unsigned long int flags;           unsigned long int flags;
# Line 1533  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1687  MATCHING A PATTERN: THE TRADITIONAL FUNC
1687           void *callout_data;           void *callout_data;
1688           const unsigned char *tables;           const unsigned char *tables;
1689    
1690         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1691         are set. The flag bits are:         are set. The flag bits are:
1692    
1693           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
# Line 1542  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1696  MATCHING A PATTERN: THE TRADITIONAL FUNC
1696           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1697           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1698    
1699         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1700         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1701         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1702         add  to  the  block by setting the other fields and their corresponding         add to the block by setting the other fields  and  their  corresponding
1703         flag bits.         flag bits.
1704    
1705         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1706         a  vast amount of resources when running patterns that are not going to         a vast amount of resources when running patterns that are not going  to
1707         match, but which have a very large number  of  possibilities  in  their         match,  but  which  have  a very large number of possibilities in their
1708         search  trees.  The  classic  example  is  the  use of nested unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1709         repeats.         repeats.
1710    
1711         Internally, PCRE uses a function called match() which it calls  repeat-         Internally,  PCRE uses a function called match() which it calls repeat-
1712         edly  (sometimes  recursively). The limit set by match_limit is imposed         edly (sometimes recursively). The limit set by match_limit  is  imposed
1713         on the number of times this function is called during  a  match,  which         on  the  number  of times this function is called during a match, which
1714         has  the  effect  of  limiting the amount of backtracking that can take         has the effect of limiting the amount of  backtracking  that  can  take
1715         place. For patterns that are not anchored, the count restarts from zero         place. For patterns that are not anchored, the count restarts from zero
1716         for each position in the subject string.         for each position in the subject string.
1717    
1718         The  default  value  for  the  limit can be set when PCRE is built; the         The default value for the limit can be set  when  PCRE  is  built;  the
1719         default default is 10 million, which handles all but the  most  extreme         default  default  is 10 million, which handles all but the most extreme
1720         cases.  You  can  override  the  default by suppling pcre_exec() with a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1721         pcre_extra    block    in    which    match_limit    is    set,     and         pcre_extra     block    in    which    match_limit    is    set,    and
1722         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1723         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1724    
1725         The match_limit_recursion field is similar to match_limit, but  instead         The  match_limit_recursion field is similar to match_limit, but instead
1726         of limiting the total number of times that match() is called, it limits         of limiting the total number of times that match() is called, it limits
1727         the depth of recursion. The recursion depth is a  smaller  number  than         the  depth  of  recursion. The recursion depth is a smaller number than
1728         the  total number of calls, because not all calls to match() are recur-         the total number of calls, because not all calls to match() are  recur-
1729         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
1730    
1731         Limiting  the  recursion  depth  limits the amount of stack that can be         Limiting the recursion depth limits the amount of  stack  that  can  be
1732         used, or, when PCRE has been compiled to use memory on the heap instead         used, or, when PCRE has been compiled to use memory on the heap instead
1733         of the stack, the amount of heap memory that can be used.         of the stack, the amount of heap memory that can be used.
1734    
1735         The  default  value  for  match_limit_recursion can be set when PCRE is         The default value for match_limit_recursion can be  set  when  PCRE  is
1736         built; the default default  is  the  same  value  as  the  default  for         built;  the  default  default  is  the  same  value  as the default for
1737         match_limit.  You can override the default by suppling pcre_exec() with         match_limit. You can override the default by suppling pcre_exec()  with
1738         a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1739         PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1740         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1741    
1742         The pcre_callout field is used in conjunction with the  "callout"  fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1743         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1744    
1745         The  tables  field  is  used  to  pass  a  character  tables pointer to         The tables field  is  used  to  pass  a  character  tables  pointer  to
1746         pcre_exec(); this overrides the value that is stored with the  compiled         pcre_exec();  this overrides the value that is stored with the compiled
1747         pattern.  A  non-NULL value is stored with the compiled pattern only if         pattern. A non-NULL value is stored with the compiled pattern  only  if
1748         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1749         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1750         PCRE's internal tables to be used. This facility is  helpful  when  re-         PCRE's  internal  tables  to be used. This facility is helpful when re-
1751         using  patterns  that  have been saved after compiling with an external         using patterns that have been saved after compiling  with  an  external
1752         set of tables, because the external tables  might  be  at  a  different         set  of  tables,  because  the  external tables might be at a different
1753         address  when  pcre_exec() is called. See the pcreprecompile documenta-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1754         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1755    
1756     Option bits for pcre_exec()     Option bits for pcre_exec()
1757    
1758         The unused bits of the options argument for pcre_exec() must  be  zero.         The  unused  bits of the options argument for pcre_exec() must be zero.
1759         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1760         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1761         PCRE_PARTIAL.         PCRE_PARTIAL.
1762    
1763           PCRE_ANCHORED           PCRE_ANCHORED
1764    
1765         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1766         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1767         turned  out to be anchored by virtue of its contents, it cannot be made         turned out to be anchored by virtue of its contents, it cannot be  made
1768         unachored at matching time.         unachored at matching time.
1769    
1770           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
1771           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
1772           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
1773             PCRE_NEWLINE_ANYCRLF
1774             PCRE_NEWLINE_ANY
1775    
1776         These options override  the  newline  definition  that  was  chosen  or         These  options  override  the  newline  definition  that  was chosen or
1777         defaulted  when the pattern was compiled. For details, see the descrip-         defaulted when the pattern was compiled. For details, see the  descrip-
1778         tion pcre_compile() above. During matching, the newline choice  affects         tion  of  pcre_compile()  above.  During  matching,  the newline choice
1779         the behaviour of the dot, circumflex, and dollar metacharacters.         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
1780           ters.  It may also alter the way the match position is advanced after a
1781           match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,
1782           PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt
1783           fails when the current position is at a CRLF sequence, the match  posi-
1784           tion  is  advanced by two characters instead of one, in other words, to
1785           after the CRLF.
1786    
1787           PCRE_NOTBOL           PCRE_NOTBOL
1788    
1789         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1790         the beginning of a line, so the  circumflex  metacharacter  should  not         the  beginning  of  a  line, so the circumflex metacharacter should not
1791         match  before it. Setting this without PCRE_MULTILINE (at compile time)         match before it. Setting this without PCRE_MULTILINE (at compile  time)
1792         causes circumflex never to match. This option affects only  the  behav-         causes  circumflex  never to match. This option affects only the behav-
1793         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1794    
1795           PCRE_NOTEOL           PCRE_NOTEOL
1796    
1797         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
1798         of a line, so the dollar metacharacter should not match it nor  (except         of  a line, so the dollar metacharacter should not match it nor (except
1799         in  multiline mode) a newline immediately before it. Setting this with-         in multiline mode) a newline immediately before it. Setting this  with-
1800         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1801         option  affects only the behaviour of the dollar metacharacter. It does         option affects only the behaviour of the dollar metacharacter. It  does
1802         not affect \Z or \z.         not affect \Z or \z.
1803    
1804           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1805    
1806         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1807         set.  If  there are alternatives in the pattern, they are tried. If all         set. If there are alternatives in the pattern, they are tried.  If  all
1808         the alternatives match the empty string, the entire  match  fails.  For         the  alternatives  match  the empty string, the entire match fails. For
1809         example, if the pattern         example, if the pattern
1810    
1811           a?b?           a?b?
1812    
1813         is  applied  to  a string not beginning with "a" or "b", it matches the         is applied to a string not beginning with "a" or "b",  it  matches  the
1814         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
1815         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1816         rences of "a" or "b".         rences of "a" or "b".
1817    
1818         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1819         cial  case  of  a  pattern match of the empty string within its split()         cial case of a pattern match of the empty  string  within  its  split()
1820         function, and when using the /g modifier. It  is  possible  to  emulate         function,  and  when  using  the /g modifier. It is possible to emulate
1821         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1822         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1823         if  that  fails by advancing the starting offset (see below) and trying         if that fails by advancing the starting offset (see below)  and  trying
1824         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
1825         this in the pcredemo.c sample program.         this in the pcredemo.c sample program.
1826    
1827           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1828    
1829         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1830         UTF-8 string is automatically checked when pcre_exec() is  subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
1831         called.   The  value  of  startoffset is also checked to ensure that it         called.  The value of startoffset is also checked  to  ensure  that  it
1832         points to the start of a UTF-8 character. If an invalid UTF-8  sequence         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence
1833         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1834         startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is
1835         returned.         returned.
1836    
1837         If  you  already  know that your subject is valid, and you want to skip         If you already know that your subject is valid, and you  want  to  skip
1838         these   checks   for   performance   reasons,   you   can    set    the         these    checks    for   performance   reasons,   you   can   set   the
1839         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
1840         do this for the second and subsequent calls to pcre_exec() if  you  are         do  this  for the second and subsequent calls to pcre_exec() if you are
1841         making  repeated  calls  to  find  all  the matches in a single subject         making repeated calls to find all  the  matches  in  a  single  subject
1842         string. However, you should be  sure  that  the  value  of  startoffset         string.  However,  you  should  be  sure  that the value of startoffset
1843         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is
1844         set, the effect of passing an invalid UTF-8 string as a subject,  or  a         set,  the  effect of passing an invalid UTF-8 string as a subject, or a
1845         value  of startoffset that does not point to the start of a UTF-8 char-         value of startoffset that does not point to the start of a UTF-8  char-
1846         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
1847    
1848           PCRE_PARTIAL           PCRE_PARTIAL
1849    
1850         This option turns on the  partial  matching  feature.  If  the  subject         This  option  turns  on  the  partial  matching feature. If the subject
1851         string  fails to match the pattern, but at some point during the match-         string fails to match the pattern, but at some point during the  match-
1852         ing process the end of the subject was reached (that  is,  the  subject         ing  process  the  end of the subject was reached (that is, the subject
1853         partially  matches  the  pattern and the failure to match occurred only         partially matches the pattern and the failure to  match  occurred  only
1854         because there were not enough subject characters), pcre_exec()  returns         because  there were not enough subject characters), pcre_exec() returns
1855         PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is
1856         used, there are restrictions on what may appear in the  pattern.  These         used,  there  are restrictions on what may appear in the pattern. These
1857         are discussed in the pcrepartial documentation.         are discussed in the pcrepartial documentation.
1858    
1859     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
1860    
1861         The  subject string is passed to pcre_exec() as a pointer in subject, a         The subject string is passed to pcre_exec() as a pointer in subject,  a
1862         length in length, and a starting byte offset in startoffset.  In  UTF-8         length  in  length, and a starting byte offset in startoffset. In UTF-8
1863         mode,  the  byte  offset  must point to the start of a UTF-8 character.         mode, the byte offset must point to the start  of  a  UTF-8  character.
1864         Unlike the pattern string, the subject may contain binary  zero  bytes.         Unlike  the  pattern string, the subject may contain binary zero bytes.
1865         When  the starting offset is zero, the search for a match starts at the         When the starting offset is zero, the search for a match starts at  the
1866         beginning of the subject, and this is by far the most common case.         beginning of the subject, and this is by far the most common case.
1867    
1868         A non-zero starting offset is useful when searching for  another  match         A  non-zero  starting offset is useful when searching for another match
1869         in  the same subject by calling pcre_exec() again after a previous suc-         in the same subject by calling pcre_exec() again after a previous  suc-
1870         cess.  Setting startoffset differs from just passing over  a  shortened         cess.   Setting  startoffset differs from just passing over a shortened
1871         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins
1872         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1873    
1874           \Biss\B           \Biss\B
1875    
1876         which finds occurrences of "iss" in the middle of  words.  (\B  matches         which  finds  occurrences  of "iss" in the middle of words. (\B matches
1877         only  if  the  current position in the subject is not a word boundary.)         only if the current position in the subject is not  a  word  boundary.)
1878         When applied to the string "Mississipi" the first call  to  pcre_exec()         When  applied  to the string "Mississipi" the first call to pcre_exec()
1879         finds  the  first  occurrence. If pcre_exec() is called again with just         finds the first occurrence. If pcre_exec() is called  again  with  just
1880         the remainder of the subject,  namely  "issipi",  it  does  not  match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
1881         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1882         to be a word boundary. However, if pcre_exec()  is  passed  the  entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
1883         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
1884         rence of "iss" because it is able to look behind the starting point  to         rence  of "iss" because it is able to look behind the starting point to
1885         discover that it is preceded by a letter.         discover that it is preceded by a letter.
1886    
1887         If  a  non-zero starting offset is passed when the pattern is anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
1888         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
1889         if  the  pattern  does  not require the match to be at the start of the         if the pattern does not require the match to be at  the  start  of  the
1890         subject.         subject.
1891    
1892     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
1893    
1894         In general, a pattern matches a certain portion of the subject, and  in         In  general, a pattern matches a certain portion of the subject, and in
1895         addition,  further  substrings  from  the  subject may be picked out by         addition, further substrings from the subject  may  be  picked  out  by
1896         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
1897         this  is  called "capturing" in what follows, and the phrase "capturing         this is called "capturing" in what follows, and the  phrase  "capturing
1898         subpattern" is used for a fragment of a pattern that picks out  a  sub-         subpattern"  is  used for a fragment of a pattern that picks out a sub-
1899         string.  PCRE  supports several other kinds of parenthesized subpattern         string. PCRE supports several other kinds of  parenthesized  subpattern
1900         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1901    
1902         Captured substrings are returned to the caller via a vector of  integer         Captured  substrings are returned to the caller via a vector of integer
1903         offsets  whose  address is passed in ovector. The number of elements in         offsets whose address is passed in ovector. The number of  elements  in
1904         the vector is passed in ovecsize, which must be a non-negative  number.         the  vector is passed in ovecsize, which must be a non-negative number.
1905         Note: this argument is NOT the size of ovector in bytes.         Note: this argument is NOT the size of ovector in bytes.
1906    
1907         The  first  two-thirds of the vector is used to pass back captured sub-         The first two-thirds of the vector is used to pass back  captured  sub-
1908         strings, each substring using a pair of integers. The  remaining  third         strings,  each  substring using a pair of integers. The remaining third
1909         of  the  vector is used as workspace by pcre_exec() while matching cap-         of the vector is used as workspace by pcre_exec() while  matching  cap-
1910         turing subpatterns, and is not available for passing back  information.         turing  subpatterns, and is not available for passing back information.
1911         The  length passed in ovecsize should always be a multiple of three. If         The length passed in ovecsize should always be a multiple of three.  If
1912         it is not, it is rounded down.         it is not, it is rounded down.
1913    
1914         When a match is successful, information about  captured  substrings  is         When  a  match  is successful, information about captured substrings is
1915         returned  in  pairs  of integers, starting at the beginning of ovector,         returned in pairs of integers, starting at the  beginning  of  ovector,
1916         and continuing up to two-thirds of its length at the  most.  The  first         and  continuing  up  to two-thirds of its length at the most. The first
1917         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1918         string, and the second is set to the  offset  of  the  first  character         string,  and  the  second  is  set to the offset of the first character
1919         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-
1920         tor[1], identify the portion of  the  subject  string  matched  by  the         tor[1],  identify  the  portion  of  the  subject string matched by the
1921         entire  pattern.  The next pair is used for the first capturing subpat-         entire pattern. The next pair is used for the first  capturing  subpat-
1922         tern, and so on. The value returned by pcre_exec() is one more than the         tern, and so on. The value returned by pcre_exec() is one more than the
1923         highest numbered pair that has been set. For example, if two substrings         highest numbered pair that has been set. For example, if two substrings
1924         have been captured, the returned value is 3. If there are no  capturing         have  been captured, the returned value is 3. If there are no capturing
1925         subpatterns,  the return value from a successful match is 1, indicating         subpatterns, the return value from a successful match is 1,  indicating
1926         that just the first pair of offsets has been set.         that just the first pair of offsets has been set.
1927    
1928         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1929         of the string that it matched that is returned.         of the string that it matched that is returned.
1930    
1931         If  the vector is too small to hold all the captured substring offsets,         If the vector is too small to hold all the captured substring  offsets,
1932         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1933         function  returns a value of zero. In particular, if the substring off-         function returns a value of zero. In particular, if the substring  off-
1934         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1935         as  NULL  and  ovecsize  as zero. However, if the pattern contains back         as NULL and ovecsize as zero. However, if  the  pattern  contains  back
1936         references and the ovector is not big enough to  remember  the  related         references  and  the  ovector is not big enough to remember the related
1937         substrings,  PCRE has to get additional memory for use during matching.         substrings, PCRE has to get additional memory for use during  matching.
1938         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1939    
1940         The pcre_info() function can be used to find  out  how  many  capturing         The  pcre_info()  function  can  be used to find out how many capturing
1941         subpatterns  there  are  in  a  compiled pattern. The smallest size for         subpatterns there are in a compiled  pattern.  The  smallest  size  for
1942         ovector that will allow for n captured substrings, in addition  to  the         ovector  that  will allow for n captured substrings, in addition to the
1943         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1944    
1945         It  is  possible for capturing subpattern number n+1 to match some part         It is possible for capturing subpattern number n+1 to match  some  part
1946         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
1947         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
1948         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
1949         2  is  not.  When  this happens, both values in the offset pairs corre-         2 is not. When this happens, both values in  the  offset  pairs  corre-
1950         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
1951    
1952         Offset values that correspond to unused subpatterns at the end  of  the         Offset  values  that correspond to unused subpatterns at the end of the
1953         expression  are  also  set  to  -1. For example, if the string "abc" is         expression are also set to -1. For example,  if  the  string  "abc"  is
1954         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1955         matched.  The  return  from the function is 2, because the highest used         matched. The return from the function is 2, because  the  highest  used
1956         capturing subpattern number is 1. However, you can refer to the offsets         capturing subpattern number is 1. However, you can refer to the offsets
1957         for  the  second  and third capturing subpatterns if you wish (assuming         for the second and third capturing subpatterns if  you  wish  (assuming
1958         the vector is large enough, of course).         the vector is large enough, of course).
1959    
1960         Some convenience functions are provided  for  extracting  the  captured         Some  convenience  functions  are  provided for extracting the captured
1961         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
1962    
1963     Error return values from pcre_exec()     Error return values from pcre_exec()
1964    
1965         If  pcre_exec()  fails, it returns a negative number. The following are         If pcre_exec() fails, it returns a negative number. The  following  are
1966         defined in the header file:         defined in the header file:
1967    
1968           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1809  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1971  MATCHING A PATTERN: THE TRADITIONAL FUNC
1971    
1972           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1973    
1974         Either code or subject was passed as NULL,  or  ovector  was  NULL  and         Either  code  or  subject  was  passed as NULL, or ovector was NULL and
1975         ovecsize was not zero.         ovecsize was not zero.
1976    
1977           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1818  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1980  MATCHING A PATTERN: THE TRADITIONAL FUNC
1980    
1981           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1982    
1983         PCRE  stores a 4-byte "magic number" at the start of the compiled code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
1984         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1985         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1986         an environment with the other endianness. This is the error  that  PCRE         an  environment  with the other endianness. This is the error that PCRE
1987         gives when the magic number is not present.         gives when the magic number is not present.
1988    
1989           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
1990    
1991         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1992         compiled pattern. This error could be caused by a bug  in  PCRE  or  by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
1993         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1994    
1995           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1996    
1997         If  a  pattern contains back references, but the ovector that is passed         If a pattern contains back references, but the ovector that  is  passed
1998         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1999         PCRE  gets  a  block of memory at the start of matching to use for this         PCRE gets a block of memory at the start of matching to  use  for  this
2000         purpose. If the call via pcre_malloc() fails, this error is given.  The         purpose.  If the call via pcre_malloc() fails, this error is given. The
2001         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
2002    
2003           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2004    
2005         This  error is used by the pcre_copy_substring(), pcre_get_substring(),         This error is used by the pcre_copy_substring(),  pcre_get_substring(),
2006         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2007         returned by pcre_exec().         returned by pcre_exec().
2008    
2009           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2010    
2011         The  backtracking  limit,  as  specified  by the match_limit field in a         The backtracking limit, as specified by  the  match_limit  field  in  a
2012         pcre_extra structure (or defaulted) was reached.  See  the  description         pcre_extra  structure  (or  defaulted) was reached. See the description
2013         above.         above.
2014    
          PCRE_ERROR_RECURSIONLIMIT (-21)  
   
        The internal recursion limit, as specified by the match_limit_recursion  
        field in a pcre_extra structure (or defaulted)  was  reached.  See  the  
        description above.  
   
2015           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2016    
2017         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2018         use by callout functions that want to yield a distinctive  error  code.         use  by  callout functions that want to yield a distinctive error code.
2019         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2020    
2021           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2022    
2023         A  string  that contains an invalid UTF-8 byte sequence was passed as a         A string that contains an invalid UTF-8 byte sequence was passed  as  a
2024         subject.         subject.
2025    
2026           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2027    
2028         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
2029         value  of startoffset did not point to the beginning of a UTF-8 charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2030         ter.         ter.
2031    
2032           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2033    
2034         The subject string did not match, but it did match partially.  See  the         The  subject  string did not match, but it did match partially. See the
2035         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2036    
2037           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2038    
2039         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
2040         items that are not supported for partial matching. See the  pcrepartial         items  that are not supported for partial matching. See the pcrepartial
2041         documentation for details of partial matching.         documentation for details of partial matching.
2042    
2043           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2044    
2045         An  unexpected  internal error has occurred. This error could be caused         An unexpected internal error has occurred. This error could  be  caused
2046         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2047    
2048           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
2049    
2050         This error is given if the value of the ovecsize argument is  negative.         This  error is given if the value of the ovecsize argument is negative.
2051    
2052             PCRE_ERROR_RECURSIONLIMIT (-21)
2053    
2054           The internal recursion limit, as specified by the match_limit_recursion
2055           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2056           description above.
2057    
2058             PCRE_ERROR_BADNEWLINE     (-23)
2059    
2060           An invalid combination of PCRE_NEWLINE_xxx options was given.
2061    
2062           Error numbers -16 to -20 and -22 are not used by pcre_exec().
2063    
2064    
2065  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 1907  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2075  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2075         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2076              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2077    
2078         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2079         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2080         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2081         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2082         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2083         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2084         substrings.         substrings.
2085    
2086         A  substring that contains a binary zero is correctly extracted and has         A substring that contains a binary zero is correctly extracted and  has
2087         a further zero added on the end, but the result is not, of course, a  C         a  further zero added on the end, but the result is not, of course, a C
2088         string.   However,  you  can  process such a string by referring to the         string.  However, you can process such a string  by  referring  to  the
2089         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2090         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
2091         not adequate for handling strings containing binary zeros, because  the         not  adequate for handling strings containing binary zeros, because the
2092         end of the final string is not independently indicated.         end of the final string is not independently indicated.
2093    
2094         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2095         tions: subject is the subject string that has  just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2096         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2097         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2098         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2099         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
2100         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2101         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2102         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
2103    
2104         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2105         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2106         zero  extracts  the  substring that matched the entire pattern, whereas         zero extracts the substring that matched the  entire  pattern,  whereas
2107         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2108         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2109         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2110         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2111         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2112         the terminating zero, or one of         the terminating zero, or one of these error codes:
2113    
2114           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2115    
2116         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2117         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2118    
2119           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2120    
2121         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2122    
2123         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2124         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2125         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
2126         the  memory  block  is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2127         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2128         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all  went  well,  or  the
2129           error code
2130    
2131           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2132    
# Line 1999  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2168  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2168         To extract a substring by name, you first have to find associated  num-         To extract a substring by name, you first have to find associated  num-
2169         ber.  For example, for this pattern         ber.  For example, for this pattern
2170    
2171           (a+)b(?P<xxx>\d+)...           (a+)b(?<xxx>\d+)...
2172    
2173         the number of the subpattern called "xxx" is 2. If the name is known to         the number of the subpattern called "xxx" is 2. If the name is known to
2174         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
# Line 2025  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2194  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2194    
2195         These  functions call pcre_get_stringnumber(), and if it succeeds, they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2196         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2197         ate.         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2198           behaviour may not be what you want (see the next section).
2199    
2200    
2201  DUPLICATE SUBPATTERN NAMES  DUPLICATE SUBPATTERN NAMES
# Line 2033  DUPLICATE SUBPATTERN NAMES Line 2203  DUPLICATE SUBPATTERN NAMES
2203         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
2204              const char *name, char **first, char **last);              const char *name, char **first, char **last);
2205    
2206         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2207         subpatterns are not required to  be  unique.  Normally,  patterns  with         subpatterns  are  not  required  to  be unique. Normally, patterns with
2208         duplicate  names  are such that in any one match, only one of the named         duplicate names are such that in any one match, only one of  the  named
2209         subpatterns participates. An example is shown in the pcrepattern  docu-         subpatterns  participates. An example is shown in the pcrepattern docu-
2210         mentation. When duplicates are present, pcre_copy_named_substring() and         mentation.
2211         pcre_get_named_substring() return the first substring corresponding  to  
2212         the  given  name  that  is  set.  If  none  are set, an empty string is         When   duplicates   are   present,   pcre_copy_named_substring()    and
2213         returned.  The pcre_get_stringnumber() function returns one of the num-         pcre_get_named_substring()  return the first substring corresponding to
2214         bers  that are associated with the name, but it is not defined which it         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
2215         is.         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
2216           function returns one of the numbers that are associated with the  name,
2217           but it is not defined which it is.
2218    
2219         If you want to get full details of all captured substrings for a  given         If  you want to get full details of all captured substrings for a given
2220         name,  you  must  use  the pcre_get_stringtable_entries() function. The         name, you must use  the  pcre_get_stringtable_entries()  function.  The
2221         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
2222         third  and  fourth  are  pointers to variables which are updated by the         third and fourth are pointers to variables which  are  updated  by  the
2223         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
2224         the  name-to-number  table  for  the  given  name.  The function itself         the name-to-number table  for  the  given  name.  The  function  itself
2225         returns the length of each entry, or  PCRE_ERROR_NOSUBSTRING  if  there         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2226         are  none.  The  format  of the table is described above in the section         there are none. The format of the table is described above in the  sec-
2227         entitled Information about a pattern. Given all  the  relevant  entries         tion  entitled  Information  about  a  pattern.  Given all the relevant
2228         for the name, you can extract each of their numbers, and hence the cap-         entries for the name, you can extract each of their numbers, and  hence
2229         tured data, if any.         the captured data, if any.
2230    
2231    
2232  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2233    
2234         The traditional matching function uses a  similar  algorithm  to  Perl,         The  traditional  matching  function  uses a similar algorithm to Perl,
2235         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2236         the subject. If you want to find all possible matches, or  the  longest         the  subject.  If you want to find all possible matches, or the longest
2237         possible  match,  consider using the alternative matching function (see         possible match, consider using the alternative matching  function  (see
2238         below) instead. If you cannot use the alternative function,  but  still         below)  instead.  If you cannot use the alternative function, but still
2239         need  to  find all possible matches, you can kludge it up by making use         need to find all possible matches, you can kludge it up by  making  use
2240         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2241         tation.         tation.
2242    
2243         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2244         tern.  When your callout function is called, extract and save the  cur-         tern.   When your callout function is called, extract and save the cur-
2245         rent  matched  substring.  Then  return  1, which forces pcre_exec() to         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2246         backtrack and try other alternatives. Ultimately, when it runs  out  of         backtrack  and  try other alternatives. Ultimately, when it runs out of
2247         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2248    
2249    
# Line 2082  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2254  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2254              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2255              int *workspace, int wscount);              int *workspace, int wscount);
2256    
2257         The  function  pcre_dfa_exec()  is  called  to  match  a subject string         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2258         against a compiled pattern, using a "DFA" matching algorithm. This  has         against  a  compiled pattern, using a matching algorithm that scans the
2259         different  characteristics to the normal algorithm, and is not compati-         subject string just once, and does not backtrack.  This  has  different
2260         ble with Perl. Some of the features of PCRE patterns are not supported.         characteristics  to  the  normal  algorithm, and is not compatible with
2261         Nevertheless, there are times when this kind of matching can be useful.         Perl. Some of the features of PCRE patterns are not  supported.  Never-
2262         For a discussion of the two matching algorithms, see  the  pcrematching         theless,  there are times when this kind of matching can be useful. For
2263         documentation.         a discussion of the two matching algorithms, see the pcrematching docu-
2264           mentation.
2265    
2266         The  arguments  for  the  pcre_dfa_exec()  function are the same as for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2267         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
# Line 2141  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2314  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2314           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2315    
2316         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2317         stop  as  soon  as  it  has found one match. Because of the way the DFA         stop as soon as it has found one match. Because of the way the alterna-
2318         algorithm works, this is necessarily the shortest possible match at the         tive algorithm works, this is necessarily the shortest  possible  match
2319         first possible matching point in the subject string.         at the first possible matching point in the subject string.
2320    
2321           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2322    
# Line 2179  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2352  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2352         On success, the yield of the function is a number  greater  than  zero,         On success, the yield of the function is a number  greater  than  zero,
2353         which  is  the  number of matched substrings. The substrings themselves         which  is  the  number of matched substrings. The substrings themselves
2354         are returned in ovector. Each string uses two elements;  the  first  is         are returned in ovector. Each string uses two elements;  the  first  is
2355         the  offset  to the start, and the second is the offset to the end. All         the  offset  to  the start, and the second is the offset to the end. In
2356         the strings have the same start offset. (Space could have been saved by         fact, all the strings have the same start  offset.  (Space  could  have
2357         giving  this only once, but it was decided to retain some compatibility         been  saved by giving this only once, but it was decided to retain some
2358         with the way pcre_exec() returns data, even though the meaning  of  the         compatibility with the way pcre_exec() returns data,  even  though  the
2359         strings is different.)         meaning of the strings is different.)
2360    
2361         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2362         est matching string is given first. If there were too many  matches  to         est matching string is given first. If there were too many  matches  to
# Line 2205  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2378  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2378    
2379           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2380    
2381         This  return is given if pcre_dfa_exec() encounters a condition item in         This  return  is  given  if pcre_dfa_exec() encounters a condition item
2382         a pattern that uses a back reference for the  condition.  This  is  not         that uses a back reference for the condition, or a test  for  recursion
2383         supported.         in a specific group. These are not supported.
2384    
2385           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2386    
# Line 2227  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2400  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2400         This error is given if the output vector  is  not  large  enough.  This         This error is given if the output vector  is  not  large  enough.  This
2401         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2402    
2403  Last updated: 08 June 2006  
2404  Copyright (c) 1997-2006 University of Cambridge.  SEE ALSO
2405    
2406           pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2407           tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).
2408    
2409    
2410    AUTHOR
2411    
2412           Philip Hazel
2413           University Computing Service
2414           Cambridge CB2 3QH, England.
2415    
2416    
2417    REVISION
2418    
2419           Last updated: 30 July 2007
2420           Copyright (c) 1997-2007 University of Cambridge.
2421  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2422    
2423    
# Line 2255  PCRE CALLOUTS Line 2444  PCRE CALLOUTS
2444         default value is zero.  For  example,  this  pattern  has  two  callout         default value is zero.  For  example,  this  pattern  has  two  callout
2445         points:         points:
2446    
2447           (?C1)eabc(?C2)def           (?C1)abc(?C2)def
2448    
2449         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2450         called, PCRE automatically  inserts  callouts,  all  with  number  255,         called, PCRE automatically  inserts  callouts,  all  with  number  255,
# Line 2330  THE CALLOUT INTERFACE Line 2519  THE CALLOUT INTERFACE
2519         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
2520         were passed to pcre_exec().         were passed to pcre_exec().
2521    
2522         The start_match field contains the offset within the subject  at  which         The start_match field normally contains the offset within  the  subject
2523         the  current match attempt started. If the pattern is not anchored, the         at  which  the  current  match  attempt started. However, if the escape
2524         callout function may be called several times from the same point in the         sequence \K has been encountered, this value is changed to reflect  the
2525         pattern for different starting points in the subject.         modified  starting  point.  If the pattern is not anchored, the callout
2526           function may be called several times from the same point in the pattern
2527           for different starting points in the subject.
2528    
2529         The  current_position  field  contains the offset within the subject of         The  current_position  field  contains the offset within the subject of
2530         the current match pointer.         the current match pointer.
# Line 2386  RETURN VALUES Line 2577  RETURN VALUES
2577         reserved for use by callout functions; it will never be  used  by  PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2578         itself.         itself.
2579    
2580  Last updated: 28 February 2005  
2581  Copyright (c) 1997-2005 University of Cambridge.  AUTHOR
2582    
2583           Philip Hazel
2584           University Computing Service
2585           Cambridge CB2 3QH, England.
2586    
2587    
2588    REVISION
2589    
2590           Last updated: 29 May 2007
2591           Copyright (c) 1997-2007 University of Cambridge.
2592  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2593    
2594    
# Line 2401  NAME Line 2602  NAME
2602  DIFFERENCES BETWEEN PCRE AND PERL  DIFFERENCES BETWEEN PCRE AND PERL
2603    
2604         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2605         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences described here  are  mainly
2606         respect to Perl 5.8.         with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2607           some features that are expected to be in the forthcoming Perl 5.10.
2608    
2609         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details         1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2610         of what it does have are given in the section on UTF-8 support  in  the         of  what  it does have are given in the section on UTF-8 support in the
2611         main pcre page.         main pcre page.
2612    
2613         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2614         permits them, but they do not mean what you might think.  For  example,         permits  them,  but they do not mean what you might think. For example,
2615         (?!a){3} does not assert that the next three characters are not "a". It         (?!a){3} does not assert that the next three characters are not "a". It
2616         just asserts that the next character is not "a" three times.         just asserts that the next character is not "a" three times.
2617    
2618         3. Capturing subpatterns that occur inside  negative  lookahead  asser-         3.  Capturing  subpatterns  that occur inside negative lookahead asser-
2619         tions  are  counted,  but their entries in the offsets vector are never         tions are counted, but their entries in the offsets  vector  are  never
2620         set. Perl sets its numerical variables from any such patterns that  are         set.  Perl sets its numerical variables from any such patterns that are
2621         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
2622         ing), but only if the negative lookahead assertion  contains  just  one         ing),  but  only  if the negative lookahead assertion contains just one
2623         branch.         branch.
2624    
2625         4.  Though  binary zero characters are supported in the subject string,         4. Though binary zero characters are supported in the  subject  string,
2626         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2627         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
2628         the pattern to represent a binary zero.         the pattern to represent a binary zero.
2629    
2630         5. The following Perl escape sequences are not supported: \l,  \u,  \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
2631         \U, and \N. In fact these are implemented by Perl's general string-han-         \U, and \N. In fact these are implemented by Perl's general string-han-
2632         dling and are not part of its pattern matching engine. If any of  these         dling  and are not part of its pattern matching engine. If any of these
2633         are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2634    
2635         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
2636         is built with Unicode character property support. The  properties  that         is  built  with Unicode character property support. The properties that
2637         can  be tested with \p and \P are limited to the general category prop-         can be tested with \p and \P are limited to the general category  prop-
2638         erties such as Lu and Nd, script names such as Greek or  Han,  and  the         erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2639         derived properties Any and L&.         derived properties Any and L&.
2640    
2641         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2642         ters in between are treated as literals.  This  is  slightly  different         ters  in  between  are  treated as literals. This is slightly different
2643         from  Perl  in  that  $  and  @ are also handled as literals inside the         from Perl in that $ and @ are  also  handled  as  literals  inside  the
2644         quotes. In Perl, they cause variable interpolation (but of course  PCRE         quotes.  In Perl, they cause variable interpolation (but of course PCRE
2645         does not have variables). Note the following examples:         does not have variables). Note the following examples:
2646    
2647             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 2449  DIFFERENCES BETWEEN PCRE AND PERL Line 2651  DIFFERENCES BETWEEN PCRE AND PERL
2651             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
2652             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
2653    
2654         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2655         classes.         classes.
2656    
2657         8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2658         constructions.  However,  there is support for recursive patterns using         constructions. However, there is support for recursive  patterns.  This
2659         the non-Perl items (?R),  (?number),  and  (?P>name).  Also,  the  PCRE         is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
2660         "callout"  feature allows an external function to be called during pat-         "callout" feature allows an external function to be called during  pat-
2661         tern matching. See the pcrecallout documentation for details.         tern matching. See the pcrecallout documentation for details.
2662    
2663         9. There are some differences that are concerned with the  settings  of         9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2664         captured  strings  when  part  of  a  pattern is repeated. For example,         always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2665         matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2         unlike Perl.
2666    
2667           10.  There are some differences that are concerned with the settings of
2668           captured strings when part of  a  pattern  is  repeated.  For  example,
2669           matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
2670         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
2671    
2672         10. PCRE provides some extensions to the Perl regular expression facil-         11. PCRE provides some extensions to the Perl regular expression facil-
2673         ities:         ities.   Perl  5.10  will  include new features that are not in earlier
2674           versions, some of which (such as named parentheses) have been  in  PCRE
2675           for some time. This list is with respect to Perl 5.10:
2676    
2677         (a) Although lookbehind assertions must  match  fixed  length  strings,         (a)  Although  lookbehind  assertions  must match fixed length strings,
2678         each alternative branch of a lookbehind assertion can match a different         each alternative branch of a lookbehind assertion can match a different
2679         length of string. Perl requires them all to have the same length.         length of string. Perl requires them all to have the same length.
2680    
2681         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
2682         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2683    
2684         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2685         cial meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash  is         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2686         ignored. (Perl can be made to issue a warning.)         ignored.  (Perl can be made to issue a warning.)
2687    
2688         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2689         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2690         lowed by a question mark they are.         lowed by a question mark they are.
2691    
2692         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2693         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2694    
2695         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2696         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2697    
2698         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive         (g) The callout facility is PCRE-specific.
        pattern matching (Perl can do  this  using  the  (?p{code})  construct,  
        which PCRE cannot support.)  
2699    
2700         (h)  PCRE supports named capturing substrings, using the Python syntax.         (h) The partial matching facility is PCRE-specific.
2701    
2702         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from         (i) Patterns compiled by PCRE can be saved and re-used at a later time,
2703         Sun's Java package.         even on different hosts that have the other endianness.
2704    
2705         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j)  The  alternative  matching function (pcre_dfa_exec()) matches in a
2706           different way and is not Perl-compatible.
2707    
        (k) The callout facility is PCRE-specific.  
2708    
2709         (l) The partial matching facility is PCRE-specific.  AUTHOR
2710    
2711         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         Philip Hazel
2712         even on different hosts that have the other endianness.         University Computing Service
2713           Cambridge CB2 3QH, England.
2714    
        (n) The alternative matching function (pcre_dfa_exec())  matches  in  a  
        different way and is not Perl-compatible.  
2715    
2716  Last updated: 06 June 2006  REVISION
2717  Copyright (c) 1997-2006 University of Cambridge.  
2718           Last updated: 13 June 2007
2719           Copyright (c) 1997-2007 University of Cambridge.
2720  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2721    
2722    
# Line 2522  NAME Line 2729  NAME
2729    
2730  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2731    
2732         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax and semantics of the regular expressions that are supported
2733         are described below. Regular expressions are also described in the Perl         by PCRE are described in detail below. There is a quick-reference  syn-
2734         documentation  and  in  a  number  of books, some of which have copious         tax  summary  in  the  pcresyntax  page. Perl's regular expressions are
2735         examples.  Jeffrey Friedl's "Mastering Regular Expressions",  published         described in its own documentation, and regular expressions in  general
2736         by  O'Reilly, covers regular expressions in great detail. This descrip-         are  covered in a number of books, some of which have copious examples.
2737         tion of PCRE's regular expressions is intended as reference material.         Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by
2738           O'Reilly,  covers regular expressions in great detail. This description
2739           of PCRE's regular expressions is intended as reference material.
2740    
2741         The original operation of PCRE was on strings of  one-byte  characters.         The original operation of PCRE was on strings of  one-byte  characters.
2742         However,  there is now also support for UTF-8 character strings. To use         However,  there is now also support for UTF-8 character strings. To use
# Line 2541  PCRE REGULAR EXPRESSION DETAILS Line 2750  PCRE REGULAR EXPRESSION DETAILS
2750         ported  by  PCRE when its main matching function, pcre_exec(), is used.         ported  by  PCRE when its main matching function, pcre_exec(), is used.
2751         From  release  6.0,   PCRE   offers   a   second   matching   function,         From  release  6.0,   PCRE   offers   a   second   matching   function,
2752         pcre_dfa_exec(),  which matches using a different algorithm that is not         pcre_dfa_exec(),  which matches using a different algorithm that is not
2753         Perl-compatible. The advantages and disadvantages  of  the  alternative         Perl-compatible. Some of the features discussed below are not available
2754         function, and how it differs from the normal function, are discussed in         when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2755         the pcrematching page.         alternative function, and how it differs from the normal function,  are
2756           discussed in the pcrematching page.
2757         A regular expression is a pattern that is  matched  against  a  subject  
2758         string  from  left  to right. Most characters stand for themselves in a  
2759         pattern, and match the corresponding characters in the  subject.  As  a  CHARACTERS AND METACHARACTERS
2760    
2761           A  regular  expression  is  a pattern that is matched against a subject
2762           string from left to right. Most characters stand for  themselves  in  a
2763           pattern,  and  match  the corresponding characters in the subject. As a
2764         trivial example, the pattern         trivial example, the pattern
2765    
2766           The quick brown fox           The quick brown fox
2767    
2768         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
2769         caseless matching is specified (the PCRE_CASELESS option), letters  are         caseless  matching is specified (the PCRE_CASELESS option), letters are
2770         matched  independently  of case. In UTF-8 mode, PCRE always understands         matched independently of case. In UTF-8 mode, PCRE  always  understands
2771         the concept of case for characters whose values are less than  128,  so         the  concept  of case for characters whose values are less than 128, so
2772         caseless  matching  is always possible. For characters with higher val-         caseless matching is always possible. For characters with  higher  val-
2773         ues, the concept of case is supported if PCRE is compiled with  Unicode         ues,  the concept of case is supported if PCRE is compiled with Unicode
2774         property  support,  but  not  otherwise.   If  you want to use caseless         property support, but not otherwise.   If  you  want  to  use  caseless
2775         matching for characters 128 and above, you must  ensure  that  PCRE  is         matching  for  characters  128  and above, you must ensure that PCRE is
2776         compiled with Unicode property support as well as with UTF-8 support.         compiled with Unicode property support as well as with UTF-8 support.
2777    
2778         The  power  of  regular  expressions  comes from the ability to include         The power of regular expressions comes  from  the  ability  to  include
2779         alternatives and repetitions in the pattern. These are encoded  in  the         alternatives  and  repetitions in the pattern. These are encoded in the
2780         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
2781         but instead are interpreted in some special way.         but instead are interpreted in some special way.
2782    
2783         There are two different sets of metacharacters: those that  are  recog-         There  are  two different sets of metacharacters: those that are recog-
2784         nized  anywhere in the pattern except within square brackets, and those         nized anywhere in the pattern except within square brackets, and  those
2785         that are recognized in square brackets. Outside  square  brackets,  the         that  are  recognized  within square brackets. Outside square brackets,
2786         metacharacters are as follows:         the metacharacters are as follows:
2787    
2788           \      general escape character with several uses           \      general escape character with several uses
2789           ^      assert start of string (or line, in multiline mode)           ^      assert start of string (or line, in multiline mode)
# Line 2588  PCRE REGULAR EXPRESSION DETAILS Line 2801  PCRE REGULAR EXPRESSION DETAILS
2801                  also "possessive quantifier"                  also "possessive quantifier"
2802           {      start min/max quantifier           {      start min/max quantifier
2803    
2804         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
2805         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
2806    
2807           \      general escape character           \      general escape character
# Line 2598  PCRE REGULAR EXPRESSION DETAILS Line 2811  PCRE REGULAR EXPRESSION DETAILS
2811                    syntax)                    syntax)
2812           ]      terminates the character class           ]      terminates the character class
2813    
2814         The following sections describe the use of each of the  metacharacters.         The  following sections describe the use of each of the metacharacters.
2815    
2816    
2817  BACKSLASH  BACKSLASH
2818    
2819         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
2820         a non-alphanumeric character, it takes away any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
2821         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
2822         applies both inside and outside character classes.         applies both inside and outside character classes.
2823    
2824         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
2825         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
2826         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
2827         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
2828         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
2829         slash, you write \\.         slash, you write \\.
2830    
2831         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
2832         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
2833         # outside a character class and the next newline are ignored. An escap-         # outside a character class and the next newline are ignored. An escap-
2834         ing backslash can be used to include a whitespace  or  #  character  as         ing  backslash  can  be  used to include a whitespace or # character as
2835         part of the pattern.         part of the pattern.
2836    
2837         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
2838         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
2839         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
2840         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
2841         tion. Note the following examples:         tion. Note the following examples:
2842    
2843           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 2634  BACKSLASH Line 2847  BACKSLASH
2847           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
2848           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
2849    
2850         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
2851         classes.         classes.
2852    
2853     Non-printing characters     Non-printing characters
2854    
2855         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
2856         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
2857         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
2858         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
2859         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
2860         sequences than the binary character it represents:         sequences than the binary character it represents:
2861    
2862           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 2657  BACKSLASH Line 2870  BACKSLASH
2870           \xhh      character with hex code hh           \xhh      character with hex code hh
2871           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
2872    
2873         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
2874         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
2875         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
2876         becomes hex 7B.         becomes hex 7B.
2877    
2878         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
2879         in  upper  or  lower case). Any number of hexadecimal digits may appear         in upper or lower case). Any number of hexadecimal  digits  may  appear
2880         between \x{ and }, but the value of the character  code  must  be  less         between  \x{  and  },  but the value of the character code must be less
2881         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2882         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than         the  maximum  hexadecimal  value is 7FFFFFFF). If characters other than
2883         hexadecimal  digits  appear between \x{ and }, or if there is no termi-         hexadecimal digits appear between \x{ and }, or if there is  no  termi-
2884         nating }, this form of escape is not recognized.  Instead, the  initial         nating  }, this form of escape is not recognized.  Instead, the initial
2885         \x will be interpreted as a basic hexadecimal escape, with no following         \x will be interpreted as a basic hexadecimal escape, with no following
2886         digits, giving a character whose value is zero.         digits, giving a character whose value is zero.
2887    
2888         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2889         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x. There is no difference in the way  they  are  han-
2890         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
2891    
2892         After \0 up to two further octal digits are read. If  there  are  fewer         After  \0  up  to two further octal digits are read. If there are fewer
2893         than  two  digits,  just  those  that  are  present  are used. Thus the         than two digits, just  those  that  are  present  are  used.  Thus  the
2894         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2895         (code  value 7). Make sure you supply two digits after the initial zero         (code value 7). Make sure you supply two digits after the initial  zero
2896         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
2897    
2898         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2899         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2900         its as a decimal number. If the number is less than  10,  or  if  there         its  as  a  decimal  number. If the number is less than 10, or if there
2901         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2902         expression, the entire  sequence  is  taken  as  a  back  reference.  A         expression,  the  entire  sequence  is  taken  as  a  back reference. A
2903         description  of how this works is given later, following the discussion         description of how this works is given later, following the  discussion
2904         of parenthesized subpatterns.         of parenthesized subpatterns.
2905    
2906         Inside a character class, or if the decimal number is  greater  than  9         Inside  a  character  class, or if the decimal number is greater than 9
2907         and  there have not been that many capturing subpatterns, PCRE re-reads         and there have not been that many capturing subpatterns, PCRE  re-reads
2908         up to three octal digits following the backslash, ane uses them to gen-         up to three octal digits following the backslash, and uses them to gen-
2909         erate  a data character. Any subsequent digits stand for themselves. In         erate a data character. Any subsequent digits stand for themselves.  In
2910         non-UTF-8 mode, the value of a character specified  in  octal  must  be         non-UTF-8  mode,  the  value  of a character specified in octal must be
2911         less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For         less than \400. In UTF-8 mode, values up to  \777  are  permitted.  For
2912         example:         example:
2913    
2914           \040   is another way of writing a space           \040   is another way of writing a space
# Line 2713  BACKSLASH Line 2926  BACKSLASH
2926           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2927                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2928    
2929         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that  octal  values of 100 or greater must not be introduced by a
2930         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2931    
2932         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
2933         inside and outside character classes. In addition, inside  a  character         inside  and  outside character classes. In addition, inside a character
2934         class,  the  sequence \b is interpreted as the backspace character (hex         class, the sequence \b is interpreted as the backspace  character  (hex
2935         08), and the sequence \X is interpreted as the character "X". Outside a         08),  and the sequences \R and \X are interpreted as the characters "R"
2936         character class, these sequences have different meanings (see below).         and "X", respectively. Outside a character class, these sequences  have
2937           different meanings (see below).
2938    
2939       Absolute and relative back references
2940    
2941           The  sequence  \g followed by an unsigned or a negative number, option-
2942           ally enclosed in braces, is an absolute or relative back  reference.  A
2943           named back reference can be coded as \g{name}. Back references are dis-
2944           cussed later, following the discussion of parenthesized subpatterns.
2945    
2946     Generic character types     Generic character types
2947    
2948         The  third  use of backslash is for specifying generic character types.         Another use of backslash is for specifying generic character types. The
2949         The following are always recognized:         following are always recognized:
2950    
2951           \d     any decimal digit           \d     any decimal digit
2952           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
2953             \h     any horizontal whitespace character
2954             \H     any character that is not a horizontal whitespace character
2955           \s     any whitespace character           \s     any whitespace character
2956           \S     any character that is not a whitespace character           \S     any character that is not a whitespace character
2957             \v     any vertical whitespace character
2958             \V     any character that is not a vertical whitespace character
2959           \w     any "word" character           \w     any "word" character
2960           \W     any "non-word" character           \W     any "non-word" character
2961    
2962         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2963         into  two disjoint sets. Any given character matches one, and only one,         into two disjoint sets. Any given character matches one, and only  one,
2964         of each pair.         of each pair.
2965    
2966         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2967         acter  classes.  They each match one character of the appropriate type.         acter classes. They each match one character of the  appropriate  type.
2968         If the current matching point is at the end of the subject string,  all         If  the current matching point is at the end of the subject string, all
2969         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2970    
2971         For  compatibility  with Perl, \s does not match the VT character (code         For compatibility with Perl, \s does not match the VT  character  (code
2972         11).  This makes it different from the the POSIX "space" class. The  \s         11).   This makes it different from the the POSIX "space" class. The \s
2973         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If
2974         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
2975         ter. In PCRE, it never does.)         ter. In PCRE, it never does.
2976    
2977           In UTF-8 mode, characters with values greater than 128 never match  \d,
2978           \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2979           code character property support is available.  These  sequences  retain
2980           their original meanings from before UTF-8 support was available, mainly
2981           for efficiency reasons.
2982    
2983           The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
2984           the  other  sequences, these do match certain high-valued codepoints in
2985           UTF-8 mode.  The horizontal space characters are:
2986    
2987             U+0009     Horizontal tab
2988             U+0020     Space
2989             U+00A0     Non-break space
2990             U+1680     Ogham space mark
2991             U+180E     Mongolian vowel separator
2992             U+2000     En quad
2993             U+2001     Em quad
2994             U+2002     En space
2995             U+2003     Em space
2996             U+2004     Three-per-em space
2997             U+2005     Four-per-em space
2998             U+2006     Six-per-em space
2999             U+2007     Figure space
3000             U+2008     Punctuation space
3001             U+2009     Thin space
3002             U+200A     Hair space
3003             U+202F     Narrow no-break space
3004             U+205F     Medium mathematical space
3005             U+3000     Ideographic space
3006    
3007           The vertical space characters are:
3008    
3009             U+000A     Linefeed
3010             U+000B     Vertical tab
3011             U+000C     Formfeed
3012             U+000D     Carriage return
3013             U+0085     Next line
3014             U+2028     Line separator
3015             U+2029     Paragraph separator
3016    
3017         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
3018         is a letter or digit. The definition of  letters  and  digits  is  con-         is  a  letter  or  digit.  The definition of letters and digits is con-
3019         trolled  by PCRE's low-valued character tables, and may vary if locale-         trolled by PCRE's low-valued character tables, and may vary if  locale-
3020         specific matching is taking place (see "Locale support" in the  pcreapi         specific  matching is taking place (see "Locale support" in the pcreapi
3021         page).  For  example,  in  the  "fr_FR" (French) locale, some character         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
3022         codes greater than 128 are used for accented  letters,  and  these  are         systems,  or "french" in Windows, some character codes greater than 128
3023         matched by \w.         are used for accented letters, and these are matched by \w. The use  of
3024           locales with Unicode is discouraged.
3025    
3026       Newline sequences
3027    
3028           Outside  a  character class, the escape sequence \R matches any Unicode
3029           newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R  is
3030           equivalent to the following:
3031    
3032             (?>\r\n|\n|\x0b|\f|\r|\x85)
3033    
3034           This  is  an  example  of an "atomic group", details of which are given
3035           below.  This particular group matches either the two-character sequence
3036           CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,
3037           U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3038           return, U+000D), or NEL (next line, U+0085). The two-character sequence
3039           is treated as a single unit that cannot be split.
3040    
3041           In UTF-8 mode, two additional characters whose codepoints  are  greater
3042           than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3043           rator, U+2029).  Unicode character property support is not  needed  for
3044           these characters to be recognized.
3045    
3046         In  UTF-8 mode, characters with values greater than 128 never match \d,         Inside a character class, \R matches the letter "R".
        \s, or \w, and always match \D, \S, and \W. This is true even when Uni-  
        code  character  property support is available. The use of locales with  
        Unicode is discouraged.  
3047    
3048     Unicode character properties     Unicode character properties
3049    
3050         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3051         tional  escape  sequences  to  match character properties are available         tional escape sequences that match characters with specific  properties
3052         when UTF-8 mode is selected. They are:         are  available.   When not in UTF-8 mode, these sequences are of course
3053           limited to testing characters whose codepoints are less than  256,  but
3054           they do work in this mode.  The extra escape sequences are:
3055    
3056           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3057           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3058           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3059    
3060         The property names represented by xx above are limited to  the  Unicode         The  property  names represented by xx above are limited to the Unicode
3061         script names, the general category properties, and "Any", which matches         script names, the general category properties, and "Any", which matches
3062         any character (including newline). Other properties such as "InMusical-         any character (including newline). Other properties such as "InMusical-
3063         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does         Symbols" are not currently supported by PCRE. Note  that  \P{Any}  does
3064         not match any characters, so always causes a match failure.         not match any characters, so always causes a match failure.
3065    
3066         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3067         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3068         For example:         For example:
3069    
3070           \p{Greek}           \p{Greek}
3071           \P{Han}           \P{Han}
3072    
3073         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3074         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3075    
3076         Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-         Arabic,  Armenian,  Balinese,  Bengali,  Bopomofo,  Braille,  Buginese,
3077         dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,         Buhid,   Canadian_Aboriginal,   Cherokee,  Common,  Coptic,  Cuneiform,
3078         Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,         Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3079         Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,         Gothic,  Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3080         Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,         gana, Inherited, Kannada,  Katakana,  Kharoshthi,  Khmer,  Lao,  Latin,
3081         Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,         Limbu,  Linear_B,  Malayalam,  Mongolian,  Myanmar,  New_Tai_Lue,  Nko,
3082         Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-         Ogham, Old_Italic, Old_Persian, Oriya, Osmanya,  Phags_Pa,  Phoenician,
3083         banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,         Runic,  Shavian,  Sinhala,  Syloti_Nagri,  Syriac,  Tagalog,  Tagbanwa,
3084         Ugaritic, Yi.         Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3085    
3086         Each  character has exactly one general category property, specified by         Each character has exactly one general category property, specified  by
3087         a two-letter abbreviation. For compatibility with Perl, negation can be         a two-letter abbreviation. For compatibility with Perl, negation can be
3088         specified  by  including a circumflex between the opening brace and the         specified by including a circumflex between the opening brace  and  the
3089         property name. For example, \p{^Lu} is the same as \P{Lu}.         property name. For example, \p{^Lu} is the same as \P{Lu}.
3090    
3091         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3092         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3093         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3094         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3095    
3096           \p{L}           \p{L}
# Line 2857  BACKSLASH Line 3142  BACKSLASH
3142           Zp    Paragraph separator           Zp    Paragraph separator
3143           Zs    Space separator           Zs    Space separator
3144    
3145         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3146         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3147         classified as a modifier or "other".         classified as a modifier or "other".
3148    
3149         The  long  synonyms  for  these  properties that Perl supports (such as         The long synonyms for these properties  that  Perl  supports  (such  as
3150         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3151         any of these properties with "Is".         any of these properties with "Is".
3152    
3153         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3154         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3155         in the Unicode table.         in the Unicode table.
3156    
3157         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3158         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3159    
3160         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3161         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3162    
3163           (?>\PM\pM*)           (?>\PM\pM*)
3164    
3165         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3166         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3167         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3168         property are typically accents that affect the preceding character.         property  are  typically  accents  that affect the preceding character.
3169           None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3170           matches any one character.
3171    
3172         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
3173         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
3174         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
3175         \w do not use Unicode properties in PCRE.         \w do not use Unicode properties in PCRE.
3176    
3177       Resetting the match start
3178    
3179           The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3180           ously  matched  characters  not  to  be  included  in the final matched
3181           sequence. For example, the pattern:
3182    
3183             foo\Kbar
3184    
3185           matches "foobar", but reports that it has matched "bar".  This  feature
3186           is  similar  to  a lookbehind assertion (described below).  However, in
3187           this case, the part of the subject before the real match does not  have
3188           to  be of fixed length, as lookbehind assertions do. The use of \K does
3189           not interfere with the setting of captured  substrings.   For  example,
3190           when the pattern
3191    
3192             (foo)\Kbar
3193    
3194           matches "foobar", the first substring is still set to "foo".
3195    
3196     Simple assertions     Simple assertions
3197    
3198         The fourth use of backslash is for certain simple assertions. An asser-         The  final use of backslash is for certain simple assertions. An asser-
3199         tion specifies a condition that has to be met at a particular point  in         tion specifies a condition that has to be met at a particular point  in
3200         a  match, without consuming any characters from the subject string. The         a  match, without consuming any characters from the subject string. The
3201         use of subpatterns for more complicated assertions is described  below.         use of subpatterns for more complicated assertions is described  below.
# Line 2897  BACKSLASH Line 3203  BACKSLASH
3203    
3204           \b     matches at a word boundary           \b     matches at a word boundary
3205           \B     matches when not at a word boundary           \B     matches when not at a word boundary
3206           \A     matches at start of subject           \A     matches at the start of the subject
3207           \Z     matches at end of subject or before newline at end           \Z     matches at the end of the subject
3208           \z     matches at end of subject                   also matches before a newline at the end of the subject
3209           \G     matches at first matching position in subject           \z     matches only at the end of the subject
3210             \G     matches at the first matching position in the subject
3211    
3212         These  assertions may not appear in character classes (but note that \b         These  assertions may not appear in character classes (but note that \b
3213         has a different meaning, namely the backspace character, inside a char-         has a different meaning, namely the backspace character, inside a char-
# Line 2997  FULL STOP (PERIOD, DOT) Line 3304  FULL STOP (PERIOD, DOT)
3304         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
3305         ter in the subject string except (by default) a character  that  signi-         ter in the subject string except (by default) a character  that  signi-
3306         fies  the  end  of  a line. In UTF-8 mode, the matched character may be         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3307         more than one byte long. When a line ending  is  defined  as  a  single         more than one byte long.
        character  (CR  or LF), dot never matches that character; when the two-  
        character sequence CRLF is used, dot does not match CR if it is immedi-  
        ately  followed by LF, but otherwise it matches all characters (includ-  
        ing isolated CRs and LFs).  
   
        The behaviour of dot with regard to newlines can  be  changed.  If  the  
        PCRE_DOTALL  option  is  set,  a dot matches any one character, without  
        exception. If newline is defined as the two-character sequence CRLF, it  
        takes two dots to match it.  
3308    
3309         The  handling of dot is entirely independent of the handling of circum-         When a line ending is defined as a single character, dot never  matches
3310         flex and dollar, the only relationship being  that  they  both  involve         that  character; when the two-character sequence CRLF is used, dot does
3311           not match CR if it is immediately followed  by  LF,  but  otherwise  it
3312           matches  all characters (including isolated CRs and LFs). When any Uni-
3313           code line endings are being recognized, dot does not match CR or LF  or
3314           any of the other line ending characters.
3315    
3316           The  behaviour  of  dot  with regard to newlines can be changed. If the
3317           PCRE_DOTALL option is set, a dot matches  any  one  character,  without
3318           exception. If the two-character sequence CRLF is present in the subject
3319           string, it takes two dots to match it.
3320    
3321           The handling of dot is entirely independent of the handling of  circum-
3322           flex  and  dollar,  the  only relationship being that they both involve
3323         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
3324    
3325    
3326  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
3327    
3328         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
3329         both in and out of UTF-8 mode. Unlike a dot, it always matches  CR  and         both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any
3330         LF.  The feature is provided in Perl in order to match individual bytes         line-ending characters. The feature is provided in  Perl  in  order  to
3331         in UTF-8 mode.  Because it breaks up UTF-8 characters  into  individual         match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
3332         bytes,  what remains in the string may be a malformed UTF-8 string. For         acters into individual bytes, what remains in the string may be a  mal-
3333         this reason, the \C escape sequence is best avoided.         formed  UTF-8  string.  For this reason, the \C escape sequence is best
3334           avoided.
3335    
3336         PCRE does not allow \C to appear in  lookbehind  assertions  (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
3337         below),  because  in UTF-8 mode this would make it impossible to calcu-         below),  because  in UTF-8 mode this would make it impossible to calcu-
# Line 3067  SQUARE BRACKETS AND CHARACTER CLASSES Line 3378  SQUARE BRACKETS AND CHARACTER CLASSES
3378         PCRE  is  compiled  with Unicode property support as well as with UTF-8         PCRE  is  compiled  with Unicode property support as well as with UTF-8
3379         support.         support.
3380    
3381         Characters that might indicate  line  breaks  (CR  and  LF)  are  never         Characters that might indicate line breaks are  never  treated  in  any
3382         treated  in  any  special way when matching character classes, whatever         special  way  when  matching  character  classes,  whatever line-ending
3383         line-ending sequence is in use, and whatever setting of the PCRE_DOTALL         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and
3384         and PCRE_MULTILINE options is used. A class such as [^a] always matches         PCRE_MULTILINE options is used. A class such as [^a] always matches one
3385         one of these characters.         of these characters.
3386    
3387         The minus (hyphen) character can be used to specify a range of  charac-         The minus (hyphen) character can be used to specify a range of  charac-
3388         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
# Line 3097  SQUARE BRACKETS AND CHARACTER CLASSES Line 3408  SQUARE BRACKETS AND CHARACTER CLASSES
3408         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
3409         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
3410         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if
3411         character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches         character tables for a French locale are in  use,  [\xc8-\xcb]  matches
3412         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the
3413         concept of case for characters with values greater than 128  only  when         concept of case for characters with values greater than 128  only  when
3414         it is compiled with Unicode property support.         it is compiled with Unicode property support.
# Line 3203  INTERNAL OPTION SETTING Line 3514  INTERNAL OPTION SETTING
3514         PCRE extracts it into the global options (and it will therefore show up         PCRE extracts it into the global options (and it will therefore show up
3515         in data extracted by the pcre_fullinfo() function).         in data extracted by the pcre_fullinfo() function).
3516    
3517         An option change within a subpattern affects only that part of the cur-         An  option  change  within a subpattern (see below for a description of
3518         rent pattern that follows it, so         subpatterns) affects only that part of the current pattern that follows
3519           it, so
3520    
3521           (a(?i)b)c           (a(?i)b)c
3522    
3523         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
3524         used).   By  this means, options can be made to have different settings         used).  By this means, options can be made to have  different  settings
3525         in different parts of the pattern. Any changes made in one  alternative         in  different parts of the pattern. Any changes made in one alternative
3526         do  carry  on  into subsequent branches within the same subpattern. For         do carry on into subsequent branches within the  same  subpattern.  For
3527         example,         example,
3528    
3529           (a(?i)b|c)           (a(?i)b|c)
3530    
3531         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
3532         first  branch  is  abandoned before the option setting. This is because         first branch is abandoned before the option setting.  This  is  because
3533         the effects of option settings happen at compile time. There  would  be         the  effects  of option settings happen at compile time. There would be
3534         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3535    
3536         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA
3537         can be changed in the same way as the Perl-compatible options by  using         can  be changed in the same way as the Perl-compatible options by using
3538         the characters J, U and X respectively.         the characters J, U and X respectively.
3539    
3540    
# Line 3235  SUBPATTERNS Line 3547  SUBPATTERNS
3547    
3548           cat(aract|erpillar|)           cat(aract|erpillar|)
3549    
3550         matches one of the words "cat", "cataract", or  "caterpillar".  Without         matches  one  of the words "cat", "cataract", or "caterpillar". Without
3551         the  parentheses,  it  would  match "cataract", "erpillar" or the empty         the parentheses, it would match  "cataract",  "erpillar"  or  an  empty
3552         string.         string.
3553    
3554         2. It sets up the subpattern as  a  capturing  subpattern.  This  means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
3555         that,  when  the  whole  pattern  matches,  that portion of the subject         that, when the whole pattern  matches,  that  portion  of  the  subject
3556         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
3557         ovector  argument  of pcre_exec(). Opening parentheses are counted from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
3558         left to right (starting from 1) to obtain  numbers  for  the  capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
3559         subpatterns.         subpatterns.
3560    
3561         For  example,  if the string "the red king" is matched against the pat-         For example, if the string "the red king" is matched against  the  pat-
3562         tern         tern
3563    
3564           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 3254  SUBPATTERNS Line 3566  SUBPATTERNS
3566         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
3567         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
3568    
3569         The  fact  that  plain  parentheses  fulfil two functions is not always         The fact that plain parentheses fulfil  two  functions  is  not  always
3570         helpful.  There are often times when a grouping subpattern is  required         helpful.   There are often times when a grouping subpattern is required
3571         without  a capturing requirement. If an opening parenthesis is followed         without a capturing requirement. If an opening parenthesis is  followed
3572         by a question mark and a colon, the subpattern does not do any  captur-         by  a question mark and a colon, the subpattern does not do any captur-
3573         ing,  and  is  not  counted when computing the number of any subsequent         ing, and is not counted when computing the  number  of  any  subsequent
3574         capturing subpatterns. For example, if the string "the white queen"  is         capturing  subpatterns. For example, if the string "the white queen" is
3575         matched against the pattern         matched against the pattern
3576    
3577           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
3578    
3579         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
3580         1 and 2. The maximum number of capturing subpatterns is 65535, and  the         1 and 2. The maximum number of capturing subpatterns is 65535.
        maximum  depth  of  nesting of all subpatterns, both capturing and non-  
        capturing, is 200.  
3581    
3582         As a convenient shorthand, if any option settings are required  at  the         As  a  convenient shorthand, if any option settings are required at the
3583         start  of  a  non-capturing  subpattern,  the option letters may appear         start of a non-capturing subpattern,  the  option  letters  may  appear
3584         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
3585    
3586           (?i:saturday|sunday)           (?i:saturday|sunday)
3587           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
3588    
3589         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
3590         tried  from  left  to right, and options are not reset until the end of         tried from left to right, and options are not reset until  the  end  of
3591         the subpattern is reached, an option setting in one branch does  affect         the  subpattern is reached, an option setting in one branch does affect
3592         subsequent  branches,  so  the above patterns match "SUNDAY" as well as         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
3593         "Saturday".         "Saturday".
3594    
3595    
3596    DUPLICATE SUBPATTERN NUMBERS
3597    
3598           Perl 5.10 introduced a feature whereby each alternative in a subpattern
3599           uses the same numbers for its capturing parentheses. Such a  subpattern
3600           starts  with (?| and is itself a non-capturing subpattern. For example,
3601           consider this pattern:
3602    
3603             (?|(Sat)ur|(Sun))day
3604    
3605           Because the two alternatives are inside a (?| group, both sets of  cap-
3606           turing  parentheses  are  numbered one. Thus, when the pattern matches,
3607           you can look at captured substring number  one,  whichever  alternative
3608           matched.  This  construct  is useful when you want to capture part, but
3609           not all, of one of a number of alternatives. Inside a (?| group, paren-
3610           theses  are  numbered as usual, but the number is reset at the start of
3611           each branch. The numbers of any capturing buffers that follow the  sub-
3612           pattern  start after the highest number used in any branch. The follow-
3613           ing example is taken from the Perl documentation.  The  numbers  under-
3614           neath show in which buffer the captured content will be stored.
3615    
3616             # before  ---------------branch-reset----------- after
3617             / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3618             # 1            2         2  3        2     3     4
3619    
3620           A  backreference  or  a  recursive call to a numbered subpattern always
3621           refers to the first one in the pattern with the given number.
3622    
3623           An alternative approach to using this "branch reset" feature is to  use
3624           duplicate named subpatterns, as described in the next section.
3625    
3626    
3627  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3628    
3629         Identifying capturing parentheses by number is simple, but  it  can  be         Identifying  capturing  parentheses  by number is simple, but it can be
3630         very  hard  to keep track of the numbers in complicated regular expres-         very hard to keep track of the numbers in complicated  regular  expres-
3631         sions. Furthermore, if an  expression  is  modified,  the  numbers  may         sions.  Furthermore,  if  an  expression  is  modified, the numbers may
3632         change.  To help with this difficulty, PCRE supports the naming of sub-         change. To help with this difficulty, PCRE supports the naming of  sub-
3633         patterns, something that Perl  does  not  provide.  The  Python  syntax         patterns. This feature was not added to Perl until release 5.10. Python
3634         (?P<name>...)  is  used. References to capturing parentheses from other         had the feature earlier, and PCRE introduced it at release  4.0,  using
3635         parts of the pattern, such as  backreferences,  recursion,  and  condi-         the  Python syntax. PCRE now supports both the Perl and the Python syn-
3636         tions, can be made by name as well as by number.         tax.
3637    
3638         Names  consist  of  up  to  32 alphanumeric characters and underscores.         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
3639         Named capturing parentheses are still  allocated  numbers  as  well  as         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
3640         names. The PCRE API provides function calls for extracting the name-to-         to capturing parentheses from other parts of the pattern, such as back-
3641         number translation table from a compiled pattern. There is also a  con-         references,  recursion,  and conditions, can be made by name as well as
3642         venience function for extracting a captured substring by name.         by number.
3643    
3644           Names consist of up to  32  alphanumeric  characters  and  underscores.
3645           Named  capturing  parentheses  are  still  allocated numbers as well as
3646           names, exactly as if the names were not present. The PCRE API  provides
3647           function calls for extracting the name-to-number translation table from
3648           a compiled pattern. There is also a convenience function for extracting
3649           a captured substring by name.
3650    
3651         By  default, a name must be unique within a pattern, but it is possible         By  default, a name must be unique within a pattern, but it is possible
3652         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
# Line 3308  NAMED SUBPATTERNS Line 3656  NAMED SUBPATTERNS
3656         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
3657         the line breaks) does the job:         the line breaks) does the job:
3658    
3659           (?P<DN>Mon|Fri|Sun)(?:day)?|           (?<DN>Mon|Fri|Sun)(?:day)?|
3660           (?P<DN>Tue)(?:sday)?|           (?<DN>Tue)(?:sday)?|
3661           (?P<DN>Wed)(?:nesday)?|           (?<DN>Wed)(?:nesday)?|
3662           (?P<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
3663           (?P<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
3664    
3665         There  are  five capturing substrings, but only one is ever set after a         There  are  five capturing substrings, but only one is ever set after a
3666         match.  The convenience  function  for  extracting  the  data  by  name         match.  (An alternative way of solving this problem is to use a "branch
3667         returns  the  substring  for  the first, and in this example, the only,         reset" subpattern, as described in the previous section.)
3668         subpattern of that name that matched.  This  saves  searching  to  find  
3669         which  numbered  subpattern  it  was. If you make a reference to a non-         The  convenience  function  for extracting the data by name returns the
3670         unique named subpattern from elsewhere in the  pattern,  the  one  that         substring for the first (and in this example, the only)  subpattern  of
3671         corresponds  to  the  lowest number is used. For further details of the         that  name  that  matched.  This saves searching to find which numbered
3672         interfaces for handling named subpatterns, see the  pcreapi  documenta-         subpattern it was. If you make a reference to a non-unique  named  sub-
3673         tion.         pattern  from elsewhere in the pattern, the one that corresponds to the
3674           lowest number is used. For further details of the interfaces  for  han-
3675           dling named subpatterns, see the pcreapi documentation.
3676    
3677    
3678  REPETITION  REPETITION
# Line 3331  REPETITION Line 3681  REPETITION
3681         following items:         following items:
3682    
3683           a literal data character           a literal data character
3684           the . metacharacter           the dot metacharacter
3685           the \C escape sequence           the \C escape sequence
3686           the \X escape sequence (in UTF-8 mode with Unicode properties)           the \X escape sequence (in UTF-8 mode with Unicode properties)
3687             the \R escape sequence
3688           an escape such as \d that matches a single character           an escape such as \d that matches a single character
3689           a character class           a character class
3690           a back reference (see next section)           a back reference (see next section)
# Line 3373  REPETITION Line 3724  REPETITION
3724         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
3725         the previous item and the quantifier were not present.         the previous item and the quantifier were not present.
3726    
3727         For  convenience  (and  historical compatibility) the three most common         For  convenience, the three most common quantifiers have single-charac-
3728         quantifiers have single-character abbreviations:         ter abbreviations:
3729    
3730           *    is equivalent to {0,}           *    is equivalent to {0,}
3731           +    is equivalent to {1,}           +    is equivalent to {1,}
# Line 3426  REPETITION Line 3777  REPETITION
3777         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
3778         only way the rest of the pattern matches.         only way the rest of the pattern matches.
3779    
3780         If the PCRE_UNGREEDY option is set (an option which is not available in         If the PCRE_UNGREEDY option is set (an option that is not available  in
3781         Perl),  the  quantifiers are not greedy by default, but individual ones         Perl),  the  quantifiers are not greedy by default, but individual ones
3782         can be made greedy by following them with a  question  mark.  In  other         can be made greedy by following them with a  question  mark.  In  other
3783         words, it inverts the default behaviour.         words, it inverts the default behaviour.
# Line 3437  REPETITION Line 3788  REPETITION
3788         minimum or maximum.         minimum or maximum.
3789    
3790         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
3791         alent  to Perl's /s) is set, thus allowing the . to match newlines, the         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
3792         pattern is implicitly anchored, because whatever follows will be  tried         the pattern is implicitly anchored, because whatever  follows  will  be
3793         against  every character position in the subject string, so there is no         tried  against every character position in the subject string, so there
3794         point in retrying the overall match at any position  after  the  first.         is no point in retrying the overall match at  any  position  after  the
3795         PCRE normally treats such a pattern as though it were preceded by \A.         first.  PCRE  normally treats such a pattern as though it were preceded
3796           by \A.
3797    
3798         In  cases  where  it  is known that the subject string contains no new-         In cases where it is known that the subject  string  contains  no  new-
3799         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
3800         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
3801    
3802         However,  there is one situation where the optimization cannot be used.         However, there is one situation where the optimization cannot be  used.
3803         When .*  is inside capturing parentheses that  are  the  subject  of  a         When  .*   is  inside  capturing  parentheses that are the subject of a
3804         backreference  elsewhere in the pattern, a match at the start may fail,         backreference elsewhere in the pattern, a match at the start  may  fail
3805         and a later one succeed. Consider, for example:         where a later one succeeds. Consider, for example:
3806    
3807           (.*)abc\1           (.*)abc\1
3808    
3809         If the subject is "xyz123abc123" the match point is the fourth  charac-         If  the subject is "xyz123abc123" the match point is the fourth charac-
3810         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
3811    
3812         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 3463  REPETITION Line 3815  REPETITION
3815           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
3816    
3817         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
3818         is  "tweedledee".  However,  if there are nested capturing subpatterns,         is "tweedledee". However, if there are  nested  capturing  subpatterns,
3819         the corresponding captured values may have been set in previous  itera-         the  corresponding captured values may have been set in previous itera-
3820         tions. For example, after         tions. For example, after
3821    
3822           /(a|(b))+/           /(a|(b))+/
# Line 3474  REPETITION Line 3826  REPETITION
3826    
3827  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
3828    
3829         With both maximizing and minimizing repetition, failure of what follows         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
3830         normally causes the repeated item to be re-evaluated to see if  a  dif-         repetition,  failure  of what follows normally causes the repeated item
3831         ferent number of repeats allows the rest of the pattern to match. Some-         to be re-evaluated to see if a different number of repeats  allows  the
3832         times it is useful to prevent this, either to change the nature of  the         rest  of  the pattern to match. Sometimes it is useful to prevent this,
3833         match,  or  to  cause it fail earlier than it otherwise might, when the         either to change the nature of the match, or to cause it  fail  earlier
3834         author of the pattern knows there is no point in carrying on.         than  it otherwise might, when the author of the pattern knows there is
3835           no point in carrying on.
3836    
3837         Consider, for example, the pattern \d+foo when applied to  the  subject         Consider, for example, the pattern \d+foo when applied to  the  subject
3838         line         line
# Line 3493  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3846  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3846         the  means for specifying that once a subpattern has matched, it is not         the  means for specifying that once a subpattern has matched, it is not
3847         to be re-evaluated in this way.         to be re-evaluated in this way.
3848    
3849         If we use atomic grouping for the previous example, the  matcher  would         If we use atomic grouping for the previous example, the  matcher  gives
3850         give up immediately on failing to match "foo" the first time. The nota-         up  immediately  on failing to match "foo" the first time. The notation
3851         tion is a kind of special parenthesis, starting with  (?>  as  in  this         is a kind of special parenthesis, starting with (?> as in this example:
        example:  
3852    
3853           (?>\d+)foo           (?>\d+)foo
3854    
# Line 3525  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3877  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3877    
3878           \d++foo           \d++foo
3879    
3880         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Note that a possessive quantifier can be used with an entire group, for
3881           example:
3882    
3883             (abc|xyz){2,3}+
3884    
3885           Possessive   quantifiers   are   always  greedy;  the  setting  of  the
3886         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
3887         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
3888         meaning  or  processing  of  a possessive quantifier and the equivalent         meaning of a possessive quantifier and  the  equivalent  atomic  group,
3889         atomic group.         though  there  may  be a performance difference; possessive quantifiers
3890           should be slightly faster.
3891         The possessive quantifier syntax is an extension to  the  Perl  syntax.  
3892         Jeffrey  Friedl originated the idea (and the name) in the first edition         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
3893         of his book.  Mike McCloskey liked it, so implemented it when he  built         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
3894         Sun's Java package, and PCRE copied it from there.         edition of his book. Mike McCloskey liked it, so implemented it when he
3895           built  Sun's Java package, and PCRE copied it from there. It ultimately
3896           found its way into Perl at release 5.10.
3897    
3898           PCRE has an optimization that automatically "possessifies" certain sim-
3899           ple  pattern  constructs.  For  example, the sequence A+B is treated as
3900           A++B because there is no point in backtracking into a sequence  of  A's
3901           when B must follow.
3902    
3903         When  a  pattern  contains an unlimited repeat inside a subpattern that         When  a  pattern  contains an unlimited repeat inside a subpattern that
3904         can itself be repeated an unlimited number of  times,  the  use  of  an         can itself be repeated an unlimited number of  times,  the  use  of  an
# Line 3580  BACK REFERENCES Line 3944  BACK REFERENCES
3944         and the subpattern to the right has participated in an  earlier  itera-         and the subpattern to the right has participated in an  earlier  itera-
3945         tion.         tion.
3946    
3947         It is not possible to have a numerical "forward back reference" to sub-         It  is  not  possible to have a numerical "forward back reference" to a
3948         pattern whose number is 10 or more. However, a back  reference  to  any         subpattern whose number is 10 or  more  using  this  syntax  because  a
3949         subpattern  is  possible  using named parentheses (see below). See also         sequence  such  as  \50 is interpreted as a character defined in octal.
3950         the subsection entitled "Non-printing  characters"  above  for  further         See the subsection entitled "Non-printing characters" above for further
3951         details of the handling of digits following a backslash.         details  of  the  handling of digits following a backslash. There is no
3952           such problem when named parentheses are used. A back reference  to  any
3953           subpattern is possible using named parentheses (see below).
3954    
3955           Another  way  of  avoiding  the ambiguity inherent in the use of digits
3956           following a backslash is to use the \g escape sequence, which is a fea-
3957           ture  introduced  in  Perl  5.10.  This  escape  must be followed by an
3958           unsigned number or a negative number, optionally  enclosed  in  braces.
3959           These examples are all identical:
3960    
3961             (ring), \1
3962             (ring), \g1
3963             (ring), \g{1}
3964    
3965           An  unsigned number specifies an absolute reference without the ambigu-
3966           ity that is present in the older syntax. It is also useful when literal
3967           digits follow the reference. A negative number is a relative reference.
3968           Consider this example:
3969    
3970             (abc(def)ghi)\g{-1}
3971    
3972           The sequence \g{-1} is a reference to the most recently started captur-
3973           ing  subpattern  before \g, that is, is it equivalent to \2. Similarly,
3974           \g{-2} would be equivalent to \1. The use of relative references can be
3975           helpful  in  long  patterns,  and  also in patterns that are created by
3976           joining together fragments that contain references within themselves.
3977    
3978         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
3979         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
3980         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3981         of doing that). So the pattern         of doing that). So the pattern
3982    
3983           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3984    
3985         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
3986         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
3987         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
3988         ple,         ple,
3989    
3990           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3991    
3992         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
3993         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3994    
3995         Back references to named subpatterns use the Python  syntax  (?P=name).         There  are  several  different ways of writing back references to named
3996         We could rewrite the above example as follows:         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
3997           \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
3998           unified back reference syntax, in which \g can be used for both numeric
3999           and  named  references,  is  also supported. We could rewrite the above
4000           example in any of the following ways:
4001    
4002             (?<p1>(?i)rah)\s+\k<p1>
4003             (?'p1'(?i)rah)\s+\k{p1}
4004           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4005             (?<p1>(?i)rah)\s+\g{p1}
4006    
4007         A  subpattern  that  is  referenced  by  name may appear in the pattern         A subpattern that is referenced by  name  may  appear  in  the  pattern
4008         before or after the reference.         before or after the reference.
4009    
4010         There may be more than one back reference to the same subpattern. If  a         There  may be more than one back reference to the same subpattern. If a
4011         subpattern  has  not actually been used in a particular match, any back         subpattern has not actually been used in a particular match,  any  back
4012         references to it always fail. For example, the pattern         references to it always fail. For example, the pattern
4013    
4014           (a|(bc))\2           (a|(bc))\2
4015    
4016         always fails if it starts to match "a" rather than "bc". Because  there         always  fails if it starts to match "a" rather than "bc". Because there
4017         may  be  many  capturing parentheses in a pattern, all digits following         may be many capturing parentheses in a pattern,  all  digits  following
4018         the backslash are taken as part of a potential back  reference  number.         the  backslash  are taken as part of a potential back reference number.
4019         If the pattern continues with a digit character, some delimiter must be         If the pattern continues with a digit character, some delimiter must be
4020         used to terminate the back reference. If the  PCRE_EXTENDED  option  is         used  to  terminate  the back reference. If the PCRE_EXTENDED option is
4021         set,  this  can  be  whitespace.  Otherwise an empty comment (see "Com-         set, this can be whitespace.  Otherwise an  empty  comment  (see  "Com-
4022         ments" below) can be used.         ments" below) can be used.
4023    
4024         A back reference that occurs inside the parentheses to which it  refers         A  back reference that occurs inside the parentheses to which it refers
4025         fails  when  the subpattern is first used, so, for example, (a\1) never         fails when the subpattern is first used, so, for example,  (a\1)  never
4026         matches.  However, such references can be useful inside  repeated  sub-         matches.   However,  such references can be useful inside repeated sub-
4027         patterns. For example, the pattern         patterns. For example, the pattern
4028    
4029           (a|b\1)+           (a|b\1)+
4030    
4031         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4032         ation of the subpattern,  the  back  reference  matches  the  character         ation  of  the  subpattern,  the  back  reference matches the character
4033         string  corresponding  to  the previous iteration. In order for this to         string corresponding to the previous iteration. In order  for  this  to
4034         work, the pattern must be such that the first iteration does  not  need         work,  the  pattern must be such that the first iteration does not need
4035         to  match the back reference. This can be done using alternation, as in         to match the back reference. This can be done using alternation, as  in
4036         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4037    
4038    
4039  ASSERTIONS  ASSERTIONS
4040    
4041         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
4042         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
4043         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
4044         described above.         described above.
4045    
4046         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
4047         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
4048         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
4049         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
4050         matching position to be changed.         matching position to be changed.
4051    
4052         Assertion  subpatterns  are  not  capturing subpatterns, and may not be         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
4053         repeated, because it makes no sense to assert the  same  thing  several         repeated,  because  it  makes no sense to assert the same thing several
4054         times.  If  any kind of assertion contains capturing subpatterns within         times. If any kind of assertion contains capturing  subpatterns  within
4055         it, these are counted for the purposes of numbering the capturing  sub-         it,  these are counted for the purposes of numbering the capturing sub-
4056         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4057         out only for positive assertions, because it does not  make  sense  for         out  only  for  positive assertions, because it does not make sense for
4058         negative assertions.         negative assertions.
4059    
4060     Lookahead assertions     Lookahead assertions
# Line 3668  ASSERTIONS Line 4064  ASSERTIONS
4064    
4065           \w+(?=;)           \w+(?=;)
4066    
4067         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
4068         colon in the match, and         colon in the match, and
4069    
4070           foo(?!bar)           foo(?!bar)
4071    
4072         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
4073         that the apparently similar pattern         that the apparently similar pattern
4074    
4075           (?!foo)bar           (?!foo)bar
4076    
4077         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
4078         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
4079         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4080         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4081    
4082         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4083         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
4084         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
4085         string must always fail.         string must always fail.
4086    
4087     Lookbehind assertions     Lookbehind assertions
4088    
4089         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind  assertions start with (?<= for positive assertions and (?<!
4090         for negative assertions. For example,         for negative assertions. For example,
4091    
4092           (?<!foo)bar           (?<!foo)bar
4093    
4094         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does find an occurrence of "bar" that is not  preceded  by  "foo".  The
4095         contents of a lookbehind assertion are restricted  such  that  all  the         contents  of  a  lookbehind  assertion are restricted such that all the
4096         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4097         eral top-level alternatives, they do not all  have  to  have  the  same         eral  top-level  alternatives,  they  do  not all have to have the same
4098         fixed length. Thus         fixed length. Thus
4099    
4100           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 3707  ASSERTIONS Line 4103  ASSERTIONS
4103    
4104           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4105    
4106         causes  an  error at compile time. Branches that match different length         causes an error at compile time. Branches that match  different  length
4107         strings are permitted only at the top level of a lookbehind  assertion.         strings  are permitted only at the top level of a lookbehind assertion.
4108         This  is  an  extension  compared  with  Perl (at least for 5.8), which         This is an extension compared with  Perl  (at  least  for  5.8),  which
4109         requires all branches to match the same length of string. An  assertion         requires  all branches to match the same length of string. An assertion
4110         such as         such as
4111    
4112           (?<=ab(c|de))           (?<=ab(c|de))
4113    
4114         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
4115         different lengths, but it is acceptable if rewritten to  use  two  top-         different  lengths,  but  it is acceptable if rewritten to use two top-
4116         level branches:         level branches:
4117    
4118           (?<=abc|abde)           (?<=abc|abde)
4119    
4120         The  implementation  of lookbehind assertions is, for each alternative,         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4121         to temporarily move the current position back by the  fixed  width  and         instead  of  a lookbehind assertion; this is not restricted to a fixed-
4122           length.
4123    
4124           The implementation of lookbehind assertions is, for  each  alternative,
4125           to  temporarily  move the current position back by the fixed length and
4126         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4127         rent position, the match is deemed to fail.         rent position, the assertion fails.
4128    
4129         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4130         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode) to appear in lookbehind assertions, because it makes it  impossi-
4131         ble to calculate the length of the lookbehind. The \X escape, which can         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
4132         match different numbers of bytes, is also not permitted.         which can match different numbers of bytes, are also not permitted.
4133    
4134         Atomic  groups can be used in conjunction with lookbehind assertions to         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
4135         specify efficient matching at the end of the subject string. Consider a         assertions  to  specify  efficient  matching  at the end of the subject
4136         simple pattern such as         string. Consider a simple pattern such as
4137    
4138           abcd$           abcd$
4139    
4140         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
4141         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4142         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
4143         pattern is specified as         pattern is specified as
4144    
4145           ^.*abcd$           ^.*abcd$
4146    
4147         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
4148         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4149         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
4150         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
4151         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4152    
          ^(?>.*)(?<=abcd)  
   
        or, equivalently, using the possessive quantifier syntax,  
   
4153           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4154    
4155         there can be no backtracking for the .* item; it  can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
4156         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
4157         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
4158         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
4159         processing time.         processing time.
4160    
4161     Using multiple assertions     Using multiple assertions
# Line 3768  ASSERTIONS Line 4164  ASSERTIONS
4164    
4165           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4166    
4167         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
4168         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
4169         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
4170         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
4171         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4172         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
4173         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
4174         foo". A pattern to do that is         foo". A pattern to do that is
4175    
4176           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4177    
4178         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
4179         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4180         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4181    
# Line 3787  ASSERTIONS Line 4183  ASSERTIONS
4183    
4184           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4185    
4186         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
4187         is not preceded by "foo", while         is not preceded by "foo", while
4188    
4189           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4190    
4191         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
4192         three characters that are not "999".         three characters that are not "999".
4193    
4194    
4195  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4196    
4197         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
4198         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
4199         on  the result of an assertion, or whether a previous capturing subpat-         on the result of an assertion, or whether a previous capturing  subpat-
4200         tern matched or not. The two possible forms of  conditional  subpattern         tern  matched  or not. The two possible forms of conditional subpattern
4201         are         are
4202    
4203           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4204           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4205    
4206         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
4207         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
4208         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4209    
4210         There are three kinds of condition. If the text between the parentheses         There are four kinds of condition: references  to  subpatterns,  refer-
4211         consists of a sequence of digits, or a sequence of alphanumeric charac-         ences to recursion, a pseudo-condition called DEFINE, and assertions.
        ters  and underscores, the condition is satisfied if the capturing sub-  
        pattern of that number or name has previously matched. There is a  pos-  
        sible  ambiguity here, because subpattern names may consist entirely of  
        digits. PCRE looks first for a named subpattern; if it cannot find  one  
        and  the text consists entirely of digits, it looks for a subpattern of  
        that number, which must be greater than zero.  Using  subpattern  names  
        that consist entirely of digits is not recommended.  
4212    
4213         Consider  the  following  pattern, which contains non-significant white     Checking for a used subpattern by number
4214    
4215           If  the  text between the parentheses consists of a sequence of digits,
4216           the condition is true if the capturing subpattern of  that  number  has
4217           previously  matched.  An  alternative notation is to precede the digits
4218           with a plus or minus sign. In this case, the subpattern number is rela-
4219           tive rather than absolute.  The most recently opened parentheses can be
4220           referenced by (?(-1), the next most recent by (?(-2),  and  so  on.  In
4221           looping constructs it can also make sense to refer to subsequent groups
4222           with constructs such as (?(+2).
4223    
4224           Consider the following pattern, which  contains  non-significant  white
4225         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
4226         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
4227    
4228           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
4229    
4230         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
4231         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
4232         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
4233         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether the first set
4234         of parentheses matched or not. If they did, that is, if subject started         of parentheses matched or not. If they did, that is, if subject started
4235         with an opening parenthesis, the condition is true, and so the yes-pat-         with an opening parenthesis, the condition is true, and so the yes-pat-
4236         tern  is  executed  and  a  closing parenthesis is required. Otherwise,         tern is executed and a  closing  parenthesis  is  required.  Otherwise,
4237         since no-pattern is not present, the  subpattern  matches  nothing.  In         since  no-pattern  is  not  present, the subpattern matches nothing. In
4238         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other words,  this  pattern  matches  a  sequence  of  non-parentheses,
4239         optionally enclosed in parentheses. Rewriting it to use a named subpat-         optionally enclosed in parentheses.
        tern gives this:  
4240    
4241           (?P<OPEN> \( )?    [^()]+    (?(OPEN) \) )         If  you  were  embedding  this pattern in a larger one, you could use a
4242           relative reference:
4243    
4244         If the condition is the string (R), and there is no subpattern with the           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4245         name R, the condition is satisfied if a recursive call to  the  pattern  
4246         or  subpattern  has  been made. At "top level", the condition is false.         This makes the fragment independent of the parentheses  in  the  larger
4247         This is a PCRE extension.  Recursive patterns are described in the next         pattern.
        section.  
4248    
4249         If  the  condition  is  not  a sequence of digits or (R), it must be an     Checking for a used subpattern by name
4250         assertion.  This may be a positive or negative lookahead or  lookbehind  
4251         assertion.  Consider  this  pattern,  again  containing non-significant         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
4252           used subpattern by name. For compatibility  with  earlier  versions  of
4253           PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
4254           also recognized. However, there is a possible ambiguity with this  syn-
4255           tax,  because  subpattern  names  may  consist entirely of digits. PCRE
4256           looks first for a named subpattern; if it cannot find one and the  name
4257           consists  entirely  of digits, PCRE looks for a subpattern of that num-
4258           ber, which must be greater than zero. Using subpattern names that  con-
4259           sist entirely of digits is not recommended.
4260    
4261           Rewriting the above example to use a named subpattern gives this:
4262    
4263             (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
4264    
4265    
4266       Checking for pattern recursion
4267    
4268           If the condition is the string (R), and there is no subpattern with the
4269           name R, the condition is true if a recursive call to the whole  pattern
4270           or any subpattern has been made. If digits or a name preceded by amper-
4271           sand follow the letter R, for example:
4272    
4273             (?(R3)...) or (?(R&name)...)
4274    
4275           the condition is true if the most recent recursion is into the  subpat-
4276           tern  whose  number or name is given. This condition does not check the
4277           entire recursion stack.
4278    
4279           At "top level", all these recursion test conditions are  false.  Recur-
4280           sive patterns are described below.
4281    
4282       Defining subpatterns for use by reference only
4283    
4284           If  the  condition  is  the string (DEFINE), and there is no subpattern
4285           with the name DEFINE, the condition is  always  false.  In  this  case,
4286           there  may  be  only  one  alternative  in the subpattern. It is always
4287           skipped if control reaches this point  in  the  pattern;  the  idea  of
4288           DEFINE  is that it can be used to define "subroutines" that can be ref-
4289           erenced from elsewhere. (The use of "subroutines" is described  below.)
4290           For  example,  a pattern to match an IPv4 address could be written like
4291           this (ignore whitespace and line breaks):
4292    
4293             (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
4294             \b (?&byte) (\.(?&byte)){3} \b
4295    
4296           The first part of the pattern is a DEFINE group inside which a  another
4297           group  named "byte" is defined. This matches an individual component of
4298           an IPv4 address (a number less than 256). When  matching  takes  place,
4299           this  part  of  the pattern is skipped because DEFINE acts like a false
4300           condition.
4301    
4302           The rest of the pattern uses references to the named group to match the
4303           four  dot-separated  components of an IPv4 address, insisting on a word
4304           boundary at each end.
4305    
4306       Assertion conditions
4307    
4308           If the condition is not in any of the above  formats,  it  must  be  an
4309           assertion.   This may be a positive or negative lookahead or lookbehind
4310           assertion. Consider  this  pattern,  again  containing  non-significant
4311         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
4312    
4313           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
4314           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
4315    
4316         The condition  is  a  positive  lookahead  assertion  that  matches  an         The  condition  is  a  positive  lookahead  assertion  that  matches an
4317         optional  sequence of non-letters followed by a letter. In other words,         optional sequence of non-letters followed by a letter. In other  words,
4318         it tests for the presence of at least one letter in the subject.  If  a         it  tests  for the presence of at least one letter in the subject. If a
4319         letter  is found, the subject is matched against the first alternative;         letter is found, the subject is matched against the first  alternative;
4320         otherwise it is  matched  against  the  second.  This  pattern  matches         otherwise  it  is  matched  against  the  second.  This pattern matches
4321         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
4322         letters and dd are digits.         letters and dd are digits.
4323    
4324    
4325  COMMENTS  COMMENTS
4326    
4327         The sequence (?# marks the start of a comment that continues up to  the         The  sequence (?# marks the start of a comment that continues up to the
4328         next  closing  parenthesis.  Nested  parentheses are not permitted. The         next closing parenthesis. Nested parentheses  are  not  permitted.  The
4329         characters that make up a comment play no part in the pattern  matching         characters  that make up a comment play no part in the pattern matching
4330         at all.         at all.
4331    
4332         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If the PCRE_EXTENDED option is set, an unescaped # character outside  a
4333         character class introduces a  comment  that  continues  to  immediately         character  class  introduces  a  comment  that continues to immediately
4334         after the next newline in the pattern.         after the next newline in the pattern.
4335    
4336    
4337  RECURSIVE PATTERNS  RECURSIVE PATTERNS
4338    
4339         Consider  the problem of matching a string in parentheses, allowing for         Consider the problem of matching a string in parentheses, allowing  for
4340         unlimited nested parentheses. Without the use of  recursion,  the  best         unlimited  nested  parentheses.  Without the use of recursion, the best
4341         that  can  be  done  is  to use a pattern that matches up to some fixed         that can be done is to use a pattern that  matches  up  to  some  fixed
4342         depth of nesting. It is not possible to  handle  an  arbitrary  nesting         depth  of  nesting.  It  is not possible to handle an arbitrary nesting
4343         depth.  Perl  provides  a  facility  that allows regular expressions to         depth.
4344         recurse (amongst other things). It does this by interpolating Perl code  
4345         in the expression at run time, and the code can refer to the expression         For some time, Perl has provided a facility that allows regular expres-
4346         itself. A Perl pattern to solve the parentheses problem can be  created         sions  to recurse (amongst other things). It does this by interpolating
4347         like this:         Perl code in the expression at run time, and the code can refer to  the
4348           expression itself. A Perl pattern using code interpolation to solve the
4349           parentheses problem can be created like this:
4350    
4351           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
4352    
4353         The (?p{...}) item interpolates Perl code at run time, and in this case         The (?p{...}) item interpolates Perl code at run time, and in this case
4354         refers recursively to the pattern in which it appears. Obviously,  PCRE         refers recursively to the pattern in which it appears.
        cannot  support  the  interpolation  of Perl code. Instead, it supports  
        some special syntax for recursion of the entire pattern, and  also  for  
        individual subpattern recursion.  
4355    
4356         The  special item that consists of (? followed by a number greater than         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4357           it supports special syntax for recursion of  the  entire  pattern,  and
4358           also  for  individual  subpattern  recursion. After its introduction in
4359           PCRE and Python, this kind of recursion was  introduced  into  Perl  at
4360           release 5.10.
4361    
4362           A  special  item  that consists of (? followed by a number greater than
4363         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
4364         the  given  number, provided that it occurs inside that subpattern. (If         the  given  number, provided that it occurs inside that subpattern. (If
4365         not, it is a "subroutine" call, which is described  in  the  next  sec-         not, it is a "subroutine" call, which is described  in  the  next  sec-
4366         tion.)  The special item (?R) is a recursive call of the entire regular         tion.)  The special item (?R) or (?0) is a recursive call of the entire
4367         expression.         regular expression.
4368    
4369         A recursive subpattern call is always treated as an atomic group.  That         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
4370         is,  once  it  has  matched some of the subject string, it is never re-         always treated as an atomic group. That is, once it has matched some of
4371         entered, even if it contains untried alternatives and there is a subse-         the subject string, it is never re-entered, even if it contains untried
4372         quent matching failure.         alternatives and there is a subsequent matching failure.
4373    
4374         This  PCRE  pattern  solves  the nested parentheses problem (assume the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
4375         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
# Line 3924  RECURSIVE PATTERNS Line 4387  RECURSIVE PATTERNS
4387           ( \( ( (?>[^()]+) | (?1) )* \) )           ( \( ( (?>[^()]+) | (?1) )* \) )
4388    
4389         We have put the pattern into parentheses, and caused the  recursion  to         We have put the pattern into parentheses, and caused the  recursion  to
4390         refer  to them instead of the whole pattern. In a larger pattern, keep-         refer to them instead of the whole pattern.
4391         ing track of parenthesis numbers can be tricky. It may be  more  conve-  
4392         nient  to use named parentheses instead. For this, PCRE uses (?P>name),         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
4393         which is an extension to the Python syntax that  PCRE  uses  for  named         tricky. This is made easier by the use of relative references. (A  Perl
4394         parentheses (Perl does not provide named parentheses). We could rewrite         5.10  feature.)   Instead  of  (?1)  in the pattern above you can write
4395         the above example as follows:         (?-2) to refer to the second most recently opened parentheses preceding
4396           the  recursion.  In  other  words,  a  negative number counts capturing
4397           (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )         parentheses leftwards from the point at which it is encountered.
4398    
4399         This particular example pattern contains nested unlimited repeats,  and         It is also possible to refer to  subsequently  opened  parentheses,  by
4400         so  the  use of atomic grouping for matching strings of non-parentheses         writing  references  such  as (?+2). However, these cannot be recursive
4401         is important when applying the pattern to strings that  do  not  match.         because the reference is not inside the  parentheses  that  are  refer-
4402         For example, when this pattern is applied to         enced.  They  are  always  "subroutine" calls, as described in the next
4403           section.
4404