/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 90 by nigel, Sat Feb 24 21:41:21 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 81  USER DOCUMENTATION Line 81  USER DOCUMENTATION
81           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
82           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
83           pcresample        discussion of the sample program           pcresample        discussion of the sample program
84             pcrestack         discussion of stack usage
85           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
86    
87         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
# Line 100  LIMITATIONS Line 101  LIMITATIONS
101         In these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
102         of execution will be slower.         of execution will be slower.
103    
104         All values in repeating quantifiers must be less than 65536.  The maxi-         All  values in repeating quantifiers must be less than 65536. The maxi-
105         mum number of capturing subpatterns is 65535.         mum compiled length of subpattern with  an  explicit  repeat  count  is
106           30000 bytes. The maximum number of capturing subpatterns is 65535.
107    
108         There is no limit to the number of non-capturing subpatterns,  but  the         There  is  no limit to the number of non-capturing subpatterns, but the
109         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         maximum depth of nesting of  all  kinds  of  parenthesized  subpattern,
110         including capturing subpatterns, assertions, and other types of subpat-         including capturing subpatterns, assertions, and other types of subpat-
111         tern, is 200.         tern, is 200.
112    
113           The maximum length of name for a named subpattern is 32, and the  maxi-
114           mum number of named subpatterns is 10000.
115    
116         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
117         that an integer variable can hold. However, when using the  traditional         that an integer variable can hold. However, when using the  traditional
118         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
119         inite repetition.  This means that the available stack space may  limit         inite repetition.  This means that the available stack space may  limit
120         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
121           For a discussion of stack issues, see the pcrestack documentation.
122    
123    
124  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
# Line 162  UTF-8 AND UNICODE PROPERTY SUPPORT Line 168  UTF-8 AND UNICODE PROPERTY SUPPORT
168         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
169         two-byte UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
170    
171         3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
172           characters for values greater than \177.
173    
174           4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
175         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
176    
177         4. The dot metacharacter matches one UTF-8 character instead of a  sin-         5.  The dot metacharacter matches one UTF-8 character instead of a sin-
178         gle byte.         gle byte.
179    
180         5.  The  escape sequence \C can be used to match a single byte in UTF-8         6. The escape sequence \C can be used to match a single byte  in  UTF-8
181         mode, but its use can lead to some strange effects.  This  facility  is         mode,  but  its  use can lead to some strange effects. This facility is
182         not available in the alternative matching function, pcre_dfa_exec().         not available in the alternative matching function, pcre_dfa_exec().
183    
184         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
185         test characters of any code value, but the characters that PCRE  recog-         test  characters of any code value, but the characters that PCRE recog-
186         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes as digits, spaces, or word characters  remain  the  same  set  as
187         before, all with values less than 256. This remains true even when PCRE         before, all with values less than 256. This remains true even when PCRE
188         includes  Unicode  property support, because to do otherwise would slow         includes Unicode property support, because to do otherwise  would  slow
189         down PCRE in many common cases. If you really want to test for a  wider         down  PCRE in many common cases. If you really want to test for a wider
190         sense  of,  say,  "digit",  you must use Unicode property tests such as         sense of, say, "digit", you must use Unicode  property  tests  such  as
191         \p{Nd}.         \p{Nd}.
192    
193         7. Similarly, characters that match the POSIX named  character  classes         8.  Similarly,  characters that match the POSIX named character classes
194         are all low-valued characters.         are all low-valued characters.
195    
196         8.  Case-insensitive  matching  applies only to characters whose values         9. Case-insensitive matching applies only to  characters  whose  values
197         are less than 128, unless PCRE is built with Unicode property  support.         are  less than 128, unless PCRE is built with Unicode property support.
198         Even  when  Unicode  property support is available, PCRE still uses its         Even when Unicode property support is available, PCRE  still  uses  its
199         own character tables when checking the case of  low-valued  characters,         own  character  tables when checking the case of low-valued characters,
200         so  as not to degrade performance.  The Unicode property information is         so as not to degrade performance.  The Unicode property information  is
201         used only for characters with higher values. Even when Unicode property         used only for characters with higher values. Even when Unicode property
202         support is available, PCRE supports case-insensitive matching only when         support is available, PCRE supports case-insensitive matching only when
203         there is a one-to-one mapping between a letter's  cases.  There  are  a         there  is  a  one-to-one  mapping between a letter's cases. There are a
204         small  number  of  many-to-one  mappings in Unicode; these are not sup-         small number of many-to-one mappings in Unicode;  these  are  not  sup-
205         ported by PCRE.         ported by PCRE.
206    
207    
# Line 202  AUTHOR Line 211  AUTHOR
211         University Computing Service,         University Computing Service,
212         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
213    
214         Putting an actual email address here seems to have been a spam  magnet,         Putting  an actual email address here seems to have been a spam magnet,
215         so I've taken it away. If you want to email me, use my initial and sur-         so I've taken it away. If you want to email me, use my initial and sur-
216         name, separated by a dot, at the domain ucs.cam.ac.uk.         name, separated by a dot, at the domain ucs.cam.ac.uk.
217    
218  Last updated: 24 January 2006  Last updated: 05 June 2006
219  Copyright (c) 1997-2006 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
220  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
221    
# Line 281  UNICODE CHARACTER PROPERTY SUPPORT Line 290  UNICODE CHARACTER PROPERTY SUPPORT
290    
291  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
292    
293         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating
294         ter. This is the normal newline character on Unix-like systems. You can         the  end  of  a line. This is the normal newline character on Unix-like
295         compile PCRE to use character 13 (carriage return) instead by adding         systems. You can compile PCRE to use character 13 (carriage return, CR)
296           instead, by adding
297    
298           --enable-newline-is-cr           --enable-newline-is-cr
299    
300         to the configure command. For completeness there is  also  a  --enable-         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
301         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
302         line character.  
303           Alternatively, you can specify that line endings are to be indicated by
304           the two character sequence CRLF. If you want this, add
305    
306             --enable-newline-is-crlf
307    
308           to  the  configure command. Whatever line ending convention is selected
309           when PCRE is built can be overridden when  the  library  functions  are
310           called.  At  build time it is conventional to use the standard for your
311           operating system.
312    
313    
314  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
# Line 320  POSIX MALLOC USAGE Line 339  POSIX MALLOC USAGE
339         to the configure command.         to the configure command.
340    
341    
 LIMITING PCRE RESOURCE USAGE  
   
        Internally,  PCRE has a function called match(), which it calls repeat-  
        edly  (possibly  recursively)  when  matching  a   pattern   with   the  
        pcre_exec()  function.  By controlling the maximum number of times this  
        function may be called during a single matching operation, a limit  can  
        be  placed  on  the resources used by a single call to pcre_exec(). The  
        limit can be changed at run time, as described in the pcreapi  documen-  
        tation.  The default is 10 million, but this can be changed by adding a  
        setting such as  
   
          --with-match-limit=500000  
   
        to  the  configure  command.  This  setting  has  no  effect   on   the  
        pcre_dfa_exec() matching function.  
   
   
342  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
343    
344         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
# Line 366  AVOIDING EXCESSIVE STACK USAGE Line 368  AVOIDING EXCESSIVE STACK USAGE
368         ing by making recursive calls to an internal function  called  match().         ing by making recursive calls to an internal function  called  match().
369         In  environments  where  the size of the stack is limited, this can se-         In  environments  where  the size of the stack is limited, this can se-
370         verely limit PCRE's operation. (The Unix environment does  not  usually         verely limit PCRE's operation. (The Unix environment does  not  usually
371         suffer  from  this  problem.)  An alternative approach that uses memory         suffer from this problem, but it may sometimes be necessary to increase
372         from the heap to remember data, instead  of  using  recursive  function         the maximum stack size.  There is a discussion in the  pcrestack  docu-
373         calls,  has been implemented to work round this problem. If you want to         mentation.)  An alternative approach to recursion that uses memory from
374         build a version of PCRE that works this way, add         the heap to remember data, instead of using recursive  function  calls,
375           has  been  implemented to work round the problem of limited stack size.
376           If you want to build a version of PCRE that works this way, add
377    
378           --disable-stack-for-recursion           --disable-stack-for-recursion
379    
# Line 384  AVOIDING EXCESSIVE STACK USAGE Line 388  AVOIDING EXCESSIVE STACK USAGE
388         function; it is not relevant for the the pcre_dfa_exec() function.         function; it is not relevant for the the pcre_dfa_exec() function.
389    
390    
391    LIMITING PCRE RESOURCE USAGE
392    
393           Internally, PCRE has a function called match(), which it calls  repeat-
394           edly   (sometimes   recursively)  when  matching  a  pattern  with  the
395           pcre_exec() function. By controlling the maximum number of  times  this
396           function  may be called during a single matching operation, a limit can
397           be placed on the resources used by a single call  to  pcre_exec().  The
398           limit  can be changed at run time, as described in the pcreapi documen-
399           tation. The default is 10 million, but this can be changed by adding  a
400           setting such as
401    
402             --with-match-limit=500000
403    
404           to   the   configure  command.  This  setting  has  no  effect  on  the
405           pcre_dfa_exec() matching function.
406    
407           In some environments it is desirable to limit the  depth  of  recursive
408           calls of match() more strictly than the total number of calls, in order
409           to restrict the maximum amount of stack (or heap,  if  --disable-stack-
410           for-recursion is specified) that is used. A second limit controls this;
411           it defaults to the value that  is  set  for  --with-match-limit,  which
412           imposes  no  additional constraints. However, you can set a lower limit
413           by adding, for example,
414    
415             --with-match-limit-recursion=10000
416    
417           to the configure command. This value can  also  be  overridden  at  run
418           time.
419    
420    
421  USING EBCDIC CODE  USING EBCDIC CODE
422    
423         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
424         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
425         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by
426         adding         adding
427    
428           --enable-ebcdic           --enable-ebcdic
429    
430         to the configure command.         to the configure command.
431    
432  Last updated: 15 August 2005  Last updated: 06 June 2006
433  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
434  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
435    
436    
# Line 441  REGULAR EXPRESSIONS AS TREES Line 475  REGULAR EXPRESSIONS AS TREES
475         resented  as  a  tree structure. An unlimited repetition in the pattern         resented  as  a  tree structure. An unlimited repetition in the pattern
476         makes the tree of infinite size, but it is still a tree.  Matching  the         makes the tree of infinite size, but it is still a tree.  Matching  the
477         pattern  to a given subject string (from a given starting point) can be         pattern  to a given subject string (from a given starting point) can be
478         thought of as a search of the tree.  There are  two  standard  ways  to         thought of as a search of the tree.  There are two  ways  to  search  a
479         search  a  tree: depth-first and breadth-first, and these correspond to         tree:  depth-first  and  breadth-first, and these correspond to the two
480         the two matching algorithms provided by PCRE.         matching algorithms provided by PCRE.
481    
482    
483  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
# Line 563  DISADVANTAGES OF THE DFA ALGORITHM Line 597  DISADVANTAGES OF THE DFA ALGORITHM
597         but  does not provide the advantage that it does for the standard algo-         but  does not provide the advantage that it does for the standard algo-
598         rithm.         rithm.
599    
600  Last updated: 28 February 2005  Last updated: 06 June 2006
601  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
602  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
603    
604    
# Line 617  PCRE NATIVE API Line 651  PCRE NATIVE API
651         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
652              const char *name);              const char *name);
653    
654           int pcre_get_stringtable_entries(const pcre *code,
655                const char *name, char **first, char **last);
656    
657         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
658              int stringcount, int stringnumber,              int stringcount, int stringnumber,
659              const char **stringptr);              const char **stringptr);
# Line 677  PCRE API OVERVIEW Line 714  PCRE API OVERVIEW
714    
715         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
716         ble,  is  also provided. This uses a different algorithm for the match-         ble,  is  also provided. This uses a different algorithm for the match-
717         ing. This allows it to find all possible matches (at a given  point  in         ing. The alternative algorithm finds all possible matches (at  a  given
718         the  subject),  not  just  one. However, this algorithm does not return         point in the subject). However, this algorithm does not return captured
719         captured substrings. A description of the two matching  algorithms  and         substrings. A description of the  two  matching  algorithms  and  their
720         their  advantages  and disadvantages is given in the pcrematching docu-         advantages  and  disadvantages  is given in the pcrematching documenta-
721         mentation.         tion.
722    
723         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
724         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 693  PCRE API OVERVIEW Line 730  PCRE API OVERVIEW
730           pcre_get_named_substring()           pcre_get_named_substring()
731           pcre_get_substring_list()           pcre_get_substring_list()
732           pcre_get_stringnumber()           pcre_get_stringnumber()
733             pcre_get_stringtable_entries()
734    
735         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
736         to free the memory used for extracted strings.         to free the memory used for extracted strings.
# Line 724  PCRE API OVERVIEW Line 762  PCRE API OVERVIEW
762         indirections  to  memory  management functions. These special functions         indirections  to  memory  management functions. These special functions
763         are used only when PCRE is compiled to use  the  heap  for  remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
764         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
765         function. This is a non-standard way of building PCRE, for use in envi-         function. See the pcrebuild documentation for  details  of  how  to  do
766         ronments that have limited stacks. Because of the greater use of memory         this.  It  is  a non-standard way of building PCRE, for use in environ-
767         management, it runs more slowly.  Separate functions  are  provided  so         ments that have limited stacks. Because of the greater  use  of  memory
768         that  special-purpose  external  code  can  be used for this case. When         management,  it  runs  more  slowly. Separate functions are provided so
769         used, these functions are always called in a  stack-like  manner  (last         that special-purpose external code can be  used  for  this  case.  When
770         obtained,  first freed), and always for memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
771           obtained, first freed), and always for memory blocks of the same  size.
772           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
773           mentation.
774    
775         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
776         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 737  PCRE API OVERVIEW Line 778  PCRE API OVERVIEW
778         pcrecallout documentation.         pcrecallout documentation.
779    
780    
781    NEWLINES
782           PCRE supports three different conventions for indicating line breaks in
783           strings: a single CR character, a single LF character, or the two-char-
784           acter  sequence  CRLF.  All  three  are used as "standard" by different
785           operating systems.  When PCRE is built, a default can be specified. The
786           default  default  is  LF, which is the Unix standard. When PCRE is run,
787           the default can be overridden, either when a pattern  is  compiled,  or
788           when it is matched.
789    
790           In the PCRE documentation the word "newline" is used to mean "the char-
791           acter or pair of characters that indicate a line break".
792    
793    
794  MULTITHREADING  MULTITHREADING
795    
796         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
797         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
798         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
799         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
800    
801         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
802         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
803         at once.         at once.
804    
# Line 752  MULTITHREADING Line 806  MULTITHREADING
806  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
807    
808         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
809         later time, possibly by a different program, and even on a  host  other         later  time,  possibly by a different program, and even on a host other
810         than  the  one  on  which  it  was  compiled.  Details are given in the         than the one on which  it  was  compiled.  Details  are  given  in  the
811         pcreprecompile documentation.         pcreprecompile documentation.
812    
813    
# Line 761  CHECKING BUILD-TIME OPTIONS Line 815  CHECKING BUILD-TIME OPTIONS
815    
816         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
817    
818         The function pcre_config() makes it possible for a PCRE client to  dis-         The  function pcre_config() makes it possible for a PCRE client to dis-
819         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
820         The pcrebuild documentation has more details about these optional  fea-         The  pcrebuild documentation has more details about these optional fea-
821         tures.         tures.
822    
823         The  first  argument  for pcre_config() is an integer, specifying which         The first argument for pcre_config() is an  integer,  specifying  which
824         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
825         into  which  the  information  is  placed. The following information is         into which the information is  placed.  The  following  information  is
826         available:         available:
827    
828           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
829    
830         The output is an integer that is set to one if UTF-8 support is  avail-         The  output is an integer that is set to one if UTF-8 support is avail-
831         able; otherwise it is set to zero.         able; otherwise it is set to zero.
832    
833           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
834    
835         The  output  is  an  integer  that is set to one if support for Unicode         The output is an integer that is set to  one  if  support  for  Unicode
836         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
837    
838           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
839    
840         The output is an integer that is set to the value of the code  that  is         The  output  is  an integer whose value specifies the default character
841         used  for the newline character. It is either linefeed (10) or carriage         sequence that is recognized as meaning "newline". The three values that
842         return (13), and should normally be the  standard  character  for  your         are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default
843         operating system.         should normally be the standard sequence for your operating system.
844    
845           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
846    
847         The  output  is  an  integer that contains the number of bytes used for         The output is an integer that contains the number  of  bytes  used  for
848         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
849         4.  Larger  values  allow larger regular expressions to be compiled, at         4. Larger values allow larger regular expressions to  be  compiled,  at
850         the expense of slower matching. The default value of  2  is  sufficient         the  expense  of  slower matching. The default value of 2 is sufficient
851         for  all  but  the  most massive patterns, since it allows the compiled         for all but the most massive patterns, since  it  allows  the  compiled
852         pattern to be up to 64K in size.         pattern to be up to 64K in size.
853    
854           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
855    
856         The output is an integer that contains the threshold  above  which  the         The  output  is  an integer that contains the threshold above which the
857         POSIX  interface  uses malloc() for output vectors. Further details are         POSIX interface uses malloc() for output vectors. Further  details  are
858         given in the pcreposix documentation.         given in the pcreposix documentation.
859    
860           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
861    
862         The output is an integer that gives the default limit for the number of         The output is an integer that gives the default limit for the number of
863         internal  matching  function  calls in a pcre_exec() execution. Further         internal matching function calls in a  pcre_exec()  execution.  Further
864         details are given with pcre_exec() below.         details are given with pcre_exec() below.
865    
866           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
867    
868         The output is an integer that gives the default limit for the depth  of         The  output is an integer that gives the default limit for the depth of
869         recursion  when calling the internal matching function in a pcre_exec()         recursion when calling the internal matching function in a  pcre_exec()
870         execution. Further details are given with pcre_exec() below.         execution. Further details are given with pcre_exec() below.
871    
872           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
873    
874         The output is an integer that is set to one if internal recursion  when         The  output is an integer that is set to one if internal recursion when
875         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
876         the stack to remember their state. This is the usual way that  PCRE  is         the  stack  to remember their state. This is the usual way that PCRE is
877         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
878         on the  heap  instead  of  recursive  function  calls.  In  this  case,         on  the  heap  instead  of  recursive  function  calls.  In  this case,
879         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
880         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
881    
882    
# Line 839  COMPILING A PATTERN Line 893  COMPILING A PATTERN
893    
894         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
895         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
896         the two interfaces is that pcre_compile2() has an additional  argument,         the  two interfaces is that pcre_compile2() has an additional argument,
897         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error code can be returned.
898    
899         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
900         the pattern argument. A pointer to a single block  of  memory  that  is         the  pattern  argument.  A  pointer to a single block of memory that is
901         obtained  via  pcre_malloc is returned. This contains the compiled code         obtained via pcre_malloc is returned. This contains the  compiled  code
902         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
903         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
904         It is up to the caller  to  free  the  memory  when  it  is  no  longer         It is up to the caller to free the memory (via pcre_free) when it is no
905         required.         longer required.
906    
907         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
908         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
909         fully  relocatable, because it may contain a copy of the tableptr argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
910         ment, which is an address (see below).         ment, which is an address (see below).
911    
912         The options argument contains independent bits that affect the compila-         The options argument contains independent bits that affect the compila-
913         tion.  It  should  be  zero  if  no options are required. The available         tion. It should be zero if  no  options  are  required.  The  available
914         options are described below. Some of them, in  particular,  those  that         options  are  described  below. Some of them, in particular, those that
915         are  compatible  with  Perl,  can also be set and unset from within the         are compatible with Perl, can also be set and  unset  from  within  the
916         pattern (see the detailed description  in  the  pcrepattern  documenta-         pattern  (see  the  detailed  description in the pcrepattern documenta-
917         tion).  For  these options, the contents of the options argument speci-         tion). For these options, the contents of the options  argument  speci-
918         fies their initial settings at the start of compilation and  execution.         fies  their initial settings at the start of compilation and execution.
919         The  PCRE_ANCHORED option can be set at the time of matching as well as         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time
920         at compile time.         of matching as well as at compile time.
921    
922         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
923         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
924         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
925         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
926         try to free it. The offset from the start of the pattern to the charac-         try to free it. The offset from the start of the pattern to the charac-
927         ter where the error was discovered is placed in the variable pointed to         ter where the error was discovered is placed in the variable pointed to
928         by erroffset, which must not be NULL. If it is, an immediate  error  is         by  erroffset,  which must not be NULL. If it is, an immediate error is
929         given.         given.
930    
931         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
932         codeptr argument is not NULL, a non-zero error code number is  returned         codeptr  argument is not NULL, a non-zero error code number is returned
933         via  this argument in the event of an error. This is in addition to the         via this argument in the event of an error. This is in addition to  the
934         textual error message. Error codes and messages are listed below.         textual error message. Error codes and messages are listed below.
935    
936         If the final argument, tableptr, is NULL, PCRE uses a  default  set  of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
937         character  tables  that  are  built  when  PCRE  is compiled, using the         character tables that are  built  when  PCRE  is  compiled,  using  the
938         default C locale. Otherwise, tableptr must be an address  that  is  the         default  C  locale.  Otherwise, tableptr must be an address that is the
939         result  of  a  call to pcre_maketables(). This value is stored with the         result of a call to pcre_maketables(). This value is  stored  with  the
940         compiled pattern, and used again by pcre_exec(), unless  another  table         compiled  pattern,  and used again by pcre_exec(), unless another table
941         pointer is passed to it. For more discussion, see the section on locale         pointer is passed to it. For more discussion, see the section on locale
942         support below.         support below.
943    
944         This code fragment shows a typical straightforward  call  to  pcre_com-         This  code  fragment  shows a typical straightforward call to pcre_com-
945         pile():         pile():
946    
947           pcre *re;           pcre *re;
# Line 900  COMPILING A PATTERN Line 954  COMPILING A PATTERN
954             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
955             NULL);            /* use default character tables */             NULL);            /* use default character tables */
956    
957         The  following  names  for option bits are defined in the pcre.h header         The following names for option bits are defined in  the  pcre.h  header
958         file:         file:
959    
960           PCRE_ANCHORED           PCRE_ANCHORED
961    
962         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
963         is  constrained to match only at the first matching point in the string         is constrained to match only at the first matching point in the  string
964         that is being searched (the "subject string"). This effect can also  be         that  is being searched (the "subject string"). This effect can also be
965         achieved  by appropriate constructs in the pattern itself, which is the         achieved by appropriate constructs in the pattern itself, which is  the
966         only way to do it in Perl.         only way to do it in Perl.
967    
968           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
969    
970         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
971         all  with  number  255, before each pattern item. For discussion of the         all with number 255, before each pattern item. For  discussion  of  the
972         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
973    
974           PCRE_CASELESS           PCRE_CASELESS
975    
976         If this bit is set, letters in the pattern match both upper  and  lower         If  this  bit is set, letters in the pattern match both upper and lower
977         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
978         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
979         always  understands the concept of case for characters whose values are         always understands the concept of case for characters whose values  are
980         less than 128, so caseless matching is always possible. For  characters         less  than 128, so caseless matching is always possible. For characters
981         with  higher  values,  the concept of case is supported if PCRE is com-         with higher values, the concept of case is supported if  PCRE  is  com-
982         piled with Unicode property support, but not otherwise. If you want  to         piled  with Unicode property support, but not otherwise. If you want to
983         use  caseless  matching  for  characters 128 and above, you must ensure         use caseless matching for characters 128 and  above,  you  must  ensure
984         that PCRE is compiled with Unicode property support  as  well  as  with         that  PCRE  is  compiled  with Unicode property support as well as with
985         UTF-8 support.         UTF-8 support.
986    
987           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
988    
989         If  this bit is set, a dollar metacharacter in the pattern matches only         If this bit is set, a dollar metacharacter in the pattern matches  only
990         at the end of the subject string. Without this option,  a  dollar  also         at  the  end  of the subject string. Without this option, a dollar also
991         matches  immediately before the final character if it is a newline (but         matches immediately before a newline at the end of the string (but  not
992         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
993         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
994         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
995    
996           PCRE_DOTALL           PCRE_DOTALL
997    
998         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
999         acters,  including  newlines.  Without  it, newlines are excluded. This         acters, including those that indicate newline. Without it, a  dot  does
1000         option is equivalent to Perl's /s option, and it can be changed  within         not  match  when  the  current position is at a newline. This option is
1001         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]         equivalent to Perl's /s option, and it can be changed within a  pattern
1002         always matches a newline character, independent of the setting of  this         by  a (?s) option setting. A negative class such as [^a] always matches
1003         option.         newlines, independent of the setting of this option.
1004    
1005             PCRE_DUPNAMES
1006    
1007           If this bit is set, names used to identify capturing  subpatterns  need
1008           not be unique. This can be helpful for certain types of pattern when it
1009           is known that only one instance of the named  subpattern  can  ever  be
1010           matched.  There  are  more details of named subpatterns below; see also
1011           the pcrepattern documentation.
1012    
1013           PCRE_EXTENDED           PCRE_EXTENDED
1014    
1015         If  this  bit  is  set,  whitespace  data characters in the pattern are         If this bit is set, whitespace  data  characters  in  the  pattern  are
1016         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1017         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1018         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1019         line  character,  inclusive,  are  also  ignored. This is equivalent to         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
1020         Perl's /x option, and it can be changed within  a  pattern  by  a  (?x)         option,  and  it  can be changed within a pattern by a (?x) option set-
1021         option setting.         ting.
1022    
1023         This  option  makes  it possible to include comments inside complicated         This option makes it possible to include  comments  inside  complicated
1024         patterns.  Note, however, that this applies only  to  data  characters.         patterns.   Note,  however,  that this applies only to data characters.
1025         Whitespace   characters  may  never  appear  within  special  character         Whitespace  characters  may  never  appear  within  special   character
1026         sequences in a pattern, for  example  within  the  sequence  (?(  which         sequences  in  a  pattern,  for  example  within the sequence (?( which
1027         introduces a conditional subpattern.         introduces a conditional subpattern.
1028    
1029           PCRE_EXTRA           PCRE_EXTRA
1030    
1031         This  option  was invented in order to turn on additional functionality         This option was invented in order to turn on  additional  functionality
1032         of PCRE that is incompatible with Perl, but it  is  currently  of  very         of  PCRE  that  is  incompatible with Perl, but it is currently of very
1033         little  use. When set, any backslash in a pattern that is followed by a         little use. When set, any backslash in a pattern that is followed by  a
1034         letter that has no special meaning  causes  an  error,  thus  reserving         letter  that  has  no  special  meaning causes an error, thus reserving
1035         these  combinations  for  future  expansion.  By default, as in Perl, a         these combinations for future expansion. By  default,  as  in  Perl,  a
1036         backslash followed by a letter with no special meaning is treated as  a         backslash  followed by a letter with no special meaning is treated as a
1037         literal.  There  are  at  present  no other features controlled by this         literal. (Perl can, however, be persuaded to give a warning for  this.)
1038         option. It can also be set by a (?X) option setting within a pattern.         There  are  at  present no other features controlled by this option. It
1039           can also be set by a (?X) option setting within a pattern.
1040    
1041           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1042    
1043         If this option is set, an  unanchored  pattern  is  required  to  match         If this option is set, an  unanchored  pattern  is  required  to  match
1044         before  or at the first newline character in the subject string, though         before  or  at  the  first  newline  in  the subject string, though the
1045         the matched text may continue over the newline.         matched text may continue over the newline.
1046    
1047           PCRE_MULTILINE           PCRE_MULTILINE
1048    
# Line 991  COMPILING A PATTERN Line 1054  COMPILING A PATTERN
1054         is set). This is the same as Perl.         is set). This is the same as Perl.
1055    
1056         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1057         constructs match immediately following or immediately before  any  new-         constructs match immediately following or immediately  before  internal
1058         line  in the subject string, respectively, as well as at the very start         newlines  in  the  subject string, respectively, as well as at the very
1059         and end. This is equivalent to Perl's /m option, and it can be  changed         start and end. This is equivalent to Perl's /m option, and  it  can  be
1060         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1061         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1062         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1063    
1064             PCRE_NEWLINE_CR
1065             PCRE_NEWLINE_LF
1066             PCRE_NEWLINE_CRLF
1067    
1068           These  options  override the default newline definition that was chosen
1069           when PCRE was built. Setting the first or the second specifies  that  a
1070           newline  is  indicated  by a single character (CR or LF, respectively).
1071           Setting both of them specifies that a newline is indicated by the  two-
1072           character  CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined
1073           to contain both bits. The only time that a line break is relevant  when
1074           compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out-
1075           side a character class is encountered. This indicates  a  comment  that
1076           lasts until after the next newline.
1077    
1078           The newline option set at compile time becomes the default that is used
1079           for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1080    
1081           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1082    
1083         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
1084         theses in the pattern. Any opening parenthesis that is not followed  by         theses  in the pattern. Any opening parenthesis that is not followed by
1085         ?  behaves as if it were followed by ?: but named parentheses can still         ? behaves as if it were followed by ?: but named parentheses can  still
1086         be used for capturing (and they acquire  numbers  in  the  usual  way).         be  used  for  capturing  (and  they acquire numbers in the usual way).
1087         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
1088    
1089           PCRE_UNGREEDY           PCRE_UNGREEDY
1090    
1091         This  option  inverts  the "greediness" of the quantifiers so that they         This option inverts the "greediness" of the quantifiers  so  that  they
1092         are not greedy by default, but become greedy if followed by "?". It  is         are  not greedy by default, but become greedy if followed by "?". It is
1093         not  compatible  with Perl. It can also be set by a (?U) option setting         not compatible with Perl. It can also be set by a (?U)  option  setting
1094         within the pattern.         within the pattern.
1095    
1096           PCRE_UTF8           PCRE_UTF8
1097    
1098         This option causes PCRE to regard both the pattern and the  subject  as         This  option  causes PCRE to regard both the pattern and the subject as
1099         strings  of  UTF-8 characters instead of single-byte character strings.         strings of UTF-8 characters instead of single-byte  character  strings.
1100         However, it is available only when PCRE is built to include UTF-8  sup-         However,  it is available only when PCRE is built to include UTF-8 sup-
1101         port.  If not, the use of this option provokes an error. Details of how         port. If not, the use of this option provokes an error. Details of  how
1102         this option changes the behaviour of PCRE are given in the  section  on         this  option  changes the behaviour of PCRE are given in the section on
1103         UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1104    
1105           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1106    
1107         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1108         automatically checked. If an invalid UTF-8 sequence of bytes is  found,         automatically  checked. If an invalid UTF-8 sequence of bytes is found,
1109         pcre_compile()  returns an error. If you already know that your pattern         pcre_compile() returns an error. If you already know that your  pattern
1110         is valid, and you want to skip this check for performance reasons,  you         is  valid, and you want to skip this check for performance reasons, you
1111         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         can set the PCRE_NO_UTF8_CHECK option. When it is set,  the  effect  of
1112         passing an invalid UTF-8 string as a pattern is undefined. It may cause         passing an invalid UTF-8 string as a pattern is undefined. It may cause
1113         your  program  to  crash.   Note that this option can also be passed to         your program to crash.  Note that this option can  also  be  passed  to
1114         pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity  check-         pcre_exec()  and pcre_dfa_exec(), to suppress the UTF-8 validity check-
1115         ing of subject strings.         ing of subject strings.
1116    
1117    
1118  COMPILATION ERROR CODES  COMPILATION ERROR CODES
1119    
1120         The  following  table  lists  the  error  codes than may be returned by         The following table lists the error  codes  than  may  be  returned  by
1121         pcre_compile2(), along with the error messages that may be returned  by         pcre_compile2(),  along with the error messages that may be returned by
1122         both compiling functions.         both compiling functions.
1123    
1124            0  no error            0  no error
# Line 1067  COMPILATION ERROR CODES Line 1147  COMPILATION ERROR CODES
1147           23  internal error: code overflow           23  internal error: code overflow
1148           24  unrecognized character after (?<           24  unrecognized character after (?<
1149           25  lookbehind assertion is not fixed length           25  lookbehind assertion is not fixed length
1150           26  malformed number after (?(           26  malformed number or name after (?(
1151           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1152           28  assertion expected after (?(           28  assertion expected after (?(
1153           29  (?R or (?digits must be followed by )           29  (?R or (?digits must be followed by )
# Line 1084  COMPILATION ERROR CODES Line 1164  COMPILATION ERROR CODES
1164           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
1165           41  unrecognized character after (?P           41  unrecognized character after (?P
1166           42  syntax error after (?P           42  syntax error after (?P
1167           43  two named groups have the same name           43  two named subpatterns have the same name
1168           44  invalid UTF-8 string           44  invalid UTF-8 string
1169           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
1170           46  malformed \P or \p sequence           46  malformed \P or \p sequence
1171           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1172             48  subpattern name is too long (maximum 32 characters)
1173             49  too many named subpatterns (maximum 10,000)
1174             50  repeated subpattern is too long
1175             51  octal value is greater than \377 (not in UTF-8 mode)
1176    
1177    
1178  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1096  STUDYING A PATTERN Line 1180  STUDYING A PATTERN
1180         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
1181              const char **errptr);              const char **errptr);
1182    
1183         If  a  compiled  pattern is going to be used several times, it is worth         If a compiled pattern is going to be used several times,  it  is  worth
1184         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
1185         matching.  The function pcre_study() takes a pointer to a compiled pat-         matching. The function pcre_study() takes a pointer to a compiled  pat-
1186         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
1187         information  that  will  help speed up matching, pcre_study() returns a         information that will help speed up matching,  pcre_study()  returns  a
1188         pointer to a pcre_extra block, in which the study_data field points  to         pointer  to a pcre_extra block, in which the study_data field points to
1189         the results of the study.         the results of the study.
1190    
1191         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1192         pcre_exec(). However, a pcre_extra block  also  contains  other  fields         pcre_exec().  However,  a  pcre_extra  block also contains other fields
1193         that  can  be  set  by the caller before the block is passed; these are         that can be set by the caller before the block  is  passed;  these  are
1194         described below in the section on matching a pattern.         described below in the section on matching a pattern.
1195    
1196         If studying the pattern does not  produce  any  additional  information         If  studying  the  pattern  does not produce any additional information
1197         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study() returns NULL. In that circumstance, if the calling program
1198         wants to pass any of the other fields to pcre_exec(), it  must  set  up         wants  to  pass  any of the other fields to pcre_exec(), it must set up
1199         its own pcre_extra block.         its own pcre_extra block.
1200    
1201         The  second  argument of pcre_study() contains option bits. At present,         The second argument of pcre_study() contains option bits.  At  present,
1202         no options are defined, and this argument should always be zero.         no options are defined, and this argument should always be zero.
1203    
1204         The third argument for pcre_study() is a pointer for an error  message.         The  third argument for pcre_study() is a pointer for an error message.
1205         If  studying  succeeds  (even  if no data is returned), the variable it         If studying succeeds (even if no data is  returned),  the  variable  it
1206         points to is set to NULL. Otherwise it is set to  point  to  a  textual         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1207         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
1208         must not try to free it. You should test the  error  pointer  for  NULL         must  not  try  to  free it. You should test the error pointer for NULL
1209         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
1210    
1211         This is a typical call to pcre_study():         This is a typical call to pcre_study():
# Line 1133  STUDYING A PATTERN Line 1217  STUDYING A PATTERN
1217             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1218    
1219         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1220         that do not have a single fixed starting character. A bitmap of  possi-         that  do not have a single fixed starting character. A bitmap of possi-
1221         ble starting bytes is created.         ble starting bytes is created.
1222    
1223    
1224  LOCALE SUPPORT  LOCALE SUPPORT
1225    
1226         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1227         letters digits, or whatever, by reference to a set of  tables,  indexed         letters  digits,  or whatever, by reference to a set of tables, indexed
1228         by  character  value.  When running in UTF-8 mode, this applies only to         by character value. When running in UTF-8 mode, this  applies  only  to
1229         characters with codes less than 128. Higher-valued  codes  never  match         characters  with  codes  less than 128. Higher-valued codes never match
1230         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1231         with Unicode character property support. The use of locales  with  Uni-         with  Unicode  character property support. The use of locales with Uni-
1232         code is discouraged.         code is discouraged.
1233    
1234         An  internal set of tables is created in the default C locale when PCRE         An internal set of tables is created in the default C locale when  PCRE
1235         is built. This is used when the final  argument  of  pcre_compile()  is         is  built.  This  is  used when the final argument of pcre_compile() is
1236         NULL,  and  is  sufficient for many applications. An alternative set of         NULL, and is sufficient for many applications. An  alternative  set  of
1237         tables can, however, be supplied. These may be created in  a  different         tables  can,  however, be supplied. These may be created in a different
1238         locale  from the default. As more and more applications change to using         locale from the default. As more and more applications change to  using
1239         Unicode, the need for this locale support is expected to die away.         Unicode, the need for this locale support is expected to die away.
1240    
1241         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
1242         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
1243         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1244         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
1245         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
1246         treated as letters), the following code could be used:         treated as letters), the following code could be used:
1247    
1248           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
1249           tables = pcre_maketables();           tables = pcre_maketables();
1250           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1251    
1252         When  pcre_maketables()  runs,  the  tables are built in memory that is         When pcre_maketables() runs, the tables are built  in  memory  that  is
1253         obtained via pcre_malloc. It is the caller's responsibility  to  ensure         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1254         that  the memory containing the tables remains available for as long as         that the memory containing the tables remains available for as long  as
1255         it is needed.         it is needed.
1256    
1257         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
1258         pattern,  and the same tables are used via this pointer by pcre_study()         pattern, and the same tables are used via this pointer by  pcre_study()
1259         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
1260         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
1261         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
1262    
1263         It is possible to pass a table pointer or NULL (indicating the  use  of         It  is  possible to pass a table pointer or NULL (indicating the use of
1264         the  internal  tables)  to  pcre_exec(). Although not intended for this         the internal tables) to pcre_exec(). Although  not  intended  for  this
1265         purpose, this facility could be used to match a pattern in a  different         purpose,  this facility could be used to match a pattern in a different
1266         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
1267         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
1268    
# Line 1188  INFORMATION ABOUT A PATTERN Line 1272  INFORMATION ABOUT A PATTERN
1272         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1273              int what, void *where);              int what, void *where);
1274    
1275         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1276         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1277         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1278    
1279         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1280         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1281         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1282         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1283         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1284         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1285    
1286           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 1204  INFORMATION ABOUT A PATTERN Line 1288  INFORMATION ABOUT A PATTERN
1288           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1289           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1290    
1291         The  "magic  number" is placed at the start of each compiled pattern as         The "magic number" is placed at the start of each compiled  pattern  as
1292         an simple check against passing an arbitrary memory pointer. Here is  a         an  simple check against passing an arbitrary memory pointer. Here is a
1293         typical  call  of pcre_fullinfo(), to obtain the length of the compiled         typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1294         pattern:         pattern:
1295    
1296           int rc;           int rc;
1297           unsigned long int length;           size_t length;
1298           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1299             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1300             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
1301             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1302             &length);         /* where to put the data */             &length);         /* where to put the data */
1303    
1304         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1305         are as follows:         are as follows:
1306    
1307           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1308    
1309         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1310         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1311         there are no back references.         there are no back references.
1312    
1313           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1314    
1315         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1316         argument should point to an int variable.         argument should point to an int variable.
1317    
1318           PCRE_INFO_DEFAULT_TABLES           PCRE_INFO_DEFAULT_TABLES
1319    
1320         Return a pointer to the internal default character tables within  PCRE.         Return  a pointer to the internal default character tables within PCRE.
1321         The  fourth  argument should point to an unsigned char * variable. This         The fourth argument should point to an unsigned char *  variable.  This
1322         information call is provided for internal use by the pcre_study() func-         information call is provided for internal use by the pcre_study() func-
1323         tion.  External  callers  can  cause PCRE to use its internal tables by         tion. External callers can cause PCRE to use  its  internal  tables  by
1324         passing a NULL table pointer.         passing a NULL table pointer.
1325    
1326           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1327    
1328         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1329         non-anchored    pattern.    (This    option    used    to   be   called         non-anchored pattern. The fourth argument should point to an int  vari-
1330         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1331         compatibility.)         is still recognized for backwards compatibility.)
1332    
1333         If  there  is  a  fixed first byte, for example, from a pattern such as         If there is a fixed first byte, for example, from  a  pattern  such  as
1334         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.         (cat|cow|coyote). Otherwise, if either
        Otherwise, if either  
1335    
1336         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1337         branch starts with "^", or         branch starts with "^", or
# Line 1284  INFORMATION ABOUT A PATTERN Line 1367  INFORMATION ABOUT A PATTERN
1367    
1368         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1369         ses. The names are just an additional way of identifying the  parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1370         ses,  which  still  acquire  numbers.  A  convenience  function  called         ses, which still acquire numbers. Several convenience functions such as
1371         pcre_get_named_substring() is provided  for  extracting  an  individual         pcre_get_named_substring() are provided for  extracting  captured  sub-
1372         captured  substring  by  name.  It is also possible to extract the data         strings  by  name. It is also possible to extract the data directly, by
1373         directly, by first converting the name to a number in order  to  access         first converting the name to a number in order to  access  the  correct
1374         the  correct  pointers in the output vector (described with pcre_exec()         pointers in the output vector (described with pcre_exec() below). To do
1375         below). To do the conversion, you need to use the  name-to-number  map,         the conversion, you need  to  use  the  name-to-number  map,  which  is
1376         which is described by these three values.         described by these three values.
1377    
1378         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1379         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
# Line 1300  INFORMATION ABOUT A PATTERN Line 1383  INFORMATION ABOUT A PATTERN
1383         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1384         sis,  most  significant byte first. The rest of the entry is the corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1385         sponding name, zero terminated. The names are  in  alphabetical  order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1386         For  example,  consider  the following pattern (assume PCRE_EXTENDED is         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1387         set, so white space - including newlines - is ignored):         theses numbers. For example, consider  the  following  pattern  (assume
1388           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1389           ignored):
1390    
1391           (?P<date> (?P<year>(\d\d)?\d\d) -           (?P<date> (?P<year>(\d\d)?\d\d) -
1392           (?P<month>\d\d) - (?P<day>\d\d) )           (?P<month>\d\d) - (?P<day>\d\d) )
# Line 1317  INFORMATION ABOUT A PATTERN Line 1402  INFORMATION ABOUT A PATTERN
1402           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1403    
1404         When  writing  code  to  extract  data from named subpatterns using the         When  writing  code  to  extract  data from named subpatterns using the
1405         name-to-number map, remember that the length of each entry is likely to         name-to-number map, remember that the length of the entries  is  likely
1406         be different for each compiled pattern.         to be different for each compiled pattern.
1407    
1408           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1409    
# Line 1517  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1602  MATCHING A PATTERN: THE TRADITIONAL FUNC
1602     Option bits for pcre_exec()     Option bits for pcre_exec()
1603    
1604         The unused bits of the options argument for pcre_exec() must  be  zero.         The unused bits of the options argument for pcre_exec() must  be  zero.
1605         The   only  bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1606         PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1607           PCRE_PARTIAL.
1608    
1609           PCRE_ANCHORED           PCRE_ANCHORED
1610    
1611         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
1612         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
1613         turned out to be anchored by virtue of its contents, it cannot be  made         turned  out to be anchored by virtue of its contents, it cannot be made
1614         unachored at matching time.         unachored at matching time.
1615    
1616             PCRE_NEWLINE_CR
1617             PCRE_NEWLINE_LF
1618             PCRE_NEWLINE_CRLF
1619    
1620           These options override  the  newline  definition  that  was  chosen  or
1621           defaulted  when the pattern was compiled. For details, see the descrip-
1622           tion pcre_compile() above. During matching, the newline choice  affects
1623           the behaviour of the dot, circumflex, and dollar metacharacters.
1624    
1625           PCRE_NOTBOL           PCRE_NOTBOL
1626    
1627         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
# Line 1662  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1757  MATCHING A PATTERN: THE TRADITIONAL FUNC
1757         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1758         tor[1], identify the portion of  the  subject  string  matched  by  the         tor[1], identify the portion of  the  subject  string  matched  by  the
1759         entire  pattern.  The next pair is used for the first capturing subpat-         entire  pattern.  The next pair is used for the first capturing subpat-
1760         tern, and so on. The value returned by pcre_exec()  is  the  number  of         tern, and so on. The value returned by pcre_exec() is one more than the
1761         pairs  that  have  been set. If there are no capturing subpatterns, the         highest numbered pair that has been set. For example, if two substrings
1762         return value from a successful match is 1,  indicating  that  just  the         have been captured, the returned value is 3. If there are no  capturing
1763         first pair of offsets has been set.         subpatterns,  the return value from a successful match is 1, indicating
1764           that just the first pair of offsets has been set.
        Some  convenience  functions  are  provided for extracting the captured  
        substrings as separate strings. These are described  in  the  following  
        section.  
   
        It  is  possible  for  an capturing subpattern number n+1 to match some  
        part of the subject when subpattern n has not been  used  at  all.  For  
        example, if the string "abc" is matched against the pattern (a|(z))(bc)  
        subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both  
        offset values corresponding to the unused subpattern are set to -1.  
1765    
1766         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1767         of the string that it matched that is returned.         of the string that it matched that is returned.
1768    
1769         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
1770         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
1771         function returns a value of zero. In particular, if the substring  off-         function  returns a value of zero. In particular, if the substring off-
1772         sets are not of interest, pcre_exec() may be called with ovector passed         sets are not of interest, pcre_exec() may be called with ovector passed
1773         as NULL and ovecsize as zero. However, if  the  pattern  contains  back         as  NULL  and  ovecsize  as zero. However, if the pattern contains back
1774         references  and  the  ovector is not big enough to remember the related         references and the ovector is not big enough to  remember  the  related
1775         substrings, PCRE has to get additional memory for use during  matching.         substrings,  PCRE has to get additional memory for use during matching.
1776         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1777    
1778         Note  that  pcre_info() can be used to find out how many capturing sub-         The pcre_info() function can be used to find  out  how  many  capturing
1779         patterns there are in a compiled pattern. The smallest size for ovector         subpatterns  there  are  in  a  compiled pattern. The smallest size for
1780         that  will  allow for n captured substrings, in addition to the offsets         ovector that will allow for n captured substrings, in addition  to  the
1781         of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1782    
1783           It  is  possible for capturing subpattern number n+1 to match some part
1784           of the subject when subpattern n has not been used at all. For example,
1785           if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1786           return from the function is 4, and subpatterns 1 and 3 are matched, but
1787           2  is  not.  When  this happens, both values in the offset pairs corre-
1788           sponding to unused subpatterns are set to -1.
1789    
1790           Offset values that correspond to unused subpatterns at the end  of  the
1791           expression  are  also  set  to  -1. For example, if the string "abc" is
1792           matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
1793           matched.  The  return  from the function is 2, because the highest used
1794           capturing subpattern number is 1. However, you can refer to the offsets
1795           for  the  second  and third capturing subpatterns if you wish (assuming
1796           the vector is large enough, of course).
1797    
1798           Some convenience functions are provided  for  extracting  the  captured
1799           substrings as separate strings. These are described below.
1800    
1801     Return values from pcre_exec()     Error return values from pcre_exec()
1802    
1803         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
1804         defined in the header file:         defined in the header file:
1805    
1806           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 1705  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1809  MATCHING A PATTERN: THE TRADITIONAL FUNC
1809    
1810           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
1811    
1812         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
1813         ovecsize was not zero.         ovecsize was not zero.
1814    
1815           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 1714  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1818  MATCHING A PATTERN: THE TRADITIONAL FUNC
1818    
1819           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
1820    
1821         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
1822         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
1823         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
1824         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
1825         gives when the magic number is not present.         gives when the magic number is not present.
1826    
1827           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_NODE   (-5)
1828    
1829         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
1830         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
1831         overwriting of the compiled pattern.         overwriting of the compiled pattern.
1832    
1833           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1834    
1835         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
1836         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
1837         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
1838         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
1839         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
1840    
1841           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1842    
1843         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
1844         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
1845         returned by pcre_exec().         returned by pcre_exec().
1846    
1847           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1848    
1849         The backtracking limit, as specified by  the  match_limit  field  in  a         The  backtracking  limit,  as  specified  by the match_limit field in a
1850         pcre_extra  structure  (or  defaulted) was reached. See the description         pcre_extra structure (or defaulted) was reached.  See  the  description
1851         above.         above.
1852    
1853           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
1854    
1855         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
1856         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
1857         description above.         description above.
1858    
1859           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
1860    
1861         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
1862         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
1863         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
1864    
1865           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
1866    
1867         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
1868         subject.         subject.
1869    
1870           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
1871    
1872         The UTF-8 byte sequence that was passed as a subject was valid, but the         The UTF-8 byte sequence that was passed as a subject was valid, but the
1873         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
1874         ter.         ter.
1875    
1876           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
1877    
1878         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
1879         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
1880    
1881           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
1882    
1883         The PCRE_PARTIAL option was used with  a  compiled  pattern  containing         The  PCRE_PARTIAL  option  was  used with a compiled pattern containing
1884         items  that are not supported for partial matching. See the pcrepartial         items that are not supported for partial matching. See the  pcrepartial
1885         documentation for details of partial matching.         documentation for details of partial matching.
1886    
1887           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
1888    
1889         An unexpected internal error has occurred. This error could  be  caused         An  unexpected  internal error has occurred. This error could be caused
1890         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
1891    
1892           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
1893    
1894         This  error is given if the value of the ovecsize argument is negative.         This error is given if the value of the ovecsize argument is  negative.
1895    
1896    
1897  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
# Line 1803  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1907  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1907         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
1908              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
1909    
1910         Captured substrings can be  accessed  directly  by  using  the  offsets         Captured  substrings  can  be  accessed  directly  by using the offsets
1911         returned  by  pcre_exec()  in  ovector.  For convenience, the functions         returned by pcre_exec() in  ovector.  For  convenience,  the  functions
1912         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
1913         string_list()  are  provided for extracting captured substrings as new,         string_list() are provided for extracting captured substrings  as  new,
1914         separate, zero-terminated strings. These functions identify  substrings         separate,  zero-terminated strings. These functions identify substrings
1915         by  number.  The  next section describes functions for extracting named         by number. The next section describes functions  for  extracting  named
1916         substrings. A substring  that  contains  a  binary  zero  is  correctly         substrings.
1917         extracted  and  has  a further zero added on the end, but the result is  
1918         not, of course, a C string.         A  substring that contains a binary zero is correctly extracted and has
1919           a further zero added on the end, but the result is not, of course, a  C
1920           string.   However,  you  can  process such a string by referring to the
1921           length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
1922           string().  Unfortunately, the interface to pcre_get_substring_list() is
1923           not adequate for handling strings containing binary zeros, because  the
1924           end of the final string is not independently indicated.
1925    
1926         The first three arguments are the same for all  three  of  these  func-         The  first  three  arguments  are the same for all three of these func-
1927         tions:  subject  is  the subject string that has just been successfully         tions: subject is the subject string that has  just  been  successfully
1928         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
1929         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
1930         were captured by the match, including the substring  that  matched  the         were  captured  by  the match, including the substring that matched the
1931         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
1932         it is greater than zero. If pcre_exec() returned zero, indicating  that         it  is greater than zero. If pcre_exec() returned zero, indicating that
1933         it  ran out of space in ovector, the value passed as stringcount should         it ran out of space in ovector, the value passed as stringcount  should
1934         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
1935    
1936         The functions pcre_copy_substring() and pcre_get_substring() extract  a         The  functions pcre_copy_substring() and pcre_get_substring() extract a
1937         single  substring,  whose  number  is given as stringnumber. A value of         single substring, whose number is given as  stringnumber.  A  value  of
1938         zero extracts the substring that matched the  entire  pattern,  whereas         zero  extracts  the  substring that matched the entire pattern, whereas
1939         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-
1940         string(), the string is placed in buffer,  whose  length  is  given  by         string(),  the  string  is  placed  in buffer, whose length is given by
1941         buffersize,  while  for  pcre_get_substring()  a new block of memory is         buffersize, while for pcre_get_substring() a new  block  of  memory  is
1942         obtained via pcre_malloc, and its address is  returned  via  stringptr.         obtained  via  pcre_malloc,  and its address is returned via stringptr.
1943         The  yield  of  the function is the length of the string, not including         The yield of the function is the length of the  string,  not  including
1944         the terminating zero, or one of         the terminating zero, or one of
1945    
1946           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1947    
1948         The buffer was too small for pcre_copy_substring(), or the  attempt  to         The  buffer  was too small for pcre_copy_substring(), or the attempt to
1949         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
1950    
1951           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
1952    
1953         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
1954    
1955         The  pcre_get_substring_list()  function  extracts  all  available sub-         The pcre_get_substring_list()  function  extracts  all  available  sub-
1956         strings and builds a list of pointers to them. All this is  done  in  a         strings  and  builds  a list of pointers to them. All this is done in a
1957         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
1958         the memory block is returned via listptr, which is also  the  start  of         the  memory  block  is returned via listptr, which is also the start of
1959         the  list  of  string pointers. The end of the list is marked by a NULL         the list of string pointers. The end of the list is marked  by  a  NULL
1960         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all went well, or
1961    
1962           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
1963    
1964         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
1965    
1966         When any of these functions encounter a substring that is unset,  which         When  any of these functions encounter a substring that is unset, which
1967         can  happen  when  capturing subpattern number n+1 matches some part of         can happen when capturing subpattern number n+1 matches  some  part  of
1968         the subject, but subpattern n has not been used at all, they return  an         the  subject, but subpattern n has not been used at all, they return an
1969         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
1970         string by inspecting the appropriate offset in ovector, which is  nega-         string  by inspecting the appropriate offset in ovector, which is nega-
1971         tive for unset substrings.         tive for unset substrings.
1972    
1973         The  two convenience functions pcre_free_substring() and pcre_free_sub-         The two convenience functions pcre_free_substring() and  pcre_free_sub-
1974         string_list() can be used to free the memory  returned  by  a  previous         string_list()  can  be  used  to free the memory returned by a previous
1975         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
1976         tively. They do nothing more than  call  the  function  pointed  to  by         tively.  They  do  nothing  more  than  call the function pointed to by
1977         pcre_free,  which  of course could be called directly from a C program.         pcre_free, which of course could be called directly from a  C  program.
1978         However, PCRE is used in some situations where it is linked via a  spe-         However,  PCRE is used in some situations where it is linked via a spe-
1979         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
1980         pcre_free directly; it is for these cases that the functions  are  pro-         pcre_free  directly;  it is for these cases that the functions are pro-
1981         vided.         vided.
1982    
1983    
# Line 1886  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 1996  EXTRACTING CAPTURED SUBSTRINGS BY NAME
1996              int stringcount, const char *stringname,              int stringcount, const char *stringname,
1997              const char **stringptr);              const char **stringptr);
1998    
1999         To  extract a substring by name, you first have to find associated num-         To extract a substring by name, you first have to find associated  num-
2000         ber.  For example, for this pattern         ber.  For example, for this pattern
2001    
2002           (a+)b(?P<xxx>\d+)...           (a+)b(?P<xxx>\d+)...
2003    
2004         the number of the subpattern called "xxx" is 2. You can find the number         the number of the subpattern called "xxx" is 2. If the name is known to
2005         from the name by calling pcre_get_stringnumber(). The first argument is         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2006         the compiled pattern, and the second is the  name.  The  yield  of  the         name by calling pcre_get_stringnumber(). The first argument is the com-
2007         function  is  the  subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if         piled pattern, and the second is the name. The yield of the function is
2008         there is no subpattern of that name.         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2009           subpattern of that name.
2010    
2011         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2012         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
# Line 1917  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2028  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2028         ate.         ate.
2029    
2030    
2031    DUPLICATE SUBPATTERN NAMES
2032    
2033           int pcre_get_stringtable_entries(const pcre *code,
2034                const char *name, char **first, char **last);
2035    
2036           When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2037           subpatterns are not required to  be  unique.  Normally,  patterns  with
2038           duplicate  names  are such that in any one match, only one of the named
2039           subpatterns participates. An example is shown in the pcrepattern  docu-
2040           mentation. When duplicates are present, pcre_copy_named_substring() and
2041           pcre_get_named_substring() return the first substring corresponding  to
2042           the  given  name  that  is  set.  If  none  are set, an empty string is
2043           returned.  The pcre_get_stringnumber() function returns one of the num-
2044           bers  that are associated with the name, but it is not defined which it
2045           is.
2046    
2047           If you want to get full details of all captured substrings for a  given
2048           name,  you  must  use  the pcre_get_stringtable_entries() function. The
2049           first argument is the compiled pattern, and the second is the name. The
2050           third  and  fourth  are  pointers to variables which are updated by the
2051           function. After it has run, they point to the first and last entries in
2052           the  name-to-number  table  for  the  given  name.  The function itself
2053           returns the length of each entry, or  PCRE_ERROR_NOSUBSTRING  if  there
2054           are  none.  The  format  of the table is described above in the section
2055           entitled Information about a pattern. Given all  the  relevant  entries
2056           for the name, you can extract each of their numbers, and hence the cap-
2057           tured data, if any.
2058    
2059    
2060  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2061    
2062         The  traditional  matching  function  uses a similar algorithm to Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
2063         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
2064         the  subject.  If you want to find all possible matches, or the longest         the subject. If you want to find all possible matches, or  the  longest
2065         possible match, consider using the alternative matching  function  (see         possible  match,  consider using the alternative matching function (see
2066         below)  instead.  If you cannot use the alternative function, but still         below) instead. If you cannot use the alternative function,  but  still
2067         need to find all possible matches, you can kludge it up by  making  use         need  to  find all possible matches, you can kludge it up by making use
2068         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
2069         tation.         tation.
2070    
2071         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
2072         tern.   When your callout function is called, extract and save the cur-         tern.  When your callout function is called, extract and save the  cur-
2073         rent matched substring. Then return  1,  which  forces  pcre_exec()  to         rent  matched  substring.  Then  return  1, which forces pcre_exec() to
2074         backtrack  and  try other alternatives. Ultimately, when it runs out of         backtrack and try other alternatives. Ultimately, when it runs  out  of
2075         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2076    
2077    
# Line 1942  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2082  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2082              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
2083              int *workspace, int wscount);              int *workspace, int wscount);
2084    
2085         The function pcre_dfa_exec()  is  called  to  match  a  subject  string         The  function  pcre_dfa_exec()  is  called  to  match  a subject string
2086         against  a compiled pattern, using a "DFA" matching algorithm. This has         against a compiled pattern, using a "DFA" matching algorithm. This  has
2087         different characteristics to the normal algorithm, and is not  compati-         different  characteristics to the normal algorithm, and is not compati-
2088         ble with Perl. Some of the features of PCRE patterns are not supported.         ble with Perl. Some of the features of PCRE patterns are not supported.
2089         Nevertheless, there are times when this kind of matching can be useful.         Nevertheless, there are times when this kind of matching can be useful.
2090         For  a  discussion of the two matching algorithms, see the pcrematching         For a discussion of the two matching algorithms, see  the  pcrematching
2091         documentation.         documentation.
2092    
2093         The arguments for the pcre_dfa_exec() function  are  the  same  as  for         The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2094         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
2095         ent way, and this is described below. The other  common  arguments  are         ent  way,  and  this is described below. The other common arguments are
2096         used  in  the  same way as for pcre_exec(), so their description is not         used in the same way as for pcre_exec(), so their  description  is  not
2097         repeated here.         repeated here.
2098    
2099         The two additional arguments provide workspace for  the  function.  The         The  two  additional  arguments provide workspace for the function. The
2100         workspace  vector  should  contain at least 20 elements. It is used for         workspace vector should contain at least 20 elements. It  is  used  for
2101         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2102         workspace  will  be  needed for patterns and subjects where there are a         workspace will be needed for patterns and subjects where  there  are  a
2103         lot of possible matches.         lot of potential matches.
2104    
2105         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
2106    
# Line 1981  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2121  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2121    
2122     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2123    
2124         The unused bits of the options argument  for  pcre_dfa_exec()  must  be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2125         zero.  The  only  bits  that may be set are PCRE_ANCHORED, PCRE_NOTBOL,         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2126         PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NO_UTF8_CHECK,     PCRE_PARTIAL,         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2127         PCRE_DFA_SHORTEST,  and  PCRE_DFA_RESTART.  All  but  the last three of         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2128         these are the same as for pcre_exec(),  so  their  description  is  not         three of these are the same as for pcre_exec(), so their description is
2129         repeated here.         not repeated here.
2130    
2131           PCRE_PARTIAL           PCRE_PARTIAL
2132    
2133         This  has  the  same general effect as it does for pcre_exec(), but the         This has the same general effect as it does for  pcre_exec(),  but  the
2134         details  are  slightly  different.  When  PCRE_PARTIAL   is   set   for         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
2135         pcre_dfa_exec(),  the  return code PCRE_ERROR_NOMATCH is converted into         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
2136         PCRE_ERROR_PARTIAL if the end of the subject  is  reached,  there  have         PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
2137         been no complete matches, but there is still at least one matching pos-         been no complete matches, but there is still at least one matching pos-
2138         sibility. The portion of the string that provided the partial match  is         sibility.  The portion of the string that provided the partial match is
2139         set as the first matching string.         set as the first matching string.
2140    
2141           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
2142    
2143         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2144         stop as soon as it has found one match. Because  of  the  way  the  DFA         stop  as  soon  as  it  has found one match. Because of the way the DFA
2145         algorithm works, this is necessarily the shortest possible match at the         algorithm works, this is necessarily the shortest possible match at the
2146         first possible matching point in the subject string.         first possible matching point in the subject string.
2147    
2148           PCRE_DFA_RESTART           PCRE_DFA_RESTART
2149    
2150         When pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option,  and         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
2151         returns  a  partial  match, it is possible to call it again, with addi-         returns a partial match, it is possible to call it  again,  with  addi-
2152         tional subject characters, and have it continue with  the  same  match.         tional  subject  characters,  and have it continue with the same match.
2153         The  PCRE_DFA_RESTART  option requests this action; when it is set, the         The PCRE_DFA_RESTART option requests this action; when it is  set,  the
2154         workspace and wscount options must reference the same vector as  before         workspace  and wscount options must reference the same vector as before
2155         because  data  about  the  match so far is left in them after a partial         because data about the match so far is left in  them  after  a  partial
2156         match. There is more discussion of this  facility  in  the  pcrepartial         match.  There  is  more  discussion of this facility in the pcrepartial
2157         documentation.         documentation.
2158    
2159     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
2160    
2161         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
2162         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
2163         of  the  function  start  at the same point in the subject. The shorter         of the function start at the same point in  the  subject.  The  shorter
2164         matches are all initial substrings of the longer matches. For  example,         matches  are all initial substrings of the longer matches. For example,
2165         if the pattern         if the pattern
2166    
2167           <.*>           <.*>
# Line 2036  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2176  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2176           <something> <something else>           <something> <something else>
2177           <something> <something else> <something further>           <something> <something else> <something further>
2178    
2179         On  success,  the  yield of the function is a number greater than zero,         On success, the yield of the function is a number  greater  than  zero,
2180         which is the number of matched substrings.  The  substrings  themselves         which  is  the  number of matched substrings. The substrings themselves
2181         are  returned  in  ovector. Each string uses two elements; the first is         are returned in ovector. Each string uses two elements;  the  first  is
2182         the offset to the start, and the second is the offset to the  end.  All         the  offset  to the start, and the second is the offset to the end. All
2183         the strings have the same start offset. (Space could have been saved by         the strings have the same start offset. (Space could have been saved by
2184         giving this only once, but it was decided to retain some  compatibility         giving  this only once, but it was decided to retain some compatibility
2185         with  the  way pcre_exec() returns data, even though the meaning of the         with the way pcre_exec() returns data, even though the meaning  of  the
2186         strings is different.)         strings is different.)
2187    
2188         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
2189         est  matching  string is given first. If there were too many matches to         est matching string is given first. If there were too many  matches  to
2190         fit into ovector, the yield of the function is zero, and the vector  is         fit  into ovector, the yield of the function is zero, and the vector is
2191         filled with the longest matches.         filled with the longest matches.
2192    
2193     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
2194    
2195         The  pcre_dfa_exec()  function returns a negative number when it fails.         The pcre_dfa_exec() function returns a negative number when  it  fails.
2196         Many of the errors are the same  as  for  pcre_exec(),  and  these  are         Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2197         described  above.   There are in addition the following errors that are         described above.  There are in addition the following errors  that  are
2198         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
2199    
2200           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
2201    
2202         This return is given if pcre_dfa_exec() encounters an item in the  pat-         This  return is given if pcre_dfa_exec() encounters an item in the pat-
2203         tern  that  it  does not support, for instance, the use of \C or a back         tern that it does not support, for instance, the use of \C  or  a  back
2204         reference.         reference.
2205    
2206           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
2207    
2208         This return is given if pcre_dfa_exec() encounters a condition item  in         This  return is given if pcre_dfa_exec() encounters a condition item in
2209         a  pattern  that  uses  a back reference for the condition. This is not         a pattern that uses a back reference for the  condition.  This  is  not
2210         supported.         supported.
2211    
2212           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
2213    
2214         This return is given if pcre_dfa_exec() is called with an  extra  block         This  return  is given if pcre_dfa_exec() is called with an extra block
2215         that contains a setting of the match_limit field. This is not supported         that contains a setting of the match_limit field. This is not supported
2216         (it is meaningless).         (it is meaningless).
2217    
2218           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
2219    
2220         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2221         workspace vector.         workspace vector.
2222    
2223           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
2224    
2225         When  a  recursive subpattern is processed, the matching function calls         When a recursive subpattern is processed, the matching  function  calls
2226         itself recursively, using private vectors for  ovector  and  workspace.         itself  recursively,  using  private vectors for ovector and workspace.
2227         This  error  is  given  if  the output vector is not large enough. This         This error is given if the output vector  is  not  large  enough.  This
2228         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2229    
2230  Last updated: 18 January 2006  Last updated: 08 June 2006
2231  Copyright (c) 1997-2006 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
2232  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2233    
# Line 2334  DIFFERENCES BETWEEN PCRE AND PERL Line 2474  DIFFERENCES BETWEEN PCRE AND PERL
2474         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2475    
2476         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2477         cial meaning is faulted.         cial meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash  is
2478           ignored. (Perl can be made to issue a warning.)
2479    
2480         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
2481         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2482         lowed by a question mark they are.         lowed by a question mark they are.
2483    
2484         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2485         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2486    
2487         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
2488         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2489    
2490         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
2491         pattern  matching  (Perl  can  do  this using the (?p{code}) construct,         pattern matching (Perl can do  this  using  the  (?p{code})  construct,
2492         which PCRE cannot support.)         which PCRE cannot support.)
2493    
2494         (h) PCRE supports named capturing substrings, using the Python  syntax.         (h)  PCRE supports named capturing substrings, using the Python syntax.
2495    
2496         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
2497         Sun's Java package.         Sun's Java package.
2498    
2499         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) The (R) condition, for testing recursion, is a PCRE extension.
# Line 2364  DIFFERENCES BETWEEN PCRE AND PERL Line 2505  DIFFERENCES BETWEEN PCRE AND PERL
2505         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2506         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2507    
2508         (n)  The  alternative  matching function (pcre_dfa_exec()) matches in a         (n) The alternative matching function (pcre_dfa_exec())  matches  in  a
2509         different way and is not Perl-compatible.         different way and is not Perl-compatible.
2510    
2511  Last updated: 24 January 2006  Last updated: 06 June 2006
2512  Copyright (c) 1997-2006 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
2513  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2514    
# Line 2476  BACKSLASH Line 2617  BACKSLASH
2617    
2618         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
2619         the pattern (other than in a character class) and characters between  a         the pattern (other than in a character class) and characters between  a
2620         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline are ignored. An escap-
2621         An escaping backslash can be used to include a whitespace or #  charac-         ing backslash can be used to include a whitespace  or  #  character  as
2622         ter as part of the pattern.         part of the pattern.
2623    
2624         If  you  want  to remove the special meaning from a sequence of charac-         If  you  want  to remove the special meaning from a sequence of charac-
2625         ters, you can do so by putting them between \Q and \E. This is  differ-         ters, you can do so by putting them between \Q and \E. This is  differ-
# Line 2535  BACKSLASH Line 2676  BACKSLASH
2676         two  syntaxes  for  \x. There is no difference in the way they are han-         two  syntaxes  for  \x. There is no difference in the way they are han-
2677         dled. For example, \xdc is exactly the same as \x{dc}.         dled. For example, \xdc is exactly the same as \x{dc}.
2678    
2679         After \0 up to two further octal digits are read.  In  both  cases,  if         After \0 up to two further octal digits are read. If  there  are  fewer
2680         there  are fewer than two digits, just those that are present are used.         than  two  digits,  just  those  that  are  present  are used. Thus the
2681         Thus the sequence \0\x\07 specifies two binary zeros followed by a  BEL         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2682         character  (code  value  7).  Make sure you supply two digits after the         (code  value 7). Make sure you supply two digits after the initial zero
2683         initial zero if the pattern character that follows is itself  an  octal         if the pattern character that follows is itself an octal digit.
        digit.  
2684    
2685         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2686         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
2687         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
2688         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
2689         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
2690         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
2691         of parenthesized subpatterns.         of parenthesized subpatterns.
2692    
2693         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
2694         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
2695         up  to three octal digits following the backslash, and generates a sin-         up to three octal digits following the backslash, ane uses them to gen-
2696         gle byte from the least significant 8 bits of the value. Any subsequent         erate  a data character. Any subsequent digits stand for themselves. In
2697         digits stand for themselves.  For example:         non-UTF-8 mode, the value of a character specified  in  octal  must  be
2698           less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
2699           example:
2700    
2701           \040   is another way of writing a space           \040   is another way of writing a space
2702           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 2571  BACKSLASH Line 2713  BACKSLASH
2713           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
2714                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
2715    
2716         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
2717         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2718    
2719         All the sequences that define a single byte value  or  a  single  UTF-8         All the sequences that define a single character value can be used both
2720         character (in UTF-8 mode) can be used both inside and outside character         inside and outside character classes. In addition, inside  a  character
2721         classes. In addition, inside a character  class,  the  sequence  \b  is         class,  the  sequence \b is interpreted as the backspace character (hex
2722         interpreted as the backspace character (hex 08), and the sequence \X is         08), and the sequence \X is interpreted as the character "X". Outside a
2723         interpreted as the character "X".  Outside  a  character  class,  these         character class, these sequences have different meanings (see below).
        sequences have different meanings (see below).  
2724    
2725     Generic character types     Generic character types
2726    
# Line 2604  BACKSLASH Line 2745  BACKSLASH
2745    
2746         For  compatibility  with Perl, \s does not match the VT character (code         For  compatibility  with Perl, \s does not match the VT character (code
2747         11).  This makes it different from the the POSIX "space" class. The  \s         11).  This makes it different from the the POSIX "space" class. The  \s
2748         characters are HT (9), LF (10), FF (12), CR (13), and space (32).         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If
2749           "use locale;" is included in a Perl script, \s may match the VT charac-
2750           ter. In PCRE, it never does.)
2751    
2752         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
2753         is a letter or digit. The definition of  letters  and  digits  is  con-         is a letter or digit. The definition of  letters  and  digits  is  con-
# Line 2719  BACKSLASH Line 2862  BACKSLASH
2862         classified as a modifier or "other".         classified as a modifier or "other".
2863    
2864         The  long  synonyms  for  these  properties that Perl supports (such as         The  long  synonyms  for  these  properties that Perl supports (such as
2865         \p{Letter}) are not supported by PCRE. Nor is is  permitted  to  prefix         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
2866         any of these properties with "Is".         any of these properties with "Is".
2867    
2868         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
# Line 2777  BACKSLASH Line 2920  BACKSLASH
2920         However, if the startoffset argument of pcre_exec() is non-zero,  indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
2921         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
2922         the subject, \A can never match. The difference between \Z  and  \z  is         the subject, \A can never match. The difference between \Z  and  \z  is
2923         that  \Z  matches  before  a  newline that is the last character of the         that \Z matches before a newline at the end of the string as well as at
2924         string as well as at the end of the string, whereas \z matches only  at         the very end, whereas \z matches only at the end.
2925         the end.  
2926           The \G assertion is true only when the current matching position is  at
2927         The  \G assertion is true only when the current matching position is at         the  start point of the match, as specified by the startoffset argument
2928         the start point of the match, as specified by the startoffset  argument         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
2929         of  pcre_exec().  It  differs  from \A when the value of startoffset is         non-zero.  By calling pcre_exec() multiple times with appropriate argu-
        non-zero. By calling pcre_exec() multiple times with appropriate  argu-  
2930         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
2931         mentation where \G can be useful.         mentation where \G can be useful.
2932    
2933         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
2934         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
2935         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
2936         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
2937         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
2938    
2939         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
2940         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
2941         in the compiled regular expression.         in the compiled regular expression.
2942    
# Line 2802  BACKSLASH Line 2944  BACKSLASH
2944  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
2945    
2946         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
2947         character  is  an  assertion  that is true only if the current matching         character is an assertion that is true only  if  the  current  matching
2948         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
2949         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
2950         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
2951         has an entirely different meaning (see below).         has an entirely different meaning (see below).
2952    
2953         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
2954         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
2955         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
2956         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
2957         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
2958         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
2959         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
2960    
2961         A  dollar  character  is  an assertion that is true only if the current         A dollar character is an assertion that is true  only  if  the  current
2962         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
2963         before a newline character that is the last character in the string (by         before a newline at the end of the string (by default). Dollar need not
2964         default). Dollar need not be the last character of  the  pattern  if  a         be  the  last  character of the pattern if a number of alternatives are
2965         number  of alternatives are involved, but it should be the last item in         involved, but it should be the last item in  any  branch  in  which  it
2966         any branch in which it appears.  Dollar has no  special  meaning  in  a         appears. Dollar has no special meaning in a character class.
        character class.  
2967    
2968         The  meaning  of  dollar  can be changed so that it matches only at the         The  meaning  of  dollar  can be changed so that it matches only at the
2969         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
2970         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
2971    
2972         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
2973         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
2974         ately  after  and  immediately  before  an  internal newline character,         matches  immediately after internal newlines as well as at the start of
2975         respectively, in addition to matching at the start and end of the  sub-         the subject string. It does not match after a  newline  that  ends  the
2976         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject         string.  A dollar matches before any newlines in the string, as well as
2977         string "def\nabc" (where \n represents a newline character)  in  multi-         at the very end, when PCRE_MULTILINE is set. When newline is  specified
2978         line mode, but not otherwise.  Consequently, patterns that are anchored         as  the  two-character  sequence CRLF, isolated CR and LF characters do
2979         in single line mode because all branches start with ^ are not  anchored         not indicate newlines.
2980         in  multiline  mode,  and  a  match for circumflex is possible when the  
2981         startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
2982         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.         (where  \n  represents a newline) in multiline mode, but not otherwise.
2983           Consequently, patterns that are anchored in single  line  mode  because
2984         Note  that  the sequences \A, \Z, and \z can be used to match the start         all  branches  start  with  ^ are not anchored in multiline mode, and a
2985         and end of the subject in both modes, and if all branches of a  pattern         match for circumflex is  possible  when  the  startoffset  argument  of
2986         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
2987         not.         PCRE_MULTILINE is set.
2988    
2989           Note that the sequences \A, \Z, and \z can be used to match  the  start
2990           and  end of the subject in both modes, and if all branches of a pattern
2991           start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
2992           set.
2993    
2994    
2995  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
2996    
2997         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2998         ter  in  the  subject,  including a non-printing character, but not (by         ter in the subject string except (by default) a character  that  signi-
2999         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3000         which might be more than one byte long, except (by default) newline. If         more than one byte long. When a line ending  is  defined  as  a  single
3001         the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-         character  (CR  or LF), dot never matches that character; when the two-
3002         dling  of dot is entirely independent of the handling of circumflex and         character sequence CRLF is used, dot does not match CR if it is immedi-
3003         dollar, the only relationship being  that  they  both  involve  newline         ately  followed by LF, but otherwise it matches all characters (includ-
3004         characters. Dot has no special meaning in a character class.         ing isolated CRs and LFs).
3005    
3006           The behaviour of dot with regard to newlines can  be  changed.  If  the
3007           PCRE_DOTALL  option  is  set,  a dot matches any one character, without
3008           exception. If newline is defined as the two-character sequence CRLF, it
3009           takes two dots to match it.
3010    
3011           The  handling of dot is entirely independent of the handling of circum-
3012           flex and dollar, the only relationship being  that  they  both  involve
3013           newlines. Dot has no special meaning in a character class.
3014    
3015    
3016  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
3017    
3018         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
3019         both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.         both in and out of UTF-8 mode. Unlike a dot, it always matches  CR  and
3020         The  feature  is provided in Perl in order to match individual bytes in         LF.  The feature is provided in Perl in order to match individual bytes
3021         UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual         in UTF-8 mode.  Because it breaks up UTF-8 characters  into  individual
3022         bytes,  what remains in the string may be a malformed UTF-8 string. For         bytes,  what remains in the string may be a malformed UTF-8 string. For
3023         this reason, the \C escape sequence is best avoided.         this reason, the \C escape sequence is best avoided.
3024    
# Line 2912  SQUARE BRACKETS AND CHARACTER CLASSES Line 3067  SQUARE BRACKETS AND CHARACTER CLASSES
3067         PCRE  is  compiled  with Unicode property support as well as with UTF-8         PCRE  is  compiled  with Unicode property support as well as with UTF-8
3068         support.         support.
3069    
3070         The newline character is never treated in any special way in  character         Characters that might indicate  line  breaks  (CR  and  LF)  are  never
3071         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE         treated  in  any  special way when matching character classes, whatever
3072         options is. A class such as [^a] will always match a newline.         line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
3073           and PCRE_MULTILINE options is used. A class such as [^a] always matches
3074           one of these characters.
3075    
3076         The minus (hyphen) character can be used to specify a range of  charac-         The minus (hyphen) character can be used to specify a range of  charac-
3077         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
# Line 3015  VERTICAL BAR Line 3172  VERTICAL BAR
3172    
3173         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches  either "gilbert" or "sullivan". Any number of alternatives may
3174         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
3175         string).   The  matching  process  tries each alternative in turn, from         string). The matching process tries each alternative in turn, from left
3176         left to right, and the first one that succeeds is used. If the alterna-         to right, and the first one that succeeds is used. If the  alternatives
3177         tives  are within a subpattern (defined below), "succeeds" means match-         are  within a subpattern (defined below), "succeeds" means matching the
3178         ing the rest of the main pattern as well as the alternative in the sub-         rest of the main pattern as well as the alternative in the  subpattern.
        pattern.  
3179    
3180    
3181  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
# Line 3065  INTERNAL OPTION SETTING Line 3221  INTERNAL OPTION SETTING
3221         the effects of option settings happen at compile time. There  would  be         the effects of option settings happen at compile time. There  would  be
3222         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3223    
3224         The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
3225         in the same way as the Perl-compatible options by using the  characters         can be changed in the same way as the Perl-compatible options by  using
3226         U  and X respectively. The (?X) flag setting is special in that it must         the characters J, U and X respectively.
        always occur earlier in the pattern than any of the additional features  
        it  turns on, even when it is at top level. It is best to put it at the  
        start.  
3227    
3228    
3229  SUBPATTERNS  SUBPATTERNS
# Line 3082  SUBPATTERNS Line 3235  SUBPATTERNS
3235    
3236           cat(aract|erpillar|)           cat(aract|erpillar|)
3237    
3238         matches  one  of the words "cat", "cataract", or "caterpillar". Without         matches one of the words "cat", "cataract", or  "caterpillar".  Without
3239         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the  parentheses,  it  would  match "cataract", "erpillar" or the empty
3240         string.         string.
3241    
3242         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2. It sets up the subpattern as  a  capturing  subpattern.  This  means
3243         that, when the whole pattern  matches,  that  portion  of  the  subject         that,  when  the  whole  pattern  matches,  that portion of the subject
3244         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
3245         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector  argument  of pcre_exec(). Opening parentheses are counted from
3246         left  to  right  (starting  from 1) to obtain numbers for the capturing         left to right (starting from 1) to obtain  numbers  for  the  capturing
3247         subpatterns.         subpatterns.
3248    
3249         For example, if the string "the red king" is matched against  the  pat-         For  example,  if the string "the red king" is matched against the pat-
3250         tern         tern
3251    
3252           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 3101  SUBPATTERNS Line 3254  SUBPATTERNS
3254         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
3255         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
3256    
3257         The fact that plain parentheses fulfil  two  functions  is  not  always         The  fact  that  plain  parentheses  fulfil two functions is not always
3258         helpful.   There are often times when a grouping subpattern is required         helpful.  There are often times when a grouping subpattern is  required
3259         without a capturing requirement. If an opening parenthesis is  followed         without  a capturing requirement. If an opening parenthesis is followed
3260         by  a question mark and a colon, the subpattern does not do any captur-         by a question mark and a colon, the subpattern does not do any  captur-
3261         ing, and is not counted when computing the  number  of  any  subsequent         ing,  and  is  not  counted when computing the number of any subsequent
3262         capturing  subpatterns. For example, if the string "the white queen" is         capturing subpatterns. For example, if the string "the white queen"  is
3263         matched against the pattern         matched against the pattern
3264    
3265           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
3266    
3267         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
3268         1  and 2. The maximum number of capturing subpatterns is 65535, and the         1 and 2. The maximum number of capturing subpatterns is 65535, and  the
3269         maximum depth of nesting of all subpatterns, both  capturing  and  non-         maximum  depth  of  nesting of all subpatterns, both capturing and non-
3270         capturing, is 200.         capturing, is 200.
3271    
3272         As  a  convenient shorthand, if any option settings are required at the         As a convenient shorthand, if any option settings are required  at  the
3273         start of a non-capturing subpattern,  the  option  letters  may  appear         start  of  a  non-capturing  subpattern,  the option letters may appear
3274         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
3275    
3276           (?i:saturday|sunday)           (?i:saturday|sunday)
3277           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
3278    
3279         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
3280         tried from left to right, and options are not reset until  the  end  of         tried  from  left  to right, and options are not reset until the end of
3281         the  subpattern is reached, an option setting in one branch does affect         the subpattern is reached, an option setting in one branch does  affect
3282         subsequent branches, so the above patterns match "SUNDAY"  as  well  as         subsequent  branches,  so  the above patterns match "SUNDAY" as well as
3283         "Saturday".         "Saturday".
3284    
3285    
3286  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3287    
3288         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
3289         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
3290         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
3291         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
3292         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns, something that Perl  does  not  provide.  The  Python  syntax
3293         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         (?P<name>...)  is  used. References to capturing parentheses from other
3294         underscores, and must be unique within a pattern.         parts of the pattern, such as  backreferences,  recursion,  and  condi-
3295           tions, can be made by name as well as by number.
3296    
3297         Named  capturing  parentheses  are  still  allocated numbers as well as         Names  consist  of  up  to  32 alphanumeric characters and underscores.
3298           Named capturing parentheses are still  allocated  numbers  as  well  as
3299         names. The PCRE API provides function calls for extracting the name-to-         names. The PCRE API provides function calls for extracting the name-to-
3300         number  translation table from a compiled pattern. There is also a con-         number translation table from a compiled pattern. There is also a  con-
3301         venience function for extracting a captured substring by name. For fur-         venience function for extracting a captured substring by name.
3302         ther details see the pcreapi documentation.  
3303           By  default, a name must be unique within a pattern, but it is possible
3304           to relax this constraint by setting the PCRE_DUPNAMES option at compile
3305           time.  This  can  be useful for patterns where only one instance of the
3306           named parentheses can match. Suppose you want to match the  name  of  a
3307           weekday,  either as a 3-letter abbreviation or as the full name, and in
3308           both cases you want to extract the abbreviation. This pattern (ignoring
3309           the line breaks) does the job:
3310    
3311             (?P<DN>Mon|Fri|Sun)(?:day)?|
3312             (?P<DN>Tue)(?:sday)?|
3313             (?P<DN>Wed)(?:nesday)?|
3314             (?P<DN>Thu)(?:rsday)?|
3315             (?P<DN>Sat)(?:urday)?
3316    
3317           There  are  five capturing substrings, but only one is ever set after a
3318           match.  The convenience  function  for  extracting  the  data  by  name
3319           returns  the  substring  for  the first, and in this example, the only,
3320           subpattern of that name that matched.  This  saves  searching  to  find
3321           which  numbered  subpattern  it  was. If you make a reference to a non-
3322           unique named subpattern from elsewhere in the  pattern,  the  one  that
3323           corresponds  to  the  lowest number is used. For further details of the
3324           interfaces for handling named subpatterns, see the  pcreapi  documenta-
3325           tion.
3326    
3327    
3328  REPETITION  REPETITION
# Line 3353  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3531  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3531         meaning  or  processing  of  a possessive quantifier and the equivalent         meaning  or  processing  of  a possessive quantifier and the equivalent
3532         atomic group.         atomic group.
3533    
3534         The possessive quantifier syntax is an extension to the Perl syntax. It         The possessive quantifier syntax is an extension to  the  Perl  syntax.
3535         originates in Sun's Java package.         Jeffrey  Friedl originated the idea (and the name) in the first edition
3536           of his book.  Mike McCloskey liked it, so implemented it when he  built
3537           Sun's Java package, and PCRE copied it from there.
3538    
3539         When  a  pattern  contains an unlimited repeat inside a subpattern that         When  a  pattern  contains an unlimited repeat inside a subpattern that
3540         can itself be repeated an unlimited number of  times,  the  use  of  an         can itself be repeated an unlimited number of  times,  the  use  of  an
# Line 3395  BACK REFERENCES Line 3575  BACK REFERENCES
3575         it  is  always  taken  as a back reference, and causes an error only if         it  is  always  taken  as a back reference, and causes an error only if
3576         there are not that many capturing left parentheses in the  entire  pat-         there are not that many capturing left parentheses in the  entire  pat-
3577         tern.  In  other words, the parentheses that are referenced need not be         tern.  In  other words, the parentheses that are referenced need not be
3578         to the left of the reference for numbers less than 10. See the  subsec-         to the left of the reference for numbers less than 10. A "forward  back
3579         tion  entitled  "Non-printing  characters" above for further details of         reference"  of  this  type can make sense when a repetition is involved
3580         the handling of digits following a backslash.         and the subpattern to the right has participated in an  earlier  itera-
3581           tion.
3582    
3583           It is not possible to have a numerical "forward back reference" to sub-
3584           pattern whose number is 10 or more. However, a back  reference  to  any
3585           subpattern  is  possible  using named parentheses (see below). See also
3586           the subsection entitled "Non-printing  characters"  above  for  further
3587           details of the handling of digits following a backslash.
3588    
3589         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
3590         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
3591         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3592         of doing that). So the pattern         of doing that). So the pattern
3593    
3594           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3595    
3596         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
3597         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
3598         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
3599         ple,         ple,
3600    
3601           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3602    
3603         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3604         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3605    
3606         Back  references  to named subpatterns use the Python syntax (?P=name).         Back references to named subpatterns use the Python  syntax  (?P=name).
3607         We could rewrite the above example as follows:         We could rewrite the above example as follows:
3608    
3609           (?<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
3610    
3611           A  subpattern  that  is  referenced  by  name may appear in the pattern
3612           before or after the reference.
3613    
3614         There may be more than one back reference to the same subpattern. If  a         There may be more than one back reference to the same subpattern. If  a
3615         subpattern  has  not actually been used in a particular match, any back         subpattern  has  not actually been used in a particular match, any back
# Line 3508  ASSERTIONS Line 3698  ASSERTIONS
3698         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
3699         contents of a lookbehind assertion are restricted  such  that  all  the         contents of a lookbehind assertion are restricted  such  that  all  the
3700         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
3701         eral alternatives, they do not all have to have the same fixed  length.         eral top-level alternatives, they do not all  have  to  have  the  same
3702         Thus         fixed length. Thus
3703    
3704           (?<=bullock|donkey)           (?<=bullock|donkey)
3705    
# Line 3622  CONDITIONAL SUBPATTERNS Line 3812  CONDITIONAL SUBPATTERNS
3812         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
3813    
3814         There are three kinds of condition. If the text between the parentheses         There are three kinds of condition. If the text between the parentheses
3815         consists of a sequence of digits, the condition  is  satisfied  if  the         consists of a sequence of digits, or a sequence of alphanumeric charac-
3816         capturing  subpattern of that number has previously matched. The number         ters  and underscores, the condition is satisfied if the capturing sub-
3817         must be greater than zero. Consider the following pattern,  which  con-         pattern of that number or name has previously matched. There is a  pos-
3818         tains  non-significant white space to make it more readable (assume the         sible  ambiguity here, because subpattern names may consist entirely of
3819         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of         digits. PCRE looks first for a named subpattern; if it cannot find  one
3820         discussion:         and  the text consists entirely of digits, it looks for a subpattern of
3821           that number, which must be greater than zero.  Using  subpattern  names
3822           that consist entirely of digits is not recommended.
3823    
3824           Consider  the  following  pattern, which contains non-significant white
3825           space to make it more readable (assume the PCRE_EXTENDED option) and to
3826           divide it into three parts for ease of discussion:
3827    
3828           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
3829    
# Line 3640  CONDITIONAL SUBPATTERNS Line 3836  CONDITIONAL SUBPATTERNS
3836         tern  is  executed  and  a  closing parenthesis is required. Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
3837         since no-pattern is not present, the  subpattern  matches  nothing.  In         since no-pattern is not present, the  subpattern  matches  nothing.  In
3838         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
3839         optionally enclosed in parentheses.         optionally enclosed in parentheses. Rewriting it to use a named subpat-
3840           tern gives this:
3841    
3842             (?P<OPEN> \( )?    [^()]+    (?(OPEN) \) )
3843    
3844         If the condition is the string (R), it is satisfied if a recursive call         If the condition is the string (R), and there is no subpattern with the
3845         to  the pattern or subpattern has been made. At "top level", the condi-         name R, the condition is satisfied if a recursive call to  the  pattern
3846         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are         or  subpattern  has  been made. At "top level", the condition is false.
3847         described in the next section.         This is a PCRE extension.  Recursive patterns are described in the next
3848           section.
3849    
3850         If  the  condition  is  not  a sequence of digits or (R), it must be an         If  the  condition  is  not  a sequence of digits or (R), it must be an
3851         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
# Line 3672  COMMENTS Line 3872  COMMENTS
3872         at all.         at all.
3873    
3874         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
3875         character class introduces a comment that continues up to the next new-         character class introduces a  comment  that  continues  to  immediately
3876         line character in the pattern.         after the next newline in the pattern.
3877    
3878    
3879  RECURSIVE PATTERNS  RECURSIVE PATTERNS
# Line 3796  SUBPATTERNS AS SUBROUTINES Line 3996  SUBPATTERNS AS SUBROUTINES
3996           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
3997    
3998         is  used, it does match "sense and responsibility" as well as the other         is  used, it does match "sense and responsibility" as well as the other
3999         two strings. Such references must, however, follow  the  subpattern  to         two strings. Such references, if given  numerically,  must  follow  the
4000         which they refer.         subpattern  to which they refer. However, named references can refer to
4001           later subpatterns.
4002    
4003         Like recursive subpatterns, a "subroutine" call is always treated as an         Like recursive subpatterns, a "subroutine" call is always treated as an
4004         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
4005         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
4006         there is a subsequent matching failure.         there is a subsequent matching failure.
4007    
4008    
4009  CALLOUTS  CALLOUTS
4010    
4011         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4012         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
4013         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
4014         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
4015         tion.         tion.
4016    
4017         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
4018         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
4019         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
4020         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
4021         all calling out.         all calling out.
4022    
4023         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
4024         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
4025         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
4026         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
4027         points:         points:
4028    
4029           (?C1)abc(?C2)def           (?C1)abc(?C2)def
4030    
4031         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4032         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
4033         numbered 255.         numbered 255.
4034    
4035         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
4036         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
4037         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
4038         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
4039         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
4040         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4041         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4042    
4043  Last updated: 24 January 2006  Last updated: 06 June 2006
4044  Copyright (c) 1997-2006 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
4045  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4046    
# Line 4847  PCRE SAMPLE PROGRAM Line 5048  PCRE SAMPLE PROGRAM
5048  Last updated: 09 September 2004  Last updated: 09 September 2004
5049  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
5050  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5051    PCRESTACK(3)                                                      PCRESTACK(3)
5052    
5053    
5054    NAME
5055           PCRE - Perl-compatible regular expressions
5056    
5057    
5058    PCRE DISCUSSION OF STACK USAGE
5059    
5060           When  you call pcre_exec(), it makes use of an internal function called
5061           match(). This calls itself recursively at branch points in the pattern,
5062           in  order to remember the state of the match so that it can back up and
5063           try a different alternative if the first one fails.  As  matching  pro-
5064           ceeds  deeper  and deeper into the tree of possibilities, the recursion
5065           depth increases.
5066    
5067           Not all calls of match() increase the recursion depth; for an item such
5068           as  a* it may be called several times at the same level, after matching
5069           different numbers of a's. Furthermore, in a number of cases  where  the
5070           result  of  the  recursive call would immediately be passed back as the
5071           result of the current call (a "tail recursion"), the function  is  just
5072           restarted instead.
5073    
5074           The pcre_dfa_exec() function operates in an entirely different way, and
5075           hardly uses recursion at all. The limit on its complexity is the amount
5076           of  workspace  it  is  given.  The comments that follow do NOT apply to
5077           pcre_dfa_exec(); they are relevant only for pcre_exec().
5078    
5079           You can set limits on the number of times that match() is called,  both
5080           in  total  and  recursively. If the limit is exceeded, an error occurs.
5081           For details, see the section on  extra  data  for  pcre_exec()  in  the
5082           pcreapi documentation.
5083    
5084           Each  time  that match() is actually called recursively, it uses memory
5085           from the process stack. For certain kinds of  pattern  and  data,  very
5086           large  amounts of stack may be needed, despite the recognition of "tail
5087           recursion".  You can often reduce the amount of recursion,  and  there-
5088           fore  the  amount of stack used, by modifying the pattern that is being
5089           matched. Consider, for example, this pattern:
5090    
5091             ([^<]|<(?!inet))+
5092    
5093           It matches from wherever it starts until it encounters "<inet"  or  the
5094           end  of  the  data,  and is the kind of pattern that might be used when
5095           processing an XML file. Each iteration of the outer parentheses matches
5096           either  one  character that is not "<" or a "<" that is not followed by
5097           "inet". However, each time a  parenthesis  is  processed,  a  recursion
5098           occurs, so this formulation uses a stack frame for each matched charac-
5099           ter. For a long string, a lot of stack is required. Consider  now  this
5100           rewritten pattern, which matches exactly the same strings:
5101    
5102             ([^<]++|<(?!inet))
5103    
5104           This  uses very much less stack, because runs of characters that do not
5105           contain "<" are "swallowed" in one item inside the parentheses.  Recur-
5106           sion  happens  only when a "<" character that is not followed by "inet"
5107           is encountered (and we assume this is relatively  rare).  A  possessive
5108           quantifier  is  used  to stop any backtracking into the runs of non-"<"
5109           characters, but that is not related to stack usage.
5110    
5111           In environments where stack memory is constrained, you  might  want  to
5112           compile  PCRE to use heap memory instead of stack for remembering back-
5113           up points. This makes it run a lot more slowly, however. Details of how
5114           to do this are given in the pcrebuild documentation.
5115    
5116           In Unix-like environments, there is not often a problem with the stack,
5117           though the default limit on stack size varies from  system  to  system.
5118           Values  from 8Mb to 64Mb are common. You can find your default limit by
5119           running the command:
5120    
5121             ulimit -s
5122    
5123           The effect of running out of stack is often SIGSEGV,  though  sometimes
5124           an error message is given. You can normally increase the limit on stack
5125           size by code such as this:
5126    
5127             struct rlimit rlim;
5128             getrlimit(RLIMIT_STACK, &rlim);
5129             rlim.rlim_cur = 100*1024*1024;
5130             setrlimit(RLIMIT_STACK, &rlim);
5131    
5132           This reads the current limits (soft and hard) using  getrlimit(),  then
5133           attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You
5134           must do this before calling pcre_exec().
5135    
5136           PCRE has an internal counter that can be used to  limit  the  depth  of
5137           recursion,  and  thus cause pcre_exec() to give an error code before it
5138           runs out of stack. By default, the limit is very  large,  and  unlikely
5139           ever  to operate. It can be changed when PCRE is built, and it can also
5140           be set when pcre_exec() is called. For details of these interfaces, see
5141           the pcrebuild and pcreapi documentation.
5142    
5143           As a very rough rule of thumb, you should reckon on about 500 bytes per
5144           recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
5145           should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
5146           hand, can support around 128000 recursions. The pcretest  test  program
5147           has  a command line option (-S) that can be used to increase its stack.
5148    
5149    Last updated: 29 June 2006
5150    Copyright (c) 1997-2006 University of Cambridge.
5151    ------------------------------------------------------------------------------
5152    
5153    

Legend:
Removed from v.90  
changed lines
  Added in v.91

  ViewVC Help
Powered by ViewVC 1.1.5