/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 83 by nigel, Sat Feb 24 21:41:06 2007 UTC revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC
# Line 81  USER DOCUMENTATION Line 81  USER DOCUMENTATION
81           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API
82           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
83           pcresample        discussion of the sample program           pcresample        discussion of the sample program
84             pcrestack         discussion of stack usage
85           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
86    
87         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
# Line 100  LIMITATIONS Line 101  LIMITATIONS
101         In these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
102         of execution will be slower.         of execution will be slower.
103    
104         All values in repeating quantifiers must be less than 65536.  The maxi-         All  values in repeating quantifiers must be less than 65536. The maxi-
105         mum number of capturing subpatterns is 65535.         mum compiled length of subpattern with  an  explicit  repeat  count  is
106           30000 bytes. The maximum number of capturing subpatterns is 65535.
107    
108         There is no limit to the number of non-capturing subpatterns,  but  the         There  is  no limit to the number of non-capturing subpatterns, but the
109         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         maximum depth of nesting of  all  kinds  of  parenthesized  subpattern,
110         including capturing subpatterns, assertions, and other types of subpat-         including capturing subpatterns, assertions, and other types of subpat-
111         tern, is 200.         tern, is 200.
112    
113           The maximum length of name for a named subpattern is 32, and the  maxi-
114           mum number of named subpatterns is 10000.
115    
116         The  maximum  length of a subject string is the largest positive number         The  maximum  length of a subject string is the largest positive number
117         that an integer variable can hold. However, when using the  traditional         that an integer variable can hold. However, when using the  traditional
118         matching function, PCRE uses recursion to handle subpatterns and indef-         matching function, PCRE uses recursion to handle subpatterns and indef-
119         inite repetition.  This means that the available stack space may  limit         inite repetition.  This means that the available stack space may  limit
120         the size of a subject string that can be processed by certain patterns.         the size of a subject string that can be processed by certain patterns.
121           For a discussion of stack issues, see the pcrestack documentation.
122    
123    
124  UTF-8 AND UNICODE PROPERTY SUPPORT  UTF-8 AND UNICODE PROPERTY SUPPORT
# Line 137  UTF-8 AND UNICODE PROPERTY SUPPORT Line 143  UTF-8 AND UNICODE PROPERTY SUPPORT
143         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-         UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
144         ported.  The available properties that can be tested are limited to the         ported.  The available properties that can be tested are limited to the
145         general  category  properties such as Lu for an upper case letter or Nd         general  category  properties such as Lu for an upper case letter or Nd
146         for a decimal number. A full list is given in the pcrepattern  documen-         for a decimal number, the Unicode script names such as Arabic  or  Han,
147         tation. The PCRE library is increased in size by about 90K when Unicode         and  the  derived  properties  Any  and L&. A full list is given in the
148         property support is included.         pcrepattern documentation. Only the short names for properties are sup-
149           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
150           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
151           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
152           does not support this.
153    
154         The following comments apply when PCRE is running in UTF-8 mode:         The following comments apply when PCRE is running in UTF-8 mode:
155    
# Line 155  UTF-8 AND UNICODE PROPERTY SUPPORT Line 165  UTF-8 AND UNICODE PROPERTY SUPPORT
165         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may
166         crash.         crash.
167    
168         2. In a pattern, the escape sequence \x{...}, where the contents of the         2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
169         braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8         two-byte UTF-8 character if the value is greater than 127.
        character whose code number is the given hexadecimal number, for  exam-  
        ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,  
        the item is not recognized.  This escape sequence can be used either as  
        a literal, or within a character class.  
170    
171         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte         3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
172         UTF-8 character if the value is greater than 127.         characters for values greater than \177.
173    
174         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
175         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
# Line 192  UTF-8 AND UNICODE PROPERTY SUPPORT Line 198  UTF-8 AND UNICODE PROPERTY SUPPORT
198         Even when Unicode property support is available, PCRE  still  uses  its         Even when Unicode property support is available, PCRE  still  uses  its
199         own  character  tables when checking the case of low-valued characters,         own  character  tables when checking the case of low-valued characters,
200         so as not to degrade performance.  The Unicode property information  is         so as not to degrade performance.  The Unicode property information  is
201         used only for characters with higher values.         used only for characters with higher values. Even when Unicode property
202           support is available, PCRE supports case-insensitive matching only when
203           there  is  a  one-to-one  mapping between a letter's cases. There are a
204           small number of many-to-one mappings in Unicode;  these  are  not  sup-
205           ported by PCRE.
206    
207    
208  AUTHOR  AUTHOR
# Line 205  AUTHOR Line 215  AUTHOR
215         so I've taken it away. If you want to email me, use my initial and sur-         so I've taken it away. If you want to email me, use my initial and sur-
216         name, separated by a dot, at the domain ucs.cam.ac.uk.         name, separated by a dot, at the domain ucs.cam.ac.uk.
217    
218  Last updated: 07 March 2005  Last updated: 05 June 2006
219  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
220  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
221    
222    
# Line 280  UNICODE CHARACTER PROPERTY SUPPORT Line 290  UNICODE CHARACTER PROPERTY SUPPORT
290    
291  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
292    
293         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating
294         ter. This is the normal newline character on Unix-like systems. You can         the  end  of  a line. This is the normal newline character on Unix-like
295         compile PCRE to use character 13 (carriage return) instead by adding         systems. You can compile PCRE to use character 13 (carriage return, CR)
296           instead, by adding
297    
298           --enable-newline-is-cr           --enable-newline-is-cr
299    
300         to the configure command. For completeness there is  also  a  --enable-         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
301         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
302         line character.  
303           Alternatively, you can specify that line endings are to be indicated by
304           the two character sequence CRLF. If you want this, add
305    
306             --enable-newline-is-crlf
307    
308           to  the  configure command. Whatever line ending convention is selected
309           when PCRE is built can be overridden when  the  library  functions  are
310           called.  At  build time it is conventional to use the standard for your
311           operating system.
312    
313    
314  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
# Line 319  POSIX MALLOC USAGE Line 339  POSIX MALLOC USAGE
339         to the configure command.         to the configure command.
340    
341    
 LIMITING PCRE RESOURCE USAGE  
   
        Internally,  PCRE has a function called match(), which it calls repeat-  
        edly  (possibly  recursively)  when  matching  a   pattern   with   the  
        pcre_exec()  function.  By controlling the maximum number of times this  
        function may be called during a single matching operation, a limit  can  
        be  placed  on  the resources used by a single call to pcre_exec(). The  
        limit can be changed at run time, as described in the pcreapi  documen-  
        tation.  The default is 10 million, but this can be changed by adding a  
        setting such as  
   
          --with-match-limit=500000  
   
        to  the  configure  command.  This  setting  has  no  effect   on   the  
        pcre_dfa_exec() matching function.  
   
   
342  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
343    
344         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
# Line 365  AVOIDING EXCESSIVE STACK USAGE Line 368  AVOIDING EXCESSIVE STACK USAGE
368         ing by making recursive calls to an internal function  called  match().         ing by making recursive calls to an internal function  called  match().
369         In  environments  where  the size of the stack is limited, this can se-         In  environments  where  the size of the stack is limited, this can se-
370         verely limit PCRE's operation. (The Unix environment does  not  usually         verely limit PCRE's operation. (The Unix environment does  not  usually
371         suffer  from  this  problem.)  An alternative approach that uses memory         suffer from this problem, but it may sometimes be necessary to increase
372         from the heap to remember data, instead  of  using  recursive  function         the maximum stack size.  There is a discussion in the  pcrestack  docu-
373         calls,  has been implemented to work round this problem. If you want to         mentation.)  An alternative approach to recursion that uses memory from
374         build a version of PCRE that works this way, add         the heap to remember data, instead of using recursive  function  calls,
375           has  been  implemented to work round the problem of limited stack size.
376           If you want to build a version of PCRE that works this way, add
377    
378           --disable-stack-for-recursion           --disable-stack-for-recursion
379    
# Line 383  AVOIDING EXCESSIVE STACK USAGE Line 388  AVOIDING EXCESSIVE STACK USAGE
388         function; it is not relevant for the the pcre_dfa_exec() function.         function; it is not relevant for the the pcre_dfa_exec() function.
389    
390    
391    LIMITING PCRE RESOURCE USAGE
392    
393           Internally, PCRE has a function called match(), which it calls  repeat-
394           edly   (sometimes   recursively)  when  matching  a  pattern  with  the
395           pcre_exec() function. By controlling the maximum number of  times  this
396           function  may be called during a single matching operation, a limit can
397           be placed on the resources used by a single call  to  pcre_exec().  The
398           limit  can be changed at run time, as described in the pcreapi documen-
399           tation. The default is 10 million, but this can be changed by adding  a
400           setting such as
401    
402             --with-match-limit=500000
403    
404           to   the   configure  command.  This  setting  has  no  effect  on  the
405           pcre_dfa_exec() matching function.
406    
407           In some environments it is desirable to limit the  depth  of  recursive
408           calls of match() more strictly than the total number of calls, in order
409           to restrict the maximum amount of stack (or heap,  if  --disable-stack-
410           for-recursion is specified) that is used. A second limit controls this;
411           it defaults to the value that  is  set  for  --with-match-limit,  which
412           imposes  no  additional constraints. However, you can set a lower limit
413           by adding, for example,
414    
415             --with-match-limit-recursion=10000
416    
417           to the configure command. This value can  also  be  overridden  at  run
418           time.
419    
420    
421  USING EBCDIC CODE  USING EBCDIC CODE
422    
423         PCRE assumes by default that it will run in an  environment  where  the         PCRE  assumes  by  default that it will run in an environment where the
424         character  code  is  ASCII  (or Unicode, which is a superset of ASCII).         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
425         PCRE can, however, be compiled to  run  in  an  EBCDIC  environment  by         PCRE  can,  however,  be  compiled  to  run in an EBCDIC environment by
426         adding         adding
427    
428           --enable-ebcdic           --enable-ebcdic
429    
430         to the configure command.         to the configure command.
431    
432  Last updated: 15 August 2005  Last updated: 06 June 2006
433  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
434  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
435    
436    
# Line 440  REGULAR EXPRESSIONS AS TREES Line 475  REGULAR EXPRESSIONS AS TREES
475         resented  as  a  tree structure. An unlimited repetition in the pattern         resented  as  a  tree structure. An unlimited repetition in the pattern
476         makes the tree of infinite size, but it is still a tree.  Matching  the         makes the tree of infinite size, but it is still a tree.  Matching  the
477         pattern  to a given subject string (from a given starting point) can be         pattern  to a given subject string (from a given starting point) can be
478         thought of as a search of the tree.  There are  two  standard  ways  to         thought of as a search of the tree.  There are two  ways  to  search  a
479         search  a  tree: depth-first and breadth-first, and these correspond to         tree:  depth-first  and  breadth-first, and these correspond to the two
480         the two matching algorithms provided by PCRE.         matching algorithms provided by PCRE.
481    
482    
483  THE STANDARD MATCHING ALGORITHM  THE STANDARD MATCHING ALGORITHM
# Line 562  DISADVANTAGES OF THE DFA ALGORITHM Line 597  DISADVANTAGES OF THE DFA ALGORITHM
597         but  does not provide the advantage that it does for the standard algo-         but  does not provide the advantage that it does for the standard algo-
598         rithm.         rithm.
599    
600  Last updated: 28 February 2005  Last updated: 06 June 2006
601  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
602  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
603    
604    
# Line 616  PCRE NATIVE API Line 651  PCRE NATIVE API
651         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
652              const char *name);              const char *name);
653    
654           int pcre_get_stringtable_entries(const pcre *code,
655                const char *name, char **first, char **last);
656    
657         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
658              int stringcount, int stringnumber,              int stringcount, int stringnumber,
659              const char **stringptr);              const char **stringptr);
# Line 676  PCRE API OVERVIEW Line 714  PCRE API OVERVIEW
714    
715         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
716         ble,  is  also provided. This uses a different algorithm for the match-         ble,  is  also provided. This uses a different algorithm for the match-
717         ing. This allows it to find all possible matches (at a given  point  in         ing. The alternative algorithm finds all possible matches (at  a  given
718         the  subject),  not  just  one. However, this algorithm does not return         point in the subject). However, this algorithm does not return captured
719         captured substrings. A description of the two matching  algorithms  and         substrings. A description of the  two  matching  algorithms  and  their
720         their  advantages  and disadvantages is given in the pcrematching docu-         advantages  and  disadvantages  is given in the pcrematching documenta-
721         mentation.         tion.
722    
723         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
724         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 692  PCRE API OVERVIEW Line 730  PCRE API OVERVIEW
730           pcre_get_named_substring()           pcre_get_named_substring()
731           pcre_get_substring_list()           pcre_get_substring_list()
732           pcre_get_stringnumber()           pcre_get_stringnumber()
733             pcre_get_stringtable_entries()
734    
735         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
736         to free the memory used for extracted strings.         to free the memory used for extracted strings.
# Line 723  PCRE API OVERVIEW Line 762  PCRE API OVERVIEW
762         indirections  to  memory  management functions. These special functions         indirections  to  memory  management functions. These special functions
763         are used only when PCRE is compiled to use  the  heap  for  remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
764         data, instead of recursive function calls, when running the pcre_exec()         data, instead of recursive function calls, when running the pcre_exec()
765         function. This is a non-standard way of building PCRE, for use in envi-         function. See the pcrebuild documentation for  details  of  how  to  do
766         ronments that have limited stacks. Because of the greater use of memory         this.  It  is  a non-standard way of building PCRE, for use in environ-
767         management, it runs more slowly.  Separate functions  are  provided  so         ments that have limited stacks. Because of the greater  use  of  memory
768         that  special-purpose  external  code  can  be used for this case. When         management,  it  runs  more  slowly. Separate functions are provided so
769         used, these functions are always called in a  stack-like  manner  (last         that special-purpose external code can be  used  for  this  case.  When
770         obtained,  first freed), and always for memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
771           obtained, first freed), and always for memory blocks of the same  size.
772           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
773           mentation.
774    
775         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
776         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 736  PCRE API OVERVIEW Line 778  PCRE API OVERVIEW
778         pcrecallout documentation.         pcrecallout documentation.
779    
780    
781    NEWLINES
782           PCRE supports three different conventions for indicating line breaks in
783           strings: a single CR character, a single LF character, or the two-char-
784           acter  sequence  CRLF.  All  three  are used as "standard" by different
785           operating systems.  When PCRE is built, a default can be specified. The
786           default  default  is  LF, which is the Unix standard. When PCRE is run,
787           the default can be overridden, either when a pattern  is  compiled,  or
788           when it is matched.
789    
790           In the PCRE documentation the word "newline" is used to mean "the char-
791           acter or pair of characters that indicate a line break".
792    
793    
794  MULTITHREADING  MULTITHREADING
795    
796         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
797         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
798         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
799         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
800    
801         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
802         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
803         at once.         at once.
804    
# Line 751  MULTITHREADING Line 806  MULTITHREADING
806  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
807    
808         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
809         later time, possibly by a different program, and even on a  host  other         later  time,  possibly by a different program, and even on a host other
810         than  the  one  on  which  it  was  compiled.  Details are given in the         than the one on which  it  was  compiled.  Details  are  given  in  the
811         pcreprecompile documentation.         pcreprecompile documentation.
812    
813    
# Line 760  CHECKING BUILD-TIME OPTIONS Line 815  CHECKING BUILD-TIME OPTIONS
815    
816         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
817    
818         The function pcre_config() makes it possible for a PCRE client to  dis-         The  function pcre_config() makes it possible for a PCRE client to dis-
819         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
820         The pcrebuild documentation has more details about these optional  fea-         The  pcrebuild documentation has more details about these optional fea-
821         tures.         tures.
822    
823         The  first  argument  for pcre_config() is an integer, specifying which         The first argument for pcre_config() is an  integer,  specifying  which
824         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
825         into  which  the  information  is  placed. The following information is         into which the information is  placed.  The  following  information  is
826         available:         available:
827    
828           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
829    
830         The output is an integer that is set to one if UTF-8 support is  avail-         The  output is an integer that is set to one if UTF-8 support is avail-
831         able; otherwise it is set to zero.         able; otherwise it is set to zero.
832    
833           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
834    
835         The  output  is  an  integer  that is set to one if support for Unicode         The output is an integer that is set to  one  if  support  for  Unicode
836         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
837    
838           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
839    
840         The output is an integer that is set to the value of the code  that  is         The  output  is  an integer whose value specifies the default character
841         used  for the newline character. It is either linefeed (10) or carriage         sequence that is recognized as meaning "newline". The three values that
842         return (13), and should normally be the  standard  character  for  your         are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default
843         operating system.         should normally be the standard sequence for your operating system.
844    
845           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
846    
847         The  output  is  an  integer that contains the number of bytes used for         The output is an integer that contains the number  of  bytes  used  for
848         internal linkage in compiled regular expressions. The value is 2, 3, or         internal linkage in compiled regular expressions. The value is 2, 3, or
849         4.  Larger  values  allow larger regular expressions to be compiled, at         4. Larger values allow larger regular expressions to  be  compiled,  at
850         the expense of slower matching. The default value of  2  is  sufficient         the  expense  of  slower matching. The default value of 2 is sufficient
851         for  all  but  the  most massive patterns, since it allows the compiled         for all but the most massive patterns, since  it  allows  the  compiled
852         pattern to be up to 64K in size.         pattern to be up to 64K in size.
853    
854           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
855    
856         The output is an integer that contains the threshold  above  which  the         The  output  is  an integer that contains the threshold above which the
857         POSIX  interface  uses malloc() for output vectors. Further details are         POSIX interface uses malloc() for output vectors. Further  details  are
858         given in the pcreposix documentation.         given in the pcreposix documentation.
859    
860           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
861    
862         The output is an integer that gives the default limit for the number of         The output is an integer that gives the default limit for the number of
863         internal  matching  function  calls in a pcre_exec() execution. Further         internal matching function calls in a  pcre_exec()  execution.  Further
864         details are given with pcre_exec() below.         details are given with pcre_exec() below.
865    
866             PCRE_CONFIG_MATCH_LIMIT_RECURSION
867    
868           The  output is an integer that gives the default limit for the depth of
869           recursion when calling the internal matching function in a  pcre_exec()
870           execution. Further details are given with pcre_exec() below.
871    
872           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
873    
874         The output is an integer that is set to one if internal recursion  when         The  output is an integer that is set to one if internal recursion when
875         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
876         the stack to remember their state. This is the usual way that  PCRE  is         the  stack  to remember their state. This is the usual way that PCRE is
877         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
878         on the  heap  instead  of  recursive  function  calls.  In  this  case,         on  the  heap  instead  of  recursive  function  calls.  In  this case,
879         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
880         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
881    
882    
# Line 832  COMPILING A PATTERN Line 893  COMPILING A PATTERN
893    
894         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
895         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
896         the two interfaces is that pcre_compile2() has an additional  argument,         the  two interfaces is that pcre_compile2() has an additional argument,
897         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error code can be returned.
898    
899         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
900         the pattern argument. A pointer to a single block  of  memory  that  is         the  pattern  argument.  A  pointer to a single block of memory that is
901         obtained  via  pcre_malloc is returned. This contains the compiled code         obtained via pcre_malloc is returned. This contains the  compiled  code
902         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
903         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
904         It is up to the caller  to  free  the  memory  when  it  is  no  longer         It is up to the caller to free the memory (via pcre_free) when it is no
905         required.         longer required.
906    
907         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
908         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
909         fully  relocatable, because it may contain a copy of the tableptr argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
910         ment, which is an address (see below).         ment, which is an address (see below).
911    
912         The options argument contains independent bits that affect the compila-         The options argument contains independent bits that affect the compila-
913         tion.  It  should  be  zero  if  no options are required. The available         tion. It should be zero if  no  options  are  required.  The  available
914         options are described below. Some of them, in  particular,  those  that         options  are  described  below. Some of them, in particular, those that
915         are  compatible  with  Perl,  can also be set and unset from within the         are compatible with Perl, can also be set and  unset  from  within  the
916         pattern (see the detailed description  in  the  pcrepattern  documenta-         pattern  (see  the  detailed  description in the pcrepattern documenta-
917         tion).  For  these options, the contents of the options argument speci-         tion). For these options, the contents of the options  argument  speci-
918         fies their initial settings at the start of compilation and  execution.         fies  their initial settings at the start of compilation and execution.
919         The  PCRE_ANCHORED option can be set at the time of matching as well as         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time
920         at compile time.         of matching as well as at compile time.
921    
922         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
923         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
924         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
925         sage.  The  offset from the start of the pattern to the character where         sage. This is a static string that is part of the library. You must not
926         the error was discovered is  placed  in  the  variable  pointed  to  by         try to free it. The offset from the start of the pattern to the charac-
927         erroffset,  which  must  not  be  NULL. If it is, an immediate error is         ter where the error was discovered is placed in the variable pointed to
928           by  erroffset,  which must not be NULL. If it is, an immediate error is
929         given.         given.
930    
931         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
# Line 926  COMPILING A PATTERN Line 988  COMPILING A PATTERN
988    
989         If this bit is set, a dollar metacharacter in the pattern matches  only         If this bit is set, a dollar metacharacter in the pattern matches  only
990         at  the  end  of the subject string. Without this option, a dollar also         at  the  end  of the subject string. Without this option, a dollar also
991         matches immediately before the final character if it is a newline  (but         matches immediately before a newline at the end of the string (but  not
992         not  before  any  other  newlines).  The  PCRE_DOLLAR_ENDONLY option is         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
993         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
994         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
995    
996           PCRE_DOTALL           PCRE_DOTALL
997    
998         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
999         acters, including newlines. Without it,  newlines  are  excluded.  This         acters, including those that indicate newline. Without it, a  dot  does
1000         option  is equivalent to Perl's /s option, and it can be changed within         not  match  when  the  current position is at a newline. This option is
1001         a pattern by a (?s) option setting.  A  negative  class  such  as  [^a]         equivalent to Perl's /s option, and it can be changed within a  pattern
1002         always  matches a newline character, independent of the setting of this         by  a (?s) option setting. A negative class such as [^a] always matches
1003         option.         newlines, independent of the setting of this option.
1004    
1005             PCRE_DUPNAMES
1006    
1007           If this bit is set, names used to identify capturing  subpatterns  need
1008           not be unique. This can be helpful for certain types of pattern when it
1009           is known that only one instance of the named  subpattern  can  ever  be
1010           matched.  There  are  more details of named subpatterns below; see also
1011           the pcrepattern documentation.
1012    
1013           PCRE_EXTENDED           PCRE_EXTENDED
1014    
# Line 946  COMPILING A PATTERN Line 1016  COMPILING A PATTERN
1016         totally ignored except when escaped or inside a character class. White-         totally ignored except when escaped or inside a character class. White-
1017         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
1018         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
1019         line character, inclusive, are also  ignored.  This  is  equivalent  to         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
1020         Perl's  /x  option,  and  it  can be changed within a pattern by a (?x)         option,  and  it  can be changed within a pattern by a (?x) option set-
1021         option setting.         ting.
1022    
1023         This option makes it possible to include  comments  inside  complicated         This option makes it possible to include  comments  inside  complicated
1024         patterns.   Note,  however,  that this applies only to data characters.         patterns.   Note,  however,  that this applies only to data characters.
# Line 964  COMPILING A PATTERN Line 1034  COMPILING A PATTERN
1034         letter  that  has  no  special  meaning causes an error, thus reserving         letter  that  has  no  special  meaning causes an error, thus reserving
1035         these combinations for future expansion. By  default,  as  in  Perl,  a         these combinations for future expansion. By  default,  as  in  Perl,  a
1036         backslash  followed by a letter with no special meaning is treated as a         backslash  followed by a letter with no special meaning is treated as a
1037         literal. There are at present no  other  features  controlled  by  this         literal. (Perl can, however, be persuaded to give a warning for  this.)
1038         option. It can also be set by a (?X) option setting within a pattern.         There  are  at  present no other features controlled by this option. It
1039           can also be set by a (?X) option setting within a pattern.
1040    
1041           PCRE_FIRSTLINE           PCRE_FIRSTLINE
1042    
1043         If  this  option  is  set,  an  unanchored pattern is required to match         If this option is set, an  unanchored  pattern  is  required  to  match
1044         before or at the first newline character in the subject string,  though         before  or  at  the  first  newline  in  the subject string, though the
1045         the matched text may continue over the newline.         matched text may continue over the newline.
1046    
1047           PCRE_MULTILINE           PCRE_MULTILINE
1048    
1049         By  default,  PCRE  treats the subject string as consisting of a single         By default, PCRE treats the subject string as consisting  of  a  single
1050         line of characters (even if it actually contains newlines). The  "start         line  of characters (even if it actually contains newlines). The "start
1051         of  line"  metacharacter  (^)  matches only at the start of the string,         of line" metacharacter (^) matches only at the  start  of  the  string,
1052         while the "end of line" metacharacter ($) matches only at  the  end  of         while  the  "end  of line" metacharacter ($) matches only at the end of
1053         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1054         is set). This is the same as Perl.         is set). This is the same as Perl.
1055    
1056         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1057         constructs  match  immediately following or immediately before any new-         constructs match immediately following or immediately  before  internal
1058         line in the subject string, respectively, as well as at the very  start         newlines  in  the  subject string, respectively, as well as at the very
1059         and  end. This is equivalent to Perl's /m option, and it can be changed         start and end. This is equivalent to Perl's /m option, and  it  can  be
1060         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1061         ters  in  a  subject  string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1062         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1063    
1064             PCRE_NEWLINE_CR
1065             PCRE_NEWLINE_LF
1066             PCRE_NEWLINE_CRLF
1067    
1068           These  options  override the default newline definition that was chosen
1069           when PCRE was built. Setting the first or the second specifies  that  a
1070           newline  is  indicated  by a single character (CR or LF, respectively).
1071           Setting both of them specifies that a newline is indicated by the  two-
1072           character  CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined
1073           to contain both bits. The only time that a line break is relevant  when
1074           compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out-
1075           side a character class is encountered. This indicates  a  comment  that
1076           lasts until after the next newline.
1077    
1078           The newline option set at compile time becomes the default that is used
1079           for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1080    
1081           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1082    
1083         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
# Line 1059  COMPILATION ERROR CODES Line 1147  COMPILATION ERROR CODES
1147           23  internal error: code overflow           23  internal error: code overflow
1148           24  unrecognized character after (?<           24  unrecognized character after (?<
1149           25  lookbehind assertion is not fixed length           25  lookbehind assertion is not fixed length
1150           26  malformed number after (?(           26  malformed number or name after (?(
1151           27  conditional group contains more than two branches           27  conditional group contains more than two branches
1152           28  assertion expected after (?(           28  assertion expected after (?(
1153           29  (?R or (?digits must be followed by )           29  (?R or (?digits must be followed by )
# Line 1076  COMPILATION ERROR CODES Line 1164  COMPILATION ERROR CODES
1164           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
1165           41  unrecognized character after (?P           41  unrecognized character after (?P
1166           42  syntax error after (?P           42  syntax error after (?P
1167           43  two named groups have the same name           43  two named subpatterns have the same name
1168           44  invalid UTF-8 string           44  invalid UTF-8 string
1169           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
1170           46  malformed \P or \p sequence           46  malformed \P or \p sequence
1171           47  unknown property name after \P or \p           47  unknown property name after \P or \p
1172             48  subpattern name is too long (maximum 32 characters)
1173             49  too many named subpatterns (maximum 10,000)
1174             50  repeated subpattern is too long
1175             51  octal value is greater than \377 (not in UTF-8 mode)
1176    
1177    
1178  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1111  STUDYING A PATTERN Line 1203  STUDYING A PATTERN
1203    
1204         The  third argument for pcre_study() is a pointer for an error message.         The  third argument for pcre_study() is a pointer for an error message.
1205         If studying succeeds (even if no data is  returned),  the  variable  it         If studying succeeds (even if no data is  returned),  the  variable  it
1206         points  to  is set to NULL. Otherwise it points to a textual error mes-         points  to  is  set  to NULL. Otherwise it is set to point to a textual
1207         sage. You should therefore test the error pointer for NULL after  call-         error message. This is a static string that is part of the library. You
1208         ing pcre_study(), to be sure that it has run successfully.         must  not  try  to  free it. You should test the error pointer for NULL
1209           after calling pcre_study(), to be sure that it has run successfully.
1210    
1211         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1212    
# Line 1124  STUDYING A PATTERN Line 1217  STUDYING A PATTERN
1217             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1218    
1219         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1220         that do not have a single fixed starting character. A bitmap of  possi-         that  do not have a single fixed starting character. A bitmap of possi-
1221         ble starting bytes is created.         ble starting bytes is created.
1222    
1223    
1224  LOCALE SUPPORT  LOCALE SUPPORT
1225    
1226         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1227         letters digits, or whatever, by reference to a set of  tables,  indexed         letters  digits,  or whatever, by reference to a set of tables, indexed
1228         by  character  value.  When running in UTF-8 mode, this applies only to         by character value. When running in UTF-8 mode, this  applies  only  to
1229         characters with codes less than 128. Higher-valued  codes  never  match         characters  with  codes  less than 128. Higher-valued codes never match
1230         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1231         with Unicode character property support.         with  Unicode  character property support. The use of locales with Uni-
1232           code is discouraged.
1233    
1234         An internal set of tables is created in the default C locale when  PCRE         An internal set of tables is created in the default C locale when  PCRE
1235         is  built.  This  is  used when the final argument of pcre_compile() is         is  built.  This  is  used when the final argument of pcre_compile() is
# Line 1200  INFORMATION ABOUT A PATTERN Line 1294  INFORMATION ABOUT A PATTERN
1294         pattern:         pattern:
1295    
1296           int rc;           int rc;
1297           unsigned long int length;           size_t length;
1298           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1299             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1300             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
# Line 1232  INFORMATION ABOUT A PATTERN Line 1326  INFORMATION ABOUT A PATTERN
1326           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1327    
1328         Return  information  about  the first byte of any matched string, for a         Return  information  about  the first byte of any matched string, for a
1329         non-anchored   pattern.   (This    option    used    to    be    called         non-anchored pattern. The fourth argument should point to an int  vari-
1330         PCRE_INFO_FIRSTCHAR;  the  old  name  is still recognized for backwards         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1331         compatibility.)         is still recognized for backwards compatibility.)
1332    
1333         If there is a fixed first byte, for example, from  a  pattern  such  as         If there is a fixed first byte, for example, from  a  pattern  such  as
1334         (cat|cow|coyote),  it  is  returned in the integer pointed to by where.         (cat|cow|coyote). Otherwise, if either
        Otherwise, if either  
1335    
1336         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1337         branch starts with "^", or         branch starts with "^", or
1338    
1339         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1340         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1341    
1342         -1 is returned, indicating that the pattern matches only at  the  start         -1  is  returned, indicating that the pattern matches only at the start
1343         of  a  subject string or after any newline within the string. Otherwise         of a subject string or after any newline within the  string.  Otherwise
1344         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1345    
1346           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1347    
1348         If the pattern was studied, and this resulted in the construction of  a         If  the pattern was studied, and this resulted in the construction of a
1349         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1350         matching string, a pointer to the table is returned. Otherwise NULL  is         matching  string, a pointer to the table is returned. Otherwise NULL is
1351         returned.  The fourth argument should point to an unsigned char * vari-         returned. The fourth argument should point to an unsigned char *  vari-
1352         able.         able.
1353    
1354           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1355    
1356         Return the value of the rightmost literal byte that must exist  in  any         Return  the  value of the rightmost literal byte that must exist in any
1357         matched  string,  other  than  at  its  start,  if such a byte has been         matched string, other than at its  start,  if  such  a  byte  has  been
1358         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1359         is  no such byte, -1 is returned. For anchored patterns, a last literal         is no such byte, -1 is returned. For anchored patterns, a last  literal
1360         byte is recorded only if it follows something of variable  length.  For         byte  is  recorded only if it follows something of variable length. For
1361         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1362         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1363    
# Line 1272  INFORMATION ABOUT A PATTERN Line 1365  INFORMATION ABOUT A PATTERN
1365           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1366           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1367    
1368         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1369         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1370         ses,  which  still  acquire  numbers.  A  convenience  function  called         ses, which still acquire numbers. Several convenience functions such as
1371         pcre_get_named_substring()  is  provided  for  extracting an individual         pcre_get_named_substring() are provided for  extracting  captured  sub-
1372         captured substring by name. It is also possible  to  extract  the  data         strings  by  name. It is also possible to extract the data directly, by
1373         directly,  by  first converting the name to a number in order to access         first converting the name to a number in order to  access  the  correct
1374         the correct pointers in the output vector (described  with  pcre_exec()         pointers in the output vector (described with pcre_exec() below). To do
1375         below).  To  do the conversion, you need to use the name-to-number map,         the conversion, you need  to  use  the  name-to-number  map,  which  is
1376         which is described by these three values.         described by these three values.
1377    
1378         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1379         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1380         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
1381         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1382         a pointer to the first entry of the table  (a  pointer  to  char).  The         a  pointer  to  the  first  entry of the table (a pointer to char). The
1383         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1384         sis, most significant byte first. The rest of the entry is  the  corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1385         sponding  name,  zero  terminated. The names are in alphabetical order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1386         For example, consider the following pattern  (assume  PCRE_EXTENDED  is         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1387         set, so white space - including newlines - is ignored):         theses numbers. For example, consider  the  following  pattern  (assume
1388           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1389           ignored):
1390    
1391           (?P<date> (?P<year>(\d\d)?\d\d) -           (?P<date> (?P<year>(\d\d)?\d\d) -
1392           (?P<month>\d\d) - (?P<day>\d\d) )           (?P<month>\d\d) - (?P<day>\d\d) )
1393    
1394         There  are  four  named subpatterns, so the table has four entries, and         There are four named subpatterns, so the table has  four  entries,  and
1395         each entry in the table is eight bytes long. The table is  as  follows,         each  entry  in the table is eight bytes long. The table is as follows,
1396         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1397         as ??:         as ??:
1398    
# Line 1306  INFORMATION ABOUT A PATTERN Line 1401  INFORMATION ABOUT A PATTERN
1401           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1402           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1403    
1404         When writing code to extract data  from  named  subpatterns  using  the         When  writing  code  to  extract  data from named subpatterns using the
1405         name-to-number map, remember that the length of each entry is likely to         name-to-number map, remember that the length of the entries  is  likely
1406         be different for each compiled pattern.         to be different for each compiled pattern.
1407    
1408           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1409    
1410         Return a copy of the options with which the pattern was  compiled.  The         Return  a  copy of the options with which the pattern was compiled. The
1411         fourth  argument  should  point to an unsigned long int variable. These         fourth argument should point to an unsigned long  int  variable.  These
1412         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1413         by any top-level option settings within the pattern itself.         by any top-level option settings within the pattern itself.
1414    
1415         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A pattern is automatically anchored by PCRE if  all  of  its  top-level
1416         alternatives begin with one of the following:         alternatives begin with one of the following:
1417    
1418           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1331  INFORMATION ABOUT A PATTERN Line 1426  INFORMATION ABOUT A PATTERN
1426    
1427           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1428    
1429         Return  the  size  of the compiled pattern, that is, the value that was         Return the size of the compiled pattern, that is, the  value  that  was
1430         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1431         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1432         size_t variable.         size_t variable.
# Line 1339  INFORMATION ABOUT A PATTERN Line 1434  INFORMATION ABOUT A PATTERN
1434           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1435    
1436         Return the size of the data block pointed to by the study_data field in         Return the size of the data block pointed to by the study_data field in
1437         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to
1438         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1439         created  by  pcre_study(). The fourth argument should point to a size_t         created by pcre_study(). The fourth argument should point to  a  size_t
1440         variable.         variable.
1441    
1442    
# Line 1349  OBSOLETE INFO FUNCTION Line 1444  OBSOLETE INFO FUNCTION
1444    
1445         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1446    
1447         The pcre_info() function is now obsolete because its interface  is  too         The  pcre_info()  function is now obsolete because its interface is too
1448         restrictive  to return all the available data about a compiled pattern.         restrictive to return all the available data about a compiled  pattern.
1449         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
1450         pcre_info()  is the number of capturing subpatterns, or one of the fol-         pcre_info() is the number of capturing subpatterns, or one of the  fol-
1451         lowing negative numbers:         lowing negative numbers:
1452    
1453           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1454           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1455    
1456         If the optptr argument is not NULL, a copy of the  options  with  which         If  the  optptr  argument is not NULL, a copy of the options with which
1457         the  pattern  was  compiled  is placed in the integer it points to (see         the pattern was compiled is placed in the integer  it  points  to  (see
1458         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1459    
1460         If the pattern is not anchored and the  firstcharptr  argument  is  not         If  the  pattern  is  not anchored and the firstcharptr argument is not
1461         NULL,  it is used to pass back information about the first character of         NULL, it is used to pass back information about the first character  of
1462         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1463    
1464    
# Line 1371  REFERENCE COUNTS Line 1466  REFERENCE COUNTS
1466    
1467         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1468    
1469         The pcre_refcount() function is used to maintain a reference  count  in         The  pcre_refcount()  function is used to maintain a reference count in
1470         the data block that contains a compiled pattern. It is provided for the         the data block that contains a compiled pattern. It is provided for the
1471         benefit of applications that  operate  in  an  object-oriented  manner,         benefit  of  applications  that  operate  in an object-oriented manner,
1472         where different parts of the application may be using the same compiled         where different parts of the application may be using the same compiled
1473         pattern, but you want to free the block when they are all done.         pattern, but you want to free the block when they are all done.
1474    
1475         When a pattern is compiled, the reference count field is initialized to         When a pattern is compiled, the reference count field is initialized to
1476         zero.   It is changed only by calling this function, whose action is to         zero.  It is changed only by calling this function, whose action is  to
1477         add the adjust value (which may be positive or  negative)  to  it.  The         add  the  adjust  value  (which may be positive or negative) to it. The
1478         yield of the function is the new value. However, the value of the count         yield of the function is the new value. However, the value of the count
1479         is constrained to lie between 0 and 65535, inclusive. If the new  value         is  constrained to lie between 0 and 65535, inclusive. If the new value
1480         is outside these limits, it is forced to the appropriate limit value.         is outside these limits, it is forced to the appropriate limit value.
1481    
1482         Except  when it is zero, the reference count is not correctly preserved         Except when it is zero, the reference count is not correctly  preserved
1483         if a pattern is compiled on one host and then  transferred  to  a  host         if  a  pattern  is  compiled on one host and then transferred to a host
1484         whose byte-order is different. (This seems a highly unlikely scenario.)         whose byte-order is different. (This seems a highly unlikely scenario.)
1485    
1486    
# Line 1395  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1490  MATCHING A PATTERN: THE TRADITIONAL FUNC
1490              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1491              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1492    
1493         The function pcre_exec() is called to match a subject string against  a         The  function pcre_exec() is called to match a subject string against a
1494         compiled  pattern, which is passed in the code argument. If the pattern         compiled pattern, which is passed in the code argument. If the  pattern
1495         has been studied, the result of the study should be passed in the extra         has been studied, the result of the study should be passed in the extra
1496         argument.  This  function is the main matching facility of the library,         argument. This function is the main matching facility of  the  library,
1497         and it operates in a Perl-like manner. For specialist use there is also         and it operates in a Perl-like manner. For specialist use there is also
1498         an  alternative matching function, which is described below in the sec-         an alternative matching function, which is described below in the  sec-
1499         tion about the pcre_dfa_exec() function.         tion about the pcre_dfa_exec() function.
1500    
1501         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
1502         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
1503         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
1504         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
1505         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
1506    
1507         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1425  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1520  MATCHING A PATTERN: THE TRADITIONAL FUNC
1520    
1521     Extra data for pcre_exec()     Extra data for pcre_exec()
1522    
1523         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1524         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1525         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
1526         tional  information in it. The fields in a pcre_extra block are as fol-         tional information in it. The pcre_extra block contains  the  following
1527         lows:         fields (not necessarily in this order):
1528    
1529           unsigned long int flags;           unsigned long int flags;
1530           void *study_data;           void *study_data;
1531           unsigned long int match_limit;           unsigned long int match_limit;
1532             unsigned long int match_limit_recursion;
1533           void *callout_data;           void *callout_data;
1534           const unsigned char *tables;           const unsigned char *tables;
1535    
1536         The flags field is a bitmap that specifies which of  the  other  fields         The  flags  field  is a bitmap that specifies which of the other fields
1537         are set. The flag bits are:         are set. The flag bits are:
1538    
1539           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1540           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1541             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1542           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1543           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
1544    
1545         Other  flag  bits should be set to zero. The study_data field is set in         Other flag bits should be set to zero. The study_data field is  set  in
1546         the pcre_extra block that is returned by  pcre_study(),  together  with         the  pcre_extra  block  that is returned by pcre_study(), together with
1547         the appropriate flag bit. You should not set this yourself, but you may         the appropriate flag bit. You should not set this yourself, but you may
1548         add to the block by setting the other fields  and  their  corresponding         add  to  the  block by setting the other fields and their corresponding
1549         flag bits.         flag bits.
1550    
1551         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1552         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
1553         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
1554         search trees. The classic  example  is  the  use  of  nested  unlimited         search  trees.  The  classic  example  is  the  use of nested unlimited
1555         repeats.         repeats.
1556    
1557         Internally,  PCRE uses a function called match() which it calls repeat-         Internally, PCRE uses a function called match() which it calls  repeat-
1558         edly (sometimes recursively). The limit is imposed  on  the  number  of         edly  (sometimes  recursively). The limit set by match_limit is imposed
1559         times  this  function is called during a match, which has the effect of         on the number of times this function is called during  a  match,  which
1560         limiting the amount of recursion and backtracking that can take  place.         has  the  effect  of  limiting the amount of backtracking that can take
1561         For patterns that are not anchored, the count starts from zero for each         place. For patterns that are not anchored, the count restarts from zero
1562         position in the subject string.         for each position in the subject string.
1563    
1564         The default limit for the library can be set when PCRE  is  built;  the         The  default  value  for  the  limit can be set when PCRE is built; the
1565         default  default  is 10 million, which handles all but the most extreme         default default is 10 million, which handles all but the  most  extreme
1566         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1567         pcre_extra  block  in  which match_limit is set to a smaller value, and         pcre_extra    block    in    which    match_limit    is    set,     and
1568         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1569         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1570    
1571         The  pcre_callout  field is used in conjunction with the "callout" fea-         The match_limit_recursion field is similar to match_limit, but  instead
1572           of limiting the total number of times that match() is called, it limits
1573           the depth of recursion. The recursion depth is a  smaller  number  than
1574           the  total number of calls, because not all calls to match() are recur-
1575           sive.  This limit is of use only if it is set smaller than match_limit.
1576    
1577           Limiting  the  recursion  depth  limits the amount of stack that can be
1578           used, or, when PCRE has been compiled to use memory on the heap instead
1579           of the stack, the amount of heap memory that can be used.
1580    
1581           The  default  value  for  match_limit_recursion can be set when PCRE is
1582           built; the default default  is  the  same  value  as  the  default  for
1583           match_limit.  You can override the default by suppling pcre_exec() with
1584           a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
1585           PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
1586           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1587    
1588           The pcre_callout field is used in conjunction with the  "callout"  fea-
1589         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1590    
1591         The tables field  is  used  to  pass  a  character  tables  pointer  to         The  tables  field  is  used  to  pass  a  character  tables pointer to
1592         pcre_exec();  this overrides the value that is stored with the compiled         pcre_exec(); this overrides the value that is stored with the  compiled
1593         pattern. A non-NULL value is stored with the compiled pattern  only  if         pattern.  A  non-NULL value is stored with the compiled pattern only if
1594         custom  tables  were  supplied to pcre_compile() via its tableptr argu-         custom tables were supplied to pcre_compile() via  its  tableptr  argu-
1595         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1596         PCRE's  internal  tables  to be used. This facility is helpful when re-         PCRE's internal tables to be used. This facility is  helpful  when  re-
1597         using patterns that have been saved after compiling  with  an  external         using  patterns  that  have been saved after compiling with an external
1598         set  of  tables,  because  the  external tables might be at a different         set of tables, because the external tables  might  be  at  a  different
1599         address when pcre_exec() is called. See the  pcreprecompile  documenta-         address  when  pcre_exec() is called. See the pcreprecompile documenta-
1600         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
1601    
1602     Option bits for pcre_exec()     Option bits for pcre_exec()
1603    
1604         The  unused  bits of the options argument for pcre_exec() must be zero.         The unused bits of the options argument for pcre_exec() must  be  zero.
1605         The  only  bits  that  may  be  set  are  PCRE_ANCHORED,   PCRE_NOTBOL,         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1606         PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1607           PCRE_PARTIAL.
1608    
1609           PCRE_ANCHORED           PCRE_ANCHORED
1610    
# Line 1498  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1613  MATCHING A PATTERN: THE TRADITIONAL FUNC
1613         turned  out to be anchored by virtue of its contents, it cannot be made         turned  out to be anchored by virtue of its contents, it cannot be made
1614         unachored at matching time.         unachored at matching time.
1615    
1616             PCRE_NEWLINE_CR
1617             PCRE_NEWLINE_LF
1618             PCRE_NEWLINE_CRLF
1619    
1620           These options override  the  newline  definition  that  was  chosen  or
1621           defaulted  when the pattern was compiled. For details, see the descrip-
1622           tion pcre_compile() above. During matching, the newline choice  affects
1623           the behaviour of the dot, circumflex, and dollar metacharacters.
1624    
1625           PCRE_NOTBOL           PCRE_NOTBOL
1626    
1627         This option specifies that first character of the subject string is not         This option specifies that first character of the subject string is not
1628         the  beginning  of  a  line, so the circumflex metacharacter should not         the beginning of a line, so the  circumflex  metacharacter  should  not
1629         match before it. Setting this without PCRE_MULTILINE (at compile  time)         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1630         causes  circumflex  never to match. This option affects only the behav-         causes circumflex never to match. This option affects only  the  behav-
1631         iour of the circumflex metacharacter. It does not affect \A.         iour of the circumflex metacharacter. It does not affect \A.
1632    
1633           PCRE_NOTEOL           PCRE_NOTEOL
1634    
1635         This option specifies that the end of the subject string is not the end         This option specifies that the end of the subject string is not the end
1636         of  a line, so the dollar metacharacter should not match it nor (except         of a line, so the dollar metacharacter should not match it nor  (except
1637         in multiline mode) a newline immediately before it. Setting this  with-         in  multiline mode) a newline immediately before it. Setting this with-
1638         out PCRE_MULTILINE (at compile time) causes dollar never to match. This         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1639         option affects only the behaviour of the dollar metacharacter. It  does         option  affects only the behaviour of the dollar metacharacter. It does
1640         not affect \Z or \z.         not affect \Z or \z.
1641    
1642           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1643    
1644         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1645         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
1646         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
1647         example, if the pattern         example, if the pattern
1648    
1649           a?b?           a?b?
1650    
1651         is applied to a string not beginning with "a" or "b",  it  matches  the         is  applied  to  a string not beginning with "a" or "b", it matches the
1652         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
1653         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1654         rences of "a" or "b".         rences of "a" or "b".
1655    
1656         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1657         cial case of a pattern match of the empty  string  within  its  split()         cial  case  of  a  pattern match of the empty string within its split()
1658         function,  and  when  using  the /g modifier. It is possible to emulate         function, and when using the /g modifier. It  is  possible  to  emulate
1659         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1660         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1661         if that fails by advancing the starting offset (see below)  and  trying         if  that  fails by advancing the starting offset (see below) and trying
1662         an ordinary match again. There is some code that demonstrates how to do         an ordinary match again. There is some code that demonstrates how to do
1663         this in the pcredemo.c sample program.         this in the pcredemo.c sample program.
1664    
1665           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1666    
1667         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1668         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8 string is automatically checked when pcre_exec() is  subsequently
1669         called.  The value of startoffset is also checked  to  ensure  that  it         called.   The  value  of  startoffset is also checked to ensure that it
1670         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence         points to the start of a UTF-8 character. If an invalid UTF-8  sequence
1671         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1672         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is         startoffset contains an  invalid  value,  PCRE_ERROR_BADUTF8_OFFSET  is
1673         returned.         returned.
1674    
1675         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
1676         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
1677         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
1678         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
1679         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
1680         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
1681         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1682         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         set, the effect of passing an invalid UTF-8 string as a subject,  or  a
1683         value of startoffset that does not point to the start of a UTF-8  char-         value  of startoffset that does not point to the start of a UTF-8 char-
1684         acter, is undefined. Your program may crash.         acter, is undefined. Your program may crash.
1685    
1686           PCRE_PARTIAL           PCRE_PARTIAL
1687    
1688         This  option  turns  on  the  partial  matching feature. If the subject         This option turns on the  partial  matching  feature.  If  the  subject
1689         string fails to match the pattern, but at some point during the  match-         string  fails to match the pattern, but at some point during the match-
1690         ing  process  the  end of the subject was reached (that is, the subject         ing process the end of the subject was reached (that  is,  the  subject
1691         partially matches the pattern and the failure to  match  occurred  only         partially  matches  the  pattern and the failure to match occurred only
1692         because  there were not enough subject characters), pcre_exec() returns         because there were not enough subject characters), pcre_exec()  returns
1693         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         PCRE_ERROR_PARTIAL  instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1694         used,  there  are restrictions on what may appear in the pattern. These         used, there are restrictions on what may appear in the  pattern.  These
1695         are discussed in the pcrepartial documentation.         are discussed in the pcrepartial documentation.
1696    
1697     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
1698    
1699         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
1700         length  in  length, and a starting byte offset in startoffset. In UTF-8         length in length, and a starting byte offset in startoffset.  In  UTF-8
1701         mode, the byte offset must point to the start  of  a  UTF-8  character.         mode,  the  byte  offset  must point to the start of a UTF-8 character.
1702         Unlike  the  pattern string, the subject may contain binary zero bytes.         Unlike the pattern string, the subject may contain binary  zero  bytes.
1703         When the starting offset is zero, the search for a match starts at  the         When  the starting offset is zero, the search for a match starts at the
1704         beginning of the subject, and this is by far the most common case.         beginning of the subject, and this is by far the most common case.
1705    
1706         A  non-zero  starting offset is useful when searching for another match         A non-zero starting offset is useful when searching for  another  match
1707         in the same subject by calling pcre_exec() again after a previous  suc-         in  the same subject by calling pcre_exec() again after a previous suc-
1708         cess.   Setting  startoffset differs from just passing over a shortened         cess.  Setting startoffset differs from just passing over  a  shortened
1709         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
1710         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
1711    
1712           \Biss\B           \Biss\B
1713    
1714         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
1715         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
1716         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
1717         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
1718         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
1719         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
1720         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
1721         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
1722         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
1723         discover that it is preceded by a letter.         discover that it is preceded by a letter.
1724    
1725         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
1726         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
1727         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
1728         subject.         subject.
1729    
1730     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
1731    
1732         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
1733         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
1734         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
1735         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
1736         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
1737         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
1738         that do not cause substrings to be captured.         that do not cause substrings to be captured.
1739    
1740         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of  integer
1741         offsets whose address is passed in ovector. The number of  elements  in         offsets  whose  address is passed in ovector. The number of elements in
1742         the  vector is passed in ovecsize, which must be a non-negative number.         the vector is passed in ovecsize, which must be a non-negative  number.
1743         Note: this argument is NOT the size of ovector in bytes.         Note: this argument is NOT the size of ovector in bytes.
1744    
1745         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
1746         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
1747         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
1748         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
1749         The length passed in ovecsize should always be a multiple of three.  If         The  length passed in ovecsize should always be a multiple of three. If
1750         it is not, it is rounded down.         it is not, it is rounded down.
1751    
1752         When  a  match  is successful, information about captured substrings is         When a match is successful, information about  captured  substrings  is
1753         returned in pairs of integers, starting at the  beginning  of  ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
1754         and  continuing  up  to two-thirds of its length at the most. The first         and continuing up to two-thirds of its length at the  most.  The  first
1755         element of a pair is set to the offset of the first character in a sub-         element of a pair is set to the offset of the first character in a sub-
1756         string,  and  the  second  is  set to the offset of the first character         string, and the second is set to the  offset  of  the  first  character
1757         after the end of a substring. The  first  pair,  ovector[0]  and  ovec-         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-
1758         tor[1],  identify  the  portion  of  the  subject string matched by the         tor[1], identify the portion of  the  subject  string  matched  by  the
1759         entire pattern. The next pair is used for the first  capturing  subpat-         entire  pattern.  The next pair is used for the first capturing subpat-
1760         tern,  and  so  on.  The value returned by pcre_exec() is the number of         tern, and so on. The value returned by pcre_exec() is one more than the
1761         pairs that have been set. If there are no  capturing  subpatterns,  the         highest numbered pair that has been set. For example, if two substrings
1762         return  value  from  a  successful match is 1, indicating that just the         have been captured, the returned value is 3. If there are no  capturing
1763         first pair of offsets has been set.         subpatterns,  the return value from a successful match is 1, indicating
1764           that just the first pair of offsets has been set.
        Some convenience functions are provided  for  extracting  the  captured  
        substrings  as  separate  strings. These are described in the following  
        section.  
   
        It is possible for an capturing subpattern number  n+1  to  match  some  
        part  of  the  subject  when subpattern n has not been used at all. For  
        example, if the string "abc" is matched against the pattern (a|(z))(bc)  
        subpatterns  1 and 3 are matched, but 2 is not. When this happens, both  
        offset values corresponding to the unused subpattern are set to -1.  
1765    
1766         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
1767         of the string that it matched that is returned.         of the string that it matched that is returned.
# Line 1660  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1775  MATCHING A PATTERN: THE TRADITIONAL FUNC
1775         substrings,  PCRE has to get additional memory for use during matching.         substrings,  PCRE has to get additional memory for use during matching.
1776         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
1777    
1778         Note that pcre_info() can be used to find out how many  capturing  sub-         The pcre_info() function can be used to find  out  how  many  capturing
1779         patterns there are in a compiled pattern. The smallest size for ovector         subpatterns  there  are  in  a  compiled pattern. The smallest size for
1780         that will allow for n captured substrings, in addition to  the  offsets         ovector that will allow for n captured substrings, in addition  to  the
1781         of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
1782    
1783           It  is  possible for capturing subpattern number n+1 to match some part
1784           of the subject when subpattern n has not been used at all. For example,
1785           if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
1786           return from the function is 4, and subpatterns 1 and 3 are matched, but
1787           2  is  not.  When  this happens, both values in the offset pairs corre-
1788           sponding to unused subpatterns are set to -1.
1789    
1790           Offset values that correspond to unused subpatterns at the end  of  the
1791           expression  are  also  set  to  -1. For example, if the string "abc" is
1792           matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
1793           matched.  The  return  from the function is 2, because the highest used
1794           capturing subpattern number is 1. However, you can refer to the offsets
1795           for  the  second  and third capturing subpatterns if you wish (assuming
1796           the vector is large enough, of course).
1797    
1798     Return values from pcre_exec()         Some convenience functions are provided  for  extracting  the  captured
1799           substrings as separate strings. These are described below.
1800    
1801       Error return values from pcre_exec()
1802    
1803         If  pcre_exec()  fails, it returns a negative number. The following are         If  pcre_exec()  fails, it returns a negative number. The following are
1804         defined in the header file:         defined in the header file:
# Line 1713  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 1846  MATCHING A PATTERN: THE TRADITIONAL FUNC
1846    
1847           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
1848    
1849         The  recursion  and backtracking limit, as specified by the match_limit         The  backtracking  limit,  as  specified  by the match_limit field in a
1850           pcre_extra structure (or defaulted) was reached.  See  the  description
1851           above.
1852    
1853             PCRE_ERROR_RECURSIONLIMIT (-21)
1854    
1855           The internal recursion limit, as specified by the match_limit_recursion
1856         field in a pcre_extra structure (or defaulted)  was  reached.  See  the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
1857         description above.         description above.
1858    
# Line 1774  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1913  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1913         string_list() are provided for extracting captured substrings  as  new,         string_list() are provided for extracting captured substrings  as  new,
1914         separate,  zero-terminated strings. These functions identify substrings         separate,  zero-terminated strings. These functions identify substrings
1915         by number. The next section describes functions  for  extracting  named         by number. The next section describes functions  for  extracting  named
1916         substrings.  A  substring  that  contains  a  binary  zero is correctly         substrings.
1917         extracted and has a further zero added on the end, but  the  result  is  
1918         not, of course, a C string.         A  substring that contains a binary zero is correctly extracted and has
1919           a further zero added on the end, but the result is not, of course, a  C
1920           string.   However,  you  can  process such a string by referring to the
1921           length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-
1922           string().  Unfortunately, the interface to pcre_get_substring_list() is
1923           not adequate for handling strings containing binary zeros, because  the
1924           end of the final string is not independently indicated.
1925    
1926         The  first  three  arguments  are the same for all three of these func-         The  first  three  arguments  are the same for all three of these func-
1927         tions: subject is the subject string that has  just  been  successfully         tions: subject is the subject string that has  just  been  successfully
# Line 1831  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 1976  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1976         tively.  They  do  nothing  more  than  call the function pointed to by         tively.  They  do  nothing  more  than  call the function pointed to by
1977         pcre_free, which of course could be called directly from a  C  program.         pcre_free, which of course could be called directly from a  C  program.
1978         However,  PCRE is used in some situations where it is linked via a spe-         However,  PCRE is used in some situations where it is linked via a spe-
1979         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
1980         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free  directly;  it is for these cases that the functions are pro-
1981         vided.         vided.
1982    
# Line 1856  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 2001  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2001    
2002           (a+)b(?P<xxx>\d+)...           (a+)b(?P<xxx>\d+)...
2003    
2004         the number of the subpattern called "xxx" is 2. You can find the number         the number of the subpattern called "xxx" is 2. If the name is known to
2005         from the name by calling pcre_get_stringnumber(). The first argument is         be unique (PCRE_DUPNAMES was not set), you can find the number from the
2006         the  compiled  pattern,  and  the  second is the name. The yield of the         name by calling pcre_get_stringnumber(). The first argument is the com-
2007         function is the subpattern number, or  PCRE_ERROR_NOSUBSTRING  (-7)  if         piled pattern, and the second is the name. The yield of the function is
2008         there is no subpattern of that name.         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2009           subpattern of that name.
2010    
2011         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
2012         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
2013         are also two functions that do the whole job.         are also two functions that do the whole job.
2014    
2015         Most    of    the    arguments   of   pcre_copy_named_substring()   and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2016         pcre_get_named_substring() are the same  as  those  for  the  similarly         pcre_get_named_substring()  are  the  same  as  those for the similarly
2017         named  functions  that extract by number. As these are described in the         named functions that extract by number. As these are described  in  the
2018         previous section, they are not re-described here. There  are  just  two         previous  section,  they  are not re-described here. There are just two
2019         differences:         differences:
2020    
2021         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2022         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2023         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2024         name-to-number translation table.         name-to-number translation table.
2025    
2026         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2027         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2028         ate.         ate.
2029    
2030    
2031    DUPLICATE SUBPATTERN NAMES
2032    
2033           int pcre_get_stringtable_entries(const pcre *code,
2034                const char *name, char **first, char **last);
2035    
2036           When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for
2037           subpatterns are not required to  be  unique.  Normally,  patterns  with
2038           duplicate  names  are such that in any one match, only one of the named
2039           subpatterns participates. An example is shown in the pcrepattern  docu-
2040           mentation. When duplicates are present, pcre_copy_named_substring() and
2041           pcre_get_named_substring() return the first substring corresponding  to
2042           the  given  name  that  is  set.  If  none  are set, an empty string is
2043           returned.  The pcre_get_stringnumber() function returns one of the num-
2044           bers  that are associated with the name, but it is not defined which it
2045           is.
2046    
2047           If you want to get full details of all captured substrings for a  given
2048           name,  you  must  use  the pcre_get_stringtable_entries() function. The
2049           first argument is the compiled pattern, and the second is the name. The
2050           third  and  fourth  are  pointers to variables which are updated by the
2051           function. After it has run, they point to the first and last entries in
2052           the  name-to-number  table  for  the  given  name.  The function itself
2053           returns the length of each entry, or  PCRE_ERROR_NOSUBSTRING  if  there
2054           are  none.  The  format  of the table is described above in the section
2055           entitled Information about a pattern. Given all  the  relevant  entries
2056           for the name, you can extract each of their numbers, and hence the cap-
2057           tured data, if any.
2058    
2059    
2060  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
2061    
2062         The traditional matching function uses a  similar  algorithm  to  Perl,         The traditional matching function uses a  similar  algorithm  to  Perl,
# Line 1925  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2100  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2100         workspace vector should contain at least 20 elements. It  is  used  for         workspace vector should contain at least 20 elements. It  is  used  for
2101         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2102         workspace will be needed for patterns and subjects where  there  are  a         workspace will be needed for patterns and subjects where  there  are  a
2103         lot of possible matches.         lot of potential matches.
2104    
2105         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_dfa_exec():
2106    
2107           int rc;           int rc;
2108           int ovector[10];           int ovector[10];
2109           int wspace[20];           int wspace[20];
2110           rc = pcre_exec(           rc = pcre_dfa_exec(
2111             re,             /* result of pcre_compile() */             re,             /* result of pcre_compile() */
2112             NULL,           /* we didn't study the pattern */             NULL,           /* we didn't study the pattern */
2113             "some string",  /* the subject string */             "some string",  /* the subject string */
# Line 1947  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2122  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2122     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
2123    
2124         The  unused  bits  of  the options argument for pcre_dfa_exec() must be         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2125         zero. The only bits that may be  set  are  PCRE_ANCHORED,  PCRE_NOTBOL,         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2126         PCRE_NOTEOL,     PCRE_NOTEMPTY,    PCRE_NO_UTF8_CHECK,    PCRE_PARTIAL,         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2127         PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All  but  the  last  three  of         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2128         these  are  the  same  as  for pcre_exec(), so their description is not         three of these are the same as for pcre_exec(), so their description is
2129         repeated here.         not repeated here.
2130    
2131           PCRE_PARTIAL           PCRE_PARTIAL
2132    
# Line 2052  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 2227  MATCHING A PATTERN: THE ALTERNATIVE FUNC
2227         This error is given if the output vector  is  not  large  enough.  This         This error is given if the output vector  is  not  large  enough.  This
2228         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
2229    
2230  Last updated: 16 May 2005  Last updated: 08 June 2006
2231  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
2232  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2233    
2234    
# Line 2229  DIFFERENCES BETWEEN PCRE AND PERL Line 2404  DIFFERENCES BETWEEN PCRE AND PERL
2404         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
2405         respect to Perl 5.8.         respect to Perl 5.8.
2406    
2407         1.  PCRE does not have full UTF-8 support. Details of what it does have         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
2408         are given in the section on UTF-8 support in the main pcre page.         of what it does have are given in the section on UTF-8 support  in  the
2409           main pcre page.
2410    
2411         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2412         permits  them,  but they do not mean what you might think. For example,         permits them, but they do not mean what you might think.  For  example,
2413         (?!a){3} does not assert that the next three characters are not "a". It         (?!a){3} does not assert that the next three characters are not "a". It
2414         just asserts that the next character is not "a" three times.         just asserts that the next character is not "a" three times.
2415    
2416         3.  Capturing  subpatterns  that occur inside negative lookahead asser-         3. Capturing subpatterns that occur inside  negative  lookahead  asser-
2417         tions are counted, but their entries in the offsets  vector  are  never         tions  are  counted,  but their entries in the offsets vector are never
2418         set.  Perl sets its numerical variables from any such patterns that are         set. Perl sets its numerical variables from any such patterns that  are
2419         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
2420         ing),  but  only  if the negative lookahead assertion contains just one         ing), but only if the negative lookahead assertion  contains  just  one
2421         branch.         branch.
2422    
2423         4. Though binary zero characters are supported in the  subject  string,         4.  Though  binary zero characters are supported in the subject string,
2424         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2425         mal C string, terminated by zero. The escape sequence \0 can be used in         mal C string, terminated by zero. The escape sequence \0 can be used in
2426         the pattern to represent a binary zero.         the pattern to represent a binary zero.
2427    
2428         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5. The following Perl escape sequences are not supported: \l,  \u,  \L,
2429         \U, and \N. In fact these are implemented by Perl's general string-han-         \U, and \N. In fact these are implemented by Perl's general string-han-
2430         dling  and are not part of its pattern matching engine. If any of these         dling and are not part of its pattern matching engine. If any of  these
2431         are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2432    
2433         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE
2434         is  built  with Unicode character property support. The properties that         is built with Unicode character property support. The  properties  that
2435         can be tested with \p and \P are limited to the general category  prop-         can  be tested with \p and \P are limited to the general category prop-
2436         erties such as Lu and Nd.         erties such as Lu and Nd, script names such as Greek or  Han,  and  the
2437           derived properties Any and L&.
2438    
2439         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2440         ters in between are treated as literals.  This  is  slightly  different         ters in between are treated as literals.  This  is  slightly  different
# Line 2297  DIFFERENCES BETWEEN PCRE AND PERL Line 2474  DIFFERENCES BETWEEN PCRE AND PERL
2474         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2475    
2476         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2477         cial meaning is faulted.         cial meaning  is  faulted.  Otherwise,  like  Perl,  the  backslash  is
2478           ignored. (Perl can be made to issue a warning.)
2479    
2480         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
2481         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2482         lowed by a question mark they are.         lowed by a question mark they are.
2483    
2484         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2485         tried only at the first matching position in the subject string.         tried only at the first matching position in the subject string.
2486    
2487         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-
2488         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2489    
2490         (g) The (?R), (?number), and (?P>name) constructs allows for  recursive         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive
2491         pattern  matching  (Perl  can  do  this using the (?p{code}) construct,         pattern matching (Perl can do  this  using  the  (?p{code})  construct,
2492         which PCRE cannot support.)         which PCRE cannot support.)
2493    
2494         (h) PCRE supports named capturing substrings, using the Python  syntax.         (h)  PCRE supports named capturing substrings, using the Python syntax.
2495    
2496         (i)  PCRE  supports  the  possessive quantifier "++" syntax, taken from         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from
2497         Sun's Java package.         Sun's Java package.
2498    
2499         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) The (R) condition, for testing recursion, is a PCRE extension.
# Line 2327  DIFFERENCES BETWEEN PCRE AND PERL Line 2505  DIFFERENCES BETWEEN PCRE AND PERL
2505         (m) Patterns compiled by PCRE can be saved and re-used at a later time,         (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2506         even on different hosts that have the other endianness.         even on different hosts that have the other endianness.
2507    
2508         (n)  The  alternative  matching function (pcre_dfa_exec()) matches in a         (n) The alternative matching function (pcre_dfa_exec())  matches  in  a
2509         different way and is not Perl-compatible.         different way and is not Perl-compatible.
2510    
2511  Last updated: 28 February 2005  Last updated: 06 June 2006
2512  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
2513  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
2514    
2515    
# Line 2439  BACKSLASH Line 2617  BACKSLASH
2617    
2618         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in
2619         the pattern (other than in a character class) and characters between  a         the pattern (other than in a character class) and characters between  a
2620         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline are ignored. An escap-
2621         An escaping backslash can be used to include a whitespace or #  charac-         ing backslash can be used to include a whitespace  or  #  character  as
2622         ter as part of the pattern.         part of the pattern.
2623    
2624         If  you  want  to remove the special meaning from a sequence of charac-         If  you  want  to remove the special meaning from a sequence of charac-
2625         ters, you can do so by putting them between \Q and \E. This is  differ-         ters, you can do so by putting them between \Q and \E. This is  differ-
# Line 2477  BACKSLASH Line 2655  BACKSLASH
2655           \t        tab (hex 09)           \t        tab (hex 09)
2656           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
2657           \xhh      character with hex code hh           \xhh      character with hex code hh
2658           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh..
2659    
2660         The  precise  effect of \cx is as follows: if x is a lower case letter,         The  precise  effect of \cx is as follows: if x is a lower case letter,
2661         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it is converted to upper case. Then bit 6 of the character (hex 40)  is
# Line 2485  BACKSLASH Line 2663  BACKSLASH
2663         becomes hex 7B.         becomes hex 7B.
2664    
2665         After \x, from zero to two hexadecimal digits are read (letters can  be         After \x, from zero to two hexadecimal digits are read (letters can  be
2666         in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-         in  upper  or  lower case). Any number of hexadecimal digits may appear
2667         its may appear between \x{ and }, but the value of the  character  code         between \x{ and }, but the value of the character  code  must  be  less
2668         must  be  less  than  2**31  (that is, the maximum hexadecimal value is         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2669         7FFFFFFF). If characters other than hexadecimal digits  appear  between         the maximum hexadecimal value is 7FFFFFFF). If  characters  other  than
2670         \x{  and }, or if there is no terminating }, this form of escape is not         hexadecimal  digits  appear between \x{ and }, or if there is no termi-
2671         recognized. Instead, the initial \x will  be  interpreted  as  a  basic         nating }, this form of escape is not recognized.  Instead, the  initial
2672         hexadecimal  escape, with no following digits, giving a character whose         \x will be interpreted as a basic hexadecimal escape, with no following
2673         value is zero.         digits, giving a character whose value is zero.
2674    
2675         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
2676         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference         two  syntaxes  for  \x. There is no difference in the way they are han-
2677         in the way they are handled. For example, \xdc is exactly the  same  as         dled. For example, \xdc is exactly the same as \x{dc}.
2678         \x{dc}.  
2679           After \0 up to two further octal digits are read. If  there  are  fewer
2680         After  \0  up  to  two further octal digits are read. In both cases, if         than  two  digits,  just  those  that  are  present  are used. Thus the
2681         there are fewer than two digits, just those that are present are  used.         sequence \0\x\07 specifies two binary zeros followed by a BEL character
2682         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL         (code  value 7). Make sure you supply two digits after the initial zero
2683         character (code value 7). Make sure you supply  two  digits  after  the         if the pattern character that follows is itself an octal digit.
        initial  zero  if the pattern character that follows is itself an octal  
        digit.  
2684    
2685         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
2686         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
# Line 2516  BACKSLASH Line 2692  BACKSLASH
2692    
2693         Inside a character class, or if the decimal number is  greater  than  9         Inside a character class, or if the decimal number is  greater  than  9
2694         and  there have not been that many capturing subpatterns, PCRE re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
2695         up to three octal digits following the backslash, and generates a  sin-         up to three octal digits following the backslash, ane uses them to gen-
2696         gle byte from the least significant 8 bits of the value. Any subsequent         erate  a data character. Any subsequent digits stand for themselves. In
2697         digits stand for themselves.  For example:         non-UTF-8 mode, the value of a character specified  in  octal  must  be
2698           less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
2699           example:
2700    
2701           \040   is another way of writing a space           \040   is another way of writing a space
2702           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 2538  BACKSLASH Line 2716  BACKSLASH
2716         Note that octal values of 100 or greater must not be  introduced  by  a         Note that octal values of 100 or greater must not be  introduced  by  a
2717         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
2718    
2719         All  the  sequences  that  define a single byte value or a single UTF-8         All the sequences that define a single character value can be used both
2720         character (in UTF-8 mode) can be used both inside and outside character         inside and outside character classes. In addition, inside  a  character
2721         classes.  In  addition,  inside  a  character class, the sequence \b is         class,  the  sequence \b is interpreted as the backspace character (hex
2722         interpreted as the backspace character (hex 08), and the sequence \X is         08), and the sequence \X is interpreted as the character "X". Outside a
2723         interpreted  as  the  character  "X".  Outside a character class, these         character class, these sequences have different meanings (see below).
        sequences have different meanings (see below).  
2724    
2725     Generic character types     Generic character types
2726    
2727         The third use of backslash is for specifying generic  character  types.         The  third  use of backslash is for specifying generic character types.
2728         The following are always recognized:         The following are always recognized:
2729    
2730           \d     any decimal digit           \d     any decimal digit
# Line 2558  BACKSLASH Line 2735  BACKSLASH
2735           \W     any "non-word" character           \W     any "non-word" character
2736    
2737         Each pair of escape sequences partitions the complete set of characters         Each pair of escape sequences partitions the complete set of characters
2738         into two disjoint sets. Any given character matches one, and only  one,         into  two disjoint sets. Any given character matches one, and only one,
2739         of each pair.         of each pair.
2740    
2741         These character type sequences can appear both inside and outside char-         These character type sequences can appear both inside and outside char-
2742         acter classes. They each match one character of the  appropriate  type.         acter  classes.  They each match one character of the appropriate type.
2743         If  the current matching point is at the end of the subject string, all         If the current matching point is at the end of the subject string,  all
2744         of them fail, since there is no character to match.         of them fail, since there is no character to match.
2745    
2746         For compatibility with Perl, \s does not match the VT  character  (code         For  compatibility  with Perl, \s does not match the VT character (code
2747         11).   This makes it different from the the POSIX "space" class. The \s         11).  This makes it different from the the POSIX "space" class. The  \s
2748         characters are HT (9), LF (10), FF (12), CR (13), and space (32).         characters  are  HT (9), LF (10), FF (12), CR (13), and space (32). (If
2749           "use locale;" is included in a Perl script, \s may match the VT charac-
2750           ter. In PCRE, it never does.)
2751    
2752         A "word" character is an underscore or any character less than 256 that         A "word" character is an underscore or any character less than 256 that
2753         is  a  letter  or  digit.  The definition of letters and digits is con-         is a letter or digit. The definition of  letters  and  digits  is  con-
2754         trolled by PCRE's low-valued character tables, and may vary if  locale-         trolled  by PCRE's low-valued character tables, and may vary if locale-
2755         specific  matching is taking place (see "Locale support" in the pcreapi         specific matching is taking place (see "Locale support" in the  pcreapi
2756         page). For example, in the  "fr_FR"  (French)  locale,  some  character         page).  For  example,  in  the  "fr_FR" (French) locale, some character
2757         codes  greater  than  128  are used for accented letters, and these are         codes greater than 128 are used for accented  letters,  and  these  are
2758         matched by \w.         matched by \w.
2759    
2760         In UTF-8 mode, characters with values greater than 128 never match  \d,         In  UTF-8 mode, characters with values greater than 128 never match \d,
2761         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-         \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2762         code character property support is available.         code  character  property support is available. The use of locales with
2763           Unicode is discouraged.
2764    
2765     Unicode character properties     Unicode character properties
2766    
2767         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
2768         tional  escape sequences to match generic character types are available         tional  escape  sequences  to  match character properties are available
2769         when UTF-8 mode is selected. They are:         when UTF-8 mode is selected. They are:
2770    
2771          \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
2772          \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
2773          \X       an extended Unicode sequence           \X       an extended Unicode sequence
2774    
2775         The property names represented by xx above are limited to  the  Unicode         The property names represented by xx above are limited to  the  Unicode
2776         general  category properties. Each character has exactly one such prop-         script names, the general category properties, and "Any", which matches
2777         erty, specified by a two-letter abbreviation.  For  compatibility  with         any character (including newline). Other properties such as "InMusical-
2778         Perl,  negation  can be specified by including a circumflex between the         Symbols"  are  not  currently supported by PCRE. Note that \P{Any} does
2779         opening brace and the property name. For example, \p{^Lu} is  the  same         not match any characters, so always causes a match failure.
2780         as \P{Lu}.  
2781           Sets of Unicode characters are defined as belonging to certain scripts.
2782         If  only  one  letter  is  specified with \p or \P, it includes all the         A  character from one of these sets can be matched using a script name.
2783         properties that start with that letter. In this case, in the absence of         For example:
2784         negation, the curly brackets in the escape sequence are optional; these  
2785         two examples have the same effect:           \p{Greek}
2786             \P{Han}
2787    
2788           Those that are not part of an identified script are lumped together  as
2789           "Common". The current list of scripts is:
2790    
2791           Arabic,  Armenian,  Bengali,  Bopomofo, Braille, Buginese, Buhid, Cana-
2792           dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic,  Deseret,
2793           Devanagari,  Ethiopic,  Georgian,  Glagolitic, Gothic, Greek, Gujarati,
2794           Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana,  Inherited,  Kannada,
2795           Katakana,  Kharoshthi,  Khmer,  Lao, Latin, Limbu, Linear_B, Malayalam,
2796           Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya,
2797           Osmanya,  Runic,  Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag-
2798           banwa,  Tai_Le,  Tamil,  Telugu,  Thaana,  Thai,   Tibetan,   Tifinagh,
2799           Ugaritic, Yi.
2800    
2801           Each  character has exactly one general category property, specified by
2802           a two-letter abbreviation. For compatibility with Perl, negation can be
2803           specified  by  including a circumflex between the opening brace and the
2804           property name. For example, \p{^Lu} is the same as \P{Lu}.
2805    
2806           If only one letter is specified with \p or \P, it includes all the gen-
2807           eral  category properties that start with that letter. In this case, in
2808           the absence of negation, the curly brackets in the escape sequence  are
2809           optional; these two examples have the same effect:
2810    
2811           \p{L}           \p{L}
2812           \pL           \pL
2813    
2814         The following property codes are supported:         The following general category property codes are supported:
2815    
2816           C     Other           C     Other
2817           Cc    Control           Cc    Control
# Line 2653  BACKSLASH Line 2857  BACKSLASH
2857           Zp    Paragraph separator           Zp    Paragraph separator
2858           Zs    Space separator           Zs    Space separator
2859    
2860         Extended properties such as "Greek" or "InMusicalSymbols" are not  sup-         The  special property L& is also supported: it matches a character that
2861         ported by PCRE.         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not
2862           classified as a modifier or "other".
2863    
2864           The  long  synonyms  for  these  properties that Perl supports (such as
2865           \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix
2866           any of these properties with "Is".
2867    
2868           No character that is in the Unicode table has the Cn (unassigned) prop-
2869           erty.  Instead, this property is assumed for any code point that is not
2870           in the Unicode table.
2871    
2872         Specifying  caseless  matching  does not affect these escape sequences.         Specifying  caseless  matching  does not affect these escape sequences.
2873         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
# Line 2707  BACKSLASH Line 2920  BACKSLASH
2920         However, if the startoffset argument of pcre_exec() is non-zero,  indi-         However, if the startoffset argument of pcre_exec() is non-zero,  indi-
2921         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
2922         the subject, \A can never match. The difference between \Z  and  \z  is         the subject, \A can never match. The difference between \Z  and  \z  is
2923         that  \Z  matches  before  a  newline that is the last character of the         that \Z matches before a newline at the end of the string as well as at
2924         string as well as at the end of the string, whereas \z matches only  at         the very end, whereas \z matches only at the end.
2925         the end.  
2926           The \G assertion is true only when the current matching position is  at
2927         The  \G assertion is true only when the current matching position is at         the  start point of the match, as specified by the startoffset argument
2928         the start point of the match, as specified by the startoffset  argument         of pcre_exec(). It differs from \A when the  value  of  startoffset  is
2929         of  pcre_exec().  It  differs  from \A when the value of startoffset is         non-zero.  By calling pcre_exec() multiple times with appropriate argu-
        non-zero. By calling pcre_exec() multiple times with appropriate  argu-  
2930         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
2931         mentation where \G can be useful.         mentation where \G can be useful.
2932    
2933         Note, however, that PCRE's interpretation of \G, as the  start  of  the         Note,  however,  that  PCRE's interpretation of \G, as the start of the
2934         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
2935         end of the previous match. In Perl, these can  be  different  when  the         end  of  the  previous  match. In Perl, these can be different when the
2936         previously  matched  string was empty. Because PCRE does just one match         previously matched string was empty. Because PCRE does just  one  match
2937         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
2938    
2939         If all the alternatives of a pattern begin with \G, the  expression  is         If  all  the alternatives of a pattern begin with \G, the expression is
2940         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
2941         in the compiled regular expression.         in the compiled regular expression.
2942    
# Line 2732  BACKSLASH Line 2944  BACKSLASH
2944  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
2945    
2946         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
2947         character  is  an  assertion  that is true only if the current matching         character is an assertion that is true only  if  the  current  matching
2948         point is at the start of the subject string. If the  startoffset  argu-         point  is  at the start of the subject string. If the startoffset argu-
2949         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the
2950         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex
2951         has an entirely different meaning (see below).         has an entirely different meaning (see below).
2952    
2953         Circumflex  need  not be the first character of the pattern if a number         Circumflex need not be the first character of the pattern if  a  number
2954         of alternatives are involved, but it should be the first thing in  each         of  alternatives are involved, but it should be the first thing in each
2955         alternative  in  which  it appears if the pattern is ever to match that         alternative in which it appears if the pattern is ever  to  match  that
2956         branch. If all possible alternatives start with a circumflex, that  is,         branch.  If all possible alternatives start with a circumflex, that is,
2957         if  the  pattern  is constrained to match only at the start of the sub-         if the pattern is constrained to match only at the start  of  the  sub-
2958         ject, it is said to be an "anchored" pattern.  (There  are  also  other         ject,  it  is  said  to be an "anchored" pattern. (There are also other
2959         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
2960    
2961         A  dollar  character  is  an assertion that is true only if the current         A dollar character is an assertion that is true  only  if  the  current
2962         matching point is at the end of  the  subject  string,  or  immediately         matching  point  is  at  the  end of the subject string, or immediately
2963         before a newline character that is the last character in the string (by         before a newline at the end of the string (by default). Dollar need not
2964         default). Dollar need not be the last character of  the  pattern  if  a         be  the  last  character of the pattern if a number of alternatives are
2965         number  of alternatives are involved, but it should be the last item in         involved, but it should be the last item in  any  branch  in  which  it
2966         any branch in which it appears.  Dollar has no  special  meaning  in  a         appears. Dollar has no special meaning in a character class.
        character class.  
2967    
2968         The  meaning  of  dollar  can be changed so that it matches only at the         The  meaning  of  dollar  can be changed so that it matches only at the
2969         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
2970         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
2971    
2972         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
2973         PCRE_MULTILINE option is set. When this is the case, they match immedi-         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
2974         ately  after  and  immediately  before  an  internal newline character,         matches  immediately after internal newlines as well as at the start of
2975         respectively, in addition to matching at the start and end of the  sub-         the subject string. It does not match after a  newline  that  ends  the
2976         ject  string.  For  example,  the  pattern  /^abc$/ matches the subject         string.  A dollar matches before any newlines in the string, as well as
2977         string "def\nabc" (where \n represents a newline character)  in  multi-         at the very end, when PCRE_MULTILINE is set. When newline is  specified
2978         line mode, but not otherwise.  Consequently, patterns that are anchored         as  the  two-character  sequence CRLF, isolated CR and LF characters do
2979         in single line mode because all branches start with ^ are not  anchored         not indicate newlines.
2980         in  multiline  mode,  and  a  match for circumflex is possible when the  
2981         startoffset  argument  of  pcre_exec()  is  non-zero.   The   PCRE_DOL-         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
2982         LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.         (where  \n  represents a newline) in multiline mode, but not otherwise.
2983           Consequently, patterns that are anchored in single  line  mode  because
2984         Note  that  the sequences \A, \Z, and \z can be used to match the start         all  branches  start  with  ^ are not anchored in multiline mode, and a
2985         and end of the subject in both modes, and if all branches of a  pattern         match for circumflex is  possible  when  the  startoffset  argument  of
2986         start  with  \A it is always anchored, whether PCRE_MULTILINE is set or         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
2987         not.         PCRE_MULTILINE is set.
2988    
2989           Note that the sequences \A, \Z, and \z can be used to match  the  start
2990           and  end of the subject in both modes, and if all branches of a pattern
2991           start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
2992           set.
2993    
2994    
2995  FULL STOP (PERIOD, DOT)  FULL STOP (PERIOD, DOT)
2996    
2997         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
2998         ter  in  the  subject,  including a non-printing character, but not (by         ter in the subject string except (by default) a character  that  signi-
2999         default) newline.  In UTF-8 mode, a dot matches  any  UTF-8  character,         fies  the  end  of  a line. In UTF-8 mode, the matched character may be
3000         which might be more than one byte long, except (by default) newline. If         more than one byte long. When a line ending  is  defined  as  a  single
3001         the PCRE_DOTALL option is set, dots match newlines as  well.  The  han-         character  (CR  or LF), dot never matches that character; when the two-
3002         dling  of dot is entirely independent of the handling of circumflex and         character sequence CRLF is used, dot does not match CR if it is immedi-
3003         dollar, the only relationship being  that  they  both  involve  newline         ately  followed by LF, but otherwise it matches all characters (includ-
3004         characters. Dot has no special meaning in a character class.         ing isolated CRs and LFs).
3005    
3006           The behaviour of dot with regard to newlines can  be  changed.  If  the
3007           PCRE_DOTALL  option  is  set,  a dot matches any one character, without
3008           exception. If newline is defined as the two-character sequence CRLF, it
3009           takes two dots to match it.
3010    
3011           The  handling of dot is entirely independent of the handling of circum-
3012           flex and dollar, the only relationship being  that  they  both  involve
3013           newlines. Dot has no special meaning in a character class.
3014    
3015    
3016  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
3017    
3018         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
3019         both in and out of UTF-8 mode. Unlike a dot, it can  match  a  newline.         both in and out of UTF-8 mode. Unlike a dot, it always matches  CR  and
3020         The  feature  is provided in Perl in order to match individual bytes in         LF.  The feature is provided in Perl in order to match individual bytes
3021         UTF-8 mode. Because it  breaks  up  UTF-8  characters  into  individual         in UTF-8 mode.  Because it breaks up UTF-8 characters  into  individual
3022         bytes,  what remains in the string may be a malformed UTF-8 string. For         bytes,  what remains in the string may be a malformed UTF-8 string. For
3023         this reason, the \C escape sequence is best avoided.         this reason, the \C escape sequence is best avoided.
3024    
# Line 2842  SQUARE BRACKETS AND CHARACTER CLASSES Line 3067  SQUARE BRACKETS AND CHARACTER CLASSES
3067         PCRE  is  compiled  with Unicode property support as well as with UTF-8         PCRE  is  compiled  with Unicode property support as well as with UTF-8
3068         support.         support.
3069    
3070         The newline character is never treated in any special way in  character         Characters that might indicate  line  breaks  (CR  and  LF)  are  never
3071         classes,  whatever  the  setting  of  the PCRE_DOTALL or PCRE_MULTILINE         treated  in  any  special way when matching character classes, whatever
3072         options is. A class such as [^a] will always match a newline.         line-ending sequence is in use, and whatever setting of the PCRE_DOTALL
3073           and PCRE_MULTILINE options is used. A class such as [^a] always matches
3074           one of these characters.
3075    
3076         The minus (hyphen) character can be used to specify a range of  charac-         The minus (hyphen) character can be used to specify a range of  charac-
3077         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters  in  a  character  class.  For  example,  [d-m] matches any letter
# Line 2945  VERTICAL BAR Line 3172  VERTICAL BAR
3172    
3173         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches  either "gilbert" or "sullivan". Any number of alternatives may
3174         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear, and an empty  alternative  is  permitted  (matching  the  empty
3175         string).   The  matching  process  tries each alternative in turn, from         string). The matching process tries each alternative in turn, from left
3176         left to right, and the first one that succeeds is used. If the alterna-         to right, and the first one that succeeds is used. If the  alternatives
3177         tives  are within a subpattern (defined below), "succeeds" means match-         are  within a subpattern (defined below), "succeeds" means matching the
3178         ing the rest of the main pattern as well as the alternative in the sub-         rest of the main pattern as well as the alternative in the  subpattern.
        pattern.  
3179    
3180    
3181  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
# Line 2995  INTERNAL OPTION SETTING Line 3221  INTERNAL OPTION SETTING
3221         the effects of option settings happen at compile time. There  would  be         the effects of option settings happen at compile time. There  would  be
3222         some very weird behaviour otherwise.         some very weird behaviour otherwise.
3223    
3224         The  PCRE-specific  options PCRE_UNGREEDY and PCRE_EXTRA can be changed         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
3225         in the same way as the Perl-compatible options by using the  characters         can be changed in the same way as the Perl-compatible options by  using
3226         U  and X respectively. The (?X) flag setting is special in that it must         the characters J, U and X respectively.
        always occur earlier in the pattern than any of the additional features  
        it  turns on, even when it is at top level. It is best to put it at the  
        start.  
3227    
3228    
3229  SUBPATTERNS  SUBPATTERNS
# Line 3012  SUBPATTERNS Line 3235  SUBPATTERNS
3235    
3236           cat(aract|erpillar|)           cat(aract|erpillar|)
3237    
3238         matches  one  of the words "cat", "cataract", or "caterpillar". Without         matches one of the words "cat", "cataract", or  "caterpillar".  Without
3239         the parentheses, it would match "cataract",  "erpillar"  or  the  empty         the  parentheses,  it  would  match "cataract", "erpillar" or the empty
3240         string.         string.
3241    
3242         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2. It sets up the subpattern as  a  capturing  subpattern.  This  means
3243         that, when the whole pattern  matches,  that  portion  of  the  subject         that,  when  the  whole  pattern  matches,  that portion of the subject
3244         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
3245         ovector argument of pcre_exec(). Opening parentheses are  counted  from         ovector  argument  of pcre_exec(). Opening parentheses are counted from
3246         left  to  right  (starting  from 1) to obtain numbers for the capturing         left to right (starting from 1) to obtain  numbers  for  the  capturing
3247         subpatterns.         subpatterns.
3248    
3249         For example, if the string "the red king" is matched against  the  pat-         For  example,  if the string "the red king" is matched against the pat-
3250         tern         tern
3251    
3252           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 3031  SUBPATTERNS Line 3254  SUBPATTERNS
3254         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
3255         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
3256    
3257         The fact that plain parentheses fulfil  two  functions  is  not  always         The  fact  that  plain  parentheses  fulfil two functions is not always
3258         helpful.   There are often times when a grouping subpattern is required         helpful.  There are often times when a grouping subpattern is  required
3259         without a capturing requirement. If an opening parenthesis is  followed         without  a capturing requirement. If an opening parenthesis is followed
3260         by  a question mark and a colon, the subpattern does not do any captur-         by a question mark and a colon, the subpattern does not do any  captur-
3261         ing, and is not counted when computing the  number  of  any  subsequent         ing,  and  is  not  counted when computing the number of any subsequent
3262         capturing  subpatterns. For example, if the string "the white queen" is         capturing subpatterns. For example, if the string "the white queen"  is
3263         matched against the pattern         matched against the pattern
3264    
3265           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
3266    
3267         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
3268         1  and 2. The maximum number of capturing subpatterns is 65535, and the         1 and 2. The maximum number of capturing subpatterns is 65535, and  the
3269         maximum depth of nesting of all subpatterns, both  capturing  and  non-         maximum  depth  of  nesting of all subpatterns, both capturing and non-
3270         capturing, is 200.         capturing, is 200.
3271    
3272         As  a  convenient shorthand, if any option settings are required at the         As a convenient shorthand, if any option settings are required  at  the
3273         start of a non-capturing subpattern,  the  option  letters  may  appear         start  of  a  non-capturing  subpattern,  the option letters may appear
3274         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
3275    
3276           (?i:saturday|sunday)           (?i:saturday|sunday)
3277           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
3278    
3279         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
3280         tried from left to right, and options are not reset until  the  end  of         tried  from  left  to right, and options are not reset until the end of
3281         the  subpattern is reached, an option setting in one branch does affect         the subpattern is reached, an option setting in one branch does  affect
3282         subsequent branches, so the above patterns match "SUNDAY"  as  well  as         subsequent  branches,  so  the above patterns match "SUNDAY" as well as
3283         "Saturday".         "Saturday".
3284    
3285    
3286  NAMED SUBPATTERNS  NAMED SUBPATTERNS
3287    
3288         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
3289         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
3290         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
3291         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
3292         patterns,  something  that  Perl  does  not  provide. The Python syntax         patterns, something that Perl  does  not  provide.  The  Python  syntax
3293         (?P<name>...) is used. Names consist  of  alphanumeric  characters  and         (?P<name>...)  is  used. References to capturing parentheses from other
3294         underscores, and must be unique within a pattern.         parts of the pattern, such as  backreferences,  recursion,  and  condi-
3295           tions, can be made by name as well as by number.
3296    
3297         Named  capturing  parentheses  are  still  allocated numbers as well as         Names  consist  of  up  to  32 alphanumeric characters and underscores.
3298           Named capturing parentheses are still  allocated  numbers  as  well  as
3299         names. The PCRE API provides function calls for extracting the name-to-         names. The PCRE API provides function calls for extracting the name-to-
3300         number  translation table from a compiled pattern. There is also a con-         number translation table from a compiled pattern. There is also a  con-
3301         venience function for extracting a captured substring by name. For fur-         venience function for extracting a captured substring by name.
3302         ther details see the pcreapi documentation.  
3303           By  default, a name must be unique within a pattern, but it is possible
3304           to relax this constraint by setting the PCRE_DUPNAMES option at compile
3305           time.  This  can  be useful for patterns where only one instance of the
3306           named parentheses can match. Suppose you want to match the  name  of  a
3307           weekday,  either as a 3-letter abbreviation or as the full name, and in
3308           both cases you want to extract the abbreviation. This pattern (ignoring
3309           the line breaks) does the job:
3310    
3311             (?P<DN>Mon|Fri|Sun)(?:day)?|
3312             (?P<DN>Tue)(?:sday)?|
3313             (?P<DN>Wed)(?:nesday)?|
3314             (?P<DN>Thu)(?:rsday)?|
3315             (?P<DN>Sat)(?:urday)?
3316    
3317           There  are  five capturing substrings, but only one is ever set after a
3318           match.  The convenience  function  for  extracting  the  data  by  name
3319           returns  the  substring  for  the first, and in this example, the only,
3320           subpattern of that name that matched.  This  saves  searching  to  find
3321           which  numbered  subpattern  it  was. If you make a reference to a non-
3322           unique named subpattern from elsewhere in the  pattern,  the  one  that
3323           corresponds  to  the  lowest number is used. For further details of the
3324           interfaces for handling named subpatterns, see the  pcreapi  documenta-
3325           tion.
3326    
3327    
3328  REPETITION  REPETITION
# Line 3283  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 3531  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
3531         meaning  or  processing  of  a possessive quantifier and the equivalent         meaning  or  processing  of  a possessive quantifier and the equivalent
3532         atomic group.         atomic group.
3533    
3534         The possessive quantifier syntax is an extension to the Perl syntax. It         The possessive quantifier syntax is an extension to  the  Perl  syntax.
3535         originates in Sun's Java package.         Jeffrey  Friedl originated the idea (and the name) in the first edition
3536           of his book.  Mike McCloskey liked it, so implemented it when he  built
3537           Sun's Java package, and PCRE copied it from there.
3538    
3539         When  a  pattern  contains an unlimited repeat inside a subpattern that         When  a  pattern  contains an unlimited repeat inside a subpattern that
3540         can itself be repeated an unlimited number of  times,  the  use  of  an         can itself be repeated an unlimited number of  times,  the  use  of  an
# Line 3325  BACK REFERENCES Line 3575  BACK REFERENCES
3575         it  is  always  taken  as a back reference, and causes an error only if         it  is  always  taken  as a back reference, and causes an error only if
3576         there are not that many capturing left parentheses in the  entire  pat-         there are not that many capturing left parentheses in the  entire  pat-
3577         tern.  In  other words, the parentheses that are referenced need not be         tern.  In  other words, the parentheses that are referenced need not be
3578         to the left of the reference for numbers less than 10. See the  subsec-         to the left of the reference for numbers less than 10. A "forward  back
3579         tion  entitled  "Non-printing  characters" above for further details of         reference"  of  this  type can make sense when a repetition is involved
3580         the handling of digits following a backslash.         and the subpattern to the right has participated in an  earlier  itera-
3581           tion.
3582    
3583           It is not possible to have a numerical "forward back reference" to sub-
3584           pattern whose number is 10 or more. However, a back  reference  to  any
3585           subpattern  is  possible  using named parentheses (see below). See also
3586           the subsection entitled "Non-printing  characters"  above  for  further
3587           details of the handling of digits following a backslash.
3588    
3589         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
3590         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
3591         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
3592         of doing that). So the pattern         of doing that). So the pattern
3593    
3594           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
3595    
3596         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
3597         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
3598         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
3599         ple,         ple,
3600    
3601           ((?i)rah)\s+\1           ((?i)rah)\s+\1
3602    
3603         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
3604         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
3605    
3606         Back  references  to named subpatterns use the Python syntax (?P=name).         Back references to named subpatterns use the Python  syntax  (?P=name).
3607         We could rewrite the above example as follows:         We could rewrite the above example as follows:
3608    
3609           (?<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
3610    
3611           A  subpattern  that  is  referenced  by  name may appear in the pattern
3612           before or after the reference.
3613    
3614         There may be more than one back reference to the same subpattern. If  a         There may be more than one back reference to the same subpattern. If  a
3615         subpattern  has  not actually been used in a particular match, any back         subpattern  has  not actually been used in a particular match, any back
# Line 3438  ASSERTIONS Line 3698  ASSERTIONS
3698         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
3699         contents of a lookbehind assertion are restricted  such  that  all  the         contents of a lookbehind assertion are restricted  such  that  all  the
3700         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
3701         eral alternatives, they do not all have to have the same fixed  length.         eral top-level alternatives, they do not all  have  to  have  the  same
3702         Thus         fixed length. Thus
3703    
3704           (?<=bullock|donkey)           (?<=bullock|donkey)
3705    
# Line 3552  CONDITIONAL SUBPATTERNS Line 3812  CONDITIONAL SUBPATTERNS
3812         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
3813    
3814         There are three kinds of condition. If the text between the parentheses         There are three kinds of condition. If the text between the parentheses
3815         consists of a sequence of digits, the condition  is  satisfied  if  the         consists of a sequence of digits, or a sequence of alphanumeric charac-
3816         capturing  subpattern of that number has previously matched. The number         ters  and underscores, the condition is satisfied if the capturing sub-
3817         must be greater than zero. Consider the following pattern,  which  con-         pattern of that number or name has previously matched. There is a  pos-
3818         tains  non-significant white space to make it more readable (assume the         sible  ambiguity here, because subpattern names may consist entirely of
3819         PCRE_EXTENDED option) and to divide it into three  parts  for  ease  of         digits. PCRE looks first for a named subpattern; if it cannot find  one
3820         discussion:         and  the text consists entirely of digits, it looks for a subpattern of
3821           that number, which must be greater than zero.  Using  subpattern  names
3822           that consist entirely of digits is not recommended.
3823    
3824           Consider  the  following  pattern, which contains non-significant white
3825           space to make it more readable (assume the PCRE_EXTENDED option) and to
3826           divide it into three parts for ease of discussion:
3827    
3828           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
3829    
# Line 3570  CONDITIONAL SUBPATTERNS Line 3836  CONDITIONAL SUBPATTERNS
3836         tern  is  executed  and  a  closing parenthesis is required. Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
3837         since no-pattern is not present, the  subpattern  matches  nothing.  In         since no-pattern is not present, the  subpattern  matches  nothing.  In
3838         other  words,  this  pattern  matches  a  sequence  of non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
3839         optionally enclosed in parentheses.         optionally enclosed in parentheses. Rewriting it to use a named subpat-
3840           tern gives this:
3841    
3842             (?P<OPEN> \( )?    [^()]+    (?(OPEN) \) )
3843    
3844         If the condition is the string (R), it is satisfied if a recursive call         If the condition is the string (R), and there is no subpattern with the
3845         to  the pattern or subpattern has been made. At "top level", the condi-         name R, the condition is satisfied if a recursive call to  the  pattern
3846         tion is false.  This  is  a  PCRE  extension.  Recursive  patterns  are         or  subpattern  has  been made. At "top level", the condition is false.
3847         described in the next section.         This is a PCRE extension.  Recursive patterns are described in the next
3848           section.
3849    
3850         If  the  condition  is  not  a sequence of digits or (R), it must be an         If  the  condition  is  not  a sequence of digits or (R), it must be an
3851         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
# Line 3602  COMMENTS Line 3872  COMMENTS
3872         at all.         at all.
3873    
3874         If  the PCRE_EXTENDED option is set, an unescaped # character outside a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
3875         character class introduces a comment that continues up to the next new-         character class introduces a  comment  that  continues  to  immediately
3876         line character in the pattern.         after the next newline in the pattern.
3877    
3878    
3879  RECURSIVE PATTERNS  RECURSIVE PATTERNS
# Line 3633  RECURSIVE PATTERNS Line 3903  RECURSIVE PATTERNS
3903         tion.)  The special item (?R) is a recursive call of the entire regular         tion.)  The special item (?R) is a recursive call of the entire regular
3904         expression.         expression.
3905    
3906         For example, this PCRE pattern solves the  nested  parentheses  problem         A recursive subpattern call is always treated as an atomic group.  That
3907         (assume  the  PCRE_EXTENDED  option  is  set  so  that  white  space is         is,  once  it  has  matched some of the subject string, it is never re-
3908         ignored):         entered, even if it contains untried alternatives and there is a subse-
3909           quent matching failure.
3910    
3911           This  PCRE  pattern  solves  the nested parentheses problem (assume the
3912           PCRE_EXTENDED option is set so that white space is ignored):
3913    
3914           \( ( (?>[^()]+) | (?R) )* \)           \( ( (?>[^()]+) | (?R) )* \)
3915    
3916         First it matches an opening parenthesis. Then it matches any number  of         First it matches an opening parenthesis. Then it matches any number  of
3917         substrings  which  can  either  be  a sequence of non-parentheses, or a         substrings  which  can  either  be  a sequence of non-parentheses, or a
3918         recursive match of the pattern itself (that is  a  correctly  parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
3919         sized substring).  Finally there is a closing parenthesis.         sized substring).  Finally there is a closing parenthesis.
3920    
3921         If  this  were  part of a larger pattern, you would not want to recurse         If  this  were  part of a larger pattern, you would not want to recurse
# Line 3722  SUBPATTERNS AS SUBROUTINES Line 3996  SUBPATTERNS AS SUBROUTINES
3996           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
3997    
3998         is  used, it does match "sense and responsibility" as well as the other         is  used, it does match "sense and responsibility" as well as the other
3999         two strings. Such references must, however, follow  the  subpattern  to         two strings. Such references, if given  numerically,  must  follow  the
4000         which they refer.         subpattern  to which they refer. However, named references can refer to
4001           later subpatterns.
4002    
4003           Like recursive subpatterns, a "subroutine" call is always treated as an
4004           atomic  group. That is, once it has matched some of the subject string,
4005           it is never re-entered, even if it contains  untried  alternatives  and
4006           there is a subsequent matching failure.
4007    
4008    
4009  CALLOUTS  CALLOUTS
# Line 3760  CALLOUTS Line 4040  CALLOUTS
4040         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
4041         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
4042    
4043  Last updated: 28 February 2005  Last updated: 06 June 2006
4044  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
4045  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4046    
4047    
# Line 3851  EXAMPLE OF PARTIAL MATCHING USING PCRETE Line 4131  EXAMPLE OF PARTIAL MATCHING USING PCRETE
4131         uses the date example quoted above:         uses the date example quoted above:
4132    
4133             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/             re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4134           data> 25jun04P           data> 25jun04\P
4135            0: 25jun04            0: 25jun04
4136            1: jun            1: jun
4137           data> 25dec3P           data> 25dec3\P
4138           Partial match           Partial match
4139           data> 3juP           data> 3ju\P
4140           Partial match           Partial match
4141           data> 3jujP           data> 3juj\P
4142           No match           No match
4143           data> jP           data> j\P
4144           No match           No match
4145    
4146         The first data string is matched  completely,  so  pcretest  shows  the         The first data string is matched  completely,  so  pcretest  shows  the
# Line 3950  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe Line 4230  MULTI-SEGMENT MATCHING WITH pcre_dfa_exe
4230         Because of this phenomenon, it does not usually make  sense  to  end  a         Because of this phenomenon, it does not usually make  sense  to  end  a
4231         pattern that is going to be matched in this way with a variable repeat.         pattern that is going to be matched in this way with a variable repeat.
4232    
4233  Last updated: 28 February 2005         4. Patterns that contain alternatives at the top level which do not all
4234  Copyright (c) 1997-2005 University of Cambridge.         start with the same pattern item may not work as expected. For example,
4235           consider this pattern:
4236    
4237             1234|3789
4238    
4239           If the first part of the subject is "ABC123", a partial  match  of  the
4240           first  alternative  is found at offset 3. There is no partial match for
4241           the second alternative, because such a match does not start at the same
4242           point  in  the  subject  string. Attempting to continue with the string
4243           "789" does not yield a match because only those alternatives that match
4244           at  one point in the subject are remembered. The problem arises because
4245           the start of the second alternative matches within the  first  alterna-
4246           tive. There is no problem with anchored patterns or patterns such as:
4247    
4248             1234|ABCD
4249    
4250           where no string can be a partial match for both alternatives.
4251    
4252    Last updated: 16 January 2006
4253    Copyright (c) 1997-2006 University of Cambridge.
4254  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4255    
4256    
# Line 4065  COMPATIBILITY WITH DIFFERENT PCRE RELEAS Line 4364  COMPATIBILITY WITH DIFFERENT PCRE RELEAS
4364         them for release 5.0. However, from now on, it should  be  possible  to         them for release 5.0. However, from now on, it should  be  possible  to
4365         make changes in a compatible manner.         make changes in a compatible manner.
4366    
4367  Last updated: 28 February 2005         Notwithstanding the above, if you have any saved patterns in UTF-8 mode
4368  Copyright (c) 1997-2005 University of Cambridge.         that use \p or \P that were compiled with any release up to and includ-
4369           ing 6.4, you will have to recompile them for release 6.5 and above.
4370    
4371    Last updated: 01 February 2006
4372    Copyright (c) 1997-2006 University of Cambridge.
4373  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4374    
4375    
# Line 4191  DESCRIPTION Line 4494  DESCRIPTION
4494         functions call the native ones, it is also necessary to add -lpcre.         functions call the native ones, it is also necessary to add -lpcre.
4495    
4496         I have implemented only those option bits that can be reasonably mapped         I have implemented only those option bits that can be reasonably mapped
4497         to  PCRE  native  options.  In  addition,  the options REG_EXTENDED and         to PCRE native options. In addition, the option REG_EXTENDED is defined
4498         REG_NOSUB are defined with the value zero. They  have  no  effect,  but         with the value zero. This has no effect, but since  programs  that  are
4499         since  programs that are written to the POSIX interface often use them,         written  to  the  POSIX interface often use it, this makes it easier to
4500         this makes it easier to slot in PCRE as a  replacement  library.  Other         slot in PCRE as a replacement library. Other POSIX options are not even
4501         POSIX options are not even defined.         defined.
4502    
4503         When  PCRE  is  called  via these functions, it is only the API that is         When  PCRE  is  called  via these functions, it is only the API that is
4504         POSIX-like in style. The syntax and semantics of  the  regular  expres-         POSIX-like in style. The syntax and semantics of  the  regular  expres-
# Line 4220  COMPILING A PATTERN Line 4523  COMPILING A PATTERN
4523         form. The pattern is a C string terminated by a  binary  zero,  and  is         form. The pattern is a C string terminated by a  binary  zero,  and  is
4524         passed  in  the  argument  pattern. The preg argument is a pointer to a         passed  in  the  argument  pattern. The preg argument is a pointer to a
4525         regex_t structure that is used as a base for storing information  about         regex_t structure that is used as a base for storing information  about
4526         the compiled expression.         the compiled regular expression.
4527    
4528         The argument cflags is either zero, or contains one or more of the bits         The argument cflags is either zero, or contains one or more of the bits
4529         defined by the following macros:         defined by the following macros:
4530    
4531           REG_DOTALL           REG_DOTALL
4532    
4533         The PCRE_DOTALL option is set when the expression is passed for  compi-         The PCRE_DOTALL option is set when the regular expression is passed for
4534         lation  to the native function. Note that REG_DOTALL is not part of the         compilation to the native function. Note that REG_DOTALL is not part of
4535         POSIX standard.         the POSIX standard.
4536    
4537           REG_ICASE           REG_ICASE
4538    
4539         The PCRE_CASELESS option is set when the expression is passed for  com-         The PCRE_CASELESS option is set when the regular expression  is  passed
4540         pilation to the native function.         for compilation to the native function.
4541    
4542           REG_NEWLINE           REG_NEWLINE
4543    
4544         The PCRE_MULTILINE option is set when the expression is passed for com-         The  PCRE_MULTILINE option is set when the regular expression is passed
4545         pilation to the native function. Note that  this  does  not  mimic  the         for compilation to the native function. Note that this does  not  mimic
4546         defined POSIX behaviour for REG_NEWLINE (see the following section).         the  defined  POSIX  behaviour  for REG_NEWLINE (see the following sec-
4547           tion).
4548    
4549             REG_NOSUB
4550    
4551           The PCRE_NO_AUTO_CAPTURE option is set when the regular  expression  is
4552           passed for compilation to the native function. In addition, when a pat-
4553           tern that is compiled with this flag is passed to regexec() for  match-
4554           ing,  the  nmatch  and  pmatch  arguments  are ignored, and no captured
4555           strings are returned.
4556    
4557             REG_UTF8
4558    
4559           The PCRE_UTF8 option is set when the regular expression is  passed  for
4560           compilation  to the native function. This causes the pattern itself and
4561           all data strings used for matching it to be treated as  UTF-8  strings.
4562           Note that REG_UTF8 is not part of the POSIX standard.
4563    
4564         In  the  absence  of  these  flags, no options are passed to the native         In  the  absence  of  these  flags, no options are passed to the native
4565         function.  This means the the  regex  is  compiled  with  PCRE  default         function.  This means the the  regex  is  compiled  with  PCRE  default
# Line 4307  MATCHING A PATTERN Line 4626  MATCHING A PATTERN
4626         The PCRE_NOTEOL option is set when calling the underlying PCRE matching         The PCRE_NOTEOL option is set when calling the underlying PCRE matching
4627         function.         function.
4628    
4629         The  portion of the string that was matched, and also any captured sub-         If  the pattern was compiled with the REG_NOSUB flag, no data about any
4630         strings, are returned via the pmatch argument, which points to an array         matched strings  is  returned.  The  nmatch  and  pmatch  arguments  of
4631         of  nmatch  structures of type regmatch_t, containing the members rm_so         regexec() are ignored.
4632         and rm_eo. These contain the offset to the first character of each sub-  
4633         string and the offset to the first character after the end of each sub-         Otherwise,the portion of the string that was matched, and also any cap-
4634         string, respectively. The 0th element of  the  vector  relates  to  the         tured substrings, are returned via the pmatch argument, which points to
4635         entire  portion  of string that was matched; subsequent elements relate         an  array  of nmatch structures of type regmatch_t, containing the mem-
4636         to the capturing subpatterns of the regular expression. Unused  entries         bers rm_so and rm_eo. These contain the offset to the  first  character
4637         in the array have both structure members set to -1.         of  each  substring and the offset to the first character after the end
4638           of each substring, respectively. The 0th element of the vector  relates
4639           to  the  entire portion of string that was matched; subsequent elements
4640           relate to the capturing subpatterns of the regular  expression.  Unused
4641           entries in the array have both structure members set to -1.
4642    
4643         A  successful  match  yields  a  zero  return;  various error codes are         A  successful  match  yields  a  zero  return;  various error codes are
4644         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"         defined in the header file, of  which  REG_NOMATCH  is  the  "expected"
# Line 4346  AUTHOR Line 4669  AUTHOR
4669         University Computing Service,         University Computing Service,
4670         Cambridge CB2 3QG, England.         Cambridge CB2 3QG, England.
4671    
4672  Last updated: 28 February 2005  Last updated: 16 January 2006
4673  Copyright (c) 1997-2005 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.
4674  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4675    
4676    
# Line 4520  PASSING MODIFIERS TO THE REGULAR EXPRESS Line 4843  PASSING MODIFIERS TO THE REGULAR EXPRESS
4843    
4844           RE_Options & set_caseless(bool)           RE_Options & set_caseless(bool)
4845    
4846         which sets or unsets the  modifier.  Moreover,  PCRE_CONFIG_MATCH_LIMIT         which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
4847         can  be accessed through the set_match_limit() and match_limit() member         be  accessed  through  the  set_match_limit()  and match_limit() member
4848         functions. Setting match_limit to a non-zero value will limit the  exe-         functions. Setting match_limit to a non-zero value will limit the  exe-
4849         cution  of pcre to keep it from doing bad things like blowing the stack         cution  of pcre to keep it from doing bad things like blowing the stack
4850         or taking an eternity to return a result.  A  value  of  5000  is  good         or taking an eternity to return a result.  A  value  of  5000  is  good
4851         enough  to stop stack blowup in a 2MB thread stack. Setting match_limit         enough  to stop stack blowup in a 2MB thread stack. Setting match_limit
4852         to zero disables match limiting.         to  zero  disables  match  limiting.  Alternatively,   you   can   call
4853           match_limit_recursion()  which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
4854           limit how much  PCRE  recurses.  match_limit()  limits  the  number  of
4855           matches PCRE does; match_limit_recursion() limits the depth of internal
4856           recursion, and therefore the amount of stack that is used.
4857    
4858         Normally, to pass one or more modifiers to a RE class,  you  declare  a         Normally, to pass one or more modifiers to a RE class,  you  declare  a
4859         RE_Options object, set the appropriate options, and pass this object to         RE_Options object, set the appropriate options, and pass this object to
# Line 4721  PCRE SAMPLE PROGRAM Line 5048  PCRE SAMPLE PROGRAM
5048  Last updated: 09 September 2004  Last updated: 09 September 2004
5049  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2004 University of Cambridge.
5050  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5051    PCRESTACK(3)                                                      PCRESTACK(3)
5052    
5053    
5054    NAME
5055           PCRE - Perl-compatible regular expressions
5056    
5057    
5058    PCRE DISCUSSION OF STACK USAGE
5059    
5060           When  you call pcre_exec(), it makes use of an internal function called
5061           match(). This calls itself recursively at branch points in the pattern,
5062           in  order to remember the state of the match so that it can back up and
5063           try a different alternative if the first one fails.  As  matching  pro-
5064           ceeds  deeper  and deeper into the tree of possibilities, the recursion
5065           depth increases.
5066    
5067           Not all calls of match() increase the recursion depth; for an item such
5068           as  a* it may be called several times at the same level, after matching
5069           different numbers of a's. Furthermore, in a number of cases  where  the
5070           result  of  the  recursive call would immediately be passed back as the
5071           result of the current call (a "tail recursion"), the function  is  just
5072           restarted instead.
5073    
5074           The pcre_dfa_exec() function operates in an entirely different way, and
5075           hardly uses recursion at all. The limit on its complexity is the amount
5076           of  workspace  it  is  given.  The comments that follow do NOT apply to
5077           pcre_dfa_exec(); they are relevant only for pcre_exec().
5078    
5079           You can set limits on the number of times that match() is called,  both
5080           in  total  and  recursively. If the limit is exceeded, an error occurs.
5081           For details, see the section on  extra  data  for  pcre_exec()  in  the
5082           pcreapi documentation.
5083    
5084           Each  time  that match() is actually called recursively, it uses memory
5085           from the process stack. For certain kinds of  pattern  and  data,  very
5086           large  amounts of stack may be needed, despite the recognition of "tail
5087           recursion".  You can often reduce the amount of recursion,  and  there-
5088           fore  the  amount of stack used, by modifying the pattern that is being
5089           matched. Consider, for example, this pattern:
5090    
5091             ([^<]|<(?!inet))+
5092    
5093           It matches from wherever it starts until it encounters "<inet"  or  the
5094           end  of  the  data,  and is the kind of pattern that might be used when
5095           processing an XML file. Each iteration of the outer parentheses matches
5096           either  one  character that is not "<" or a "<" that is not followed by
5097           "inet". However, each time a  parenthesis  is  processed,  a  recursion
5098           occurs, so this formulation uses a stack frame for each matched charac-
5099           ter. For a long string, a lot of stack is required. Consider  now  this
5100           rewritten pattern, which matches exactly the same strings:
5101    
5102             ([^<]++|<(?!inet))
5103    
5104           This  uses very much less stack, because runs of characters that do not
5105           contain "<" are "swallowed" in one item inside the parentheses.  Recur-
5106           sion  happens  only when a "<" character that is not followed by "inet"
5107           is encountered (and we assume this is relatively  rare).  A  possessive
5108           quantifier  is  used  to stop any backtracking into the runs of non-"<"
5109           characters, but that is not related to stack usage.
5110    
5111           In environments where stack memory is constrained, you  might  want  to
5112           compile  PCRE to use heap memory instead of stack for remembering back-
5113           up points. This makes it run a lot more slowly, however. Details of how
5114           to do this are given in the pcrebuild documentation.
5115    
5116           In Unix-like environments, there is not often a problem with the stack,
5117           though the default limit on stack size varies from  system  to  system.
5118           Values  from 8Mb to 64Mb are common. You can find your default limit by
5119           running the command:
5120    
5121             ulimit -s
5122    
5123           The effect of running out of stack is often SIGSEGV,  though  sometimes
5124           an error message is given. You can normally increase the limit on stack
5125           size by code such as this:
5126    
5127             struct rlimit rlim;
5128             getrlimit(RLIMIT_STACK, &rlim);
5129             rlim.rlim_cur = 100*1024*1024;
5130             setrlimit(RLIMIT_STACK, &rlim);
5131    
5132           This reads the current limits (soft and hard) using  getrlimit(),  then
5133           attempts  to  increase  the  soft limit to 100Mb using setrlimit(). You
5134           must do this before calling pcre_exec().
5135    
5136           PCRE has an internal counter that can be used to  limit  the  depth  of
5137           recursion,  and  thus cause pcre_exec() to give an error code before it
5138           runs out of stack. By default, the limit is very  large,  and  unlikely
5139           ever  to operate. It can be changed when PCRE is built, and it can also
5140           be set when pcre_exec() is called. For details of these interfaces, see
5141           the pcrebuild and pcreapi documentation.
5142    
5143           As a very rough rule of thumb, you should reckon on about 500 bytes per
5144           recursion. Thus, if you want to limit your  stack  usage  to  8Mb,  you
5145           should  set  the  limit at 16000 recursions. A 64Mb stack, on the other
5146           hand, can support around 128000 recursions. The pcretest  test  program
5147           has  a command line option (-S) that can be used to increase its stack.
5148    
5149    Last updated: 29 June 2006
5150    Copyright (c) 1997-2006 University of Cambridge.
5151    ------------------------------------------------------------------------------
5152    
5153    

Legend:
Removed from v.83  
changed lines
  Added in v.91

  ViewVC Help
Powered by ViewVC 1.1.5