/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 958 by ph10, Sat Mar 31 18:09:26 2012 UTC revision 959 by ph10, Sat Apr 14 16:16:58 2012 UTC
# Line 367  OPTION NAMES Line 367  OPTION NAMES
367         There   are   two   new   general   option   names,   PCRE_UTF16    and         There   are   two   new   general   option   names,   PCRE_UTF16    and
368         PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and         PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
369         PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options         PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
370         define the same bits in the options word.         define  the  same bits in the options word. There is a discussion about
371           the validity of UTF-16 strings in the pcreunicode page.
372    
373         For  the  pcre16_config() function there is an option PCRE_CONFIG_UTF16         For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
374         that returns 1 if UTF-16 support is configured, otherwise  0.  If  this         that  returns  1  if UTF-16 support is configured, otherwise 0. If this
375         option  is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option is         option is given to pcre_config(), or if the PCRE_CONFIG_UTF8 option  is
376         given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.         given to pcre16_config(), the result is the PCRE_ERROR_BADOPTION error.
377    
378    
379  CHARACTER CODES  CHARACTER CODES
380    
381         In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are         In  16-bit  mode,  when  PCRE_UTF16  is  not  set, character values are
382         treated in the same way as in 8-bit, non UTF-8 mode, except, of course,         treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
383         that they can range from 0 to 0xffff instead of 0  to  0xff.  Character         that  they  can  range from 0 to 0xffff instead of 0 to 0xff. Character
384         types  for characters less than 0xff can therefore be influenced by the         types for characters less than 0xff can therefore be influenced by  the
385         locale in the same way as before.  Characters greater  than  0xff  have         locale  in  the  same way as before.  Characters greater than 0xff have
386         only one case, and no "type" (such as letter or digit).         only one case, and no "type" (such as letter or digit).
387    
388         In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to         In UTF-16 mode, the character code  is  Unicode,  in  the  range  0  to
389         0x10ffff, with the exception of values in the range  0xd800  to  0xdfff         0x10ffff,  with  the  exception of values in the range 0xd800 to 0xdfff
390         because  those  are "surrogate" values that are used in pairs to encode         because those are "surrogate" values that are used in pairs  to  encode
391         values greater than 0xffff.         values greater than 0xffff.
392    
393         A UTF-16 string can indicate its endianness by special code knows as  a         A  UTF-16 string can indicate its endianness by special code knows as a
394         byte-order mark (BOM). The PCRE functions do not handle this, expecting         byte-order mark (BOM). The PCRE functions do not handle this, expecting
395         strings  to  be  in  host  byte  order.  A  utility   function   called         strings   to   be  in  host  byte  order.  A  utility  function  called
396         pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see         pcre16_utf16_to_host_byte_order() is provided to help  with  this  (see
397         above).         above).
398    
399    
400  ERROR NAMES  ERROR NAMES
401    
402         The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-         The  errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16 corre-
403         spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is         spond to their 8-bit  counterparts.  The  error  PCRE_ERROR_BADMODE  is
404         given when a compiled pattern is passed to a  function  that  processes         given  when  a  compiled pattern is passed to a function that processes
405         patterns  in  the  other  mode, for example, if a pattern compiled with         patterns in the other mode, for example, if  a  pattern  compiled  with
406         pcre_compile() is passed to pcre16_exec().         pcre_compile() is passed to pcre16_exec().
407    
408         There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for         There  are  new  error  codes whose names begin with PCRE_UTF16_ERR for
409         invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for         invalid UTF-16 strings, corresponding to the  PCRE_UTF8_ERR  codes  for
410         UTF-8 strings that are described in the section entitled "Reason  codes         UTF-8  strings that are described in the section entitled "Reason codes
411         for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors         for invalid UTF-8 strings" in the main pcreapi page. The UTF-16  errors
412         are:         are:
413    
414           PCRE_UTF16_ERR1  Missing low surrogate at end of string           PCRE_UTF16_ERR1  Missing low surrogate at end of string
# Line 418  ERROR NAMES Line 419  ERROR NAMES
419    
420  ERROR TEXTS  ERROR TEXTS
421    
422         If there is an error while compiling a pattern, the error text that  is         If  there is an error while compiling a pattern, the error text that is
423         passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit         passed back by pcre16_compile() or pcre16_compile2() is still an  8-bit
424         character string, zero-terminated.         character string, zero-terminated.
425    
426    
427  CALLOUTS  CALLOUTS
428    
429         The subject and mark fields in the callout block that is  passed  to  a         The  subject  and  mark fields in the callout block that is passed to a
430         callout function point to 16-bit vectors.         callout function point to 16-bit vectors.
431    
432    
433  TESTING  TESTING
434    
435         The  pcretest  program continues to operate with 8-bit input and output         The pcretest program continues to operate with 8-bit input  and  output
436         files, but it can be used for testing the 16-bit library. If it is  run         files,  but it can be used for testing the 16-bit library. If it is run
437         with the command line option -16, patterns and subject strings are con-         with the command line option -16, patterns and subject strings are con-
438         verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit         verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
439         library  functions  are used instead of the 8-bit ones. Returned 16-bit         library functions are used instead of the 8-bit ones.  Returned  16-bit
440         strings are converted to 8-bit for output. If the 8-bit library was not         strings are converted to 8-bit for output. If the 8-bit library was not
441         compiled, pcretest defaults to 16-bit and the -16 option is ignored.         compiled, pcretest defaults to 16-bit and the -16 option is ignored.
442    
443         When  PCRE  is  being built, the RunTest script that is called by "make         When PCRE is being built, the RunTest script that is  called  by  "make
444         check" uses the pcretest -C option to discover which of the  8-bit  and         check"  uses  the pcretest -C option to discover which of the 8-bit and
445         16-bit libraries has been built, and runs the tests appropriately.         16-bit libraries has been built, and runs the tests appropriately.
446    
447    
448  NOT SUPPORTED IN 16-BIT MODE  NOT SUPPORTED IN 16-BIT MODE
449    
450         Not all the features of the 8-bit library are available with the 16-bit         Not all the features of the 8-bit library are available with the 16-bit
451         library. The C++ and POSIX wrapper functions  support  only  the  8-bit         library.  The  C++  and  POSIX wrapper functions support only the 8-bit
452         library, and the pcregrep program is at present 8-bit only.         library, and the pcregrep program is at present 8-bit only.
453    
454    
# Line 460  AUTHOR Line 461  AUTHOR
461    
462  REVISION  REVISION
463    
464         Last updated: 08 January 2012         Last updated: 14 April 2012
465         Copyright (c) 1997-2012 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
466  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
467    
# Line 2656  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2657  MATCHING A PATTERN: THE TRADITIONAL FUNC
2657    
2658         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
2659         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8  string is automatically checked when pcre_exec() is subsequently
2660         called.  The value of startoffset is also checked  to  ensure  that  it         called.  The entire string is checked before any other processing takes
2661         points  to  the start of a UTF-8 character. There is a discussion about         place.  The  value  of  startoffset  is  also checked to ensure that it
2662         the validity of UTF-8 strings in the pcreunicode page.  If  an  invalid         points to the start of a UTF-8 character. There is a  discussion  about
2663         sequence   of   bytes   is   found,   pcre_exec()   returns  the  error         the  validity  of  UTF-8 strings in the pcreunicode page. If an invalid
2664           sequence  of  bytes   is   found,   pcre_exec()   returns   the   error
2665         PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a         PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
2666         truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In         truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
2667         both cases, information about the precise nature of the error may  also         both  cases, information about the precise nature of the error may also
2668         be  returned (see the descriptions of these errors in the section enti-         be returned (see the descriptions of these errors in the section  enti-
2669         tled Error return values from pcre_exec() below).  If startoffset  con-         tled  Error return values from pcre_exec() below).  If startoffset con-
2670         tains a value that does not point to the start of a UTF-8 character (or         tains a value that does not point to the start of a UTF-8 character (or
2671         to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.         to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
2672    
2673         If you already know that your subject is valid, and you  want  to  skip         If  you  already  know that your subject is valid, and you want to skip
2674         these    checks    for   performance   reasons,   you   can   set   the         these   checks   for   performance   reasons,   you   can    set    the
2675         PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2676         do  this  for the second and subsequent calls to pcre_exec() if you are         do this for the second and subsequent calls to pcre_exec() if  you  are
2677         making repeated calls to find all  the  matches  in  a  single  subject         making  repeated  calls  to  find  all  the matches in a single subject
2678         string.  However,  you  should  be  sure  that the value of startoffset         string. However, you should be  sure  that  the  value  of  startoffset
2679         points to the start of a character (or the end of  the  subject).  When         points  to  the  start of a character (or the end of the subject). When
2680         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
2681         subject or an invalid value of startoffset is undefined.  Your  program         subject  or  an invalid value of startoffset is undefined. Your program
2682         may crash.         may crash.
2683    
2684           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
2685           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
2686    
2687         These  options turn on the partial matching feature. For backwards com-         These options turn on the partial matching feature. For backwards  com-
2688         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial         patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2689         match  occurs if the end of the subject string is reached successfully,         match occurs if the end of the subject string is reached  successfully,
2690         but there are not enough subject characters to complete the  match.  If         but  there  are not enough subject characters to complete the match. If
2691         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
2692         matching continues by testing any remaining alternatives.  Only  if  no         matching  continues  by  testing any remaining alternatives. Only if no
2693         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of         complete match can be found is PCRE_ERROR_PARTIAL returned  instead  of
2694         PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the         PCRE_ERROR_NOMATCH.  In  other  words,  PCRE_PARTIAL_SOFT says that the
2695         caller  is  prepared to handle a partial match, but only if no complete         caller is prepared to handle a partial match, but only if  no  complete
2696         match can be found.         match can be found.
2697    
2698         If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this         If  PCRE_PARTIAL_HARD  is  set, it overrides PCRE_PARTIAL_SOFT. In this
2699         case,  if  a  partial  match  is found, pcre_exec() immediately returns         case, if a partial match  is  found,  pcre_exec()  immediately  returns
2700         PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In         PCRE_ERROR_PARTIAL,  without  considering  any  other  alternatives. In
2701         other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-         other words, when PCRE_PARTIAL_HARD is set, a partial match is  consid-
2702         ered to be more important that an alternative complete match.         ered to be more important that an alternative complete match.
2703    
2704         In both cases, the portion of the string that was  inspected  when  the         In  both  cases,  the portion of the string that was inspected when the
2705         partial match was found is set as the first matching string. There is a         partial match was found is set as the first matching string. There is a
2706         more detailed discussion of partial and  multi-segment  matching,  with         more  detailed  discussion  of partial and multi-segment matching, with
2707         examples, in the pcrepartial documentation.         examples, in the pcrepartial documentation.
2708    
2709     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
2710    
2711         The  subject string is passed to pcre_exec() as a pointer in subject, a         The subject string is passed to pcre_exec() as a pointer in subject,  a
2712         length in bytes in length, and a starting byte offset  in  startoffset.         length  in  bytes in length, and a starting byte offset in startoffset.
2713         If  this  is  negative  or  greater  than  the  length  of the subject,         If this is  negative  or  greater  than  the  length  of  the  subject,
2714         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is         pcre_exec()  returns  PCRE_ERROR_BADOFFSET. When the starting offset is
2715         zero,  the  search  for a match starts at the beginning of the subject,         zero, the search for a match starts at the beginning  of  the  subject,
2716         and this is by far the most common case. In UTF-8 mode, the byte offset         and this is by far the most common case. In UTF-8 mode, the byte offset
2717         must  point  to  the start of a UTF-8 character (or the end of the sub-         must point to the start of a UTF-8 character (or the end  of  the  sub-
2718         ject). Unlike the pattern string, the subject may contain  binary  zero         ject).  Unlike  the pattern string, the subject may contain binary zero
2719         bytes.         bytes.
2720    
2721         A  non-zero  starting offset is useful when searching for another match         A non-zero starting offset is useful when searching for  another  match
2722         in the same subject by calling pcre_exec() again after a previous  suc-         in  the same subject by calling pcre_exec() again after a previous suc-
2723         cess.   Setting  startoffset differs from just passing over a shortened         cess.  Setting startoffset differs from just passing over  a  shortened
2724         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
2725         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2726    
2727           \Biss\B           \Biss\B
2728    
2729         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
2730         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
2731         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
2732         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
2733         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
2734         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2735         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
2736         string again, but with startoffset set to 4, it finds the second occur-         string again, but with startoffset set to 4, it finds the second occur-
2737         rence  of "iss" because it is able to look behind the starting point to         rence of "iss" because it is able to look behind the starting point  to
2738         discover that it is preceded by a letter.         discover that it is preceded by a letter.
2739    
2740         Finding all the matches in a subject is tricky  when  the  pattern  can         Finding  all  the  matches  in a subject is tricky when the pattern can
2741         match an empty string. It is possible to emulate Perl's /g behaviour by         match an empty string. It is possible to emulate Perl's /g behaviour by
2742         first  trying  the  match  again  at  the   same   offset,   with   the         first   trying   the   match   again  at  the  same  offset,  with  the
2743         PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that         PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED  options,  and  then  if  that
2744         fails, advancing the starting  offset  and  trying  an  ordinary  match         fails,  advancing  the  starting  offset  and  trying an ordinary match
2745         again. There is some code that demonstrates how to do this in the pcre-         again. There is some code that demonstrates how to do this in the pcre-
2746         demo sample program. In the most general case, you have to check to see         demo sample program. In the most general case, you have to check to see
2747         if  the newline convention recognizes CRLF as a newline, and if so, and         if the newline convention recognizes CRLF as a newline, and if so,  and
2748         the current character is CR followed by LF, advance the starting offset         the current character is CR followed by LF, advance the starting offset
2749         by two characters instead of one.         by two characters instead of one.
2750    
2751         If  a  non-zero starting offset is passed when the pattern is anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
2752         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
2753         if  the  pattern  does  not require the match to be at the start of the         if the pattern does not require the match to be at  the  start  of  the
2754         subject.         subject.
2755    
2756     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
2757    
2758         In general, a pattern matches a certain portion of the subject, and  in         In  general, a pattern matches a certain portion of the subject, and in
2759         addition,  further  substrings  from  the  subject may be picked out by         addition, further substrings from the subject  may  be  picked  out  by
2760         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,
2761         this  is  called "capturing" in what follows, and the phrase "capturing         this is called "capturing" in what follows, and the  phrase  "capturing
2762         subpattern" is used for a fragment of a pattern that picks out  a  sub-         subpattern"  is  used for a fragment of a pattern that picks out a sub-
2763         string.  PCRE  supports several other kinds of parenthesized subpattern         string. PCRE supports several other kinds of  parenthesized  subpattern
2764         that do not cause substrings to be captured.         that do not cause substrings to be captured.
2765    
2766         Captured substrings are returned to the caller via a vector of integers         Captured substrings are returned to the caller via a vector of integers
2767         whose  address is passed in ovector. The number of elements in the vec-         whose address is passed in ovector. The number of elements in the  vec-
2768         tor is passed in ovecsize, which must be a non-negative  number.  Note:         tor  is  passed in ovecsize, which must be a non-negative number. Note:
2769         this argument is NOT the size of ovector in bytes.         this argument is NOT the size of ovector in bytes.
2770    
2771         The  first  two-thirds of the vector is used to pass back captured sub-         The first two-thirds of the vector is used to pass back  captured  sub-
2772         strings, each substring using a pair of integers. The  remaining  third         strings,  each  substring using a pair of integers. The remaining third
2773         of  the  vector is used as workspace by pcre_exec() while matching cap-         of the vector is used as workspace by pcre_exec() while  matching  cap-
2774         turing subpatterns, and is not available for passing back  information.         turing  subpatterns, and is not available for passing back information.
2775         The  number passed in ovecsize should always be a multiple of three. If         The number passed in ovecsize should always be a multiple of three.  If
2776         it is not, it is rounded down.         it is not, it is rounded down.
2777    
2778         When a match is successful, information about  captured  substrings  is         When  a  match  is successful, information about captured substrings is
2779         returned  in  pairs  of integers, starting at the beginning of ovector,         returned in pairs of integers, starting at the  beginning  of  ovector,
2780         and continuing up to two-thirds of its length at the  most.  The  first         and  continuing  up  to two-thirds of its length at the most. The first
2781         element  of  each pair is set to the byte offset of the first character         element of each pair is set to the byte offset of the  first  character
2782         in a substring, and the second is set to the byte offset of  the  first         in  a  substring, and the second is set to the byte offset of the first
2783         character  after  the end of a substring. Note: these values are always         character after the end of a substring. Note: these values  are  always
2784         byte offsets, even in UTF-8 mode. They are not character counts.         byte offsets, even in UTF-8 mode. They are not character counts.
2785    
2786         The first pair of integers, ovector[0]  and  ovector[1],  identify  the         The  first  pair  of  integers, ovector[0] and ovector[1], identify the
2787         portion  of  the subject string matched by the entire pattern. The next         portion of the subject string matched by the entire pattern.  The  next
2788         pair is used for the first capturing subpattern, and so on.  The  value         pair  is  used for the first capturing subpattern, and so on. The value
2789         returned by pcre_exec() is one more than the highest numbered pair that         returned by pcre_exec() is one more than the highest numbered pair that
2790         has been set.  For example, if two substrings have been  captured,  the         has  been  set.  For example, if two substrings have been captured, the
2791         returned  value is 3. If there are no capturing subpatterns, the return         returned value is 3. If there are no capturing subpatterns, the  return
2792         value from a successful match is 1, indicating that just the first pair         value from a successful match is 1, indicating that just the first pair
2793         of offsets has been set.         of offsets has been set.
2794    
2795         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
2796         of the string that it matched that is returned.         of the string that it matched that is returned.
2797    
2798         If the vector is too small to hold all the captured substring  offsets,         If  the vector is too small to hold all the captured substring offsets,
2799         it is used as far as possible (up to two-thirds of its length), and the         it is used as far as possible (up to two-thirds of its length), and the
2800         function returns a value of zero. If neither the actual string  matched         function  returns a value of zero. If neither the actual string matched
2801         nor  any captured substrings are of interest, pcre_exec() may be called         nor any captured substrings are of interest, pcre_exec() may be  called
2802         with ovector passed as NULL and ovecsize as zero. However, if the  pat-         with  ovector passed as NULL and ovecsize as zero. However, if the pat-
2803         tern  contains  back  references  and  the ovector is not big enough to         tern contains back references and the ovector  is  not  big  enough  to
2804         remember the related substrings, PCRE has to get additional memory  for         remember  the related substrings, PCRE has to get additional memory for
2805         use  during matching. Thus it is usually advisable to supply an ovector         use during matching. Thus it is usually advisable to supply an  ovector
2806         of reasonable size.         of reasonable size.
2807    
2808         There are some cases where zero is returned  (indicating  vector  over-         There  are  some  cases where zero is returned (indicating vector over-
2809         flow)  when  in fact the vector is exactly the right size for the final         flow) when in fact the vector is exactly the right size for  the  final
2810         match. For example, consider the pattern         match. For example, consider the pattern
2811    
2812           (a)(?:(b)c|bd)           (a)(?:(b)c|bd)
2813    
2814         If a vector of 6 elements (allowing for only 1 captured  substring)  is         If  a  vector of 6 elements (allowing for only 1 captured substring) is
2815         given with subject string "abd", pcre_exec() will try to set the second         given with subject string "abd", pcre_exec() will try to set the second
2816         captured string, thereby recording a vector overflow, before failing to         captured string, thereby recording a vector overflow, before failing to
2817         match  "c"  and  backing  up  to  try  the second alternative. The zero         match "c" and backing up  to  try  the  second  alternative.  The  zero
2818         return, however, does correctly indicate that  the  maximum  number  of         return,  however,  does  correctly  indicate that the maximum number of
2819         slots (namely 2) have been filled. In similar cases where there is tem-         slots (namely 2) have been filled. In similar cases where there is tem-
2820         porary overflow, but the final number of used slots  is  actually  less         porary  overflow,  but  the final number of used slots is actually less
2821         than the maximum, a non-zero value is returned.         than the maximum, a non-zero value is returned.
2822    
2823         The pcre_fullinfo() function can be used to find out how many capturing         The pcre_fullinfo() function can be used to find out how many capturing
2824         subpatterns there are in a compiled  pattern.  The  smallest  size  for         subpatterns  there  are  in  a  compiled pattern. The smallest size for
2825         ovector  that  will allow for n captured substrings, in addition to the         ovector that will allow for n captured substrings, in addition  to  the
2826         offsets of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
2827    
2828         It is possible for capturing subpattern number n+1 to match  some  part         It  is  possible for capturing subpattern number n+1 to match some part
2829         of the subject when subpattern n has not been used at all. For example,         of the subject when subpattern n has not been used at all. For example,
2830         if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the         if  the  string  "abc"  is  matched against the pattern (a|(z))(bc) the
2831         return from the function is 4, and subpatterns 1 and 3 are matched, but         return from the function is 4, and subpatterns 1 and 3 are matched, but
2832         2 is not. When this happens, both values in  the  offset  pairs  corre-         2  is  not.  When  this happens, both values in the offset pairs corre-
2833         sponding to unused subpatterns are set to -1.         sponding to unused subpatterns are set to -1.
2834    
2835         Offset  values  that correspond to unused subpatterns at the end of the         Offset values that correspond to unused subpatterns at the end  of  the
2836         expression are also set to -1. For example,  if  the  string  "abc"  is         expression  are  also  set  to  -1. For example, if the string "abc" is
2837         matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not         matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are  not
2838         matched. The return from the function is 2, because  the  highest  used         matched.  The  return  from the function is 2, because the highest used
2839         capturing  subpattern  number  is 1, and the offsets for for the second         capturing subpattern number is 1, and the offsets for  for  the  second
2840         and third capturing subpatterns (assuming the vector is  large  enough,         and  third  capturing subpatterns (assuming the vector is large enough,
2841         of course) are set to -1.         of course) are set to -1.
2842    
2843         Note:  Elements  in  the first two-thirds of ovector that do not corre-         Note: Elements in the first two-thirds of ovector that  do  not  corre-
2844         spond to capturing parentheses in the pattern are never  changed.  That         spond  to  capturing parentheses in the pattern are never changed. That
2845         is,  if  a pattern contains n capturing parentheses, no more than ovec-         is, if a pattern contains n capturing parentheses, no more  than  ovec-
2846         tor[0] to ovector[2n+1] are set by pcre_exec(). The other elements  (in         tor[0]  to ovector[2n+1] are set by pcre_exec(). The other elements (in
2847         the first two-thirds) retain whatever values they previously had.         the first two-thirds) retain whatever values they previously had.
2848    
2849         Some  convenience  functions  are  provided for extracting the captured         Some convenience functions are provided  for  extracting  the  captured
2850         substrings as separate strings. These are described below.         substrings as separate strings. These are described below.
2851    
2852     Error return values from pcre_exec()     Error return values from pcre_exec()
2853    
2854         If pcre_exec() fails, it returns a negative number. The  following  are         If  pcre_exec()  fails, it returns a negative number. The following are
2855         defined in the header file:         defined in the header file:
2856    
2857           PCRE_ERROR_NOMATCH        (-1)           PCRE_ERROR_NOMATCH        (-1)
# Line 2858  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2860  MATCHING A PATTERN: THE TRADITIONAL FUNC
2860    
2861           PCRE_ERROR_NULL           (-2)           PCRE_ERROR_NULL           (-2)
2862    
2863         Either  code  or  subject  was  passed as NULL, or ovector was NULL and         Either code or subject was passed as NULL,  or  ovector  was  NULL  and
2864         ovecsize was not zero.         ovecsize was not zero.
2865    
2866           PCRE_ERROR_BADOPTION      (-3)           PCRE_ERROR_BADOPTION      (-3)
# Line 2867  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2869  MATCHING A PATTERN: THE TRADITIONAL FUNC
2869    
2870           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
2871    
2872         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE  stores a 4-byte "magic number" at the start of the compiled code,
2873         to catch the case when it is passed a junk pointer and to detect when a         to catch the case when it is passed a junk pointer and to detect when a
2874         pattern that was compiled in an environment of one endianness is run in         pattern that was compiled in an environment of one endianness is run in
2875         an  environment  with the other endianness. This is the error that PCRE         an environment with the other endianness. This is the error  that  PCRE
2876         gives when the magic number is not present.         gives when the magic number is not present.
2877    
2878           PCRE_ERROR_UNKNOWN_OPCODE (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
2879    
2880         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
2881         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled pattern. This error could be caused by a bug  in  PCRE  or  by
2882         overwriting of the compiled pattern.         overwriting of the compiled pattern.
2883    
2884           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2885    
2886         If a pattern contains back references, but the ovector that  is  passed         If  a  pattern contains back references, but the ovector that is passed
2887         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2888         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE  gets  a  block of memory at the start of matching to use for this
2889         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose. If the call via pcre_malloc() fails, this error is given.  The
2890         memory is automatically freed at the end of matching.         memory is automatically freed at the end of matching.
2891    
2892         This error is also given if pcre_stack_malloc() fails  in  pcre_exec().         This  error  is also given if pcre_stack_malloc() fails in pcre_exec().
2893         This  can happen only when PCRE has been compiled with --disable-stack-         This can happen only when PCRE has been compiled with  --disable-stack-
2894         for-recursion.         for-recursion.
2895    
2896           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2897    
2898         This error is used by the pcre_copy_substring(),  pcre_get_substring(),         This  error is used by the pcre_copy_substring(), pcre_get_substring(),
2899         and  pcre_get_substring_list()  functions  (see  below).  It  is  never         and  pcre_get_substring_list()  functions  (see  below).  It  is  never
2900         returned by pcre_exec().         returned by pcre_exec().
2901    
2902           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2903    
2904         The backtracking limit, as specified by  the  match_limit  field  in  a         The  backtracking  limit,  as  specified  by the match_limit field in a
2905         pcre_extra  structure  (or  defaulted) was reached. See the description         pcre_extra structure (or defaulted) was reached.  See  the  description
2906         above.         above.
2907    
2908           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2909    
2910         This error is never generated by pcre_exec() itself. It is provided for         This error is never generated by pcre_exec() itself. It is provided for
2911         use  by  callout functions that want to yield a distinctive error code.         use by callout functions that want to yield a distinctive  error  code.
2912         See the pcrecallout documentation for details.         See the pcrecallout documentation for details.
2913    
2914           PCRE_ERROR_BADUTF8        (-10)           PCRE_ERROR_BADUTF8        (-10)
2915    
2916         A string that contains an invalid UTF-8 byte sequence was passed  as  a         A  string  that contains an invalid UTF-8 byte sequence was passed as a
2917         subject,  and the PCRE_NO_UTF8_CHECK option was not set. If the size of         subject, and the PCRE_NO_UTF8_CHECK option was not set. If the size  of
2918         the output vector (ovecsize) is at least 2,  the  byte  offset  to  the         the  output  vector  (ovecsize)  is  at least 2, the byte offset to the
2919         start  of  the  the invalid UTF-8 character is placed in the first ele-         start of the the invalid UTF-8 character is placed in  the  first  ele-
2920         ment, and a reason code is placed in the  second  element.  The  reason         ment,  and  a  reason  code is placed in the second element. The reason
2921         codes are listed in the following section.  For backward compatibility,         codes are listed in the following section.  For backward compatibility,
2922         if PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8  char-         if  PCRE_PARTIAL_HARD is set and the problem is a truncated UTF-8 char-
2923         acter   at   the   end   of   the   subject  (reason  codes  1  to  5),         acter  at  the  end  of  the   subject   (reason   codes   1   to   5),
2924         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.         PCRE_ERROR_SHORTUTF8 is returned instead of PCRE_ERROR_BADUTF8.
2925    
2926           PCRE_ERROR_BADUTF8_OFFSET (-11)           PCRE_ERROR_BADUTF8_OFFSET (-11)
2927    
2928         The UTF-8 byte sequence that was passed as a subject  was  checked  and         The  UTF-8  byte  sequence that was passed as a subject was checked and
2929         found  to be valid (the PCRE_NO_UTF8_CHECK option was not set), but the         found to be valid (the PCRE_NO_UTF8_CHECK option was not set), but  the
2930         value of startoffset did not point to the beginning of a UTF-8  charac-         value  of startoffset did not point to the beginning of a UTF-8 charac-
2931         ter or the end of the subject.         ter or the end of the subject.
2932    
2933           PCRE_ERROR_PARTIAL        (-12)           PCRE_ERROR_PARTIAL        (-12)
2934    
2935         The  subject  string did not match, but it did match partially. See the         The subject string did not match, but it did match partially.  See  the
2936         pcrepartial documentation for details of partial matching.         pcrepartial documentation for details of partial matching.
2937    
2938           PCRE_ERROR_BADPARTIAL     (-13)           PCRE_ERROR_BADPARTIAL     (-13)
2939    
2940         This code is no longer in  use.  It  was  formerly  returned  when  the         This  code  is  no  longer  in  use.  It was formerly returned when the
2941         PCRE_PARTIAL  option  was used with a compiled pattern containing items         PCRE_PARTIAL option was used with a compiled pattern  containing  items
2942         that were  not  supported  for  partial  matching.  From  release  8.00         that  were  not  supported  for  partial  matching.  From  release 8.00
2943         onwards, there are no restrictions on partial matching.         onwards, there are no restrictions on partial matching.
2944    
2945           PCRE_ERROR_INTERNAL       (-14)           PCRE_ERROR_INTERNAL       (-14)
2946    
2947         An  unexpected  internal error has occurred. This error could be caused         An unexpected internal error has occurred. This error could  be  caused
2948         by a bug in PCRE or by overwriting of the compiled pattern.         by a bug in PCRE or by overwriting of the compiled pattern.
2949    
2950           PCRE_ERROR_BADCOUNT       (-15)           PCRE_ERROR_BADCOUNT       (-15)
# Line 2952  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2954  MATCHING A PATTERN: THE TRADITIONAL FUNC
2954           PCRE_ERROR_RECURSIONLIMIT (-21)           PCRE_ERROR_RECURSIONLIMIT (-21)
2955    
2956         The internal recursion limit, as specified by the match_limit_recursion         The internal recursion limit, as specified by the match_limit_recursion
2957         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         field in a pcre_extra structure (or defaulted)  was  reached.  See  the
2958         description above.         description above.
2959    
2960           PCRE_ERROR_BADNEWLINE     (-23)           PCRE_ERROR_BADNEWLINE     (-23)
# Line 2966  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2968  MATCHING A PATTERN: THE TRADITIONAL FUNC
2968    
2969           PCRE_ERROR_SHORTUTF8      (-25)           PCRE_ERROR_SHORTUTF8      (-25)
2970    
2971         This  error  is returned instead of PCRE_ERROR_BADUTF8 when the subject         This error is returned instead of PCRE_ERROR_BADUTF8 when  the  subject
2972         string ends with a truncated UTF-8 character and the  PCRE_PARTIAL_HARD         string  ends with a truncated UTF-8 character and the PCRE_PARTIAL_HARD
2973         option  is  set.   Information  about  the  failure  is returned as for         option is set.  Information  about  the  failure  is  returned  as  for
2974         PCRE_ERROR_BADUTF8. It is in fact sufficient to detect this  case,  but         PCRE_ERROR_BADUTF8.  It  is in fact sufficient to detect this case, but
2975         this  special error code for PCRE_PARTIAL_HARD precedes the implementa-         this special error code for PCRE_PARTIAL_HARD precedes the  implementa-
2976         tion of returned information; it is retained for backwards  compatibil-         tion  of returned information; it is retained for backwards compatibil-
2977         ity.         ity.
2978    
2979           PCRE_ERROR_RECURSELOOP    (-26)           PCRE_ERROR_RECURSELOOP    (-26)
2980    
2981         This error is returned when pcre_exec() detects a recursion loop within         This error is returned when pcre_exec() detects a recursion loop within
2982         the pattern. Specifically, it means that either the whole pattern or  a         the  pattern. Specifically, it means that either the whole pattern or a
2983         subpattern  has been called recursively for the second time at the same         subpattern has been called recursively for the second time at the  same
2984         position in the subject string. Some simple patterns that might do this         position in the subject string. Some simple patterns that might do this
2985         are  detected  and faulted at compile time, but more complicated cases,         are detected and faulted at compile time, but more  complicated  cases,
2986         in particular mutual recursions between two different subpatterns, can-         in particular mutual recursions between two different subpatterns, can-
2987         not be detected until run time.         not be detected until run time.
2988    
2989           PCRE_ERROR_JIT_STACKLIMIT (-27)           PCRE_ERROR_JIT_STACKLIMIT (-27)
2990    
2991         This  error  is  returned  when a pattern that was successfully studied         This error is returned when a pattern  that  was  successfully  studied
2992         using a JIT compile option is being matched, but the  memory  available         using  a  JIT compile option is being matched, but the memory available
2993         for  the  just-in-time  processing  stack  is not large enough. See the         for the just-in-time processing stack is  not  large  enough.  See  the
2994         pcrejit documentation for more details.         pcrejit documentation for more details.
2995    
2996           PCRE_ERROR_BADMODE (-28)           PCRE_ERROR_BADMODE (-28)
# Line 2998  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3000  MATCHING A PATTERN: THE TRADITIONAL FUNC
3000    
3001           PCRE_ERROR_BADENDIANNESS (-29)           PCRE_ERROR_BADENDIANNESS (-29)
3002    
3003         This  error  is  given  if  a  pattern  that  was compiled and saved is         This error is given if  a  pattern  that  was  compiled  and  saved  is
3004         reloaded on a host with  different  endianness.  The  utility  function         reloaded  on  a  host  with  different endianness. The utility function
3005         pcre_pattern_to_host_byte_order() can be used to convert such a pattern         pcre_pattern_to_host_byte_order() can be used to convert such a pattern
3006         so that it runs on the new host.         so that it runs on the new host.
3007    
# Line 3007  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3009  MATCHING A PATTERN: THE TRADITIONAL FUNC
3009    
3010     Reason codes for invalid UTF-8 strings     Reason codes for invalid UTF-8 strings
3011    
3012         This section applies only  to  the  8-bit  library.  The  corresponding         This  section  applies  only  to  the  8-bit library. The corresponding
3013         information for the 16-bit library is given in the pcre16 page.         information for the 16-bit library is given in the pcre16 page.
3014    
3015         When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-         When pcre_exec() returns either PCRE_ERROR_BADUTF8 or PCRE_ERROR_SHORT-
3016         UTF8, and the size of the output vector (ovecsize) is at least  2,  the         UTF8,  and  the size of the output vector (ovecsize) is at least 2, the
3017         offset  of  the  start  of the invalid UTF-8 character is placed in the         offset of the start of the invalid UTF-8 character  is  placed  in  the
3018         first output vector element (ovector[0]) and a reason code is placed in         first output vector element (ovector[0]) and a reason code is placed in
3019         the  second  element  (ovector[1]). The reason codes are given names in         the second element (ovector[1]). The reason codes are  given  names  in
3020         the pcre.h header file:         the pcre.h header file:
3021    
3022           PCRE_UTF8_ERR1           PCRE_UTF8_ERR1
# Line 3023  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3025  MATCHING A PATTERN: THE TRADITIONAL FUNC
3025           PCRE_UTF8_ERR4           PCRE_UTF8_ERR4
3026           PCRE_UTF8_ERR5           PCRE_UTF8_ERR5
3027    
3028         The string ends with a truncated UTF-8 character;  the  code  specifies         The  string  ends  with a truncated UTF-8 character; the code specifies
3029         how  many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8         how many bytes are missing (1 to 5). Although RFC 3629 restricts  UTF-8
3030         characters to be no longer than 4 bytes, the  encoding  scheme  (origi-         characters  to  be  no longer than 4 bytes, the encoding scheme (origi-
3031         nally  defined  by  RFC  2279)  allows  for  up to 6 bytes, and this is         nally defined by RFC 2279) allows for  up  to  6  bytes,  and  this  is
3032         checked first; hence the possibility of 4 or 5 missing bytes.         checked first; hence the possibility of 4 or 5 missing bytes.
3033    
3034           PCRE_UTF8_ERR6           PCRE_UTF8_ERR6
# Line 3036  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3038  MATCHING A PATTERN: THE TRADITIONAL FUNC
3038           PCRE_UTF8_ERR10           PCRE_UTF8_ERR10
3039    
3040         The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of         The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
3041         the  character  do  not have the binary value 0b10 (that is, either the         the character do not have the binary value 0b10 (that  is,  either  the
3042         most significant bit is 0, or the next bit is 1).         most significant bit is 0, or the next bit is 1).
3043    
3044           PCRE_UTF8_ERR11           PCRE_UTF8_ERR11
3045           PCRE_UTF8_ERR12           PCRE_UTF8_ERR12
3046    
3047         A character that is valid by the RFC 2279 rules is either 5 or 6  bytes         A  character that is valid by the RFC 2279 rules is either 5 or 6 bytes
3048         long; these code points are excluded by RFC 3629.         long; these code points are excluded by RFC 3629.
3049    
3050           PCRE_UTF8_ERR13           PCRE_UTF8_ERR13
3051    
3052         A  4-byte character has a value greater than 0x10fff; these code points         A 4-byte character has a value greater than 0x10fff; these code  points
3053         are excluded by RFC 3629.         are excluded by RFC 3629.
3054    
3055           PCRE_UTF8_ERR14           PCRE_UTF8_ERR14
3056    
3057         A 3-byte character has a value in the  range  0xd800  to  0xdfff;  this         A  3-byte  character  has  a  value in the range 0xd800 to 0xdfff; this
3058         range  of code points are reserved by RFC 3629 for use with UTF-16, and         range of code points are reserved by RFC 3629 for use with UTF-16,  and
3059         so are excluded from UTF-8.         so are excluded from UTF-8.
3060    
3061           PCRE_UTF8_ERR15           PCRE_UTF8_ERR15
# Line 3062  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3064  MATCHING A PATTERN: THE TRADITIONAL FUNC
3064           PCRE_UTF8_ERR18           PCRE_UTF8_ERR18
3065           PCRE_UTF8_ERR19           PCRE_UTF8_ERR19
3066    
3067         A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it  codes         A  2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
3068         for  a  value that can be represented by fewer bytes, which is invalid.         for a value that can be represented by fewer bytes, which  is  invalid.
3069         For example, the two bytes 0xc0, 0xae give the value 0x2e,  whose  cor-         For  example,  the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
3070         rect coding uses just one byte.         rect coding uses just one byte.
3071    
3072           PCRE_UTF8_ERR20           PCRE_UTF8_ERR20
3073    
3074         The two most significant bits of the first byte of a character have the         The two most significant bits of the first byte of a character have the
3075         binary value 0b10 (that is, the most significant bit is 1 and the  sec-         binary  value 0b10 (that is, the most significant bit is 1 and the sec-
3076         ond  is  0). Such a byte can only validly occur as the second or subse-         ond is 0). Such a byte can only validly occur as the second  or  subse-
3077         quent byte of a multi-byte character.         quent byte of a multi-byte character.
3078    
3079           PCRE_UTF8_ERR21           PCRE_UTF8_ERR21
3080    
3081         The first byte of a character has the value 0xfe or 0xff. These  values         The  first byte of a character has the value 0xfe or 0xff. These values
3082         can never occur in a valid UTF-8 string.         can never occur in a valid UTF-8 string.
3083    
3084    
# Line 3093  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 3095  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
3095         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
3096              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
3097    
3098         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
3099         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
3100         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
3101         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
3102         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
3103         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
3104         substrings.         substrings.
3105    
3106         A  substring that contains a binary zero is correctly extracted and has         A substring that contains a binary zero is correctly extracted and  has
3107         a further zero added on the end, but the result is not, of course, a  C         a  further zero added on the end, but the result is not, of course, a C
3108         string.   However,  you  can  process such a string by referring to the         string.  However, you can process such a string  by  referring  to  the
3109         length that is  returned  by  pcre_copy_substring()  and  pcre_get_sub-         length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
3110         string().  Unfortunately, the interface to pcre_get_substring_list() is         string().  Unfortunately, the interface to pcre_get_substring_list() is
3111         not adequate for handling strings containing binary zeros, because  the         not  adequate for handling strings containing binary zeros, because the
3112         end of the final string is not independently indicated.         end of the final string is not independently indicated.
3113    
3114         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
3115         tions: subject is the subject string that has  just  been  successfully         tions:  subject  is  the subject string that has just been successfully
3116         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
3117         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
3118         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
3119         entire regular expression. This is the value returned by pcre_exec() if         entire regular expression. This is the value returned by pcre_exec() if
3120         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
3121         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
3122         be the number of elements in the vector divided by three.         be the number of elements in the vector divided by three.
3123    
3124         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
3125         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
3126         zero  extracts  the  substring that matched the entire pattern, whereas         zero extracts the substring that matched the  entire  pattern,  whereas
3127         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
3128         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
3129         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
3130         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
3131         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
3132         the terminating zero, or one of these error codes:         the terminating zero, or one of these error codes:
3133    
3134           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
3135    
3136         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
3137         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
3138    
3139           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
3140    
3141         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
3142    
3143         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
3144         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
3145         single block of memory that is obtained via pcre_malloc. The address of         single block of memory that is obtained via pcre_malloc. The address of
3146         the  memory  block  is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
3147         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
3148         pointer.  The  yield  of  the function is zero if all went well, or the         pointer. The yield of the function is zero if all  went  well,  or  the
3149         error code         error code
3150    
3151           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
3152    
3153         if the attempt to get the memory block failed.         if the attempt to get the memory block failed.
3154    
3155         When any of these functions encounter a substring that is unset,  which         When  any of these functions encounter a substring that is unset, which
3156         can  happen  when  capturing subpattern number n+1 matches some part of         can happen when capturing subpattern number n+1 matches  some  part  of
3157         the subject, but subpattern n has not been used at all, they return  an         the  subject, but subpattern n has not been used at all, they return an
3158         empty string. This can be distinguished from a genuine zero-length sub-         empty string. This can be distinguished from a genuine zero-length sub-
3159         string by inspecting the appropriate offset in ovector, which is  nega-         string  by inspecting the appropriate offset in ovector, which is nega-
3160         tive for unset substrings.         tive for unset substrings.
3161    
3162         The  two convenience functions pcre_free_substring() and pcre_free_sub-         The two convenience functions pcre_free_substring() and  pcre_free_sub-
3163         string_list() can be used to free the memory  returned  by  a  previous         string_list()  can  be  used  to free the memory returned by a previous
3164         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
3165         tively. They do nothing more than  call  the  function  pointed  to  by         tively.  They  do  nothing  more  than  call the function pointed to by
3166         pcre_free,  which  of course could be called directly from a C program.         pcre_free, which of course could be called directly from a  C  program.
3167         However, PCRE is used in some situations where it is linked via a  spe-         However,  PCRE is used in some situations where it is linked via a spe-
3168         cial   interface  to  another  programming  language  that  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
3169         pcre_free directly; it is for these cases that the functions  are  pro-         pcre_free  directly;  it is for these cases that the functions are pro-
3170         vided.         vided.
3171    
3172    
# Line 3183  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 3185  EXTRACTING CAPTURED SUBSTRINGS BY NAME
3185              int stringcount, const char *stringname,              int stringcount, const char *stringname,
3186              const char **stringptr);              const char **stringptr);
3187    
3188         To  extract a substring by name, you first have to find associated num-         To extract a substring by name, you first have to find associated  num-
3189         ber.  For example, for this pattern         ber.  For example, for this pattern
3190    
3191           (a+)b(?<xxx>\d+)...           (a+)b(?<xxx>\d+)...
# Line 3192  EXTRACTING CAPTURED SUBSTRINGS BY NAME Line 3194  EXTRACTING CAPTURED SUBSTRINGS BY NAME
3194         be unique (PCRE_DUPNAMES was not set), you can find the number from the         be unique (PCRE_DUPNAMES was not set), you can find the number from the
3195         name by calling pcre_get_stringnumber(). The first argument is the com-         name by calling pcre_get_stringnumber(). The first argument is the com-
3196         piled pattern, and the second is the name. The yield of the function is         piled pattern, and the second is the name. The yield of the function is
3197         the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if  there  is  no         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
3198         subpattern of that name.         subpattern of that name.
3199    
3200         Given the number, you can extract the substring directly, or use one of         Given the number, you can extract the substring directly, or use one of
3201         the functions described in the previous section. For convenience, there         the functions described in the previous section. For convenience, there
3202         are also two functions that do the whole job.         are also two functions that do the whole job.
3203    
3204         Most    of    the    arguments   of   pcre_copy_named_substring()   and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
3205         pcre_get_named_substring() are the same  as  those  for  the  similarly         pcre_get_named_substring()  are  the  same  as  those for the similarly
3206         named  functions  that extract by number. As these are described in the         named functions that extract by number. As these are described  in  the
3207         previous section, they are not re-described here. There  are  just  two         previous  section,  they  are not re-described here. There are just two
3208         differences:         differences:
3209    
3210         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
3211         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
3212         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
3213         name-to-number translation table.         name-to-number translation table.
3214    
3215         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
3216         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
3217         ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate  names,  the         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
3218         behaviour may not be what you want (see the next section).         behaviour may not be what you want (see the next section).
3219    
3220         Warning: If the pattern uses the (?| feature to set up multiple subpat-         Warning: If the pattern uses the (?| feature to set up multiple subpat-
3221         terns with the same number, as described in the  section  on  duplicate         terns  with  the  same number, as described in the section on duplicate
3222         subpattern  numbers  in  the  pcrepattern page, you cannot use names to         subpattern numbers in the pcrepattern page, you  cannot  use  names  to
3223         distinguish the different subpatterns, because names are  not  included         distinguish  the  different subpatterns, because names are not included
3224         in  the compiled code. The matching process uses only numbers. For this         in the compiled code. The matching process uses only numbers. For  this
3225         reason, the use of different names for subpatterns of the  same  number         reason,  the  use of different names for subpatterns of the same number
3226         causes an error at compile time.         causes an error at compile time.
3227    
3228    
# Line 3229  DUPLICATE SUBPATTERN NAMES Line 3231  DUPLICATE SUBPATTERN NAMES
3231         int pcre_get_stringtable_entries(const pcre *code,         int pcre_get_stringtable_entries(const pcre *code,
3232              const char *name, char **first, char **last);              const char *name, char **first, char **last);
3233    
3234         When  a  pattern  is  compiled with the PCRE_DUPNAMES option, names for         When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
3235         subpatterns are not required to be unique. (Duplicate names are  always         subpatterns  are not required to be unique. (Duplicate names are always
3236         allowed  for subpatterns with the same number, created by using the (?|         allowed for subpatterns with the same number, created by using the  (?|
3237         feature. Indeed, if such subpatterns are named, they  are  required  to         feature.  Indeed,  if  such subpatterns are named, they are required to
3238         use the same names.)         use the same names.)
3239    
3240         Normally, patterns with duplicate names are such that in any one match,         Normally, patterns with duplicate names are such that in any one match,
3241         only one of the named subpatterns participates. An example is shown  in         only  one of the named subpatterns participates. An example is shown in
3242         the pcrepattern documentation.         the pcrepattern documentation.
3243    
3244         When    duplicates   are   present,   pcre_copy_named_substring()   and         When   duplicates   are   present,   pcre_copy_named_substring()    and
3245         pcre_get_named_substring() return the first substring corresponding  to         pcre_get_named_substring()  return the first substring corresponding to
3246         the  given  name  that  is set. If none are set, PCRE_ERROR_NOSUBSTRING         the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
3247         (-7) is returned; no  data  is  returned.  The  pcre_get_stringnumber()         (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
3248         function  returns one of the numbers that are associated with the name,         function returns one of the numbers that are associated with the  name,
3249         but it is not defined which it is.         but it is not defined which it is.
3250    
3251         If you want to get full details of all captured substrings for a  given         If  you want to get full details of all captured substrings for a given
3252         name,  you  must  use  the pcre_get_stringtable_entries() function. The         name, you must use  the  pcre_get_stringtable_entries()  function.  The
3253         first argument is the compiled pattern, and the second is the name. The         first argument is the compiled pattern, and the second is the name. The
3254         third  and  fourth  are  pointers to variables which are updated by the         third and fourth are pointers to variables which  are  updated  by  the
3255         function. After it has run, they point to the first and last entries in         function. After it has run, they point to the first and last entries in
3256         the  name-to-number  table  for  the  given  name.  The function itself         the name-to-number table  for  the  given  name.  The  function  itself
3257         returns the length of each entry,  or  PCRE_ERROR_NOSUBSTRING  (-7)  if         returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
3258         there  are none. The format of the table is described above in the sec-         there are none. The format of the table is described above in the  sec-
3259         tion entitled Information about a pattern above.  Given all  the  rele-         tion  entitled  Information about a pattern above.  Given all the rele-
3260         vant  entries  for the name, you can extract each of their numbers, and         vant entries for the name, you can extract each of their  numbers,  and
3261         hence the captured data, if any.         hence the captured data, if any.
3262    
3263    
3264  FINDING ALL POSSIBLE MATCHES  FINDING ALL POSSIBLE MATCHES
3265    
3266         The traditional matching function uses a  similar  algorithm  to  Perl,         The  traditional  matching  function  uses a similar algorithm to Perl,
3267         which stops when it finds the first match, starting at a given point in         which stops when it finds the first match, starting at a given point in
3268         the subject. If you want to find all possible matches, or  the  longest         the  subject.  If you want to find all possible matches, or the longest
3269         possible  match,  consider using the alternative matching function (see         possible match, consider using the alternative matching  function  (see
3270         below) instead. If you cannot use the alternative function,  but  still         below)  instead.  If you cannot use the alternative function, but still
3271         need  to  find all possible matches, you can kludge it up by making use         need to find all possible matches, you can kludge it up by  making  use
3272         of the callout facility, which is described in the pcrecallout documen-         of the callout facility, which is described in the pcrecallout documen-
3273         tation.         tation.
3274    
3275         What you have to do is to insert a callout right at the end of the pat-         What you have to do is to insert a callout right at the end of the pat-
3276         tern.  When your callout function is called, extract and save the  cur-         tern.   When your callout function is called, extract and save the cur-
3277         rent  matched  substring.  Then  return  1, which forces pcre_exec() to         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
3278         backtrack and try other alternatives. Ultimately, when it runs  out  of         backtrack  and  try other alternatives. Ultimately, when it runs out of
3279         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.         matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
3280    
3281    
3282  OBTAINING AN ESTIMATE OF STACK USAGE  OBTAINING AN ESTIMATE OF STACK USAGE
3283    
3284         Matching  certain  patterns  using pcre_exec() can use a lot of process         Matching certain patterns using pcre_exec() can use a  lot  of  process
3285         stack, which in certain environments can be  rather  limited  in  size.         stack,  which  in  certain  environments can be rather limited in size.
3286         Some  users  find it helpful to have an estimate of the amount of stack         Some users find it helpful to have an estimate of the amount  of  stack
3287         that is used by pcre_exec(), to help  them  set  recursion  limits,  as         that  is  used  by  pcre_exec(),  to help them set recursion limits, as
3288         described  in  the pcrestack documentation. The estimate that is output         described in the pcrestack documentation. The estimate that  is  output
3289         by pcretest when called with the -m and -C options is obtained by call-         by pcretest when called with the -m and -C options is obtained by call-
3290         ing  pcre_exec with the values NULL, NULL, NULL, -999, and -999 for its         ing pcre_exec with the values NULL, NULL, NULL, -999, and -999 for  its
3291         first five arguments.         first five arguments.
3292    
3293         Normally, if  its  first  argument  is  NULL,  pcre_exec()  immediately         Normally,  if  its  first  argument  is  NULL,  pcre_exec() immediately
3294         returns  the negative error code PCRE_ERROR_NULL, but with this special         returns the negative error code PCRE_ERROR_NULL, but with this  special
3295         combination of arguments, it returns instead a  negative  number  whose         combination  of  arguments,  it returns instead a negative number whose
3296         absolute  value  is the approximate stack frame size in bytes. (A nega-         absolute value is the approximate stack frame size in bytes.  (A  nega-
3297         tive number is used so that it is clear that no  match  has  happened.)         tive  number  is  used so that it is clear that no match has happened.)
3298         The  value  is  approximate  because  in some cases, recursive calls to         The value is approximate because in  some  cases,  recursive  calls  to
3299         pcre_exec() occur when there are one or two additional variables on the         pcre_exec() occur when there are one or two additional variables on the
3300         stack.         stack.
3301    
3302         If  PCRE  has  been  compiled  to use the heap instead of the stack for         If PCRE has been compiled to use the heap  instead  of  the  stack  for
3303         recursion, the value returned  is  the  size  of  each  block  that  is         recursion,  the  value  returned  is  the  size  of  each block that is
3304         obtained from the heap.         obtained from the heap.
3305    
3306    
# Line 3309  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 3311  MATCHING A PATTERN: THE ALTERNATIVE FUNC
3311              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
3312              int *workspace, int wscount);              int *workspace, int wscount);
3313    
3314         The  function  pcre_dfa_exec()  is  called  to  match  a subject string         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
3315         against a compiled pattern, using a matching algorithm that  scans  the         against  a  compiled pattern, using a matching algorithm that scans the
3316         subject  string  just  once, and does not backtrack. This has different         subject string just once, and does not backtrack.  This  has  different
3317         characteristics to the normal algorithm, and  is  not  compatible  with         characteristics  to  the  normal  algorithm, and is not compatible with
3318         Perl.  Some  of the features of PCRE patterns are not supported. Never-         Perl. Some of the features of PCRE patterns are not  supported.  Never-
3319         theless, there are times when this kind of matching can be useful.  For         theless,  there are times when this kind of matching can be useful. For
3320         a  discussion  of  the  two matching algorithms, and a list of features         a discussion of the two matching algorithms, and  a  list  of  features
3321         that pcre_dfa_exec() does not support, see the pcrematching  documenta-         that  pcre_dfa_exec() does not support, see the pcrematching documenta-
3322         tion.         tion.
3323    
3324         The  arguments  for  the  pcre_dfa_exec()  function are the same as for         The arguments for the pcre_dfa_exec() function  are  the  same  as  for
3325         pcre_exec(), plus two extras. The ovector argument is used in a differ-         pcre_exec(), plus two extras. The ovector argument is used in a differ-
3326         ent  way,  and  this is described below. The other common arguments are         ent way, and this is described below. The other  common  arguments  are
3327         used in the same way as for pcre_exec(), so their  description  is  not         used  in  the  same way as for pcre_exec(), so their description is not
3328         repeated here.         repeated here.
3329    
3330         The  two  additional  arguments provide workspace for the function. The         The two additional arguments provide workspace for  the  function.  The
3331         workspace vector should contain at least 20 elements. It  is  used  for         workspace  vector  should  contain at least 20 elements. It is used for
3332         keeping  track  of  multiple  paths  through  the  pattern  tree.  More         keeping  track  of  multiple  paths  through  the  pattern  tree.  More
3333         workspace will be needed for patterns and subjects where  there  are  a         workspace  will  be  needed for patterns and subjects where there are a
3334         lot of potential matches.         lot of potential matches.
3335    
3336         Here is an example of a simple call to pcre_dfa_exec():         Here is an example of a simple call to pcre_dfa_exec():
# Line 3350  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 3352  MATCHING A PATTERN: THE ALTERNATIVE FUNC
3352    
3353     Option bits for pcre_dfa_exec()     Option bits for pcre_dfa_exec()
3354    
3355         The  unused  bits  of  the options argument for pcre_dfa_exec() must be         The unused bits of the options argument  for  pcre_dfa_exec()  must  be
3356         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-         zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
3357         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,         LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
3358         PCRE_NOTEMPTY_ATSTART,      PCRE_NO_UTF8_CHECK,       PCRE_BSR_ANYCRLF,         PCRE_NOTEMPTY_ATSTART,       PCRE_NO_UTF8_CHECK,      PCRE_BSR_ANYCRLF,
3359         PCRE_BSR_UNICODE,  PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD, PCRE_PAR-         PCRE_BSR_UNICODE, PCRE_NO_START_OPTIMIZE, PCRE_PARTIAL_HARD,  PCRE_PAR-
3360         TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART.  All but  the  last         TIAL_SOFT,  PCRE_DFA_SHORTEST,  and PCRE_DFA_RESTART.  All but the last
3361         four  of  these  are  exactly  the  same  as  for pcre_exec(), so their         four of these are  exactly  the  same  as  for  pcre_exec(),  so  their
3362         description is not repeated here.         description is not repeated here.
3363    
3364           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
3365           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
3366    
3367         These have the same general effect as they do for pcre_exec(), but  the         These  have the same general effect as they do for pcre_exec(), but the
3368         details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for         details are slightly  different.  When  PCRE_PARTIAL_HARD  is  set  for
3369         pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-         pcre_dfa_exec(),  it  returns PCRE_ERROR_PARTIAL if the end of the sub-
3370         ject  is  reached  and there is still at least one matching possibility         ject is reached and there is still at least  one  matching  possibility
3371         that requires additional characters. This happens even if some complete         that requires additional characters. This happens even if some complete
3372         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return         matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
3373         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end         code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
3374         of  the  subject  is  reached, there have been no complete matches, but         of the subject is reached, there have been  no  complete  matches,  but
3375         there is still at least one matching possibility. The  portion  of  the         there  is  still  at least one matching possibility. The portion of the
3376         string  that  was inspected when the longest partial match was found is         string that was inspected when the longest partial match was  found  is
3377         set as the first matching string  in  both  cases.   There  is  a  more         set  as  the  first  matching  string  in  both cases.  There is a more
3378         detailed  discussion  of partial and multi-segment matching, with exam-         detailed discussion of partial and multi-segment matching,  with  exam-
3379         ples, in the pcrepartial documentation.         ples, in the pcrepartial documentation.
3380    
3381           PCRE_DFA_SHORTEST           PCRE_DFA_SHORTEST
3382    
3383         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to         Setting  the  PCRE_DFA_SHORTEST option causes the matching algorithm to
3384         stop as soon as it has found one match. Because of the way the alterna-         stop as soon as it has found one match. Because of the way the alterna-
3385         tive algorithm works, this is necessarily the shortest  possible  match         tive  algorithm  works, this is necessarily the shortest possible match
3386         at the first possible matching point in the subject string.         at the first possible matching point in the subject string.
3387    
3388           PCRE_DFA_RESTART           PCRE_DFA_RESTART
3389    
3390         When pcre_dfa_exec() returns a partial match, it is possible to call it         When pcre_dfa_exec() returns a partial match, it is possible to call it
3391         again, with additional subject characters, and have  it  continue  with         again,  with  additional  subject characters, and have it continue with
3392         the  same match. The PCRE_DFA_RESTART option requests this action; when         the same match. The PCRE_DFA_RESTART option requests this action;  when
3393         it is set, the workspace and wscount options must  reference  the  same         it  is  set,  the workspace and wscount options must reference the same
3394         vector  as  before  because data about the match so far is left in them         vector as before because data about the match so far is  left  in  them
3395         after a partial match. There is more discussion of this facility in the         after a partial match. There is more discussion of this facility in the
3396         pcrepartial documentation.         pcrepartial documentation.
3397    
3398     Successful returns from pcre_dfa_exec()     Successful returns from pcre_dfa_exec()
3399    
3400         When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
3401         string in the subject. Note, however, that all the matches from one run         string in the subject. Note, however, that all the matches from one run
3402         of  the  function  start  at the same point in the subject. The shorter         of the function start at the same point in  the  subject.  The  shorter
3403         matches are all initial substrings of the longer matches. For  example,         matches  are all initial substrings of the longer matches. For example,
3404         if the pattern         if the pattern
3405    
3406           <.*>           <.*>
# Line 3413  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 3415  MATCHING A PATTERN: THE ALTERNATIVE FUNC
3415           <something> <something else>           <something> <something else>
3416           <something> <something else> <something further>           <something> <something else> <something further>
3417    
3418         On  success,  the  yield of the function is a number greater than zero,         On success, the yield of the function is a number  greater  than  zero,
3419         which is the number of matched substrings.  The  substrings  themselves         which  is  the  number of matched substrings. The substrings themselves
3420         are  returned  in  ovector. Each string uses two elements; the first is         are returned in ovector. Each string uses two elements;  the  first  is
3421         the offset to the start, and the second is the offset to  the  end.  In         the  offset  to  the start, and the second is the offset to the end. In
3422         fact,  all  the  strings  have the same start offset. (Space could have         fact, all the strings have the same start  offset.  (Space  could  have
3423         been saved by giving this only once, but it was decided to retain  some         been  saved by giving this only once, but it was decided to retain some
3424         compatibility  with  the  way pcre_exec() returns data, even though the         compatibility with the way pcre_exec() returns data,  even  though  the
3425         meaning of the strings is different.)         meaning of the strings is different.)
3426    
3427         The strings are returned in reverse order of length; that is, the long-         The strings are returned in reverse order of length; that is, the long-
3428         est  matching  string is given first. If there were too many matches to         est matching string is given first. If there were too many  matches  to
3429         fit into ovector, the yield of the function is zero, and the vector  is         fit  into ovector, the yield of the function is zero, and the vector is
3430         filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()         filled with the longest matches.  Unlike  pcre_exec(),  pcre_dfa_exec()
3431         can use the entire ovector for returning matched strings.         can use the entire ovector for returning matched strings.
3432    
3433     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
3434    
3435         The pcre_dfa_exec() function returns a negative number when  it  fails.         The  pcre_dfa_exec()  function returns a negative number when it fails.
3436         Many  of  the  errors  are  the  same as for pcre_exec(), and these are         Many of the errors are the same  as  for  pcre_exec(),  and  these  are
3437         described above.  There are in addition the following errors  that  are         described  above.   There are in addition the following errors that are
3438         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
3439    
3440           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
3441    
3442         This  return is given if pcre_dfa_exec() encounters an item in the pat-         This return is given if pcre_dfa_exec() encounters an item in the  pat-
3443         tern that it does not support, for instance, the use of \C  or  a  back         tern  that  it  does not support, for instance, the use of \C or a back
3444         reference.         reference.
3445    
3446           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
3447    
3448         This  return  is  given  if pcre_dfa_exec() encounters a condition item         This return is given if pcre_dfa_exec()  encounters  a  condition  item
3449         that uses a back reference for the condition, or a test  for  recursion         that  uses  a back reference for the condition, or a test for recursion
3450         in a specific group. These are not supported.         in a specific group. These are not supported.
3451    
3452           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
3453    
3454         This  return  is given if pcre_dfa_exec() is called with an extra block         This return is given if pcre_dfa_exec() is called with an  extra  block
3455         that contains a setting of  the  match_limit  or  match_limit_recursion         that  contains  a  setting  of the match_limit or match_limit_recursion
3456         fields.  This  is  not  supported (these fields are meaningless for DFA         fields. This is not supported (these fields  are  meaningless  for  DFA
3457         matching).         matching).
3458    
3459           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
3460    
3461         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
3462         workspace vector.         workspace vector.
3463    
3464           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
3465    
3466         When  a  recursive subpattern is processed, the matching function calls         When a recursive subpattern is processed, the matching  function  calls
3467         itself recursively, using private vectors for  ovector  and  workspace.         itself  recursively,  using  private vectors for ovector and workspace.
3468         This  error  is  given  if  the output vector is not large enough. This         This error is given if the output vector  is  not  large  enough.  This
3469         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
3470    
3471    
3472  SEE ALSO  SEE ALSO
3473    
3474         pcre16(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),   pcrematch-         pcre16(3),   pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),  pcrematch-
3475         ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3),         ing(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcresample(3),
3476         pcrestack(3).         pcrestack(3).
3477    
# Line 3483  AUTHOR Line 3485  AUTHOR
3485    
3486  REVISION  REVISION
3487    
3488         Last updated: 24 February 2012         Last updated: 14 April 2012
3489         Copyright (c) 1997-2012 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
3490  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3491    
# Line 4697  MATCHING A SINGLE DATA UNIT Line 4699  MATCHING A SINGLE DATA UNIT
4699         means that the rest of the string may start with a malformed UTF  char-         means that the rest of the string may start with a malformed UTF  char-
4700         acter.  This  has  undefined  results,  because PCRE assumes that it is         acter.  This  has  undefined  results,  because PCRE assumes that it is
4701         dealing with valid UTF strings (and by default it checks  this  at  the         dealing with valid UTF strings (and by default it checks  this  at  the
4702         start of processing unless the PCRE_NO_UTF8_CHECK option is used).         start     of    processing    unless    the    PCRE_NO_UTF8_CHECK    or
4703           PCRE_NO_UTF16_CHECK option is used).
4704    
4705         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
4706         below) in a UTF mode, because this would make it impossible  to  calcu-         below)  in  a UTF mode, because this would make it impossible to calcu-
4707         late the length of the lookbehind.         late the length of the lookbehind.
4708    
4709         In general, the \C escape sequence is best avoided. However, one way of         In general, the \C escape sequence is best avoided. However, one way of
4710         using it that avoids the problem of malformed UTF characters is to  use         using  it that avoids the problem of malformed UTF characters is to use
4711         a  lookahead to check the length of the next character, as in this pat-         a lookahead to check the length of the next character, as in this  pat-
4712         tern, which could be used with a UTF-8 string (ignore white  space  and         tern,  which  could be used with a UTF-8 string (ignore white space and
4713         line breaks):         line breaks):
4714    
4715           (?| (?=[\x00-\x7f])(\C) |           (?| (?=[\x00-\x7f])(\C) |
# Line 4714  MATCHING A SINGLE DATA UNIT Line 4717  MATCHING A SINGLE DATA UNIT
4717               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
4718               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
4719    
4720         A  group  that starts with (?| resets the capturing parentheses numbers         A group that starts with (?| resets the capturing  parentheses  numbers
4721         in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The         in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
4722         assertions  at  the start of each branch check the next UTF-8 character         assertions at the start of each branch check the next  UTF-8  character
4723         for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The         for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
4724         character's  individual bytes are then captured by the appropriate num-         character's individual bytes are then captured by the appropriate  num-
4725         ber of groups.         ber of groups.
4726    
4727    
# Line 4728  SQUARE BRACKETS AND CHARACTER CLASSES Line 4731  SQUARE BRACKETS AND CHARACTER CLASSES
4731         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
4732         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4733         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
4734         square bracket is required as a member of the class, it should  be  the         square  bracket  is required as a member of the class, it should be the
4735         first  data  character  in  the  class (after an initial circumflex, if         first data character in the class  (after  an  initial  circumflex,  if
4736         present) or escaped with a backslash.         present) or escaped with a backslash.
4737    
4738         A character class matches a single character in the subject. In  a  UTF         A  character  class matches a single character in the subject. In a UTF
4739         mode,  the  character  may  be  more than one data unit long. A matched         mode, the character may be more than one  data  unit  long.  A  matched
4740         character must be in the set of characters defined by the class, unless         character must be in the set of characters defined by the class, unless
4741         the  first  character in the class definition is a circumflex, in which         the first character in the class definition is a circumflex,  in  which
4742         case the subject character must not be in the set defined by the class.         case the subject character must not be in the set defined by the class.
4743         If  a  circumflex is actually required as a member of the class, ensure         If a circumflex is actually required as a member of the  class,  ensure
4744         it is not the first character, or escape it with a backslash.         it is not the first character, or escape it with a backslash.
4745    
4746         For example, the character class [aeiou] matches any lower case  vowel,         For  example, the character class [aeiou] matches any lower case vowel,
4747         while  [^aeiou]  matches  any character that is not a lower case vowel.         while [^aeiou] matches any character that is not a  lower  case  vowel.
4748         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
4749         characters  that  are in the class by enumerating those that are not. A         characters that are in the class by enumerating those that are  not.  A
4750         class that starts with a circumflex is not an assertion; it still  con-         class  that starts with a circumflex is not an assertion; it still con-
4751         sumes  a  character  from the subject string, and therefore it fails if         sumes a character from the subject string, and therefore  it  fails  if
4752         the current pointer is at the end of the string.         the current pointer is at the end of the string.
4753    
4754         In UTF-8  (UTF-16)  mode,  characters  with  values  greater  than  255         In  UTF-8  (UTF-16)  mode,  characters  with  values  greater  than 255
4755         (0xffff)  can be included in a class as a literal string of data units,         (0xffff) can be included in a class as a literal string of data  units,
4756         or by using the \x{ escaping mechanism.         or by using the \x{ escaping mechanism.
4757    
4758         When caseless matching is set, any letters in a  class  represent  both         When  caseless  matching  is set, any letters in a class represent both
4759         their  upper  case  and lower case versions, so for example, a caseless         their upper case and lower case versions, so for  example,  a  caseless
4760         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
4761         match  "A", whereas a caseful version would. In a UTF mode, PCRE always         match "A", whereas a caseful version would. In a UTF mode, PCRE  always
4762         understands the concept of case for characters whose  values  are  less         understands  the  concept  of case for characters whose values are less
4763         than  128, so caseless matching is always possible. For characters with         than 128, so caseless matching is always possible. For characters  with
4764         higher values, the concept of case is supported  if  PCRE  is  compiled         higher  values,  the  concept  of case is supported if PCRE is compiled
4765         with  Unicode  property support, but not otherwise.  If you want to use         with Unicode property support, but not otherwise.  If you want  to  use
4766         caseless matching in a UTF mode for characters 128 and above, you  must         caseless  matching in a UTF mode for characters 128 and above, you must
4767         ensure  that  PCRE is compiled with Unicode property support as well as         ensure that PCRE is compiled with Unicode property support as  well  as
4768         with UTF support.         with UTF support.
4769    
4770         Characters that might indicate line breaks are  never  treated  in  any         Characters  that  might  indicate  line breaks are never treated in any
4771         special  way  when  matching  character  classes,  whatever line-ending         special way  when  matching  character  classes,  whatever  line-ending
4772         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
4773         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
4774         of these characters.         of these characters.
4775    
4776         The minus (hyphen) character can be used to specify a range of  charac-         The  minus (hyphen) character can be used to specify a range of charac-
4777         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters in a character  class.  For  example,  [d-m]  matches  any  letter
4778         between d and m, inclusive. If a  minus  character  is  required  in  a         between  d  and  m,  inclusive.  If  a minus character is required in a
4779         class,  it  must  be  escaped  with a backslash or appear in a position         class, it must be escaped with a backslash  or  appear  in  a  position
4780         where it cannot be interpreted as indicating a range, typically as  the         where  it cannot be interpreted as indicating a range, typically as the
4781         first or last character in the class.         first or last character in the class.
4782    
4783         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
4784         ter of a range. A pattern such as [W-]46] is interpreted as a class  of         ter  of a range. A pattern such as [W-]46] is interpreted as a class of
4785         two  characters ("W" and "-") followed by a literal string "46]", so it         two characters ("W" and "-") followed by a literal string "46]", so  it
4786         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
4787         backslash  it is interpreted as the end of range, so [W-\]46] is inter-         backslash it is interpreted as the end of range, so [W-\]46] is  inter-
4788         preted as a class containing a range followed by two other  characters.         preted  as a class containing a range followed by two other characters.
4789         The  octal or hexadecimal representation of "]" can also be used to end         The octal or hexadecimal representation of "]" can also be used to  end
4790         a range.         a range.
4791    
4792         Ranges operate in the collating sequence of character values. They  can         Ranges  operate in the collating sequence of character values. They can
4793         also   be  used  for  characters  specified  numerically,  for  example         also  be  used  for  characters  specified  numerically,  for   example
4794         [\000-\037]. Ranges can include any characters that are valid  for  the         [\000-\037].  Ranges  can include any characters that are valid for the
4795         current mode.         current mode.
4796    
4797         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
4798         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
4799         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in a non-UTF mode, if         to [][\\^_`wxyzabc], matched caselessly, and  in  a  non-UTF  mode,  if
4800         character tables for a French locale are in  use,  [\xc8-\xcb]  matches         character  tables  for  a French locale are in use, [\xc8-\xcb] matches
4801         accented  E  characters  in both cases. In UTF modes, PCRE supports the         accented E characters in both cases. In UTF modes,  PCRE  supports  the
4802         concept of case for characters with values greater than 128  only  when         concept  of  case for characters with values greater than 128 only when
4803         it is compiled with Unicode property support.         it is compiled with Unicode property support.
4804    
4805         The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,         The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
4806         \w, and \W may appear in a character class, and add the characters that         \w, and \W may appear in a character class, and add the characters that
4807         they  match to the class. For example, [\dABCDEF] matches any hexadeci-         they match to the class. For example, [\dABCDEF] matches any  hexadeci-
4808         mal digit. In UTF modes, the PCRE_UCP option affects  the  meanings  of         mal  digit.  In  UTF modes, the PCRE_UCP option affects the meanings of
4809         \d,  \s,  \w  and  their upper case partners, just as it does when they         \d, \s, \w and their upper case partners, just as  it  does  when  they
4810         appear outside a character class, as described in the section  entitled         appear  outside a character class, as described in the section entitled
4811         "Generic character types" above. The escape sequence \b has a different         "Generic character types" above. The escape sequence \b has a different
4812         meaning inside a character class; it matches the  backspace  character.         meaning  inside  a character class; it matches the backspace character.
4813         The  sequences  \B,  \N,  \R, and \X are not special inside a character         The sequences \B, \N, \R, and \X are not  special  inside  a  character
4814         class. Like any other unrecognized escape sequences, they  are  treated         class.  Like  any other unrecognized escape sequences, they are treated
4815         as  the literal characters "B", "N", "R", and "X" by default, but cause         as the literal characters "B", "N", "R", and "X" by default, but  cause
4816         an error if the PCRE_EXTRA option is set.         an error if the PCRE_EXTRA option is set.
4817    
4818         A circumflex can conveniently be used with  the  upper  case  character         A  circumflex  can  conveniently  be used with the upper case character
4819         types  to specify a more restricted set of characters than the matching         types to specify a more restricted set of characters than the  matching
4820         lower case type.  For example, the class [^\W_] matches any  letter  or         lower  case  type.  For example, the class [^\W_] matches any letter or
4821         digit, but not underscore, whereas [\w] includes underscore. A positive         digit, but not underscore, whereas [\w] includes underscore. A positive
4822         character class should be read as "something OR something OR ..." and a         character class should be read as "something OR something OR ..." and a
4823         negative class as "NOT something AND NOT something AND NOT ...".         negative class as "NOT something AND NOT something AND NOT ...".
4824    
4825         The  only  metacharacters  that are recognized in character classes are         The only metacharacters that are recognized in  character  classes  are
4826         backslash, hyphen (only where it can be  interpreted  as  specifying  a         backslash,  hyphen  (only  where  it can be interpreted as specifying a
4827         range),  circumflex  (only  at the start), opening square bracket (only         range), circumflex (only at the start), opening  square  bracket  (only
4828         when it can be interpreted as introducing a POSIX class name - see  the         when  it can be interpreted as introducing a POSIX class name - see the
4829         next  section),  and  the  terminating closing square bracket. However,         next section), and the terminating  closing  square  bracket.  However,
4830         escaping other non-alphanumeric characters does no harm.         escaping other non-alphanumeric characters does no harm.
4831    
4832    
4833  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
4834    
4835         Perl supports the POSIX notation for character classes. This uses names         Perl supports the POSIX notation for character classes. This uses names
4836         enclosed  by  [: and :] within the enclosing square brackets. PCRE also         enclosed by [: and :] within the enclosing square brackets.  PCRE  also
4837         supports this notation. For example,         supports this notation. For example,
4838    
4839           [01[:alpha:]%]           [01[:alpha:]%]
# Line 4853  POSIX CHARACTER CLASSES Line 4856  POSIX CHARACTER CLASSES
4856           word     "word" characters (same as \w)           word     "word" characters (same as \w)
4857           xdigit   hexadecimal digits           xdigit   hexadecimal digits
4858    
4859         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),         The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
4860         and space (32). Notice that this list includes the VT  character  (code         and  space  (32). Notice that this list includes the VT character (code
4861         11). This makes "space" different to \s, which does not include VT (for         11). This makes "space" different to \s, which does not include VT (for
4862         Perl compatibility).         Perl compatibility).
4863    
4864         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension         The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
4865         from  Perl  5.8. Another Perl extension is negation, which is indicated         from Perl 5.8. Another Perl extension is negation, which  is  indicated
4866         by a ^ character after the colon. For example,         by a ^ character after the colon. For example,
4867    
4868           [12[:^digit:]]           [12[:^digit:]]
4869    
4870         matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the         matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
4871         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4872         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
4873    
4874         By default, in UTF modes, characters with values greater  than  128  do         By  default,  in  UTF modes, characters with values greater than 128 do
4875         not  match any of the POSIX character classes. However, if the PCRE_UCP         not match any of the POSIX character classes. However, if the  PCRE_UCP
4876         option is passed to pcre_compile(), some of the classes are changed  so         option  is passed to pcre_compile(), some of the classes are changed so
4877         that Unicode character properties are used. This is achieved by replac-         that Unicode character properties are used. This is achieved by replac-
4878         ing the POSIX classes by other sequences, as follows:         ing the POSIX classes by other sequences, as follows:
4879    
# Line 4883  POSIX CHARACTER CLASSES Line 4886  POSIX CHARACTER CLASSES
4886           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
4887           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
4888    
4889         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other         Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
4890         POSIX classes are unchanged, and match only characters with code points         POSIX classes are unchanged, and match only characters with code points
4891         less than 128.         less than 128.
4892    
4893    
4894  VERTICAL BAR  VERTICAL BAR
4895    
4896         Vertical bar characters are used to separate alternative patterns.  For         Vertical  bar characters are used to separate alternative patterns. For
4897         example, the pattern         example, the pattern
4898    
4899           gilbert|sullivan           gilbert|sullivan
4900    
4901         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches either "gilbert" or "sullivan". Any number of alternatives  may
4902         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear,  and  an  empty  alternative  is  permitted (matching the empty
4903         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
4904         to right, and the first one that succeeds is used. If the  alternatives         to  right, and the first one that succeeds is used. If the alternatives
4905         are  within a subpattern (defined below), "succeeds" means matching the         are within a subpattern (defined below), "succeeds" means matching  the
4906         rest of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
4907    
4908    
4909  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
4910    
4911         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
4912         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from         PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
4913         within the pattern by  a  sequence  of  Perl  option  letters  enclosed         within  the  pattern  by  a  sequence  of  Perl option letters enclosed
4914         between "(?" and ")".  The option letters are         between "(?" and ")".  The option letters are
4915    
4916           i  for PCRE_CASELESS           i  for PCRE_CASELESS
# Line 4917  INTERNAL OPTION SETTING Line 4920  INTERNAL OPTION SETTING
4920    
4921         For example, (?im) sets caseless, multiline matching. It is also possi-         For example, (?im) sets caseless, multiline matching. It is also possi-
4922         ble to unset these options by preceding the letter with a hyphen, and a         ble to unset these options by preceding the letter with a hyphen, and a
4923         combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
4924         LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4925         is  also  permitted.  If  a  letter  appears  both before and after the         is also permitted. If a  letter  appears  both  before  and  after  the
4926         hyphen, the option is unset.         hyphen, the option is unset.
4927    
4928         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4929         can  be changed in the same way as the Perl-compatible options by using         can be changed in the same way as the Perl-compatible options by  using
4930         the characters J, U and X respectively.         the characters J, U and X respectively.
4931    
4932         When one of these option changes occurs at  top  level  (that  is,  not         When  one  of  these  option  changes occurs at top level (that is, not
4933         inside  subpattern parentheses), the change applies to the remainder of         inside subpattern parentheses), the change applies to the remainder  of
4934         the pattern that follows. If the change is placed right at the start of         the pattern that follows. If the change is placed right at the start of
4935         a pattern, PCRE extracts it into the global options (and it will there-         a pattern, PCRE extracts it into the global options (and it will there-
4936         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4937    
4938         An option change within a subpattern (see below for  a  description  of         An  option  change  within a subpattern (see below for a description of
4939         subpatterns)  affects only that part of the subpattern that follows it,         subpatterns) affects only that part of the subpattern that follows  it,
4940         so         so
4941    
4942           (a(?i)b)c           (a(?i)b)c
4943    
4944         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4945         used).   By  this means, options can be made to have different settings         used).  By this means, options can be made to have  different  settings
4946         in different parts of the pattern. Any changes made in one  alternative         in  different parts of the pattern. Any changes made in one alternative
4947         do  carry  on  into subsequent branches within the same subpattern. For         do carry on into subsequent branches within the  same  subpattern.  For
4948         example,         example,
4949    
4950           (a(?i)b|c)           (a(?i)b|c)
4951    
4952         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
4953         first  branch  is  abandoned before the option setting. This is because         first branch is abandoned before the option setting.  This  is  because
4954         the effects of option settings happen at compile time. There  would  be         the  effects  of option settings happen at compile time. There would be
4955         some very weird behaviour otherwise.         some very weird behaviour otherwise.
4956    
4957         Note:  There  are  other  PCRE-specific  options that can be set by the         Note: There are other PCRE-specific options that  can  be  set  by  the
4958         application when the compiling or matching  functions  are  called.  In         application  when  the  compiling  or matching functions are called. In
4959         some  cases  the  pattern can contain special leading sequences such as         some cases the pattern can contain special leading  sequences  such  as
4960         (*CRLF) to override what the application  has  set  or  what  has  been         (*CRLF)  to  override  what  the  application  has set or what has been
4961         defaulted.   Details   are  given  in  the  section  entitled  "Newline         defaulted.  Details  are  given  in  the  section   entitled   "Newline
4962         sequences" above. There are also  the  (*UTF8),  (*UTF16),  and  (*UCP)         sequences"  above.  There  are  also  the (*UTF8), (*UTF16), and (*UCP)
4963         leading  sequences  that  can  be  used to set UTF and Unicode property         leading sequences that can be used to  set  UTF  and  Unicode  property
4964         modes; they are equivalent to setting the  PCRE_UTF8,  PCRE_UTF16,  and         modes;  they  are  equivalent to setting the PCRE_UTF8, PCRE_UTF16, and
4965         the PCRE_UCP options, respectively.         the PCRE_UCP options, respectively.
4966    
4967    
# Line 4971  SUBPATTERNS Line 4974  SUBPATTERNS
4974    
4975           cat(aract|erpillar|)           cat(aract|erpillar|)
4976    
4977         matches "cataract", "caterpillar", or "cat". Without  the  parentheses,         matches  "cataract",  "caterpillar", or "cat". Without the parentheses,
4978         it would match "cataract", "erpillar" or an empty string.         it would match "cataract", "erpillar" or an empty string.
4979    
4980         2.  It  sets  up  the  subpattern as a capturing subpattern. This means         2. It sets up the subpattern as  a  capturing  subpattern.  This  means
4981         that, when the whole pattern  matches,  that  portion  of  the  subject         that,  when  the  whole  pattern  matches,  that portion of the subject
4982         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4983         ovector argument of the matching function. (This applies  only  to  the         ovector  argument  of  the matching function. (This applies only to the
4984         traditional  matching functions; the DFA matching functions do not sup-         traditional matching functions; the DFA matching functions do not  sup-
4985         port capturing.)         port capturing.)
4986    
4987         Opening parentheses are counted from left to right (starting from 1) to         Opening parentheses are counted from left to right (starting from 1) to
4988         obtain  numbers  for  the  capturing  subpatterns.  For example, if the         obtain numbers for the  capturing  subpatterns.  For  example,  if  the
4989         string "the red king" is matched against the pattern         string "the red king" is matched against the pattern
4990    
4991           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 4990  SUBPATTERNS Line 4993  SUBPATTERNS
4993         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
4994         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
4995    
4996         The  fact  that  plain  parentheses  fulfil two functions is not always         The fact that plain parentheses fulfil  two  functions  is  not  always
4997         helpful.  There are often times when a grouping subpattern is  required         helpful.   There are often times when a grouping subpattern is required
4998         without  a capturing requirement. If an opening parenthesis is followed         without a capturing requirement. If an opening parenthesis is  followed
4999         by a question mark and a colon, the subpattern does not do any  captur-         by  a question mark and a colon, the subpattern does not do any captur-
5000         ing,  and  is  not  counted when computing the number of any subsequent         ing, and is not counted when computing the  number  of  any  subsequent
5001         capturing subpatterns. For example, if the string "the white queen"  is         capturing  subpatterns. For example, if the string "the white queen" is
5002         matched against the pattern         matched against the pattern
5003    
5004           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
# Line 5003  SUBPATTERNS Line 5006  SUBPATTERNS
5006         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
5007         1 and 2. The maximum number of capturing subpatterns is 65535.         1 and 2. The maximum number of capturing subpatterns is 65535.
5008    
5009         As a convenient shorthand, if any option settings are required  at  the         As  a  convenient shorthand, if any option settings are required at the
5010         start  of  a  non-capturing  subpattern,  the option letters may appear         start of a non-capturing subpattern,  the  option  letters  may  appear
5011         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
5012    
5013           (?i:saturday|sunday)           (?i:saturday|sunday)
5014           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
5015    
5016         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
5017         tried  from  left  to right, and options are not reset until the end of         tried from left to right, and options are not reset until  the  end  of
5018         the subpattern is reached, an option setting in one branch does  affect         the  subpattern is reached, an option setting in one branch does affect
5019         subsequent  branches,  so  the above patterns match "SUNDAY" as well as         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
5020         "Saturday".         "Saturday".
5021    
5022    
5023  DUPLICATE SUBPATTERN NUMBERS  DUPLICATE SUBPATTERN NUMBERS
5024    
5025         Perl 5.10 introduced a feature whereby each alternative in a subpattern         Perl 5.10 introduced a feature whereby each alternative in a subpattern
5026         uses  the same numbers for its capturing parentheses. Such a subpattern         uses the same numbers for its capturing parentheses. Such a  subpattern
5027         starts with (?| and is itself a non-capturing subpattern. For  example,         starts  with (?| and is itself a non-capturing subpattern. For example,
5028         consider this pattern:         consider this pattern:
5029    
5030           (?|(Sat)ur|(Sun))day           (?|(Sat)ur|(Sun))day
5031    
5032         Because  the two alternatives are inside a (?| group, both sets of cap-         Because the two alternatives are inside a (?| group, both sets of  cap-
5033         turing parentheses are numbered one. Thus, when  the  pattern  matches,         turing  parentheses  are  numbered one. Thus, when the pattern matches,
5034         you  can  look  at captured substring number one, whichever alternative         you can look at captured substring number  one,  whichever  alternative
5035         matched. This construct is useful when you want to  capture  part,  but         matched.  This  construct  is useful when you want to capture part, but
5036         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
5037         theses are numbered as usual, but the number is reset at the  start  of         theses  are  numbered as usual, but the number is reset at the start of
5038         each  branch.  The numbers of any capturing parentheses that follow the         each branch. The numbers of any capturing parentheses that  follow  the
5039         subpattern start after the highest number used in any branch. The  fol-         subpattern  start after the highest number used in any branch. The fol-
5040         lowing example is taken from the Perl documentation. The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
5041         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
5042    
# Line 5041  DUPLICATE SUBPATTERN NUMBERS Line 5044  DUPLICATE SUBPATTERN NUMBERS
5044           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
5045           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
5046    
5047         A back reference to a numbered subpattern uses the  most  recent  value         A  back  reference  to a numbered subpattern uses the most recent value
5048         that  is  set  for that number by any subpattern. The following pattern         that is set for that number by any subpattern.  The  following  pattern
5049         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
5050    
5051           /(?|(abc)|(def))\1/           /(?|(abc)|(def))\1/
5052    
5053         In contrast, a subroutine call to a numbered subpattern  always  refers         In  contrast,  a subroutine call to a numbered subpattern always refers
5054         to  the  first  one in the pattern with the given number. The following         to the first one in the pattern with the given  number.  The  following
5055         pattern matches "abcabc" or "defabc":         pattern matches "abcabc" or "defabc":
5056    
5057           /(?|(abc)|(def))(?1)/           /(?|(abc)|(def))(?1)/
5058    
5059         If a condition test for a subpattern's having matched refers to a  non-         If  a condition test for a subpattern's having matched refers to a non-
5060         unique  number, the test is true if any of the subpatterns of that num-         unique number, the test is true if any of the subpatterns of that  num-
5061         ber have matched.         ber have matched.
5062    
5063         An alternative approach to using this "branch reset" feature is to  use         An  alternative approach to using this "branch reset" feature is to use
5064         duplicate named subpatterns, as described in the next section.         duplicate named subpatterns, as described in the next section.
5065    
5066    
5067  NAMED SUBPATTERNS  NAMED SUBPATTERNS
5068    
5069         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
5070         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
5071         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
5072         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
5073         patterns. This feature was not added to Perl until release 5.10. Python         patterns. This feature was not added to Perl until release 5.10. Python
5074         had the feature earlier, and PCRE introduced it at release  4.0,  using         had  the  feature earlier, and PCRE introduced it at release 4.0, using
5075         the  Python syntax. PCRE now supports both the Perl and the Python syn-         the Python syntax. PCRE now supports both the Perl and the Python  syn-
5076         tax. Perl allows identically numbered  subpatterns  to  have  different         tax.  Perl  allows  identically  numbered subpatterns to have different
5077         names, but PCRE does not.         names, but PCRE does not.
5078    
5079         In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
5080         or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
5081         to  capturing parentheses from other parts of the pattern, such as back         to capturing parentheses from other parts of the pattern, such as  back
5082         references, recursion, and conditions, can be made by name as  well  as         references,  recursion,  and conditions, can be made by name as well as
5083         by number.         by number.
5084    
5085         Names  consist  of  up  to  32 alphanumeric characters and underscores.         Names consist of up to  32  alphanumeric  characters  and  underscores.
5086         Named capturing parentheses are still  allocated  numbers  as  well  as         Named  capturing  parentheses  are  still  allocated numbers as well as
5087         names,  exactly as if the names were not present. The PCRE API provides         names, exactly as if the names were not present. The PCRE API  provides
5088         function calls for extracting the name-to-number translation table from         function calls for extracting the name-to-number translation table from
5089         a compiled pattern. There is also a convenience function for extracting         a compiled pattern. There is also a convenience function for extracting
5090         a captured substring by name.         a captured substring by name.
5091    
5092         By default, a name must be unique within a pattern, but it is  possible         By  default, a name must be unique within a pattern, but it is possible
5093         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
5094         time. (Duplicate names are also always permitted for  subpatterns  with         time.  (Duplicate  names are also always permitted for subpatterns with
5095         the  same  number, set up as described in the previous section.) Dupli-         the same number, set up as described in the previous  section.)  Dupli-
5096         cate names can be useful for patterns where only one  instance  of  the         cate  names  can  be useful for patterns where only one instance of the
5097         named  parentheses  can  match. Suppose you want to match the name of a         named parentheses can match. Suppose you want to match the  name  of  a
5098         weekday, either as a 3-letter abbreviation or as the full name, and  in         weekday,  either as a 3-letter abbreviation or as the full name, and in
5099         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
5100         the line breaks) does the job:         the line breaks) does the job:
5101    
# Line 5102  NAMED SUBPATTERNS Line 5105  NAMED SUBPATTERNS
5105           (?<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
5106           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
5107    
5108         There are five capturing substrings, but only one is ever set  after  a         There  are  five capturing substrings, but only one is ever set after a
5109         match.  (An alternative way of solving this problem is to use a "branch         match.  (An alternative way of solving this problem is to use a "branch
5110         reset" subpattern, as described in the previous section.)         reset" subpattern, as described in the previous section.)
5111    
5112         The convenience function for extracting the data by  name  returns  the         The  convenience  function  for extracting the data by name returns the
5113         substring  for  the first (and in this example, the only) subpattern of         substring for the first (and in this example, the only)  subpattern  of
5114         that name that matched. This saves searching  to  find  which  numbered         that  name  that  matched.  This saves searching to find which numbered
5115         subpattern it was.         subpattern it was.
5116    
5117         If  you  make  a  back  reference to a non-unique named subpattern from         If you make a back reference to  a  non-unique  named  subpattern  from
5118         elsewhere in the pattern, the one that corresponds to the first  occur-         elsewhere  in the pattern, the one that corresponds to the first occur-
5119         rence of the name is used. In the absence of duplicate numbers (see the         rence of the name is used. In the absence of duplicate numbers (see the
5120         previous section) this is the one with the lowest number. If you use  a         previous  section) this is the one with the lowest number. If you use a
5121         named  reference  in a condition test (see the section about conditions         named reference in a condition test (see the section  about  conditions
5122         below), either to check whether a subpattern has matched, or  to  check         below),  either  to check whether a subpattern has matched, or to check
5123         for  recursion,  all  subpatterns with the same name are tested. If the         for recursion, all subpatterns with the same name are  tested.  If  the
5124         condition is true for any one of them, the overall condition  is  true.         condition  is  true for any one of them, the overall condition is true.
5125         This is the same behaviour as testing by number. For further details of         This is the same behaviour as testing by number. For further details of
5126         the interfaces for handling named subpatterns, see the pcreapi documen-         the interfaces for handling named subpatterns, see the pcreapi documen-
5127         tation.         tation.
5128    
5129         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
5130         patterns with the same number because PCRE uses only the  numbers  when         patterns  with  the same number because PCRE uses only the numbers when
5131         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
5132         ent names are given to subpatterns with the same number.  However,  you         ent  names  are given to subpatterns with the same number. However, you
5133         can  give  the same name to subpatterns with the same number, even when         can give the same name to subpatterns with the same number,  even  when
5134         PCRE_DUPNAMES is not set.         PCRE_DUPNAMES is not set.
5135    
5136    
5137  REPETITION  REPETITION
5138    
5139         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
5140         following items:         following items:
5141    
5142           a literal data character           a literal data character
# Line 5147  REPETITION Line 5150  REPETITION
5150           a parenthesized subpattern (including assertions)           a parenthesized subpattern (including assertions)
5151           a subroutine call to a subpattern (recursive or otherwise)           a subroutine call to a subpattern (recursive or otherwise)
5152    
5153         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
5154         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
5155         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
5156         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
5157    
5158           z{2,4}           z{2,4}
5159    
5160         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
5161         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
5162         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
5163         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
5164         matches. Thus         matches. Thus
5165    
5166           [aeiou]{3,}           [aeiou]{3,}
# Line 5166  REPETITION Line 5169  REPETITION
5169    
5170           \d{8}           \d{8}
5171    
5172         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
5173         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
5174         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
5175         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
5176    
5177         In UTF modes, quantifiers apply to characters rather than to individual         In UTF modes, quantifiers apply to characters rather than to individual
5178         data units. Thus, for example, \x{100}{2} matches two characters,  each         data  units. Thus, for example, \x{100}{2} matches two characters, each
5179         of which is represented by a two-byte sequence in a UTF-8 string. Simi-         of which is represented by a two-byte sequence in a UTF-8 string. Simi-
5180         larly, \X{3} matches three Unicode extended sequences,  each  of  which         larly,  \X{3}  matches  three Unicode extended sequences, each of which
5181         may be several data units long (and they may be of different lengths).         may be several data units long (and they may be of different lengths).
5182    
5183         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
5184         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
5185         ful  for  subpatterns that are referenced as subroutines from elsewhere         ful for subpatterns that are referenced as subroutines  from  elsewhere
5186         in the pattern (but see also the section entitled "Defining subpatterns         in the pattern (but see also the section entitled "Defining subpatterns
5187         for  use  by  reference only" below). Items other than subpatterns that         for use by reference only" below). Items other  than  subpatterns  that
5188         have a {0} quantifier are omitted from the compiled pattern.         have a {0} quantifier are omitted from the compiled pattern.
5189    
5190         For convenience, the three most common quantifiers have  single-charac-         For  convenience, the three most common quantifiers have single-charac-
5191         ter abbreviations:         ter abbreviations:
5192    
5193           *    is equivalent to {0,}           *    is equivalent to {0,}
5194           +    is equivalent to {1,}           +    is equivalent to {1,}
5195           ?    is equivalent to {0,1}           ?    is equivalent to {0,1}
5196    
5197         It  is  possible  to construct infinite loops by following a subpattern         It is possible to construct infinite loops by  following  a  subpattern
5198         that can match no characters with a quantifier that has no upper limit,         that can match no characters with a quantifier that has no upper limit,
5199         for example:         for example:
5200    
5201           (a?)*           (a?)*
5202    
5203         Earlier versions of Perl and PCRE used to give an error at compile time         Earlier versions of Perl and PCRE used to give an error at compile time
5204         for such patterns. However, because there are cases where this  can  be         for  such  patterns. However, because there are cases where this can be
5205         useful,  such  patterns  are now accepted, but if any repetition of the         useful, such patterns are now accepted, but if any  repetition  of  the
5206         subpattern does in fact match no characters, the loop is forcibly  bro-         subpattern  does in fact match no characters, the loop is forcibly bro-
5207         ken.         ken.
5208    
5209         By  default,  the quantifiers are "greedy", that is, they match as much         By default, the quantifiers are "greedy", that is, they match  as  much
5210         as possible (up to the maximum  number  of  permitted  times),  without         as  possible  (up  to  the  maximum number of permitted times), without
5211         causing  the  rest of the pattern to fail. The classic example of where         causing the rest of the pattern to fail. The classic example  of  where
5212         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
5213         appear  between  /*  and  */ and within the comment, individual * and /         appear between /* and */ and within the comment,  individual  *  and  /
5214         characters may appear. An attempt to match C comments by  applying  the         characters  may  appear. An attempt to match C comments by applying the
5215         pattern         pattern
5216    
5217           /\*.*\*/           /\*.*\*/
# Line 5217  REPETITION Line 5220  REPETITION
5220    
5221           /* first comment */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
5222    
5223         fails,  because it matches the entire string owing to the greediness of         fails, because it matches the entire string owing to the greediness  of
5224         the .*  item.         the .*  item.
5225    
5226         However, if a quantifier is followed by a question mark, it  ceases  to         However,  if  a quantifier is followed by a question mark, it ceases to
5227         be greedy, and instead matches the minimum number of times possible, so         be greedy, and instead matches the minimum number of times possible, so
5228         the pattern         the pattern
5229    
5230           /\*.*?\*/           /\*.*?\*/
5231    
5232         does the right thing with the C comments. The meaning  of  the  various         does  the  right  thing with the C comments. The meaning of the various
5233         quantifiers  is  not  otherwise  changed,  just the preferred number of         quantifiers is not otherwise changed,  just  the  preferred  number  of
5234         matches.  Do not confuse this use of question mark with its  use  as  a         matches.   Do  not  confuse this use of question mark with its use as a
5235         quantifier  in its own right. Because it has two uses, it can sometimes         quantifier in its own right. Because it has two uses, it can  sometimes
5236         appear doubled, as in         appear doubled, as in
5237    
5238           \d??\d           \d??\d
# Line 5237  REPETITION Line 5240  REPETITION
5240         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
5241         only way the rest of the pattern matches.         only way the rest of the pattern matches.
5242    
5243         If  the PCRE_UNGREEDY option is set (an option that is not available in         If the PCRE_UNGREEDY option is set (an option that is not available  in
5244         Perl), the quantifiers are not greedy by default, but  individual  ones         Perl),  the  quantifiers are not greedy by default, but individual ones
5245         can  be  made  greedy  by following them with a question mark. In other         can be made greedy by following them with a  question  mark.  In  other
5246         words, it inverts the default behaviour.         words, it inverts the default behaviour.
5247    
5248         When a parenthesized subpattern is quantified  with  a  minimum  repeat         When  a  parenthesized  subpattern  is quantified with a minimum repeat
5249         count  that is greater than 1 or with a limited maximum, more memory is         count that is greater than 1 or with a limited maximum, more memory  is
5250         required for the compiled pattern, in proportion to  the  size  of  the         required  for  the  compiled  pattern, in proportion to the size of the
5251         minimum or maximum.         minimum or maximum.
5252    
5253         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
5254         alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
5255         the  pattern  is  implicitly anchored, because whatever follows will be         the pattern is implicitly anchored, because whatever  follows  will  be
5256         tried against every character position in the subject string, so  there         tried  against every character position in the subject string, so there
5257         is  no  point  in  retrying the overall match at any position after the         is no point in retrying the overall match at  any  position  after  the
5258         first. PCRE normally treats such a pattern as though it  were  preceded         first.  PCRE  normally treats such a pattern as though it were preceded
5259         by \A.         by \A.
5260    
5261         In  cases  where  it  is known that the subject string contains no new-         In cases where it is known that the subject  string  contains  no  new-
5262         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
5263         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
5264    
5265         However,  there is one situation where the optimization cannot be used.         However, there is one situation where the optimization cannot be  used.
5266         When .*  is inside capturing parentheses that are the subject of a back         When .*  is inside capturing parentheses that are the subject of a back
5267         reference elsewhere in the pattern, a match at the start may fail where         reference elsewhere in the pattern, a match at the start may fail where
5268         a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
5269    
5270           (.*)abc\1           (.*)abc\1
5271    
5272         If the subject is "xyz123abc123" the match point is the fourth  charac-         If  the subject is "xyz123abc123" the match point is the fourth charac-
5273         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
5274    
5275         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 5275  REPETITION Line 5278  REPETITION
5278           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
5279    
5280         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
5281         is  "tweedledee".  However,  if there are nested capturing subpatterns,         is "tweedledee". However, if there are  nested  capturing  subpatterns,
5282         the corresponding captured values may have been set in previous  itera-         the  corresponding captured values may have been set in previous itera-
5283         tions. For example, after         tions. For example, after
5284    
5285           /(a|(b))+/           /(a|(b))+/
# Line 5286  REPETITION Line 5289  REPETITION
5289    
5290  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
5291    
5292         With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")
5293         repetition, failure of what follows normally causes the  repeated  item         repetition,  failure  of what follows normally causes the repeated item
5294         to  be  re-evaluated to see if a different number of repeats allows the         to be re-evaluated to see if a different number of repeats  allows  the
5295         rest of the pattern to match. Sometimes it is useful to  prevent  this,         rest  of  the pattern to match. Sometimes it is useful to prevent this,
5296         either  to  change the nature of the match, or to cause it fail earlier         either to change the nature of the match, or to cause it  fail  earlier
5297         than it otherwise might, when the author of the pattern knows there  is         than  it otherwise might, when the author of the pattern knows there is
5298         no point in carrying on.         no point in carrying on.
5299    
5300         Consider,  for  example, the pattern \d+foo when applied to the subject         Consider, for example, the pattern \d+foo when applied to  the  subject
5301         line         line
5302    
5303           123456bar           123456bar
5304    
5305         After matching all 6 digits and then failing to match "foo", the normal         After matching all 6 digits and then failing to match "foo", the normal
5306         action  of  the matcher is to try again with only 5 digits matching the         action of the matcher is to try again with only 5 digits  matching  the
5307         \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.
5308         "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides
5309         the means for specifying that once a subpattern has matched, it is  not         the  means for specifying that once a subpattern has matched, it is not
5310         to be re-evaluated in this way.         to be re-evaluated in this way.
5311    
5312         If  we  use atomic grouping for the previous example, the matcher gives         If we use atomic grouping for the previous example, the  matcher  gives
5313         up immediately on failing to match "foo" the first time.  The  notation         up  immediately  on failing to match "foo" the first time. The notation
5314         is a kind of special parenthesis, starting with (?> as in this example:         is a kind of special parenthesis, starting with (?> as in this example:
5315    
5316           (?>\d+)foo           (?>\d+)foo
5317    
5318         This  kind  of  parenthesis "locks up" the  part of the pattern it con-         This kind of parenthesis "locks up" the  part of the  pattern  it  con-
5319         tains once it has matched, and a failure further into  the  pattern  is         tains  once  it  has matched, and a failure further into the pattern is
5320         prevented  from  backtracking into it. Backtracking past it to previous         prevented from backtracking into it. Backtracking past it  to  previous
5321         items, however, works as normal.         items, however, works as normal.
5322    
5323         An alternative description is that a subpattern of  this  type  matches         An  alternative  description  is that a subpattern of this type matches
5324         the  string  of  characters  that an identical standalone pattern would         the string of characters that an  identical  standalone  pattern  would
5325         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
5326    
5327         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
5328         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
5329         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-
5330         pared  to  adjust  the number of digits they match in order to make the         pared to adjust the number of digits they match in order  to  make  the
5331         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
5332         digits.         digits.
5333    
5334         Atomic  groups in general can of course contain arbitrarily complicated         Atomic groups in general can of course contain arbitrarily  complicated
5335         subpatterns, and can be nested. However, when  the  subpattern  for  an         subpatterns,  and  can  be  nested. However, when the subpattern for an
5336         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
5337         simpler notation, called a "possessive quantifier" can  be  used.  This         simpler  notation,  called  a "possessive quantifier" can be used. This
5338         consists  of  an  additional  + character following a quantifier. Using         consists of an additional + character  following  a  quantifier.  Using
5339         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
5340    
5341           \d++foo           \d++foo
# Line 5342  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 5345  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
5345    
5346           (abc|xyz){2,3}+           (abc|xyz){2,3}+
5347    
5348         Possessive   quantifiers   are   always  greedy;  the  setting  of  the         Possessive  quantifiers  are  always  greedy;  the   setting   of   the
5349         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
5350         simpler  forms  of atomic group. However, there is no difference in the         simpler forms of atomic group. However, there is no difference  in  the
5351         meaning of a possessive quantifier and  the  equivalent  atomic  group,         meaning  of  a  possessive  quantifier and the equivalent atomic group,
5352         though  there  may  be a performance difference; possessive quantifiers         though there may be a performance  difference;  possessive  quantifiers
5353         should be slightly faster.         should be slightly faster.
5354    
5355         The possessive quantifier syntax is an extension to the Perl  5.8  syn-         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-
5356         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first
5357         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
5358         built  Sun's Java package, and PCRE copied it from there. It ultimately         built Sun's Java package, and PCRE copied it from there. It  ultimately
5359         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
5360    
5361         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
5362         ple  pattern  constructs.  For  example, the sequence A+B is treated as         ple pattern constructs. For example, the sequence  A+B  is  treated  as
5363         A++B because there is no point in backtracking into a sequence  of  A's         A++B  because  there is no point in backtracking into a sequence of A's
5364         when B must follow.         when B must follow.
5365    
5366         When  a  pattern  contains an unlimited repeat inside a subpattern that         When a pattern contains an unlimited repeat inside  a  subpattern  that
5367         can itself be repeated an unlimited number of  times,  the  use  of  an         can  itself  be  repeated  an  unlimited number of times, the use of an
5368         atomic  group  is  the  only way to avoid some failing matches taking a         atomic group is the only way to avoid some  failing  matches  taking  a
5369         very long time indeed. The pattern         very long time indeed. The pattern
5370    
5371           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
5372    
5373         matches an unlimited number of substrings that either consist  of  non-         matches  an  unlimited number of substrings that either consist of non-
5374         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it
5375         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
5376    
5377           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
5378    
5379         it takes a long time before reporting  failure.  This  is  because  the         it  takes  a  long  time  before reporting failure. This is because the
5380         string  can be divided between the internal \D+ repeat and the external         string can be divided between the internal \D+ repeat and the  external
5381         * repeat in a large number of ways, and all  have  to  be  tried.  (The         *  repeat  in  a  large  number of ways, and all have to be tried. (The
5382         example  uses  [!?]  rather than a single character at the end, because         example uses [!?] rather than a single character at  the  end,  because
5383         both PCRE and Perl have an optimization that allows  for  fast  failure         both  PCRE  and  Perl have an optimization that allows for fast failure
5384         when  a single character is used. They remember the last single charac-         when a single character is used. They remember the last single  charac-
5385         ter that is required for a match, and fail early if it is  not  present         ter  that  is required for a match, and fail early if it is not present
5386         in  the  string.)  If  the pattern is changed so that it uses an atomic         in the string.) If the pattern is changed so that  it  uses  an  atomic
5387         group, like this:         group, like this:
5388    
5389           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
# Line 5392  BACK REFERENCES Line 5395  BACK REFERENCES
5395    
5396         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
5397         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
5398         pattern earlier (that is, to its left) in the pattern,  provided  there         pattern  earlier  (that is, to its left) in the pattern, provided there
5399         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
5400    
5401         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
5402         it is always taken as a back reference, and causes  an  error  only  if         it  is  always  taken  as a back reference, and causes an error only if
5403         there  are  not that many capturing left parentheses in the entire pat-         there are not that many capturing left parentheses in the  entire  pat-
5404         tern. In other words, the parentheses that are referenced need  not  be         tern.  In  other words, the parentheses that are referenced need not be
5405         to  the left of the reference for numbers less than 10. A "forward back         to the left of the reference for numbers less than 10. A "forward  back
5406         reference" of this type can make sense when a  repetition  is  involved         reference"  of  this  type can make sense when a repetition is involved
5407         and  the  subpattern to the right has participated in an earlier itera-         and the subpattern to the right has participated in an  earlier  itera-
5408         tion.         tion.
5409    
5410         It is not possible to have a numerical "forward back  reference"  to  a         It  is  not  possible to have a numerical "forward back reference" to a
5411         subpattern  whose  number  is  10  or  more using this syntax because a         subpattern whose number is 10 or  more  using  this  syntax  because  a
5412         sequence such as \50 is interpreted as a character  defined  in  octal.         sequence  such  as  \50 is interpreted as a character defined in octal.
5413         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
5414         details of the handling of digits following a backslash.  There  is  no         details  of  the  handling of digits following a backslash. There is no
5415         such  problem  when named parentheses are used. A back reference to any         such problem when named parentheses are used. A back reference  to  any
5416         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
5417    
5418         Another way of avoiding the ambiguity inherent in  the  use  of  digits         Another  way  of  avoiding  the ambiguity inherent in the use of digits
5419         following  a  backslash  is  to use the \g escape sequence. This escape         following a backslash is to use the \g  escape  sequence.  This  escape
5420         must be followed by an unsigned number or a negative number, optionally         must be followed by an unsigned number or a negative number, optionally
5421         enclosed in braces. These examples are all identical:         enclosed in braces. These examples are all identical:
5422    
# Line 5421  BACK REFERENCES Line 5424  BACK REFERENCES
5424           (ring), \g1           (ring), \g1
5425           (ring), \g{1}           (ring), \g{1}
5426    
5427         An  unsigned number specifies an absolute reference without the ambigu-         An unsigned number specifies an absolute reference without the  ambigu-
5428         ity that is present in the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
5429         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
5430         Consider this example:         Consider this example:
# Line 5430  BACK REFERENCES Line 5433  BACK REFERENCES
5433    
5434         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
5435         ing subpattern before \g, that is, is it equivalent to \2 in this exam-         ing subpattern before \g, that is, is it equivalent to \2 in this exam-
5436         ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative         ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative
5437         references  can  be helpful in long patterns, and also in patterns that         references can be helpful in long patterns, and also in  patterns  that
5438         are created by  joining  together  fragments  that  contain  references         are  created  by  joining  together  fragments  that contain references
5439         within themselves.         within themselves.
5440    
5441         A  back  reference matches whatever actually matched the capturing sub-         A back reference matches whatever actually matched the  capturing  sub-
5442         pattern in the current subject string, rather  than  anything  matching         pattern  in  the  current subject string, rather than anything matching
5443         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
5444         of doing that). So the pattern         of doing that). So the pattern
5445    
5446           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5447    
5448         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5449         not  "sense and responsibility". If caseful matching is in force at the         not "sense and responsibility". If caseful matching is in force at  the
5450         time of the back reference, the case of letters is relevant. For  exam-         time  of the back reference, the case of letters is relevant. For exam-
5451         ple,         ple,
5452    
5453           ((?i)rah)\s+\1           ((?i)rah)\s+\1
5454    
5455         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the
5456         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
5457    
5458         There are several different ways of writing back  references  to  named         There  are  several  different ways of writing back references to named
5459         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or
5460         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's
5461         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
5462         and named references, is also supported. We  could  rewrite  the  above         and  named  references,  is  also supported. We could rewrite the above
5463         example in any of the following ways:         example in any of the following ways:
5464    
5465           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 5464  BACK REFERENCES Line 5467  BACK REFERENCES
5467           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
5468           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
5469    
5470         A  subpattern  that  is  referenced  by  name may appear in the pattern         A subpattern that is referenced by  name  may  appear  in  the  pattern
5471         before or after the reference.         before or after the reference.
5472    
5473         There may be more than one back reference to the same subpattern. If  a         There  may be more than one back reference to the same subpattern. If a
5474         subpattern  has  not actually been used in a particular match, any back         subpattern has not actually been used in a particular match,  any  back
5475         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
5476    
5477           (a|(bc))\2           (a|(bc))\2
5478    
5479         always fails if it starts to match "a" rather than  "bc".  However,  if         always  fails  if  it starts to match "a" rather than "bc". However, if
5480         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
5481         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
5482    
5483         Because there may be many capturing parentheses in a pattern, all  dig-         Because  there may be many capturing parentheses in a pattern, all dig-
5484         its  following a backslash are taken as part of a potential back refer-         its following a backslash are taken as part of a potential back  refer-
5485         ence number.  If the pattern continues with  a  digit  character,  some         ence  number.   If  the  pattern continues with a digit character, some
5486         delimiter  must  be  used  to  terminate  the  back  reference.  If the         delimiter must  be  used  to  terminate  the  back  reference.  If  the
5487         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
5488         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
5489    
5490     Recursive back references     Recursive back references
5491    
5492         A  back reference that occurs inside the parentheses to which it refers         A back reference that occurs inside the parentheses to which it  refers
5493         fails when the subpattern is first used, so, for example,  (a\1)  never         fails  when  the subpattern is first used, so, for example, (a\1) never
5494         matches.   However,  such references can be useful inside repeated sub-         matches.  However, such references can be useful inside  repeated  sub-
5495         patterns. For example, the pattern         patterns. For example, the pattern
5496    
5497           (a|b\1)+           (a|b\1)+
5498    
5499         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
5500         ation  of  the  subpattern,  the  back  reference matches the character         ation of the subpattern,  the  back  reference  matches  the  character
5501         string corresponding to the previous iteration. In order  for  this  to         string  corresponding  to  the previous iteration. In order for this to
5502         work,  the  pattern must be such that the first iteration does not need         work, the pattern must be such that the first iteration does  not  need
5503         to match the back reference. This can be done using alternation, as  in         to  match the back reference. This can be done using alternation, as in
5504         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
5505    
5506         Back  references of this type cause the group that they reference to be         Back references of this type cause the group that they reference to  be
5507         treated as an atomic group.  Once the whole group has been  matched,  a         treated  as  an atomic group.  Once the whole group has been matched, a
5508         subsequent  matching  failure cannot cause backtracking into the middle         subsequent matching failure cannot cause backtracking into  the  middle
5509         of the group.         of the group.
5510    
5511    
5512  ASSERTIONS  ASSERTIONS
5513    
5514         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
5515         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
5516         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
5517         described above.         described above.
5518    
5519         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
5520         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
5521         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
5522         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
5523         matching position to be changed.         matching position to be changed.
5524    
5525         Assertion  subpatterns are not capturing subpatterns. If such an asser-         Assertion subpatterns are not capturing subpatterns. If such an  asser-
5526         tion contains capturing subpatterns within it, these  are  counted  for         tion  contains  capturing  subpatterns within it, these are counted for
5527         the  purposes  of numbering the capturing subpatterns in the whole pat-         the purposes of numbering the capturing subpatterns in the  whole  pat-
5528         tern. However, substring capturing is carried  out  only  for  positive         tern.  However,  substring  capturing  is carried out only for positive
5529         assertions, because it does not make sense for negative assertions.         assertions, because it does not make sense for negative assertions.
5530    
5531         For  compatibility  with  Perl,  assertion subpatterns may be repeated;         For compatibility with Perl, assertion  subpatterns  may  be  repeated;
5532         though it makes no sense to assert the same thing  several  times,  the         though  it  makes  no sense to assert the same thing several times, the
5533         side  effect  of  capturing  parentheses may occasionally be useful. In         side effect of capturing parentheses may  occasionally  be  useful.  In
5534         practice, there only three cases:         practice, there only three cases:
5535    
5536         (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during         (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
5537         matching.   However,  it  may  contain internal capturing parenthesized         matching.  However, it may  contain  internal  capturing  parenthesized
5538         groups that are called from elsewhere via the subroutine mechanism.         groups that are called from elsewhere via the subroutine mechanism.
5539    
5540         (2) If quantifier is {0,n} where n is greater than zero, it is  treated         (2)  If quantifier is {0,n} where n is greater than zero, it is treated
5541         as  if  it  were  {0,1}.  At run time, the rest of the pattern match is         as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
5542         tried with and without the assertion, the order depending on the greed-         tried with and without the assertion, the order depending on the greed-
5543         iness of the quantifier.         iness of the quantifier.
5544    
5545         (3)  If  the minimum repetition is greater than zero, the quantifier is         (3) If the minimum repetition is greater than zero, the  quantifier  is
5546         ignored.  The assertion is obeyed just  once  when  encountered  during         ignored.   The  assertion  is  obeyed just once when encountered during
5547         matching.         matching.
5548    
5549     Lookahead assertions     Lookahead assertions
# Line 5550  ASSERTIONS Line 5553  ASSERTIONS
5553    
5554           \w+(?=;)           \w+(?=;)
5555    
5556         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
5557         colon in the match, and         colon in the match, and
5558    
5559           foo(?!bar)           foo(?!bar)
5560    
5561         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
5562         that the apparently similar pattern         that the apparently similar pattern
5563    
5564           (?!foo)bar           (?!foo)bar
5565    
5566         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
5567         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
5568         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
5569         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
5570    
5571         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
5572         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
5573         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
5574         string must always fail.  The backtracking control verb (*FAIL) or (*F)         string must always fail.  The backtracking control verb (*FAIL) or (*F)
5575         is a synonym for (?!).         is a synonym for (?!).
5576    
5577     Lookbehind assertions     Lookbehind assertions
5578    
5579         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
5580         for negative assertions. For example,         for negative assertions. For example,
5581    
5582           (?<!foo)bar           (?<!foo)bar
5583    
5584         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
5585         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
5586         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
5587         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
5588         fixed length. Thus         fixed length. Thus
5589    
5590           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 5590  ASSERTIONS Line 5593  ASSERTIONS
5593    
5594           (?<!dogs?|cats?)           (?<!dogs?|cats?)
5595    
5596         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
5597         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
5598         This is an extension compared with Perl, which requires all branches to         This is an extension compared with Perl, which requires all branches to
5599         match the same length of string. An assertion such as         match the same length of string. An assertion such as
5600    
5601           (?<=ab(c|de))           (?<=ab(c|de))
5602    
5603         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
5604         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
5605         top-level branches:         top-level branches:
5606    
5607           (?<=abc|abde)           (?<=abc|abde)
5608    
5609         In  some  cases, the escape sequence \K (see above) can be used instead         In some cases, the escape sequence \K (see above) can be  used  instead
5610         of a lookbehind assertion to get round the fixed-length restriction.         of a lookbehind assertion to get round the fixed-length restriction.
5611    
5612         The implementation of lookbehind assertions is, for  each  alternative,         The  implementation  of lookbehind assertions is, for each alternative,
5613         to  temporarily  move the current position back by the fixed length and         to temporarily move the current position back by the fixed  length  and
5614         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
5615         rent position, the assertion fails.         rent position, the assertion fails.
5616    
5617         In  a UTF mode, PCRE does not allow the \C escape (which matches a sin-         In a UTF mode, PCRE does not allow the \C escape (which matches a  sin-
5618         gle data unit even in a UTF mode) to appear in  lookbehind  assertions,         gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
5619         because  it  makes it impossible to calculate the length of the lookbe-         because it makes it impossible to calculate the length of  the  lookbe-
5620         hind. The \X and \R escapes, which can match different numbers of  data         hind.  The \X and \R escapes, which can match different numbers of data
5621         units, are also not permitted.         units, are also not permitted.
5622    
5623         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
5624         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
5625         Recursion, however, is not supported.         Recursion, however, is not supported.
5626    
5627         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
5628         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
5629         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
5630    
5631           abcd$           abcd$
5632    
5633         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
5634         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
5635         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
5636         pattern is specified as         pattern is specified as
5637    
5638           ^.*abcd$           ^.*abcd$
5639    
5640         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
5641         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
5642         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
5643         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
5644         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
5645    
5646           ^.*+(?<=abcd)           ^.*+(?<=abcd)
5647    
5648         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
5649         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
5650         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
5651         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
5652         processing time.         processing time.
5653    
5654     Using multiple assertions     Using multiple assertions
# Line 5654  ASSERTIONS Line 5657  ASSERTIONS
5657    
5658           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
5659    
5660         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
5661         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
5662         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
5663         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
5664         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
5665         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
5666         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
5667         foo". A pattern to do that is         foo". A pattern to do that is
5668    
5669           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
5670    
5671         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
5672         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
5673         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
5674    
# Line 5673  ASSERTIONS Line 5676  ASSERTIONS
5676    
5677           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
5678    
5679         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
5680         is not preceded by "foo", while         is not preceded by "foo", while
5681    
5682           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
5683    
5684         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
5685         three characters that are not "999".         three characters that are not "999".
5686    
5687    
5688  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
5689    
5690         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
5691         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
5692         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
5693         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
5694         subpattern are:         subpattern are:
5695    
5696           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5697           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5698    
5699         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
5700         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
5701         tives  in  the subpattern, a compile-time error occurs. Each of the two         tives in the subpattern, a compile-time error occurs. Each of  the  two
5702         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
5703         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5704         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 5704  CONDITIONAL SUBPATTERNS Line 5707  CONDITIONAL SUBPATTERNS
5707           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5708    
5709    
5710         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
5711         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
5712    
5713     Checking for a used subpattern by number     Checking for a used subpattern by number
5714    
5715         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
5716         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
5717         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
5718         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
5719         numbers), the condition is true if any of them have matched. An  alter-         numbers),  the condition is true if any of them have matched. An alter-
5720         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
5721         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
5722         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
5723         most recent by (?(-2), and so on. Inside loops it can also  make  sense         most  recent  by (?(-2), and so on. Inside loops it can also make sense
5724         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
5725         referenced as (?(+1), and so on. (The value zero in any of these  forms         referenced  as (?(+1), and so on. (The value zero in any of these forms
5726         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
5727    
5728         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
5729         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
5730         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
5731    
5732           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5733    
5734         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
5735         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5736         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
5737         third part is a conditional subpattern that tests whether  or  not  the         third  part  is  a conditional subpattern that tests whether or not the
5738         first  set  of  parentheses  matched.  If they did, that is, if subject         first set of parentheses matched. If they  did,  that  is,  if  subject
5739         started with an opening parenthesis, the condition is true, and so  the         started  with an opening parenthesis, the condition is true, and so the
5740         yes-pattern  is  executed and a closing parenthesis is required. Other-         yes-pattern is executed and a closing parenthesis is  required.  Other-
5741         wise, since no-pattern is not present, the subpattern matches  nothing.         wise,  since no-pattern is not present, the subpattern matches nothing.
5742         In  other  words,  this  pattern matches a sequence of non-parentheses,         In other words, this pattern matches  a  sequence  of  non-parentheses,
5743         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5744    
5745         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
5746         relative reference:         relative reference:
5747    
5748           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5749    
5750         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
5751         pattern.         pattern.
5752    
5753     Checking for a used subpattern by name     Checking for a used subpattern by name
5754    
5755         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
5756         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
5757         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
5758         also  recognized. However, there is a possible ambiguity with this syn-         also recognized. However, there is a possible ambiguity with this  syn-
5759         tax, because subpattern names may  consist  entirely  of  digits.  PCRE         tax,  because  subpattern  names  may  consist entirely of digits. PCRE
5760         looks  first for a named subpattern; if it cannot find one and the name         looks first for a named subpattern; if it cannot find one and the  name
5761         consists entirely of digits, PCRE looks for a subpattern of  that  num-         consists  entirely  of digits, PCRE looks for a subpattern of that num-
5762         ber,  which must be greater than zero. Using subpattern names that con-         ber, which must be greater than zero. Using subpattern names that  con-
5763         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5764    
5765         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5766    
5767           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5768    
5769         If the name used in a condition of this kind is a duplicate,  the  test         If  the  name used in a condition of this kind is a duplicate, the test
5770         is  applied to all subpatterns of the same name, and is true if any one         is applied to all subpatterns of the same name, and is true if any  one
5771         of them has matched.         of them has matched.
5772    
5773     Checking for pattern recursion     Checking for pattern recursion
5774    
5775         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5776         name  R, the condition is true if a recursive call to the whole pattern         name R, the condition is true if a recursive call to the whole  pattern
5777         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5778         sand follow the letter R, for example:         sand follow the letter R, for example:
5779    
# Line 5778  CONDITIONAL SUBPATTERNS Line 5781  CONDITIONAL SUBPATTERNS
5781    
5782         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5783         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5784         recursion  stack.  If  the  name  used in a condition of this kind is a         recursion stack. If the name used in a condition  of  this  kind  is  a
5785         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5786         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5787    
5788         At  "top  level",  all  these recursion test conditions are false.  The         At "top level", all these recursion test  conditions  are  false.   The
5789         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5790    
5791     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5792    
5793         If the condition is the string (DEFINE), and  there  is  no  subpattern         If  the  condition  is  the string (DEFINE), and there is no subpattern
5794         with  the  name  DEFINE,  the  condition is always false. In this case,         with the name DEFINE, the condition is  always  false.  In  this  case,
5795         there may be only one alternative  in  the  subpattern.  It  is  always         there  may  be  only  one  alternative  in the subpattern. It is always
5796         skipped  if  control  reaches  this  point  in the pattern; the idea of         skipped if control reaches this point  in  the  pattern;  the  idea  of
5797         DEFINE is that it can be used to define subroutines that can be  refer-         DEFINE  is that it can be used to define subroutines that can be refer-
5798         enced  from elsewhere. (The use of subroutines is described below.) For         enced from elsewhere. (The use of subroutines is described below.)  For
5799         example, a pattern to match an IPv4 address  such  as  "192.168.23.245"         example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
5800         could be written like this (ignore whitespace and line breaks):         could be written like this (ignore whitespace and line breaks):
5801    
5802           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5803           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
5804    
5805         The  first part of the pattern is a DEFINE group inside which a another         The first part of the pattern is a DEFINE group inside which a  another
5806         group named "byte" is defined. This matches an individual component  of         group  named "byte" is defined. This matches an individual component of
5807         an  IPv4  address  (a number less than 256). When matching takes place,         an IPv4 address (a number less than 256). When  matching  takes  place,
5808         this part of the pattern is skipped because DEFINE acts  like  a  false         this  part  of  the pattern is skipped because DEFINE acts like a false
5809         condition.  The  rest of the pattern uses references to the named group         condition. The rest of the pattern uses references to the  named  group
5810         to match the four dot-separated components of an IPv4 address,  insist-         to  match the four dot-separated components of an IPv4 address, insist-
5811         ing on a word boundary at each end.         ing on a word boundary at each end.
5812    
5813     Assertion conditions     Assertion conditions
5814    
5815         If  the  condition  is  not  in any of the above formats, it must be an         If the condition is not in any of the above  formats,  it  must  be  an
5816         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.   This may be a positive or negative lookahead or lookbehind
5817         assertion.  Consider  this  pattern,  again  containing non-significant         assertion. Consider  this  pattern,  again  containing  non-significant
5818         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
5819    
5820           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
5821           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
5822    
5823         The condition  is  a  positive  lookahead  assertion  that  matches  an         The  condition  is  a  positive  lookahead  assertion  that  matches an
5824         optional  sequence of non-letters followed by a letter. In other words,         optional sequence of non-letters followed by a letter. In other  words,
5825         it tests for the presence of at least one letter in the subject.  If  a         it  tests  for the presence of at least one letter in the subject. If a
5826         letter  is found, the subject is matched against the first alternative;         letter is found, the subject is matched against the first  alternative;
5827         otherwise it is  matched  against  the  second.  This  pattern  matches         otherwise  it  is  matched  against  the  second.  This pattern matches
5828         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
5829         letters and dd are digits.         letters and dd are digits.
5830    
5831    
# Line 5831  COMMENTS Line 5834  COMMENTS
5834         There are two ways of including comments in patterns that are processed         There are two ways of including comments in patterns that are processed
5835         by PCRE. In both cases, the start of the comment must not be in a char-         by PCRE. In both cases, the start of the comment must not be in a char-
5836         acter class, nor in the middle of any other sequence of related charac-         acter class, nor in the middle of any other sequence of related charac-
5837         ters  such  as  (?: or a subpattern name or number. The characters that         ters such as (?: or a subpattern name or number.  The  characters  that
5838         make up a comment play no part in the pattern matching.         make up a comment play no part in the pattern matching.
5839    
5840         The sequence (?# marks the start of a comment that continues up to  the         The  sequence (?# marks the start of a comment that continues up to the
5841         next  closing parenthesis. Nested parentheses are not permitted. If the         next closing parenthesis. Nested parentheses are not permitted. If  the
5842         PCRE_EXTENDED option is set, an unescaped # character also introduces a         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5843         comment,  which  in  this  case continues to immediately after the next         comment, which in this case continues to  immediately  after  the  next
5844         newline character or character sequence in the pattern.  Which  charac-         newline  character  or character sequence in the pattern. Which charac-
5845         ters are interpreted as newlines is controlled by the options passed to         ters are interpreted as newlines is controlled by the options passed to
5846         a compiling function or by a special sequence at the start of the  pat-         a  compiling function or by a special sequence at the start of the pat-
5847         tern, as described in the section entitled "Newline conventions" above.         tern, as described in the section entitled "Newline conventions" above.
5848         Note that the end of this type of comment is a literal newline sequence         Note that the end of this type of comment is a literal newline sequence
5849         in  the pattern; escape sequences that happen to represent a newline do         in the pattern; escape sequences that happen to represent a newline  do
5850         not count. For example, consider this  pattern  when  PCRE_EXTENDED  is         not  count.  For  example,  consider this pattern when PCRE_EXTENDED is
5851         set, and the default newline convention is in force:         set, and the default newline convention is in force:
5852    
5853           abc #comment \n still comment           abc #comment \n still comment
5854    
5855         On  encountering  the  # character, pcre_compile() skips along, looking         On encountering the # character, pcre_compile()  skips  along,  looking
5856         for a newline in the pattern. The sequence \n is still literal at  this         for  a newline in the pattern. The sequence \n is still literal at this
5857         stage,  so  it does not terminate the comment. Only an actual character         stage, so it does not terminate the comment. Only an  actual  character
5858         with the code value 0x0a (the default newline) does so.         with the code value 0x0a (the default newline) does so.
5859    
5860    
5861  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5862    
5863         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
5864         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
5865         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
5866         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5867         depth.         depth.
5868    
5869         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5870         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
5871         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
5872         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5873         parentheses problem can be created like this:         parentheses problem can be created like this:
5874    
# Line 5875  RECURSIVE PATTERNS Line 5878  RECURSIVE PATTERNS
5878         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5879    
5880         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5881         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
5882         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
5883         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
5884         into Perl at release 5.10.         into Perl at release 5.10.
5885    
5886         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
5887         zero and a closing parenthesis is a recursive subroutine  call  of  the         zero  and  a  closing parenthesis is a recursive subroutine call of the
5888         subpattern  of  the  given  number, provided that it occurs inside that         subpattern of the given number, provided that  it  occurs  inside  that
5889         subpattern. (If not, it is a non-recursive subroutine  call,  which  is         subpattern.  (If  not,  it is a non-recursive subroutine call, which is
5890         described  in  the  next  section.)  The special item (?R) or (?0) is a         described in the next section.) The special item  (?R)  or  (?0)  is  a
5891         recursive call of the entire regular expression.         recursive call of the entire regular expression.
5892    
5893         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5894         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5895    
5896           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5897    
5898         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
5899         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
5900         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
5901         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5902         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5903         parentheses.         parentheses.
5904    
5905         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
5906         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5907    
5908           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5909    
5910         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
5911         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5912    
5913         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5914         tricky.  This is made easier by the use of relative references. Instead         tricky. This is made easier by the use of relative references.  Instead
5915         of (?1) in the pattern above you can write (?-2) to refer to the second         of (?1) in the pattern above you can write (?-2) to refer to the second
5916         most  recently  opened  parentheses  preceding  the recursion. In other         most recently opened parentheses  preceding  the  recursion.  In  other
5917         words, a negative number counts capturing  parentheses  leftwards  from         words,  a  negative  number counts capturing parentheses leftwards from
5918         the point at which it is encountered.         the point at which it is encountered.
5919    
5920         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
5921         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
5922         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
5923         enced. They are always non-recursive subroutine calls, as described  in         enced.  They are always non-recursive subroutine calls, as described in
5924         the next section.         the next section.
5925    
5926         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
5927         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5928         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5929    
5930           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5931    
5932         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
5933         one is used.         one is used.
5934    
5935         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
5936         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5937         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5938         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
5939         applied to         applied to
5940    
5941           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5942    
5943         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
5944         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
5945         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
5946         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5947    
5948         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
5949         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
5950         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
5951         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5952    
5953           (ab(cd)ef)           (ab(cd)ef)
5954    
5955         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5956         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
5957         pattern is not matched at the top level, its final  captured  value  is         pattern  is  not  matched at the top level, its final captured value is
5958         unset,  even  if  it was (temporarily) set at a deeper level during the         unset, even if it was (temporarily) set at a deeper  level  during  the
5959         matching process.         matching process.
5960    
5961         If there are more than 15 capturing parentheses in a pattern, PCRE  has         If  there are more than 15 capturing parentheses in a pattern, PCRE has
5962         to  obtain extra memory to store data during a recursion, which it does         to obtain extra memory to store data during a recursion, which it  does
5963         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5964         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5965    
5966         Do  not  confuse  the (?R) item with the condition (R), which tests for         Do not confuse the (?R) item with the condition (R),  which  tests  for
5967         recursion.  Consider this pattern, which matches text in  angle  brack-         recursion.   Consider  this pattern, which matches text in angle brack-
5968         ets,  allowing for arbitrary nesting. Only digits are allowed in nested         ets, allowing for arbitrary nesting. Only digits are allowed in  nested
5969         brackets (that is, when recursing), whereas any characters are  permit-         brackets  (that is, when recursing), whereas any characters are permit-
5970         ted at the outer level.         ted at the outer level.
5971    
5972           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5973    
5974         In  this  pattern, (?(R) is the start of a conditional subpattern, with         In this pattern, (?(R) is the start of a conditional  subpattern,  with
5975         two different alternatives for the recursive and  non-recursive  cases.         two  different  alternatives for the recursive and non-recursive cases.
5976         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5977    
5978     Differences in recursion processing between PCRE and Perl     Differences in recursion processing between PCRE and Perl
5979    
5980         Recursion  processing  in PCRE differs from Perl in two important ways.         Recursion processing in PCRE differs from Perl in two  important  ways.
5981         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
5982         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5983         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5984         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
5985         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
5986         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
5987         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5988    
5989           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5990    
5991         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5992         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
5993         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
5994         Consider the subject string "abcba":         Consider the subject string "abcba":
5995    
5996         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
5997         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5998         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5999         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
6000         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
6001    
6002         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
6003         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
6004         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
6005         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
6006         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
6007         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
6008         different:         different:
6009    
6010           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
6011    
6012         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
6013         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
6014         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
6015         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
6016         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
6017         use.         use.
6018    
6019         To change the pattern so that it matches all palindromic  strings,  not         To  change  the pattern so that it matches all palindromic strings, not
6020         just  those  with an odd number of characters, it is tempting to change         just those with an odd number of characters, it is tempting  to  change
6021         the pattern to this:         the pattern to this:
6022    
6023           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
6024    
6025         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
6026         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
6027         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
6028         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
6029         natives at the higher level:         natives at the higher level:
6030    
6031           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
6032    
6033         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
6034         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
6035    
6036           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
6037    
6038         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
6039         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
6040         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
6041         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
6042         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
6043         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
6044    
6045         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
6046         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
6047         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
6048         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
6049         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
6050         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
6051         natives, so the entire match fails.         natives, so the entire match fails.
6052    
6053         The second way in which PCRE and Perl differ in  their  recursion  pro-         The  second  way  in which PCRE and Perl differ in their recursion pro-
6054         cessing  is in the handling of captured values. In Perl, when a subpat-         cessing is in the handling of captured values. In Perl, when a  subpat-
6055         tern is called recursively or as a subpattern (see the  next  section),         tern  is  called recursively or as a subpattern (see the next section),
6056         it  has  no  access to any values that were captured outside the recur-         it has no access to any values that were captured  outside  the  recur-
6057         sion, whereas in PCRE these values can  be  referenced.  Consider  this         sion,  whereas  in  PCRE  these values can be referenced. Consider this
6058         pattern:         pattern:
6059    
6060           ^(.)(\1|a(?2))           ^(.)(\1|a(?2))
6061    
6062         In  PCRE,  this  pattern matches "bab". The first capturing parentheses         In PCRE, this pattern matches "bab". The  first  capturing  parentheses
6063         match "b", then in the second group, when the back reference  \1  fails         match  "b",  then in the second group, when the back reference \1 fails
6064         to  match "b", the second alternative matches "a" and then recurses. In         to match "b", the second alternative matches "a" and then recurses.  In
6065         the recursion, \1 does now match "b" and so the whole  match  succeeds.         the  recursion,  \1 does now match "b" and so the whole match succeeds.
6066         In  Perl,  the pattern fails to match because inside the recursive call         In Perl, the pattern fails to match because inside the  recursive  call
6067         \1 cannot access the externally set value.         \1 cannot access the externally set value.
6068    
6069    
6070  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
6071    
6072         If the syntax for a recursive subpattern call (either by number  or  by         If  the  syntax for a recursive subpattern call (either by number or by
6073         name)  is  used outside the parentheses to which it refers, it operates         name) is used outside the parentheses to which it refers,  it  operates
6074         like a subroutine in a programming language. The called subpattern  may         like  a subroutine in a programming language. The called subpattern may
6075         be  defined  before or after the reference. A numbered reference can be         be defined before or after the reference. A numbered reference  can  be
6076         absolute or relative, as in these examples:         absolute or relative, as in these examples:
6077    
6078           (...(absolute)...)...(?2)...           (...(absolute)...)...(?2)...
# Line 6080  SUBPATTERNS AS SUBROUTINES Line 6083  SUBPATTERNS AS SUBROUTINES
6083    
6084           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
6085    
6086         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
6087         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
6088    
6089           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
6090    
6091         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
6092         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
6093         above.         above.
6094    
6095         All  subroutine  calls, whether recursive or not, are always treated as         All subroutine calls, whether recursive or not, are always  treated  as
6096         atomic groups. That is, once a subroutine has matched some of the  sub-         atomic  groups. That is, once a subroutine has matched some of the sub-
6097         ject string, it is never re-entered, even if it contains untried alter-         ject string, it is never re-entered, even if it contains untried alter-
6098         natives and there is  a  subsequent  matching  failure.  Any  capturing         natives  and  there  is  a  subsequent  matching failure. Any capturing
6099         parentheses  that  are  set  during the subroutine call revert to their         parentheses that are set during the subroutine  call  revert  to  their
6100         previous values afterwards.         previous values afterwards.
6101    
6102         Processing options such as case-independence are fixed when  a  subpat-         Processing  options  such as case-independence are fixed when a subpat-
6103         tern  is defined, so if it is used as a subroutine, such options cannot         tern is defined, so if it is used as a subroutine, such options  cannot
6104         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
6105    
6106           (abc)(?i:(?-1))           (abc)(?i:(?-1))
6107    
6108         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
6109         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
6110    
6111    
6112  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
6113    
6114         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
6115         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
6116         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
6117         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
6118         ten using this syntax:         ten using this syntax:
6119    
6120           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
6121           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
6122    
6123         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
6124         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
6125    
6126           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
6127    
6128         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
6129         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
6130         call.         call.
6131    
6132    
6133  CALLOUTS  CALLOUTS
6134    
6135         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
6136         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
6137         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
6138         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
6139         tion.         tion.
6140    
6141         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
6142         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
6143         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
6144         pcre_callout  (8-bit  library)  or  pcre16_callout (16-bit library). By         pcre_callout (8-bit library) or  pcre16_callout  (16-bit  library).  By
6145         default, this variable contains NULL, which disables all calling out.         default, this variable contains NULL, which disables all calling out.
6146    
6147         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
6148         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
6149         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
6150         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
6151         points:         points:
6152    
6153           (?C1)abc(?C2)def           (?C1)abc(?C2)def
6154    
6155         If the PCRE_AUTO_CALLOUT flag is passed to a compiling function,  call-         If  the PCRE_AUTO_CALLOUT flag is passed to a compiling function, call-
6156         outs  are automatically installed before each item in the pattern. They         outs are automatically installed before each item in the pattern.  They
6157         are all numbered 255.         are all numbered 255.
6158    
6159         During matching, when PCRE reaches a callout point, the external  func-         During  matching, when PCRE reaches a callout point, the external func-
6160         tion  is  called.  It  is  provided with the number of the callout, the         tion is called. It is provided with the  number  of  the  callout,  the
6161         position in the pattern, and, optionally, one item of  data  originally         position  in  the pattern, and, optionally, one item of data originally
6162         supplied  by  the caller of the matching function. The callout function         supplied by the caller of the matching function. The  callout  function
6163         may cause matching to proceed, to backtrack, or to fail  altogether.  A         may  cause  matching to proceed, to backtrack, or to fail altogether. A
6164         complete  description of the interface to the callout function is given         complete description of the interface to the callout function is  given
6165         in the pcrecallout documentation.         in the pcrecallout documentation.
6166    
6167    
6168  BACKTRACKING CONTROL  BACKTRACKING CONTROL
6169    
6170         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
6171         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
6172         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
6173         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
6174         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
6175         in this section.         in this section.
6176    
6177         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
6178         them can be used only when the pattern is to be matched  using  one  of         them  can  be  used only when the pattern is to be matched using one of
6179         the traditional matching functions, which use a backtracking algorithm.         the traditional matching functions, which use a backtracking algorithm.
6180         With the exception of (*FAIL), which behaves like  a  failing  negative         With  the  exception  of (*FAIL), which behaves like a failing negative
6181         assertion,  they  cause an error if encountered by a DFA matching func-         assertion, they cause an error if encountered by a DFA  matching  func-
6182         tion.         tion.
6183    
6184         If any of these verbs are used in an assertion or in a subpattern  that         If  any of these verbs are used in an assertion or in a subpattern that
6185         is called as a subroutine (whether or not recursively), their effect is         is called as a subroutine (whether or not recursively), their effect is
6186         confined to that subpattern; it does not extend to the surrounding pat-         confined to that subpattern; it does not extend to the surrounding pat-
6187         tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)         tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
6188         that is encountered in a successful positive assertion is  passed  back         that  is  encountered in a successful positive assertion is passed back
6189         when  a  match  succeeds (compare capturing parentheses in assertions).         when a match succeeds (compare capturing  parentheses  in  assertions).
6190         Note that such subpatterns are processed as anchored at the point where         Note that such subpatterns are processed as anchored at the point where
6191         they are tested. Note also that Perl's treatment of subroutines is dif-         they are tested. Note also that Perl's treatment of subroutines is dif-
6192         ferent in some cases.         ferent in some cases.
6193    
6194         The new verbs make use of what was previously invalid syntax: an  open-         The  new verbs make use of what was previously invalid syntax: an open-
6195         ing parenthesis followed by an asterisk. They are generally of the form         ing parenthesis followed by an asterisk. They are generally of the form
6196         (*VERB) or (*VERB:NAME). Some may take either form, with differing  be-         (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
6197         haviour,  depending on whether or not an argument is present. A name is         haviour, depending on whether or not an argument is present. A name  is
6198         any sequence of characters that does not include a closing parenthesis.         any sequence of characters that does not include a closing parenthesis.
6199         If  the  name is empty, that is, if the closing parenthesis immediately         If the name is empty, that is, if the closing  parenthesis  immediately
6200         follows the colon, the effect is as if the colon were  not  there.  Any         follows  the  colon,  the effect is as if the colon were not there. Any
6201         number of these verbs may occur in a pattern.         number of these verbs may occur in a pattern.
6202    
6203     Optimizations that affect backtracking verbs     Optimizations that affect backtracking verbs
6204    
6205         PCRE  contains some optimizations that are used to speed up matching by         PCRE contains some optimizations that are used to speed up matching  by
6206         running some checks at the start of each match attempt. For example, it         running some checks at the start of each match attempt. For example, it
6207         may  know  the minimum length of matching subject, or that a particular         may know the minimum length of matching subject, or that  a  particular
6208         character must be present. When one of these  optimizations  suppresses         character  must  be present. When one of these optimizations suppresses
6209         the  running  of  a match, any included backtracking verbs will not, of         the running of a match, any included backtracking verbs  will  not,  of
6210         course, be processed. You can suppress the start-of-match optimizations         course, be processed. You can suppress the start-of-match optimizations
6211         by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
6212         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
6213         There is more discussion of this option in the section entitled "Option         There is more discussion of this option in the section entitled "Option
6214         bits for pcre_exec()" in the pcreapi documentation.         bits for pcre_exec()" in the pcreapi documentation.
6215    
6216         Experiments with Perl suggest that it too  has  similar  optimizations,         Experiments  with  Perl  suggest that it too has similar optimizations,
6217         sometimes leading to anomalous results.         sometimes leading to anomalous results.
6218    
6219     Verbs that act immediately     Verbs that act immediately
6220    
6221         The  following  verbs act as soon as they are encountered. They may not         The following verbs act as soon as they are encountered. They  may  not
6222         be followed by a name.         be followed by a name.
6223    
6224            (*ACCEPT)            (*ACCEPT)
6225    
6226         This verb causes the match to end successfully, skipping the  remainder         This  verb causes the match to end successfully, skipping the remainder
6227         of  the pattern. However, when it is inside a subpattern that is called         of the pattern. However, when it is inside a subpattern that is  called
6228         as a subroutine, only that subpattern is ended  successfully.  Matching         as  a  subroutine, only that subpattern is ended successfully. Matching
6229         then  continues  at  the  outer level. If (*ACCEPT) is inside capturing         then continues at the outer level. If  (*ACCEPT)  is  inside  capturing
6230         parentheses, the data so far is captured. For example:         parentheses, the data so far is captured. For example:
6231    
6232           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
6233    
6234         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
6235         tured by the outer parentheses.         tured by the outer parentheses.
6236    
6237           (*FAIL) or (*F)           (*FAIL) or (*F)
6238    
6239         This  verb causes a matching failure, forcing backtracking to occur. It         This verb causes a matching failure, forcing backtracking to occur.  It
6240         is equivalent to (?!) but easier to read. The Perl documentation  notes         is  equivalent to (?!) but easier to read. The Perl documentation notes
6241         that  it  is  probably  useful only when combined with (?{}) or (??{}).         that it is probably useful only when combined  with  (?{})  or  (??{}).
6242         Those are, of course, Perl features that are not present in  PCRE.  The         Those  are,  of course, Perl features that are not present in PCRE. The
6243         nearest  equivalent is the callout feature, as for example in this pat-         nearest equivalent is the callout feature, as for example in this  pat-
6244         tern:         tern:
6245    
6246           a+(?C)(*FAIL)           a+(?C)(*FAIL)
6247    
6248         A match with the string "aaaa" always fails, but the callout  is  taken         A  match  with the string "aaaa" always fails, but the callout is taken
6249         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
6250    
6251     Recording which path was taken     Recording which path was taken
6252    
6253         There  is  one  verb  whose  main  purpose  is to track how a match was         There is one verb whose main purpose  is  to  track  how  a  match  was
6254         arrived at, though it also has a  secondary  use  in  conjunction  with         arrived  at,  though  it  also  has a secondary use in conjunction with
6255         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
6256    
6257           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
6258    
6259         A  name  is  always  required  with  this  verb.  There  may be as many         A name is always  required  with  this  verb.  There  may  be  as  many
6260         instances of (*MARK) as you like in a pattern, and their names  do  not         instances  of  (*MARK) as you like in a pattern, and their names do not
6261         have to be unique.         have to be unique.
6262    
6263         When  a match succeeds, the name of the last-encountered (*MARK) on the         When a match succeeds, the name of the last-encountered (*MARK) on  the
6264         matching path is passed back to the caller as described in the  section         matching  path is passed back to the caller as described in the section
6265         entitled  "Extra  data  for  pcre_exec()" in the pcreapi documentation.         entitled "Extra data for pcre_exec()"  in  the  pcreapi  documentation.
6266         Here is an example of pcretest output, where the /K  modifier  requests         Here  is  an example of pcretest output, where the /K modifier requests
6267         the retrieval and outputting of (*MARK) data:         the retrieval and outputting of (*MARK) data:
6268    
6269             re> /X(*MARK:A)Y|X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
# Line 6272  BACKTRACKING CONTROL Line 6275  BACKTRACKING CONTROL
6275           MK: B           MK: B
6276    
6277         The (*MARK) name is tagged with "MK:" in this output, and in this exam-         The (*MARK) name is tagged with "MK:" in this output, and in this exam-
6278         ple it indicates which of the two alternatives matched. This is a  more         ple  it indicates which of the two alternatives matched. This is a more
6279         efficient  way of obtaining this information than putting each alterna-         efficient way of obtaining this information than putting each  alterna-
6280         tive in its own capturing parentheses.         tive in its own capturing parentheses.
6281    
6282         If (*MARK) is encountered in a positive assertion, its name is recorded         If (*MARK) is encountered in a positive assertion, its name is recorded
6283         and passed back if it is the last-encountered. This does not happen for         and passed back if it is the last-encountered. This does not happen for
6284         negative assertions.         negative assertions.
6285    
6286         After a partial match or a failed match, the name of the  last  encoun-         After  a  partial match or a failed match, the name of the last encoun-
6287         tered (*MARK) in the entire match process is returned. For example:         tered (*MARK) in the entire match process is returned. For example:
6288    
6289             re> /X(*MARK:A)Y|X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
6290           data> XP           data> XP
6291           No match, mark = B           No match, mark = B
6292    
6293         Note  that  in  this  unanchored  example the mark is retained from the         Note that in this unanchored example the  mark  is  retained  from  the
6294         match attempt that started at the letter "X" in the subject. Subsequent         match attempt that started at the letter "X" in the subject. Subsequent
6295         match attempts starting at "P" and then with an empty string do not get         match attempts starting at "P" and then with an empty string do not get
6296         as far as the (*MARK) item, but nevertheless do not reset it.         as far as the (*MARK) item, but nevertheless do not reset it.
6297    
6298         If you are interested in  (*MARK)  values  after  failed  matches,  you         If  you  are  interested  in  (*MARK)  values after failed matches, you
6299         should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to         should probably set the PCRE_NO_START_OPTIMIZE option  (see  above)  to
6300         ensure that the match is always attempted.         ensure that the match is always attempted.
6301    
6302     Verbs that act after backtracking     Verbs that act after backtracking
6303    
6304         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
6305         tinues  with what follows, but if there is no subsequent match, causing         tinues with what follows, but if there is no subsequent match,  causing
6306         a backtrack to the verb, a failure is  forced.  That  is,  backtracking         a  backtrack  to  the  verb, a failure is forced. That is, backtracking
6307         cannot  pass  to the left of the verb. However, when one of these verbs         cannot pass to the left of the verb. However, when one of  these  verbs
6308         appears inside an atomic group, its effect is confined to  that  group,         appears  inside  an atomic group, its effect is confined to that group,
6309         because  once the group has been matched, there is never any backtrack-         because once the group has been matched, there is never any  backtrack-
6310         ing into it. In this situation, backtracking can  "jump  back"  to  the         ing  into  it.  In  this situation, backtracking can "jump back" to the
6311         left  of the entire atomic group. (Remember also, as stated above, that         left of the entire atomic group. (Remember also, as stated above,  that
6312         this localization also applies in subroutine calls and assertions.)         this localization also applies in subroutine calls and assertions.)
6313    
6314         These verbs differ in exactly what kind of failure  occurs  when  back-         These  verbs  differ  in exactly what kind of failure occurs when back-
6315         tracking reaches them.         tracking reaches them.
6316    
6317           (*COMMIT)           (*COMMIT)
6318    
6319         This  verb, which may not be followed by a name, causes the whole match         This verb, which may not be followed by a name, causes the whole  match
6320         to fail outright if the rest of the pattern does not match. Even if the         to fail outright if the rest of the pattern does not match. Even if the
6321         pattern is unanchored, no further attempts to find a match by advancing         pattern is unanchored, no further attempts to find a match by advancing
6322         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
6323         pcre_exec()  is  committed  to  finding a match at the current starting         pcre_exec() is committed to finding a match  at  the  current  starting
6324         point, or not at all. For example:         point, or not at all. For example:
6325    
6326           a+(*COMMIT)b           a+(*COMMIT)b
6327    
6328         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
6329         of dynamic anchor, or "I've started, so I must finish." The name of the         of dynamic anchor, or "I've started, so I must finish." The name of the
6330         most recently passed (*MARK) in the path is passed back when  (*COMMIT)         most  recently passed (*MARK) in the path is passed back when (*COMMIT)
6331         forces a match failure.         forces a match failure.
6332    
6333         Note  that  (*COMMIT)  at  the start of a pattern is not the same as an         Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
6334         anchor, unless PCRE's start-of-match optimizations are turned  off,  as         anchor,  unless  PCRE's start-of-match optimizations are turned off, as
6335         shown in this pcretest example:         shown in this pcretest example:
6336    
6337             re> /(*COMMIT)abc/             re> /(*COMMIT)abc/
# Line 6337  BACKTRACKING CONTROL Line 6340  BACKTRACKING CONTROL
6340           xyzabc\Y           xyzabc\Y
6341           No match           No match
6342    
6343         PCRE  knows  that  any  match  must start with "a", so the optimization         PCRE knows that any match must start  with  "a",  so  the  optimization
6344         skips along the subject to "a" before running the first match  attempt,         skips  along the subject to "a" before running the first match attempt,
6345         which  succeeds.  When the optimization is disabled by the \Y escape in         which succeeds. When the optimization is disabled by the \Y  escape  in
6346         the second subject, the match starts at "x" and so the (*COMMIT) causes         the second subject, the match starts at "x" and so the (*COMMIT) causes
6347         it to fail without trying any other starting points.         it to fail without trying any other starting points.
6348    
6349           (*PRUNE) or (*PRUNE:NAME)           (*PRUNE) or (*PRUNE:NAME)
6350    
6351         This  verb causes the match to fail at the current starting position in         This verb causes the match to fail at the current starting position  in
6352         the subject if the rest of the pattern does not match. If  the  pattern         the  subject  if the rest of the pattern does not match. If the pattern
6353         is  unanchored,  the  normal  "bumpalong"  advance to the next starting         is unanchored, the normal "bumpalong"  advance  to  the  next  starting
6354         character then happens. Backtracking can occur as usual to the left  of         character  then happens. Backtracking can occur as usual to the left of
6355         (*PRUNE),  before  it  is  reached,  or  when  matching to the right of         (*PRUNE), before it is reached,  or  when  matching  to  the  right  of
6356         (*PRUNE), but if there is no match to the  right,  backtracking  cannot         (*PRUNE),  but  if  there is no match to the right, backtracking cannot
6357         cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-         cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-
6358         native to an atomic group or possessive quantifier, but there are  some         native  to an atomic group or possessive quantifier, but there are some
6359         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
6360         iour of (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE).  In  an         iour  of  (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE). In an
6361         anchored pattern (*PRUNE) has the same effect as (*COMMIT).         anchored pattern (*PRUNE) has the same effect as (*COMMIT).
6362    
6363           (*SKIP)           (*SKIP)
6364    
6365         This  verb, when given without a name, is like (*PRUNE), except that if         This verb, when given without a name, is like (*PRUNE), except that  if
6366         the pattern is unanchored, the "bumpalong" advance is not to  the  next         the  pattern  is unanchored, the "bumpalong" advance is not to the next
6367         character, but to the position in the subject where (*SKIP) was encoun-         character, but to the position in the subject where (*SKIP) was encoun-
6368         tered. (*SKIP) signifies that whatever text was matched leading  up  to         tered.  (*SKIP)  signifies that whatever text was matched leading up to
6369         it cannot be part of a successful match. Consider:         it cannot be part of a successful match. Consider:
6370    
6371           a+(*SKIP)b           a+(*SKIP)b
6372    
6373         If  the  subject  is  "aaaac...",  after  the first match attempt fails         If the subject is "aaaac...",  after  the  first  match  attempt  fails
6374         (starting at the first character in the  string),  the  starting  point         (starting  at  the  first  character in the string), the starting point
6375         skips on to start the next attempt at "c". Note that a possessive quan-         skips on to start the next attempt at "c". Note that a possessive quan-
6376         tifer does not have the same effect as this example; although it  would         tifer  does not have the same effect as this example; although it would
6377         suppress  backtracking  during  the  first  match  attempt,  the second         suppress backtracking  during  the  first  match  attempt,  the  second
6378         attempt would start at the second character instead of skipping  on  to         attempt  would  start at the second character instead of skipping on to
6379         "c".         "c".
6380    
6381           (*SKIP:NAME)           (*SKIP:NAME)
6382    
6383         When  (*SKIP) has an associated name, its behaviour is modified. If the         When (*SKIP) has an associated name, its behaviour is modified. If  the
6384         following pattern fails to match, the previous path through the pattern         following pattern fails to match, the previous path through the pattern
6385         is  searched for the most recent (*MARK) that has the same name. If one         is searched for the most recent (*MARK) that has the same name. If  one
6386         is found, the "bumpalong" advance is to the subject position that  cor-         is  found, the "bumpalong" advance is to the subject position that cor-
6387         responds  to  that (*MARK) instead of to where (*SKIP) was encountered.         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
6388         If no (*MARK) with a matching name is found, the (*SKIP) is ignored.         If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
6389    
6390           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
6391    
6392         This verb causes a skip to the next innermost alternative if  the  rest         This  verb  causes a skip to the next innermost alternative if the rest
6393         of  the  pattern does not match. That is, it cancels pending backtrack-         of the pattern does not match. That is, it cancels  pending  backtrack-
6394         ing, but only within the current alternative. Its name comes  from  the         ing,  but  only within the current alternative. Its name comes from the
6395         observation that it can be used for a pattern-based if-then-else block:         observation that it can be used for a pattern-based if-then-else block:
6396    
6397           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
6398    
6399         If  the COND1 pattern matches, FOO is tried (and possibly further items         If the COND1 pattern matches, FOO is tried (and possibly further  items
6400         after the end of the group if FOO succeeds); on  failure,  the  matcher         after  the  end  of the group if FOO succeeds); on failure, the matcher
6401         skips  to  the second alternative and tries COND2, without backtracking         skips to the second alternative and tries COND2,  without  backtracking
6402         into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as         into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as
6403         (*MARK:NAME)(*THEN).   If (*THEN) is not inside an alternation, it acts         (*MARK:NAME)(*THEN).  If (*THEN) is not inside an alternation, it  acts
6404         like (*PRUNE).         like (*PRUNE).
6405    
6406         Note that a subpattern that does not contain a | character  is  just  a         Note  that  a  subpattern that does not contain a | character is just a
6407         part  of the enclosing alternative; it is not a nested alternation with         part of the enclosing alternative; it is not a nested alternation  with
6408         only one alternative. The effect of (*THEN) extends beyond such a  sub-         only  one alternative. The effect of (*THEN) extends beyond such a sub-
6409         pattern  to  the enclosing alternative. Consider this pattern, where A,         pattern to the enclosing alternative. Consider this pattern,  where  A,
6410         B, etc. are complex pattern fragments that do not contain any | charac-         B, etc. are complex pattern fragments that do not contain any | charac-
6411         ters at this level:         ters at this level:
6412    
6413           A (B(*THEN)C) | D           A (B(*THEN)C) | D
6414    
6415         If  A and B are matched, but there is a failure in C, matching does not         If A and B are matched, but there is a failure in C, matching does  not
6416         backtrack into A; instead it moves to the next alternative, that is, D.         backtrack into A; instead it moves to the next alternative, that is, D.
6417         However,  if the subpattern containing (*THEN) is given an alternative,         However, if the subpattern containing (*THEN) is given an  alternative,
6418         it behaves differently:         it behaves differently:
6419    
6420           A (B(*THEN)C | (*FAIL)) | D           A (B(*THEN)C | (*FAIL)) | D
6421    
6422         The effect of (*THEN) is now confined to the inner subpattern. After  a         The  effect of (*THEN) is now confined to the inner subpattern. After a
6423         failure in C, matching moves to (*FAIL), which causes the whole subpat-         failure in C, matching moves to (*FAIL), which causes the whole subpat-
6424         tern to fail because there are no more alternatives  to  try.  In  this         tern  to  fail  because  there are no more alternatives to try. In this
6425         case, matching does now backtrack into A.         case, matching does now backtrack into A.
6426    
6427         Note also that a conditional subpattern is not considered as having two         Note also that a conditional subpattern is not considered as having two
6428         alternatives, because only one is ever used.  In  other  words,  the  |         alternatives,  because  only  one  is  ever used. In other words, the |
6429         character in a conditional subpattern has a different meaning. Ignoring         character in a conditional subpattern has a different meaning. Ignoring
6430         white space, consider:         white space, consider:
6431    
6432           ^.*? (?(?=a) a | b(*THEN)c )           ^.*? (?(?=a) a | b(*THEN)c )
6433    
6434         If the subject is "ba", this pattern does not  match.  Because  .*?  is         If  the  subject  is  "ba", this pattern does not match. Because .*? is
6435         ungreedy,  it  initially  matches  zero characters. The condition (?=a)         ungreedy, it initially matches zero  characters.  The  condition  (?=a)
6436         then fails, the character "b" is matched,  but  "c"  is  not.  At  this         then  fails,  the  character  "b"  is  matched, but "c" is not. At this
6437         point,  matching does not backtrack to .*? as might perhaps be expected         point, matching does not backtrack to .*? as might perhaps be  expected
6438         from the presence of the | character.  The  conditional  subpattern  is         from  the  presence  of  the | character. The conditional subpattern is
6439         part of the single alternative that comprises the whole pattern, and so         part of the single alternative that comprises the whole pattern, and so
6440         the match fails. (If there was a backtrack into  .*?,  allowing  it  to         the  match  fails.  (If  there was a backtrack into .*?, allowing it to
6441         match "b", the match would succeed.)         match "b", the match would succeed.)
6442    
6443         The  verbs just described provide four different "strengths" of control         The verbs just described provide four different "strengths" of  control
6444         when subsequent matching fails. (*THEN) is the weakest, carrying on the         when subsequent matching fails. (*THEN) is the weakest, carrying on the
6445         match  at  the next alternative. (*PRUNE) comes next, failing the match         match at the next alternative. (*PRUNE) comes next, failing  the  match
6446         at the current starting position, but allowing an advance to  the  next         at  the  current starting position, but allowing an advance to the next
6447         character  (for an unanchored pattern). (*SKIP) is similar, except that         character (for an unanchored pattern). (*SKIP) is similar, except  that
6448         the advance may be more than one character. (*COMMIT) is the strongest,         the advance may be more than one character. (*COMMIT) is the strongest,
6449         causing the entire match to fail.         causing the entire match to fail.
6450    
# Line 6451  BACKTRACKING CONTROL Line 6454  BACKTRACKING CONTROL
6454    
6455           (A(*COMMIT)B(*THEN)C|D)           (A(*COMMIT)B(*THEN)C|D)
6456    
6457         Once  A  has  matched,  PCRE is committed to this match, at the current         Once A has matched, PCRE is committed to this  match,  at  the  current
6458         starting position. If subsequently B matches, but C does not, the  nor-         starting  position. If subsequently B matches, but C does not, the nor-
6459         mal (*THEN) action of trying the next alternative (that is, D) does not         mal (*THEN) action of trying the next alternative (that is, D) does not
6460         happen because (*COMMIT) overrides.         happen because (*COMMIT) overrides.
6461    
6462    
6463  SEE ALSO  SEE ALSO
6464    
6465         pcreapi(3), pcrecallout(3),  pcrematching(3),  pcresyntax(3),  pcre(3),         pcreapi(3),  pcrecallout(3),  pcrematching(3),  pcresyntax(3), pcre(3),
6466         pcre16(3).         pcre16(3).
6467    
6468    
# Line 6472  AUTHOR Line 6475  AUTHOR
6475    
6476  REVISION  REVISION
6477    
6478         Last updated: 24 February 2012         Last updated: 14 April 2012
6479         Copyright (c) 1997-2012 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
6480  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6481    
# Line 6915  UNICODE PROPERTY SUPPORT Line 6918  UNICODE PROPERTY SUPPORT
6918    
6919         When you set the PCRE_UTF8 flag, the byte strings  passed  as  patterns         When you set the PCRE_UTF8 flag, the byte strings  passed  as  patterns
6920         and subjects are (by default) checked for validity on entry to the rel-         and subjects are (by default) checked for validity on entry to the rel-
6921         evant functions. From release 7.3 of PCRE, the check is  according  the         evant functions. The entire string is checked before any other process-
6922           ing  takes  place. From release 7.3 of PCRE, the check is according the
6923         rules of RFC 3629, which are themselves derived from the Unicode speci-         rules of RFC 3629, which are themselves derived from the Unicode speci-
6924         fication. Earlier releases of PCRE followed  the  rules  of  RFC  2279,         fication.  Earlier  releases  of  PCRE  followed the rules of RFC 2279,
6925         which  allows  the  full  range of 31-bit values (0 to 0x7FFFFFFF). The         which allows the full range of 31-bit values  (0  to  0x7FFFFFFF).  The
6926         current check allows only values in the range U+0 to U+10FFFF,  exclud-         current  check allows only values in the range U+0 to U+10FFFF, exclud-
6927         ing U+D800 to U+DFFF.         ing U+D800 to U+DFFF.
6928    
6929         The  excluded code points are the "Surrogate Area" of Unicode. They are         The excluded code points are the "Surrogate Area" of Unicode. They  are
6930         reserved for use by UTF-16, where they are  used  in  pairs  to  encode         reserved  for  use  by  UTF-16,  where they are used in pairs to encode
6931         codepoints  with  values  greater than 0xFFFF. The code points that are         codepoints with values greater than 0xFFFF. The code  points  that  are
6932         encoded by UTF-16 pairs are available independently in the UTF-8 encod-         encoded by UTF-16 pairs are available independently in the UTF-8 encod-
6933         ing.  (In  other words, the whole surrogate thing is a fudge for UTF-16         ing. (In other words, the whole surrogate thing is a fudge  for  UTF-16
6934         which unfortunately messes up UTF-8.)         which unfortunately messes up UTF-8.)
6935    
6936         If an invalid UTF-8 string is passed to PCRE, an error return is given.         If an invalid UTF-8 string is passed to PCRE, an error return is given.
6937         At  compile  time, the only additional information is the offset to the         At compile time, the only additional information is the offset  to  the
6938         first byte of the failing character. The runtime functions  pcre_exec()         first  byte of the failing character. The runtime functions pcre_exec()
6939         and  pcre_dfa_exec() also pass back this information, as well as a more         and pcre_dfa_exec() also pass back this information, as well as a  more
6940         detailed reason code if the caller has provided memory in which  to  do         detailed  reason  code if the caller has provided memory in which to do
6941         this.         this.
6942    
6943         In  some  situations, you may already know that your strings are valid,         In some situations, you may already know that your strings  are  valid,
6944         and therefore want to skip these checks in  order  to  improve  perfor-         and  therefore  want  to  skip these checks in order to improve perfor-
6945         mance. If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run         mance, for example in the case of a long subject string that  is  being
6946         time, PCRE assumes that the pattern or subject  it  is  given  (respec-         scanned   repeatedly   with   different   patterns.   If  you  set  the
6947         tively)  contains  only  valid  UTF-8  codes. In this case, it does not         PCRE_NO_UTF8_CHECK flag at compile time or at run  time,  PCRE  assumes
6948         diagnose an invalid UTF-8 string.         that  the  pattern  or subject it is given (respectively) contains only
6949           valid UTF-8 codes. In this case, it does not diagnose an invalid  UTF-8
6950           string.
6951    
6952         If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,         If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
6953         what  happens  depends on why the string is invalid. If the string con-         what happens depends on why the string is invalid. If the  string  con-
6954         forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a         forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
6955         string  of  characters  in the range 0 to 0x7FFFFFFF by pcre_dfa_exec()         string of characters in the range 0 to  0x7FFFFFFF  by  pcre_dfa_exec()
6956         and the interpreted version of pcre_exec(). In other words, apart  from         and  the interpreted version of pcre_exec(). In other words, apart from
6957         the  initial validity test, these functions (when in UTF-8 mode) handle         the initial validity test, these functions (when in UTF-8 mode)  handle
6958         strings according to the more liberal rules of RFC 2279.  However,  the         strings  according  to the more liberal rules of RFC 2279. However, the
6959         just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.         just-in-time (JIT) optimization for pcre_exec() supports only RFC 3629.
6960         If you are using JIT optimization, or if the string does not even  con-         If  you are using JIT optimization, or if the string does not even con-
6961         form to RFC 2279, the result is undefined. Your program may crash.         form to RFC 2279, the result is undefined. Your program may crash.
6962    
6963         If  you  want  to  process  strings  of  values  in the full range 0 to         If you want to process strings  of  values  in  the  full  range  0  to
6964         0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can         0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
6965         set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in         set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
6966         this situation, you will have to apply your  own  validity  check,  and         this  situation,  you  will  have to apply your own validity check, and
6967         avoid the use of JIT optimization.         avoid the use of JIT optimization.
6968    
6969     Validity of UTF-16 strings     Validity of UTF-16 strings
6970    
6971         When you set the PCRE_UTF16 flag, the strings of 16-bit data units that         When you set the PCRE_UTF16 flag, the strings of 16-bit data units that
6972         are passed as patterns and subjects are (by default) checked for valid-         are passed as patterns and subjects are (by default) checked for valid-
6973         ity  on entry to the relevant functions. Values other than those in the         ity on entry to the relevant functions. Values other than those in  the
6974         surrogate range U+D800 to U+DFFF are independent code points. Values in         surrogate range U+D800 to U+DFFF are independent code points. Values in
6975         the surrogate range must be used in pairs in the correct manner.         the surrogate range must be used in pairs in the correct manner.
6976    
6977         If  an  invalid  UTF-16  string  is  passed to PCRE, an error return is         If an invalid UTF-16 string is passed  to  PCRE,  an  error  return  is
6978         given. At compile time, the only additional information is  the  offset         given.  At  compile time, the only additional information is the offset
6979         to  the first data unit of the failing character. The runtime functions         to the first data unit of the failing character. The runtime  functions
6980         pcre16_exec() and pcre16_dfa_exec() also pass back this information, as         pcre16_exec() and pcre16_dfa_exec() also pass back this information, as
6981         well  as  a more detailed reason code if the caller has provided memory         well as a more detailed reason code if the caller has  provided  memory
6982         in which to do this.         in which to do this.
6983    
6984         In some situations, you may already know that your strings  are  valid,         In  some  situations, you may already know that your strings are valid,
6985         and  therefore  want  to  skip these checks in order to improve perfor-         and therefore want to skip these checks in  order  to  improve  perfor-
6986         mance. If you set the PCRE_NO_UTF16_CHECK flag at compile  time  or  at         mance.  If  you  set the PCRE_NO_UTF16_CHECK flag at compile time or at
6987         run time, PCRE assumes that the pattern or subject it is given (respec-         run time, PCRE assumes that the pattern or subject it is given (respec-
6988         tively) contains only valid UTF-16 sequences. In this case, it does not         tively) contains only valid UTF-16 sequences. In this case, it does not
6989         diagnose an invalid UTF-16 string.         diagnose an invalid UTF-16 string.
6990    
6991     General comments about UTF modes     General comments about UTF modes
6992    
6993         1.  Codepoints  less  than  256  can  be  specified by either braced or         1. Codepoints less than 256  can  be  specified  by  either  braced  or
6994         unbraced hexadecimal escape sequences (for example,  \x{b3}  or  \xb3).         unbraced  hexadecimal  escape  sequences (for example, \x{b3} or \xb3).
6995         Larger values have to use braced sequences.         Larger values have to use braced sequences.
6996    
6997         2.  Octal  numbers  up  to \777 are recognized, and in UTF-8 mode, they         2. Octal numbers up to \777 are recognized, and  in  UTF-8  mode,  they
6998         match two-byte characters for values greater than \177.         match two-byte characters for values greater than \177.
6999    
7000         3. Repeat quantifiers apply to complete UTF characters, not to individ-         3. Repeat quantifiers apply to complete UTF characters, not to individ-
7001         ual data units, for example: \x{100}{3}.         ual data units, for example: \x{100}{3}.
7002    
7003         4.  The dot metacharacter matches one UTF character instead of a single         4. The dot metacharacter matches one UTF character instead of a  single
7004         data unit.         data unit.
7005    
7006         5. The escape sequence \C can be used to match a single byte  in  UTF-8         5.  The  escape sequence \C can be used to match a single byte in UTF-8
7007         mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead         mode, or a single 16-bit data unit in UTF-16 mode, but its use can lead
7008         to some strange effects because it breaks up multi-unit characters (see         to some strange effects because it breaks up multi-unit characters (see
7009         the  description of \C in the pcrepattern documentation). The use of \C         the description of \C in the pcrepattern documentation). The use of  \C
7010         is   not   supported   in    the    alternative    matching    function         is    not    supported    in    the   alternative   matching   function
7011         pcre[16]_dfa_exec(),  nor  is it supported in UTF mode by the JIT opti-         pcre[16]_dfa_exec(), nor is it supported in UTF mode by the  JIT  opti-
7012         mization of pcre[16]_exec(). If JIT optimization is requested for a UTF         mization of pcre[16]_exec(). If JIT optimization is requested for a UTF
7013         pattern that contains \C, it will not succeed, and so the matching will         pattern that contains \C, it will not succeed, and so the matching will
7014         be carried out by the normal interpretive function.         be carried out by the normal interpretive function.
7015    
7016         6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
7017         test characters of any code value, but, by default, the characters that         test characters of any code value, but, by default, the characters that
7018         PCRE recognizes as digits, spaces, or word characters remain  the  same         PCRE  recognizes  as digits, spaces, or word characters remain the same
7019         set  as  in  non-UTF  mode, all with values less than 256. This remains         set as in non-UTF mode, all with values less  than  256.  This  remains
7020         true even when PCRE is  built  to  include  Unicode  property  support,         true  even  when  PCRE  is  built  to include Unicode property support,
7021         because to do otherwise would slow down PCRE in many common cases. Note         because to do otherwise would slow down PCRE in many common cases. Note
7022         in particular that this applies to \b and \B, because they are  defined         in  particular that this applies to \b and \B, because they are defined
7023         in terms of \w and \W. If you really want to test for a wider sense of,         in terms of \w and \W. If you really want to test for a wider sense of,
7024         say, "digit", you can use  explicit  Unicode  property  tests  such  as         say,  "digit",  you  can  use  explicit  Unicode property tests such as
7025         \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the         \p{Nd}. Alternatively, if you set the PCRE_UCP option, the way that the
7026         character escapes work is changed so that Unicode properties  are  used         character  escapes  work is changed so that Unicode properties are used
7027         to determine which characters match. There are more details in the sec-         to determine which characters match. There are more details in the sec-
7028         tion on generic character types in the pcrepattern documentation.         tion on generic character types in the pcrepattern documentation.
7029    
7030         7. Similarly, characters that match the POSIX named  character  classes         7.  Similarly,  characters that match the POSIX named character classes
7031         are all low-valued characters, unless the PCRE_UCP option is set.         are all low-valued characters, unless the PCRE_UCP option is set.
7032    
7033         8.  However,  the  horizontal  and vertical whitespace matching escapes         8. However, the horizontal and  vertical  whitespace  matching  escapes
7034         (\h, \H, \v, and \V) do match all the appropriate  Unicode  characters,         (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,
7035         whether or not PCRE_UCP is set.         whether or not PCRE_UCP is set.
7036    
7037         9.  Case-insensitive  matching  applies only to characters whose values         9. Case-insensitive matching applies only to  characters  whose  values
7038         are less than 128, unless PCRE is built with Unicode property  support.         are  less than 128, unless PCRE is built with Unicode property support.
7039         Even  when  Unicode  property support is available, PCRE still uses its         Even when Unicode property support is available, PCRE  still  uses  its
7040         own character tables when checking the case of  low-valued  characters,         own  character  tables when checking the case of low-valued characters,
7041         so  as not to degrade performance.  The Unicode property information is         so as not to degrade performance.  The Unicode property information  is
7042         used only for characters with higher values. Furthermore, PCRE supports         used only for characters with higher values. Furthermore, PCRE supports
7043         case-insensitive  matching  only  when  there  is  a one-to-one mapping         case-insensitive matching only  when  there  is  a  one-to-one  mapping
7044         between a letter's cases. There are a small number of many-to-one  map-         between  a letter's cases. There are a small number of many-to-one map-
7045         pings in Unicode; these are not supported by PCRE.         pings in Unicode; these are not supported by PCRE.
7046    
7047    
# Line 7048  AUTHOR Line 7054  AUTHOR
7054    
7055  REVISION  REVISION
7056    
7057         Last updated: 13 January 2012         Last updated: 14 April 2012
7058         Copyright (c) 1997-2012 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
7059  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7060    
# Line 7195  SIMPLE USE OF JIT Line 7201  SIMPLE USE OF JIT
7201  UNSUPPORTED OPTIONS AND PATTERN ITEMS  UNSUPPORTED OPTIONS AND PATTERN ITEMS
7202    
7203         The  only  pcre_exec() options that are supported for JIT execution are         The  only  pcre_exec() options that are supported for JIT execution are
7204         PCRE_NO_UTF8_CHECK,    PCRE_NOTBOL,     PCRE_NOTEOL,     PCRE_NOTEMPTY,         PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK,   PCRE_NOTBOL,   PCRE_NOTEOL,
7205         PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT.         PCRE_NOTEMPTY,  PCRE_NOTEMPTY_ATSTART, PCRE_PARTIAL_HARD, and PCRE_PAR-
7206           TIAL_SOFT.
7207    
7208         The unsupported pattern items are:         The unsupported pattern items are:
7209    
# Line 7213  UNSUPPORTED OPTIONS AND PATTERN ITEMS Line 7220  UNSUPPORTED OPTIONS AND PATTERN ITEMS
7220    
7221  RETURN VALUES FROM JIT EXECUTION  RETURN VALUES FROM JIT EXECUTION
7222    
7223         When  a  pattern  is matched using JIT execution, the return values are         When a pattern is matched using JIT execution, the  return  values  are
7224         the same as those given by the interpretive pcre_exec() code, with  the         the  same as those given by the interpretive pcre_exec() code, with the
7225         addition  of  one new error code: PCRE_ERROR_JIT_STACKLIMIT. This means         addition of one new error code: PCRE_ERROR_JIT_STACKLIMIT.  This  means
7226         that the memory used for the JIT stack was insufficient. See  "Control-         that  the memory used for the JIT stack was insufficient. See "Control-
7227         ling the JIT stack" below for a discussion of JIT stack usage. For com-         ling the JIT stack" below for a discussion of JIT stack usage. For com-
7228         patibility with the interpretive pcre_exec() code, no  more  than  two-         patibility  with  the  interpretive pcre_exec() code, no more than two-
7229         thirds  of  the ovector argument is used for passing back captured sub-         thirds of the ovector argument is used for passing back  captured  sub-
7230         strings.         strings.
7231    
7232         The error code PCRE_ERROR_MATCHLIMIT is returned by  the  JIT  code  if         The  error  code  PCRE_ERROR_MATCHLIMIT  is returned by the JIT code if
7233         searching  a  very large pattern tree goes on for too long, as it is in         searching a very large pattern tree goes on for too long, as it  is  in
7234         the same circumstance when JIT is not used, but the details of  exactly         the  same circumstance when JIT is not used, but the details of exactly
7235         what  is  counted are not the same. The PCRE_ERROR_RECURSIONLIMIT error         what is counted are not the same. The  PCRE_ERROR_RECURSIONLIMIT  error
7236         code is never returned by JIT execution.         code is never returned by JIT execution.
7237    
7238    
7239  SAVING AND RESTORING COMPILED PATTERNS  SAVING AND RESTORING COMPILED PATTERNS
7240    
7241         The code that is generated by the  JIT  compiler  is  architecture-spe-         The  code  that  is  generated by the JIT compiler is architecture-spe-
7242         cific,  and  is also position dependent. For those reasons it cannot be         cific, and is also position dependent. For those reasons it  cannot  be
7243         saved (in a file or database) and restored later like the bytecode  and         saved  (in a file or database) and restored later like the bytecode and
7244         other  data  of  a compiled pattern. Saving and restoring compiled pat-         other data of a compiled pattern. Saving and  restoring  compiled  pat-
7245         terns is not something many people do. More detail about this  facility         terns  is not something many people do. More detail about this facility
7246         is  given in the pcreprecompile documentation. It should be possible to         is given in the pcreprecompile documentation. It should be possible  to
7247         run pcre_study() on a saved and restored pattern, and thereby  recreate         run  pcre_study() on a saved and restored pattern, and thereby recreate
7248         the  JIT  data, but because JIT compilation uses significant resources,         the JIT data, but because JIT compilation uses  significant  resources,
7249         it is probably not worth doing this; you might as  well  recompile  the         it  is  probably  not worth doing this; you might as well recompile the
7250         original pattern.         original pattern.
7251    
7252    
7253  CONTROLLING THE JIT STACK  CONTROLLING THE JIT STACK
7254    
7255         When the compiled JIT code runs, it needs a block of memory to use as a         When the compiled JIT code runs, it needs a block of memory to use as a
7256         stack.  By default, it uses 32K on the  machine  stack.  However,  some         stack.   By  default,  it  uses 32K on the machine stack. However, some
7257         large   or   complicated  patterns  need  more  than  this.  The  error         large  or  complicated  patterns  need  more  than  this.   The   error
7258         PCRE_ERROR_JIT_STACKLIMIT is given when  there  is  not  enough  stack.         PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.
7259         Three  functions  are provided for managing blocks of memory for use as         Three functions are provided for managing blocks of memory for  use  as
7260         JIT stacks. There is further discussion about the use of JIT stacks  in         JIT  stacks. There is further discussion about the use of JIT stacks in
7261         the section entitled "JIT stack FAQ" below.         the section entitled "JIT stack FAQ" below.
7262    
7263         The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments         The pcre_jit_stack_alloc() function creates a JIT stack. Its  arguments
7264         are a starting size and a maximum size, and it returns a pointer to  an         are  a starting size and a maximum size, and it returns a pointer to an
7265         opaque  structure of type pcre_jit_stack, or NULL if there is an error.         opaque structure of type pcre_jit_stack, or NULL if there is an  error.
7266         The pcre_jit_stack_free() function can be used to free a stack that  is         The  pcre_jit_stack_free() function can be used to free a stack that is
7267         no  longer  needed.  (For  the technically minded: the address space is         no longer needed. (For the technically minded:  the  address  space  is
7268         allocated by mmap or VirtualAlloc.)         allocated by mmap or VirtualAlloc.)
7269    
7270         JIT uses far less memory for recursion than the interpretive code,  and         JIT  uses far less memory for recursion than the interpretive code, and
7271         a  maximum  stack size of 512K to 1M should be more than enough for any         a maximum stack size of 512K to 1M should be more than enough  for  any
7272         pattern.         pattern.
7273    
7274         The pcre_assign_jit_stack() function specifies  which  stack  JIT  code         The  pcre_assign_jit_stack()  function  specifies  which stack JIT code