/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1403 by ph10, Fri Jun 14 09:09:28 2013 UTC revision 1404 by ph10, Tue Nov 19 15:36:57 2013 UTC
# Line 53  INTRODUCTION Line 53  INTRODUCTION
53         5.12, including support for UTF-8/16/32  encoded  strings  and  Unicode         5.12, including support for UTF-8/16/32  encoded  strings  and  Unicode
54         general  category  properties. However, UTF-8/16/32 and Unicode support         general  category  properties. However, UTF-8/16/32 and Unicode support
55         has to be explicitly enabled; it is not the default. The Unicode tables         has to be explicitly enabled; it is not the default. The Unicode tables
56         correspond to Unicode release 6.2.0.         correspond to Unicode release 6.3.0.
57    
58         In  addition to the Perl-compatible matching function, PCRE contains an         In  addition to the Perl-compatible matching function, PCRE contains an
59         alternative function that matches the same compiled patterns in a  dif-         alternative function that matches the same compiled patterns in a  dif-
# Line 180  REVISION Line 180  REVISION
180         Last updated: 13 May 2013         Last updated: 13 May 2013
181         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
182  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
183    
184    
185  PCRE(3)                    Library Functions Manual                    PCRE(3)  PCRE(3)                    Library Functions Manual                    PCRE(3)
186    
187    
# Line 512  REVISION Line 512  REVISION
512         Last updated: 12 May 2013         Last updated: 12 May 2013
513         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
514  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
515    
516    
517  PCRE(3)                    Library Functions Manual                    PCRE(3)  PCRE(3)                    Library Functions Manual                    PCRE(3)
518    
519    
# Line 840  REVISION Line 840  REVISION
840         Last updated: 12 May 2013         Last updated: 12 May 2013
841         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
842  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
843    
844    
845  PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)  PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)
846    
847    
# Line 1343  REVISION Line 1343  REVISION
1343         Last updated: 12 May 2013         Last updated: 12 May 2013
1344         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
1345  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
1346    
1347    
1348  PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)  PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)
1349    
1350    
# Line 1457  THE ALTERNATIVE MATCHING ALGORITHM Line 1457  THE ALTERNATIVE MATCHING ALGORITHM
1457         at the fifth character of the subject. The algorithm does not automati-         at the fifth character of the subject. The algorithm does not automati-
1458         cally move on to find matches that start at later positions.         cally move on to find matches that start at later positions.
1459    
1460           PCRE's  "auto-possessification" optimization usually applies to charac-
1461           ter repeats at the end of a pattern (as well as internally). For  exam-
1462           ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
1463           is no point even considering the possibility of backtracking  into  the
1464           repeated  digits.  For  DFA matching, this means that only one possible
1465           match is found. If you really do want multiple matches in  such  cases,
1466           either use an ungreedy repeat ("a\d+?") or set the PCRE_NO_AUTO_POSSESS
1467           option when compiling.
1468    
1469         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
1470         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
1471    
1472         1. Because the algorithm finds all  possible  matches,  the  greedy  or         1.  Because  the  algorithm  finds  all possible matches, the greedy or
1473         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
1474         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
1475         sessive  quantifiers can make a difference when what follows could also         sessive quantifiers can make a difference when what follows could  also
1476         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
1477    
1478           ^a++\w!           ^a++\w!
1479    
1480         This pattern matches "aaab!" but not "aaa!", which would be matched  by         This  pattern matches "aaab!" but not "aaa!", which would be matched by
1481         a  non-possessive quantifier. Similarly, if an atomic group is present,         a non-possessive quantifier. Similarly, if an atomic group is  present,
1482         it is matched as if it were a standalone pattern at the current  point,         it  is matched as if it were a standalone pattern at the current point,
1483         and  the  longest match is then "locked in" for the rest of the overall         and the longest match is then "locked in" for the rest of  the  overall
1484         pattern.         pattern.
1485    
1486         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
1487         is  not  straightforward  to  keep track of captured substrings for the         is not straightforward to keep track of  captured  substrings  for  the
1488         different matching possibilities, and  PCRE's  implementation  of  this         different  matching  possibilities,  and  PCRE's implementation of this
1489         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
1490         strings are available.         strings are available.
1491    
1492         3. Because no substrings are captured, back references within the  pat-         3.  Because no substrings are captured, back references within the pat-
1493         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
1494    
1495         4.  For  the same reason, conditional expressions that use a backrefer-         4. For the same reason, conditional expressions that use  a  backrefer-
1496         ence as the condition or test for a specific group  recursion  are  not         ence  as  the  condition or test for a specific group recursion are not
1497         supported.         supported.
1498    
1499         5.  Because  many  paths  through the tree may be active, the \K escape         5. Because many paths through the tree may be  active,  the  \K  escape
1500         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
1501         be  on  some  paths  and not on others), is not supported. It causes an         be on some paths and not on others), is not  supported.  It  causes  an
1502         error if encountered.         error if encountered.
1503    
1504         6. Callouts are supported, but the value of the  capture_top  field  is         6.  Callouts  are  supported, but the value of the capture_top field is
1505         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
1506    
1507         7.  The  \C  escape  sequence, which (in the standard algorithm) always         7. The \C escape sequence, which (in  the  standard  algorithm)  always
1508         matches a single data unit, even in UTF-8, UTF-16 or UTF-32  modes,  is         matches  a  single data unit, even in UTF-8, UTF-16 or UTF-32 modes, is
1509         not  supported  in these modes, because the alternative algorithm moves         not supported in these modes, because the alternative  algorithm  moves
1510         through the subject string one character (not data unit) at a time, for         through the subject string one character (not data unit) at a time, for
1511         all active paths through the tree.         all active paths through the tree.
1512    
1513         8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
1514         are not supported. (*FAIL) is supported, and  behaves  like  a  failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
1515         negative assertion.         negative assertion.
1516    
1517    
1518  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1519    
1520         Using  the alternative matching algorithm provides the following advan-         Using the alternative matching algorithm provides the following  advan-
1521         tages:         tages:
1522    
1523         1. All possible matches (at a single point in the subject) are automat-         1. All possible matches (at a single point in the subject) are automat-
1524         ically  found,  and  in particular, the longest match is found. To find         ically found, and in particular, the longest match is  found.  To  find
1525         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
1526         things with callouts.         things with callouts.
1527    
1528         2.  Because  the  alternative  algorithm  scans the subject string just         2. Because the alternative algorithm  scans  the  subject  string  just
1529         once, and never needs to backtrack (except for lookbehinds), it is pos-         once, and never needs to backtrack (except for lookbehinds), it is pos-
1530         sible  to  pass  very  long subject strings to the matching function in         sible to pass very long subject strings to  the  matching  function  in
1531         several pieces, checking for partial matching each time. Although it is         several pieces, checking for partial matching each time. Although it is
1532         possible  to  do multi-segment matching using the standard algorithm by         possible to do multi-segment matching using the standard  algorithm  by
1533         retaining partially matched substrings, it  is  more  complicated.  The         retaining  partially  matched  substrings,  it is more complicated. The
1534         pcrepartial  documentation  gives  details of partial matching and dis-         pcrepartial documentation gives details of partial  matching  and  dis-
1535         cusses multi-segment matching.         cusses multi-segment matching.
1536    
1537    
# Line 1530  DISADVANTAGES OF THE ALTERNATIVE ALGORIT Line 1539  DISADVANTAGES OF THE ALTERNATIVE ALGORIT
1539    
1540         The alternative algorithm suffers from a number of disadvantages:         The alternative algorithm suffers from a number of disadvantages:
1541    
1542         1. It is substantially slower than  the  standard  algorithm.  This  is         1.  It  is  substantially  slower  than the standard algorithm. This is
1543         partly  because  it has to search for all possible matches, but is also         partly because it has to search for all possible matches, but  is  also
1544         because it is less susceptible to optimization.         because it is less susceptible to optimization.
1545    
1546         2. Capturing parentheses and back references are not supported.         2. Capturing parentheses and back references are not supported.
# Line 1549  AUTHOR Line 1558  AUTHOR
1558    
1559  REVISION  REVISION
1560    
1561         Last updated: 08 January 2012         Last updated: 12 November 2013
1562         Copyright (c) 1997-2012 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
1563  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
1564    
1565    
1566  PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)  PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)
1567    
1568    
# Line 1957  CHECKING BUILD-TIME OPTIONS Line 1966  CHECKING BUILD-TIME OPTIONS
1966         POSIX interface uses malloc() for output vectors. Further  details  are         POSIX interface uses malloc() for output vectors. Further  details  are
1967         given in the pcreposix documentation.         given in the pcreposix documentation.
1968    
1969             PCRE_CONFIG_PARENS_LIMIT
1970    
1971           The output is a long integer that gives the maximum depth of nesting of
1972           parentheses (of any kind) in a pattern. This limit is  imposed  to  cap
1973           the amount of system stack used when a pattern is compiled. It is spec-
1974           ified when PCRE is built; the default is 250.
1975    
1976           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1977    
1978         The  output is a long integer that gives the default limit for the num-         The output is a long integer that gives the default limit for the  num-
1979         ber of internal matching function calls  in  a  pcre_exec()  execution.         ber  of  internal  matching  function calls in a pcre_exec() execution.
1980         Further details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1981    
1982           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1983    
1984         The output is a long integer that gives the default limit for the depth         The output is a long integer that gives the default limit for the depth
1985         of  recursion  when  calling  the  internal  matching  function  in   a         of   recursion  when  calling  the  internal  matching  function  in  a
1986         pcre_exec()  execution.  Further  details  are  given  with pcre_exec()         pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1987         below.         below.
1988    
1989           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1990    
1991         The output is an integer that is set to one if internal recursion  when         The  output is an integer that is set to one if internal recursion when
1992         running pcre_exec() is implemented by recursive function calls that use         running pcre_exec() is implemented by recursive function calls that use
1993         the stack to remember their state. This is the usual way that  PCRE  is         the  stack  to remember their state. This is the usual way that PCRE is
1994         compiled. The output is zero if PCRE was compiled to use blocks of data         compiled. The output is zero if PCRE was compiled to use blocks of data
1995         on the  heap  instead  of  recursive  function  calls.  In  this  case,         on  the  heap  instead  of  recursive  function  calls.  In  this case,
1996         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1997         blocks on the heap, thus avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1998    
1999    
# Line 1994  COMPILING A PATTERN Line 2010  COMPILING A PATTERN
2010    
2011         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
2012         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
2013         the two interfaces is that pcre_compile2() has an additional  argument,         the  two interfaces is that pcre_compile2() has an additional argument,
2014         errorcodeptr,  via  which  a  numerical  error code can be returned. To         errorcodeptr, via which a numerical error  code  can  be  returned.  To
2015         avoid too much repetition, we refer just to pcre_compile()  below,  but         avoid  too  much repetition, we refer just to pcre_compile() below, but
2016         the information applies equally to pcre_compile2().         the information applies equally to pcre_compile2().
2017    
2018         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
2019         the pattern argument. A pointer to a single block  of  memory  that  is         the  pattern  argument.  A  pointer to a single block of memory that is
2020         obtained  via  pcre_malloc is returned. This contains the compiled code         obtained via pcre_malloc is returned. This contains the  compiled  code
2021         and related data. The pcre type is defined for the returned block; this         and related data. The pcre type is defined for the returned block; this
2022         is a typedef for a structure whose contents are not externally defined.         is a typedef for a structure whose contents are not externally defined.
2023         It is up to the caller to free the memory (via pcre_free) when it is no         It is up to the caller to free the memory (via pcre_free) when it is no
2024         longer required.         longer required.
2025    
2026         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
2027         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
2028         fully  relocatable, because it may contain a copy of the tableptr argu-         fully relocatable, because it may contain a copy of the tableptr  argu-
2029         ment, which is an address (see below).         ment, which is an address (see below).
2030    
2031         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
2032         pilation.  It  should be zero if no options are required. The available         pilation. It should be zero if no options are required.  The  available
2033         options are described below. Some of them (in  particular,  those  that         options  are  described  below. Some of them (in particular, those that
2034         are  compatible with Perl, but some others as well) can also be set and         are compatible with Perl, but some others as well) can also be set  and
2035         unset from within the pattern (see  the  detailed  description  in  the         unset  from  within  the  pattern  (see the detailed description in the
2036         pcrepattern  documentation). For those options that can be different in         pcrepattern documentation). For those options that can be different  in
2037         different parts of the pattern, the contents of  the  options  argument         different  parts  of  the pattern, the contents of the options argument
2038         specifies their settings at the start of compilation and execution. The         specifies their settings at the start of compilation and execution. The
2039         PCRE_ANCHORED, PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK,  and         PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2040         PCRE_NO_START_OPTIMIZE  options  can  be set at the time of matching as         PCRE_NO_START_OPTIMIZE options can be set at the time  of  matching  as
2041         well as at compile time.         well as at compile time.
2042    
2043         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
2044         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
2045         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
2046         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
2047         try to free it. Normally, the offset from the start of the  pattern  to         try  to  free it. Normally, the offset from the start of the pattern to
2048         the data unit that was being processed when the error was discovered is         the data unit that was being processed when the error was discovered is
2049         placed in the variable pointed to by erroffset, which must not be  NULL         placed  in the variable pointed to by erroffset, which must not be NULL
2050         (if  it is, an immediate error is given). However, for an invalid UTF-8         (if it is, an immediate error is given). However, for an invalid  UTF-8
2051         or UTF-16 string, the offset is that of the  first  data  unit  of  the         or  UTF-16  string,  the  offset  is that of the first data unit of the
2052         failing character.         failing character.
2053    
2054         Some  errors are not detected until the whole pattern has been scanned;         Some errors are not detected until the whole pattern has been  scanned;
2055         in these cases, the offset passed back is the length  of  the  pattern.         in  these  cases,  the offset passed back is the length of the pattern.
2056         Note  that  the  offset is in data units, not characters, even in a UTF         Note that the offset is in data units, not characters, even  in  a  UTF
2057         mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-         mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
2058         acter.         acter.
2059    
2060         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
2061         codeptr argument is not NULL, a non-zero error code number is  returned         codeptr  argument is not NULL, a non-zero error code number is returned
2062         via  this argument in the event of an error. This is in addition to the         via this argument in the event of an error. This is in addition to  the
2063         textual error message. Error codes and messages are listed below.         textual error message. Error codes and messages are listed below.
2064    
2065         If the final argument, tableptr, is NULL, PCRE uses a  default  set  of         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of
2066         character  tables  that  are  built  when  PCRE  is compiled, using the         character tables that are  built  when  PCRE  is  compiled,  using  the
2067         default C locale. Otherwise, tableptr must be an address  that  is  the         default  C  locale.  Otherwise, tableptr must be an address that is the
2068         result  of  a  call to pcre_maketables(). This value is stored with the         result of a call to pcre_maketables(). This value is  stored  with  the
2069         compiled pattern, and used again by pcre_exec(), unless  another  table         compiled  pattern,  and  used  again by pcre_exec() and pcre_dfa_exec()
2070         pointer is passed to it. For more discussion, see the section on locale         when the pattern is matched. For more discussion, see  the  section  on
2071         support below.         locale support below.
2072    
2073         This code fragment shows a typical straightforward  call  to  pcre_com-         This  code  fragment  shows a typical straightforward call to pcre_com-
2074         pile():         pile():
2075    
2076           pcre *re;           pcre *re;
# Line 2067  COMPILING A PATTERN Line 2083  COMPILING A PATTERN
2083             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
2084             NULL);            /* use default character tables */             NULL);            /* use default character tables */
2085    
2086         The  following  names  for option bits are defined in the pcre.h header         The following names for option bits are defined in  the  pcre.h  header
2087         file:         file:
2088    
2089           PCRE_ANCHORED           PCRE_ANCHORED
2090    
2091         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
2092         is  constrained to match only at the first matching point in the string         is constrained to match only at the first matching point in the  string
2093         that is being searched (the "subject string"). This effect can also  be         that  is being searched (the "subject string"). This effect can also be
2094         achieved  by appropriate constructs in the pattern itself, which is the         achieved by appropriate constructs in the pattern itself, which is  the
2095         only way to do it in Perl.         only way to do it in Perl.
2096    
2097           PCRE_AUTO_CALLOUT           PCRE_AUTO_CALLOUT
2098    
2099         If this bit is set, pcre_compile() automatically inserts callout items,         If this bit is set, pcre_compile() automatically inserts callout items,
2100         all  with  number  255, before each pattern item. For discussion of the         all with number 255, before each pattern item. For  discussion  of  the
2101         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
2102    
2103           PCRE_BSR_ANYCRLF           PCRE_BSR_ANYCRLF
2104           PCRE_BSR_UNICODE           PCRE_BSR_UNICODE
2105    
2106         These options (which are mutually exclusive) control what the \R escape         These options (which are mutually exclusive) control what the \R escape
2107         sequence  matches.  The choice is either to match only CR, LF, or CRLF,         sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2108         or to match any Unicode newline sequence. The default is specified when         or to match any Unicode newline sequence. The default is specified when
2109         PCRE is built. It can be overridden from within the pattern, or by set-         PCRE is built. It can be overridden from within the pattern, or by set-
2110         ting an option when a compiled pattern is matched.         ting an option when a compiled pattern is matched.
2111    
2112           PCRE_CASELESS           PCRE_CASELESS
2113    
2114         If this bit is set, letters in the pattern match both upper  and  lower         If  this  bit is set, letters in the pattern match both upper and lower
2115         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be
2116         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE         changed  within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
2117         always  understands the concept of case for characters whose values are         always understands the concept of case for characters whose values  are
2118         less than 128, so caseless matching is always possible. For  characters         less  than 128, so caseless matching is always possible. For characters
2119         with  higher  values,  the concept of case is supported if PCRE is com-         with higher values, the concept of case is supported if  PCRE  is  com-
2120         piled with Unicode property support, but not otherwise. If you want  to         piled  with Unicode property support, but not otherwise. If you want to
2121         use  caseless  matching  for  characters 128 and above, you must ensure         use caseless matching for characters 128 and  above,  you  must  ensure
2122         that PCRE is compiled with Unicode property support  as  well  as  with         that  PCRE  is  compiled  with Unicode property support as well as with
2123         UTF-8 support.         UTF-8 support.
2124    
2125           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
2126    
2127         If  this bit is set, a dollar metacharacter in the pattern matches only         If this bit is set, a dollar metacharacter in the pattern matches  only
2128         at the end of the subject string. Without this option,  a  dollar  also         at  the  end  of the subject string. Without this option, a dollar also
2129         matches  immediately before a newline at the end of the string (but not         matches immediately before a newline at the end of the string (but  not
2130         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored         before  any  other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
2131         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in         if PCRE_MULTILINE is set.  There is no equivalent  to  this  option  in
2132         Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
2133    
2134           PCRE_DOTALL           PCRE_DOTALL
2135    
2136         If this bit is set, a dot metacharacter in the pattern matches a  char-         If  this bit is set, a dot metacharacter in the pattern matches a char-
2137         acter of any value, including one that indicates a newline. However, it         acter of any value, including one that indicates a newline. However, it
2138         only ever matches one character, even if newlines are  coded  as  CRLF.         only  ever  matches  one character, even if newlines are coded as CRLF.
2139         Without  this option, a dot does not match when the current position is         Without this option, a dot does not match when the current position  is
2140         at a newline. This option is equivalent to Perl's /s option, and it can         at a newline. This option is equivalent to Perl's /s option, and it can
2141         be  changed within a pattern by a (?s) option setting. A negative class         be changed within a pattern by a (?s) option setting. A negative  class
2142         such as [^a] always matches newline characters, independent of the set-         such as [^a] always matches newline characters, independent of the set-
2143         ting of this option.         ting of this option.
2144    
2145           PCRE_DUPNAMES           PCRE_DUPNAMES
2146    
2147         If  this  bit is set, names used to identify capturing subpatterns need         If this bit is set, names used to identify capturing  subpatterns  need
2148         not be unique. This can be helpful for certain types of pattern when it         not be unique. This can be helpful for certain types of pattern when it
2149         is  known  that  only  one instance of the named subpattern can ever be         is known that only one instance of the named  subpattern  can  ever  be
2150         matched. There are more details of named subpatterns  below;  see  also         matched.  There  are  more details of named subpatterns below; see also
2151         the pcrepattern documentation.         the pcrepattern documentation.
2152    
2153           PCRE_EXTENDED           PCRE_EXTENDED
2154    
2155         If  this  bit  is  set,  white space data characters in the pattern are         If this bit is set, most white space  characters  in  the  pattern  are
2156         totally ignored except when escaped or inside a character class.  White         totally  ignored  except when escaped or inside a character class. How-
2157         space does not include the VT character (code 11). In addition, charac-         ever, white space is not allowed within  sequences  such  as  (?>  that
2158         ters between an unescaped # outside a character class and the next new-         introduce  various  parenthesized  subpatterns,  nor within a numerical
2159         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x         quantifier such as {1,3}.  However, ignorable white space is  permitted
2160         option, and it can be changed within a pattern by a  (?x)  option  set-         between an item and a following quantifier and between a quantifier and
2161         ting.         a following + that indicates possessiveness.
2162    
2163         Which  characters  are  interpreted  as  newlines  is controlled by the         White space did not used to include the VT character (code 11), because
2164         options passed to pcre_compile() or by a special sequence at the  start         Perl did not treat this character as white space. However, Perl changed
2165         of  the  pattern, as described in the section entitled "Newline conven-         at release 5.18, so PCRE followed  at  release  8.34,  and  VT  is  now
2166           treated as white space.
2167    
2168           PCRE_EXTENDED  also  causes characters between an unescaped # outside a
2169           character class  and  the  next  newline,  inclusive,  to  be  ignored.
2170           PCRE_EXTENDED  is equivalent to Perl's /x option, and it can be changed
2171           within a pattern by a (?x) option setting.
2172    
2173           Which characters are interpreted  as  newlines  is  controlled  by  the
2174           options  passed to pcre_compile() or by a special sequence at the start
2175           of the pattern, as described in the section entitled  "Newline  conven-
2176         tions" in the pcrepattern documentation. Note that the end of this type         tions" in the pcrepattern documentation. Note that the end of this type
2177         of  comment  is  a  literal  newline  sequence  in  the pattern; escape         of comment is  a  literal  newline  sequence  in  the  pattern;  escape
2178         sequences that happen to represent a newline do not count.         sequences that happen to represent a newline do not count.
2179    
2180         This option makes it possible to include  comments  inside  complicated         This  option  makes  it possible to include comments inside complicated
2181         patterns.   Note,  however,  that this applies only to data characters.         patterns.  Note, however, that this applies only  to  data  characters.
2182         White space  characters  may  never  appear  within  special  character         White  space  characters  may  never  appear  within  special character
2183         sequences in a pattern, for example within the sequence (?( that intro-         sequences in a pattern, for example within the sequence (?( that intro-
2184         duces a conditional subpattern.         duces a conditional subpattern.
2185    
2186           PCRE_EXTRA           PCRE_EXTRA
2187    
2188         This option was invented in order to turn on  additional  functionality         This  option  was invented in order to turn on additional functionality
2189         of  PCRE  that  is  incompatible with Perl, but it is currently of very         of PCRE that is incompatible with Perl, but it  is  currently  of  very
2190         little use. When set, any backslash in a pattern that is followed by  a         little  use. When set, any backslash in a pattern that is followed by a
2191         letter  that  has  no  special  meaning causes an error, thus reserving         letter that has no special meaning  causes  an  error,  thus  reserving
2192         these combinations for future expansion. By  default,  as  in  Perl,  a         these  combinations  for  future  expansion.  By default, as in Perl, a
2193         backslash  followed by a letter with no special meaning is treated as a         backslash followed by a letter with no special meaning is treated as  a
2194         literal. (Perl can, however, be persuaded to give an error for this, by         literal. (Perl can, however, be persuaded to give an error for this, by
2195         running  it with the -w option.) There are at present no other features         running it with the -w option.) There are at present no other  features
2196         controlled by this option. It can also be set by a (?X) option  setting         controlled  by this option. It can also be set by a (?X) option setting
2197         within a pattern.         within a pattern.
2198    
2199           PCRE_FIRSTLINE           PCRE_FIRSTLINE
2200    
2201         If  this  option  is  set,  an  unanchored pattern is required to match         If this option is set, an  unanchored  pattern  is  required  to  match
2202         before or at the first  newline  in  the  subject  string,  though  the         before  or  at  the  first  newline  in  the subject string, though the
2203         matched text may continue over the newline.         matched text may continue over the newline.
2204    
2205           PCRE_JAVASCRIPT_COMPAT           PCRE_JAVASCRIPT_COMPAT
2206    
2207         If this option is set, PCRE's behaviour is changed in some ways so that         If this option is set, PCRE's behaviour is changed in some ways so that
2208         it is compatible with JavaScript rather than Perl. The changes  are  as         it  is  compatible with JavaScript rather than Perl. The changes are as
2209         follows:         follows:
2210    
2211         (1)  A  lone  closing square bracket in a pattern causes a compile-time         (1) A lone closing square bracket in a pattern  causes  a  compile-time
2212         error, because this is illegal in JavaScript (by default it is  treated         error,  because this is illegal in JavaScript (by default it is treated
2213         as a data character). Thus, the pattern AB]CD becomes illegal when this         as a data character). Thus, the pattern AB]CD becomes illegal when this
2214         option is set.         option is set.
2215    
2216         (2) At run time, a back reference to an unset subpattern group  matches         (2)  At run time, a back reference to an unset subpattern group matches
2217         an  empty  string (by default this causes the current matching alterna-         an empty string (by default this causes the current  matching  alterna-
2218         tive to fail). A pattern such as (\1)(a) succeeds when this  option  is         tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
2219         set  (assuming  it can find an "a" in the subject), whereas it fails by         set (assuming it can find an "a" in the subject), whereas it  fails  by
2220         default, for Perl compatibility.         default, for Perl compatibility.
2221    
2222         (3) \U matches an upper case "U" character; by default \U causes a com-         (3) \U matches an upper case "U" character; by default \U causes a com-
2223         pile time error (Perl uses \U to upper case subsequent characters).         pile time error (Perl uses \U to upper case subsequent characters).
2224    
2225         (4) \u matches a lower case "u" character unless it is followed by four         (4) \u matches a lower case "u" character unless it is followed by four
2226         hexadecimal digits, in which case the hexadecimal  number  defines  the         hexadecimal  digits,  in  which case the hexadecimal number defines the
2227         code  point  to match. By default, \u causes a compile time error (Perl         code point to match. By default, \u causes a compile time  error  (Perl
2228         uses it to upper case the following character).         uses it to upper case the following character).
2229    
2230         (5) \x matches a lower case "x" character unless it is followed by  two         (5)  \x matches a lower case "x" character unless it is followed by two
2231         hexadecimal  digits,  in  which case the hexadecimal number defines the         hexadecimal digits, in which case the hexadecimal  number  defines  the
2232         code point to match. By default, as in Perl, a  hexadecimal  number  is         code  point  to  match. By default, as in Perl, a hexadecimal number is
2233         always expected after \x, but it may have zero, one, or two digits (so,         always expected after \x, but it may have zero, one, or two digits (so,
2234         for example, \xz matches a binary zero character followed by z).         for example, \xz matches a binary zero character followed by z).
2235    
2236           PCRE_MULTILINE           PCRE_MULTILINE
2237    
2238         By default, for the purposes of matching "start of line"  and  "end  of         By  default,  for  the purposes of matching "start of line" and "end of
2239         line", PCRE treats the subject string as consisting of a single line of         line", PCRE treats the subject string as consisting of a single line of
2240         characters, even if it actually contains newlines. The "start of  line"         characters,  even if it actually contains newlines. The "start of line"
2241         metacharacter (^) matches only at the start of the string, and the "end         metacharacter (^) matches only at the start of the string, and the "end
2242         of line" metacharacter ($) matches only at the end of  the  string,  or         of  line"  metacharacter  ($) matches only at the end of the string, or
2243         before  a terminating newline (except when PCRE_DOLLAR_ENDONLY is set).         before a terminating newline (except when PCRE_DOLLAR_ENDONLY is  set).
2244         Note, however, that unless PCRE_DOTALL  is  set,  the  "any  character"         Note,  however,  that  unless  PCRE_DOTALL  is set, the "any character"
2245         metacharacter  (.)  does not match at a newline. This behaviour (for ^,         metacharacter (.) does not match at a newline. This behaviour  (for  ^,
2246         $, and dot) is the same as Perl.         $, and dot) is the same as Perl.
2247    
2248         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
2249         constructs  match  immediately following or immediately before internal         constructs match immediately following or immediately  before  internal
2250         newlines in the subject string, respectively, as well as  at  the  very         newlines  in  the  subject string, respectively, as well as at the very
2251         start  and  end.  This is equivalent to Perl's /m option, and it can be         start and end. This is equivalent to Perl's /m option, and  it  can  be
2252         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
2253         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
2254         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
2255    
2256           PCRE_NEVER_UTF           PCRE_NEVER_UTF
2257    
2258         This option locks out interpretation of the pattern as UTF-8 (or UTF-16         This option locks out interpretation of the pattern as UTF-8 (or UTF-16
2259         or  UTF-32  in the 16-bit and 32-bit libraries). In particular, it pre-         or UTF-32 in the 16-bit and 32-bit libraries). In particular,  it  pre-
2260         vents the creator of the pattern from switching to  UTF  interpretation         vents  the  creator of the pattern from switching to UTF interpretation
2261         by starting the pattern with (*UTF). This may be useful in applications         by starting the pattern with (*UTF). This may be useful in applications
2262         that  process  patterns  from  external  sources.  The  combination  of         that  process  patterns  from  external  sources.  The  combination  of
2263         PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.         PCRE_UTF8 and PCRE_NEVER_UTF also causes an error.
# Line 2242  COMPILING A PATTERN Line 2268  COMPILING A PATTERN
2268           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
2269           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
2270    
2271         These  options  override the default newline definition that was chosen         These options override the default newline definition that  was  chosen
2272         when PCRE was built. Setting the first or the second specifies  that  a         when  PCRE  was built. Setting the first or the second specifies that a
2273         newline  is  indicated  by a single character (CR or LF, respectively).         newline is indicated by a single character (CR  or  LF,  respectively).
2274         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2275         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
2276         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
2277         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
2278         recognized.         recognized.
2279    
2280         In an ASCII/Unicode environment, the Unicode newline sequences are  the         In  an ASCII/Unicode environment, the Unicode newline sequences are the
2281         three  just  mentioned,  plus  the  single characters VT (vertical tab,         three just mentioned, plus the  single  characters  VT  (vertical  tab,
2282         U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-         U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2283         arator,  U+2028),  and  PS (paragraph separator, U+2029). For the 8-bit         arator, U+2028), and PS (paragraph separator, U+2029).  For  the  8-bit
2284         library, the last two are recognized only in UTF-8 mode.         library, the last two are recognized only in UTF-8 mode.
2285    
2286         When PCRE is compiled to run in an EBCDIC (mainframe) environment,  the         When  PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2287         code for CR is 0x0d, the same as ASCII. However, the character code for         code for CR is 0x0d, the same as ASCII. However, the character code for
2288         LF is normally 0x15, though in some EBCDIC environments 0x25  is  used.         LF  is  normally 0x15, though in some EBCDIC environments 0x25 is used.
2289         Whichever  of  these  is  not LF is made to correspond to Unicode's NEL         Whichever of these is not LF is made to  correspond  to  Unicode's  NEL
2290         character. EBCDIC codes are all less than 256. For  more  details,  see         character.  EBCDIC  codes  are all less than 256. For more details, see
2291         the pcrebuild documentation.         the pcrebuild documentation.
2292    
2293         The  newline  setting  in  the  options  word  uses three bits that are         The newline setting in the  options  word  uses  three  bits  that  are
2294         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
2295         used  (default  plus the five values above). This means that if you set         used (default plus the five values above). This means that if  you  set
2296         more than one newline option, the combination may or may not be  sensi-         more  than one newline option, the combination may or may not be sensi-
2297         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to         ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
2298         PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
2299         cause an error.         cause an error.
2300    
2301         The  only  time  that a line break in a pattern is specially recognized         The only time that a line break in a pattern  is  specially  recognized
2302         when compiling is when PCRE_EXTENDED is set. CR and LF are white  space         when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
2303         characters,  and so are ignored in this mode. Also, an unescaped # out-         characters, and so are ignored in this mode. Also, an unescaped #  out-
2304         side a character class indicates a comment that lasts until  after  the         side  a  character class indicates a comment that lasts until after the
2305         next  line break sequence. In other circumstances, line break sequences         next line break sequence. In other circumstances, line break  sequences
2306         in patterns are treated as literal data.         in patterns are treated as literal data.
2307    
2308         The newline option that is set at compile time becomes the default that         The newline option that is set at compile time becomes the default that
# Line 2285  COMPILING A PATTERN Line 2311  COMPILING A PATTERN
2311           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
2312    
2313         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
2314         theses in the pattern. Any opening parenthesis that is not followed  by         theses  in the pattern. Any opening parenthesis that is not followed by
2315         ?  behaves as if it were followed by ?: but named parentheses can still         ? behaves as if it were followed by ?: but named parentheses can  still
2316         be used for capturing (and they acquire  numbers  in  the  usual  way).         be  used  for  capturing  (and  they acquire numbers in the usual way).
2317         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
2318    
2319             PCRE_NO_AUTO_POSSESS
2320    
2321           If this option is set, it disables "auto-possessification". This is  an
2322           optimization  that,  for example, turns a+b into a++b in order to avoid
2323           backtracks into a+ that can never be successful. However,  if  callouts
2324           are  in  use,  auto-possessification  means that some of them are never
2325           taken. You can set this option if you want the matching functions to do
2326           a  full  unoptimized  search and run all the callouts, but it is mainly
2327           provided for testing purposes.
2328    
2329           PCRE_NO_START_OPTIMIZE           PCRE_NO_START_OPTIMIZE
2330    
2331         This  is an option that acts at matching time; that is, it is really an         This is an option that acts at matching time; that is, it is really  an
2332         option for pcre_exec() or pcre_dfa_exec(). If  it  is  set  at  compile         option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
2333         time,  it is remembered with the compiled pattern and assumed at match-         time, it is remembered with the compiled pattern and assumed at  match-
2334         ing time. This is necessary if you want to use JIT  execution,  because         ing  time.  This is necessary if you want to use JIT execution, because
2335         the  JIT  compiler needs to know whether or not this option is set. For         the JIT compiler needs to know whether or not this option is  set.  For
2336         details see the discussion of PCRE_NO_START_OPTIMIZE below.         details see the discussion of PCRE_NO_START_OPTIMIZE below.
2337    
2338           PCRE_UCP           PCRE_UCP
2339    
2340         This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,         This  option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W,
2341         \w,  and  some  of  the POSIX character classes. By default, only ASCII         \w, and some of the POSIX character classes.  By  default,  only  ASCII
2342         characters are recognized, but if PCRE_UCP is set,  Unicode  properties         characters  are  recognized, but if PCRE_UCP is set, Unicode properties
2343         are  used instead to classify characters. More details are given in the         are used instead to classify characters. More details are given in  the
2344         section on generic character types in the pcrepattern page. If you  set         section  on generic character types in the pcrepattern page. If you set
2345         PCRE_UCP,  matching  one of the items it affects takes much longer. The         PCRE_UCP, matching one of the items it affects takes much  longer.  The
2346         option is available only if PCRE has been compiled with  Unicode  prop-         option  is  available only if PCRE has been compiled with Unicode prop-
2347         erty support.         erty support.
2348    
2349           PCRE_UNGREEDY           PCRE_UNGREEDY
2350    
2351         This  option  inverts  the "greediness" of the quantifiers so that they         This option inverts the "greediness" of the quantifiers  so  that  they
2352         are not greedy by default, but become greedy if followed by "?". It  is         are  not greedy by default, but become greedy if followed by "?". It is
2353         not  compatible  with Perl. It can also be set by a (?U) option setting         not compatible with Perl. It can also be set by a (?U)  option  setting
2354         within the pattern.         within the pattern.
2355    
2356           PCRE_UTF8           PCRE_UTF8
2357    
2358         This option causes PCRE to regard both the pattern and the  subject  as         This  option  causes PCRE to regard both the pattern and the subject as
2359         strings of UTF-8 characters instead of single-byte strings. However, it         strings of UTF-8 characters instead of single-byte strings. However, it
2360         is available only when PCRE is built to include UTF  support.  If  not,         is  available  only  when PCRE is built to include UTF support. If not,
2361         the  use  of  this option provokes an error. Details of how this option         the use of this option provokes an error. Details of  how  this  option
2362         changes the behaviour of PCRE are given in the pcreunicode page.         changes the behaviour of PCRE are given in the pcreunicode page.
2363    
2364           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2365    
2366         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2367         automatically  checked.  There  is  a  discussion about the validity of         automatically checked. There is a  discussion  about  the  validity  of
2368         UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is         UTF-8  strings in the pcreunicode page. If an invalid UTF-8 sequence is
2369         found,  pcre_compile()  returns an error. If you already know that your         found, pcre_compile() returns an error. If you already know  that  your
2370         pattern is valid, and you want to skip this check for performance  rea-         pattern  is valid, and you want to skip this check for performance rea-
2371         sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the         sons, you can set the PCRE_NO_UTF8_CHECK option.  When it is  set,  the
2372         effect of passing an invalid UTF-8 string as a pattern is undefined. It         effect of passing an invalid UTF-8 string as a pattern is undefined. It
2373         may  cause  your  program  to  crash. Note that this option can also be         may cause your program to crash or loop. Note that this option can also
2374         passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity         be  passed to pcre_exec() and pcre_dfa_exec(), to suppress the validity
2375         checking  of  subject strings only. If the same string is being matched         checking of subject strings only. If the same string is  being  matched
2376         many times, the option can be safely set for the second and  subsequent         many  times, the option can be safely set for the second and subsequent
2377         matchings to improve performance.         matchings to improve performance.
2378    
2379    
2380  COMPILATION ERROR CODES  COMPILATION ERROR CODES
2381    
2382         The  following  table  lists  the  error  codes than may be returned by         The following table lists the error  codes  than  may  be  returned  by
2383         pcre_compile2(), along with the error messages that may be returned  by         pcre_compile2(),  along with the error messages that may be returned by
2384         both  compiling  functions.  Note  that error messages are always 8-bit         both compiling functions. Note that error  messages  are  always  8-bit
2385         ASCII strings, even in 16-bit or 32-bit mode. As  PCRE  has  developed,         ASCII  strings,  even  in 16-bit or 32-bit mode. As PCRE has developed,
2386         some  error codes have fallen out of use. To avoid confusion, they have         some error codes have fallen out of use. To avoid confusion, they  have
2387         not been re-used.         not been re-used.
2388    
2389            0  no error            0  no error
# Line 2384  COMPILATION ERROR CODES Line 2420  COMPILATION ERROR CODES
2420           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
2421           32  this version of PCRE is compiled without UTF support           32  this version of PCRE is compiled without UTF support
2422           33  [this code is not in use]           33  [this code is not in use]
2423           34  character value in \x{...} sequence is too large           34  character value in \x{} or \o{} is too large
2424           35  invalid condition (?(0)           35  invalid condition (?(0)
2425           36  \C not allowed in lookbehind assertion           36  \C not allowed in lookbehind assertion
2426           37  PCRE does not support \L, \l, \N{name}, \U, or \u           37  PCRE does not support \L, \l, \N{name}, \U, or \u
# Line 2432  COMPILATION ERROR CODES Line 2468  COMPILATION ERROR CODES
2468           75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)           75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2469           76  character value in \u.... sequence is too large           76  character value in \u.... sequence is too large
2470           77  invalid UTF-32 string (specifically UTF-32)           77  invalid UTF-32 string (specifically UTF-32)
2471             78  setting UTF is disabled by the application
2472             79  non-hex character in \x{} (closing brace missing?)
2473             80  non-octal character in \o{} (closing brace missing?)
2474             81  missing opening brace after \o
2475             82  parentheses are too deeply nested
2476             83  invalid range in character class
2477    
2478         The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
2479         values may be used if the limits were changed when PCRE was built.         values may be used if the limits were changed when PCRE was built.
2480    
2481    
# Line 2442  STUDYING A PATTERN Line 2484  STUDYING A PATTERN
2484         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
2485              const char **errptr);              const char **errptr);
2486    
2487         If  a  compiled  pattern is going to be used several times, it is worth         If a compiled pattern is going to be used several times,  it  is  worth
2488         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
2489         matching.  The function pcre_study() takes a pointer to a compiled pat-         matching. The function pcre_study() takes a pointer to a compiled  pat-
2490         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
2491         information  that  will  help speed up matching, pcre_study() returns a         information that will help speed up matching,  pcre_study()  returns  a
2492         pointer to a pcre_extra block, in which the study_data field points  to         pointer  to a pcre_extra block, in which the study_data field points to
2493         the results of the study.         the results of the study.
2494    
2495         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
2496         pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-         pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-
2497         tains  other  fields  that can be set by the caller before the block is         tains other fields that can be set by the caller before  the  block  is
2498         passed; these are described below in the section on matching a pattern.         passed; these are described below in the section on matching a pattern.
2499    
2500         If studying the  pattern  does  not  produce  any  useful  information,         If  studying  the  pattern  does  not  produce  any useful information,
2501         pcre_study()  returns  NULL  by  default.  In that circumstance, if the         pcre_study() returns NULL by default.  In  that  circumstance,  if  the
2502         calling program wants to pass any of the other fields to pcre_exec() or         calling program wants to pass any of the other fields to pcre_exec() or
2503         pcre_dfa_exec(),  it  must set up its own pcre_extra block. However, if         pcre_dfa_exec(), it must set up its own pcre_extra block.  However,  if
2504         pcre_study() is called  with  the  PCRE_STUDY_EXTRA_NEEDED  option,  it         pcre_study()  is  called  with  the  PCRE_STUDY_EXTRA_NEEDED option, it
2505         returns a pcre_extra block even if studying did not find any additional         returns a pcre_extra block even if studying did not find any additional
2506         information. It may still return NULL, however, if an error  occurs  in         information.  It  may still return NULL, however, if an error occurs in
2507         pcre_study().         pcre_study().
2508    
2509         The  second  argument  of  pcre_study() contains option bits. There are         The second argument of pcre_study() contains  option  bits.  There  are
2510         three further options in addition to PCRE_STUDY_EXTRA_NEEDED:         three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2511    
2512           PCRE_STUDY_JIT_COMPILE           PCRE_STUDY_JIT_COMPILE
2513           PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE           PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2514           PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE           PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2515    
2516         If any of these are set, and the just-in-time  compiler  is  available,         If  any  of  these are set, and the just-in-time compiler is available,
2517         the  pattern  is  further compiled into machine code that executes much         the pattern is further compiled into machine code  that  executes  much
2518         faster than the pcre_exec()  interpretive  matching  function.  If  the         faster  than  the  pcre_exec()  interpretive  matching function. If the
2519         just-in-time  compiler is not available, these options are ignored. All         just-in-time compiler is not available, these options are ignored.  All
2520         undefined bits in the options argument must be zero.         undefined bits in the options argument must be zero.
2521    
2522         JIT compilation is a heavyweight optimization. It can  take  some  time         JIT  compilation  is  a heavyweight optimization. It can take some time
2523         for  patterns  to  be analyzed, and for one-off matches and simple pat-         for patterns to be analyzed, and for one-off matches  and  simple  pat-
2524         terns the benefit of faster execution might be offset by a much  slower         terns  the benefit of faster execution might be offset by a much slower
2525         study time.  Not all patterns can be optimized by the JIT compiler. For         study time.  Not all patterns can be optimized by the JIT compiler. For
2526         those that cannot be handled, matching automatically falls back to  the         those  that cannot be handled, matching automatically falls back to the
2527         pcre_exec()  interpreter.  For more details, see the pcrejit documenta-         pcre_exec() interpreter. For more details, see the  pcrejit  documenta-
2528         tion.         tion.
2529    
2530         The third argument for pcre_study() is a pointer for an error  message.         The  third argument for pcre_study() is a pointer for an error message.
2531         If  studying  succeeds  (even  if no data is returned), the variable it         If studying succeeds (even if no data is  returned),  the  variable  it
2532         points to is set to NULL. Otherwise it is set to  point  to  a  textual         points  to  is  set  to NULL. Otherwise it is set to point to a textual
2533         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
2534         must not try to free it. You should test the  error  pointer  for  NULL         must  not  try  to  free it. You should test the error pointer for NULL
2535         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
2536    
2537         When  you are finished with a pattern, you can free the memory used for         When you are finished with a pattern, you can free the memory used  for
2538         the study data by calling pcre_free_study(). This function was added to         the study data by calling pcre_free_study(). This function was added to
2539         the  API  for  release  8.20. For earlier versions, the memory could be         the API for release 8.20. For earlier versions,  the  memory  could  be
2540         freed with pcre_free(), just like the pattern itself. This  will  still         freed  with  pcre_free(), just like the pattern itself. This will still
2541         work  in  cases where JIT optimization is not used, but it is advisable         work in cases where JIT optimization is not used, but it  is  advisable
2542         to change to the new function when convenient.         to change to the new function when convenient.
2543    
2544         This is a typical way in which pcre_study() is used (except that  in  a         This  is  a typical way in which pcre_study() is used (except that in a
2545         real application there should be tests for errors):         real application there should be tests for errors):
2546    
2547           int rc;           int rc;
# Line 2519  STUDYING A PATTERN Line 2561  STUDYING A PATTERN
2561         Studying a pattern does two things: first, a lower bound for the length         Studying a pattern does two things: first, a lower bound for the length
2562         of subject string that is needed to match the pattern is computed. This         of subject string that is needed to match the pattern is computed. This
2563         does not mean that there are any strings of that length that match, but         does not mean that there are any strings of that length that match, but
2564         it does guarantee that no shorter strings match. The value is  used  to         it  does  guarantee that no shorter strings match. The value is used to
2565         avoid wasting time by trying to match strings that are shorter than the         avoid wasting time by trying to match strings that are shorter than the
2566         lower bound. You can find out the value in a calling  program  via  the         lower  bound.  You  can find out the value in a calling program via the
2567         pcre_fullinfo() function.         pcre_fullinfo() function.
2568    
2569         Studying a pattern is also useful for non-anchored patterns that do not         Studying a pattern is also useful for non-anchored patterns that do not
2570         have a single fixed starting character. A bitmap of  possible  starting         have  a  single fixed starting character. A bitmap of possible starting
2571         bytes  is  created. This speeds up finding a position in the subject at         bytes is created. This speeds up finding a position in the  subject  at
2572         which to start matching. (In 16-bit mode, the bitmap is used for 16-bit         which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2573         values  less  than  256.  In 32-bit mode, the bitmap is used for 32-bit         values less than 256.  In 32-bit mode, the bitmap is  used  for  32-bit
2574         values less than 256.)         values less than 256.)
2575    
2576         These two optimizations apply to both pcre_exec() and  pcre_dfa_exec(),         These  two optimizations apply to both pcre_exec() and pcre_dfa_exec(),
2577         and  the  information  is also used by the JIT compiler.  The optimiza-         and the information is also used by the JIT  compiler.   The  optimiza-
2578         tions can be disabled by  setting  the  PCRE_NO_START_OPTIMIZE  option.         tions  can  be  disabled  by setting the PCRE_NO_START_OPTIMIZE option.
2579         You  might want to do this if your pattern contains callouts or (*MARK)         You might want to do this if your pattern contains callouts or  (*MARK)
2580         and you want to make use of these facilities in  cases  where  matching         and  you  want  to make use of these facilities in cases where matching
2581         fails.         fails.
2582    
2583         PCRE_NO_START_OPTIMIZE  can be specified at either compile time or exe-         PCRE_NO_START_OPTIMIZE can be specified at either compile time or  exe-
2584         cution  time.  However,  if   PCRE_NO_START_OPTIMIZE   is   passed   to         cution   time.   However,   if   PCRE_NO_START_OPTIMIZE  is  passed  to
2585         pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-         pcre_exec(), (that is, after any JIT compilation has happened) JIT exe-
2586         cution is disabled. For JIT execution to work with  PCRE_NO_START_OPTI-         cution  is disabled. For JIT execution to work with PCRE_NO_START_OPTI-
2587         MIZE, the option must be set at compile time.         MIZE, the option must be set at compile time.
2588    
2589         There is a longer discussion of PCRE_NO_START_OPTIMIZE below.         There is a longer discussion of PCRE_NO_START_OPTIMIZE below.
# Line 2549  STUDYING A PATTERN Line 2591  STUDYING A PATTERN
2591    
2592  LOCALE SUPPORT  LOCALE SUPPORT
2593    
2594         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
2595         letters, digits, or whatever, by reference to a set of tables,  indexed         letters,  digits, or whatever, by reference to a set of tables, indexed
2596         by  character  value.  When running in UTF-8 mode, this applies only to         by character code point. When running in UTF-8 mode, or in the  16-  or
2597         characters with codes less than 128. By  default,  higher-valued  codes         32-bit libraries, this applies only to characters with code points less
2598         never match escapes such as \w or \d, but they can be tested with \p if         than 256. By default, higher-valued code  points  never  match  escapes
2599         PCRE is built with Unicode character property  support.  Alternatively,         such  as \w or \d. However, if PCRE is built with Unicode property sup-
2600         the  PCRE_UCP  option  can  be  set at compile time; this causes \w and         port, all characters can be tested with \p and \P,  or,  alternatively,
2601         friends to use Unicode property support instead of built-in tables. The         the  PCRE_UCP option can be set when a pattern is compiled; this causes
2602         use of locales with Unicode is discouraged. If you are handling charac-         \w and friends to use Unicode property support instead of the  built-in
2603         ters with codes greater than 128, you should either use UTF-8 and  Uni-         tables.
2604         code, or use locales, but not try to mix the two.  
2605           The  use  of  locales  with Unicode is discouraged. If you are handling
2606           characters with code points greater than 128,  you  should  either  use
2607           Unicode support, or use locales, but not try to mix the two.
2608    
2609         PCRE  contains  an  internal set of tables that are used when the final         PCRE  contains  an  internal set of tables that are used when the final
2610         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
# Line 2575  LOCALE SUPPORT Line 2620  LOCALE SUPPORT
2620    
2621         External  tables  are  built by calling the pcre_maketables() function,         External  tables  are  built by calling the pcre_maketables() function,
2622         which has no arguments, in the relevant locale. The result can then  be         which has no arguments, in the relevant locale. The result can then  be
2623         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For         passed  to  pcre_compile() as often as necessary. For example, to build
2624         example, to build and use tables that are appropriate  for  the  French         and use tables that  are  appropriate  for  the  French  locale  (where
2625         locale  (where  accented  characters  with  values greater than 128 are         accented  characters  with  values greater than 128 are treated as let-
2626         treated as letters), the following code could be used:         ters), the following code could be used:
2627    
2628           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
2629           tables = pcre_maketables();           tables = pcre_maketables();
# Line 2594  LOCALE SUPPORT Line 2639  LOCALE SUPPORT
2639    
2640         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
2641         pattern,  and the same tables are used via this pointer by pcre_study()         pattern,  and the same tables are used via this pointer by pcre_study()
2642         and normally also by pcre_exec(). Thus, by default, for any single pat-         and also by pcre_exec() and pcre_dfa_exec(). Thus, for any single  pat-
2643         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
2644         but different patterns can be compiled in different locales.         but different patterns can be processed in different locales.
2645    
2646         It is possible to pass a table pointer or NULL (indicating the  use  of         It is possible to pass a table pointer or NULL (indicating the  use  of
2647         the  internal  tables)  to  pcre_exec(). Although not intended for this         the internal tables) to pcre_exec() or pcre_dfa_exec() (see the discus-
2648         purpose, this facility could be used to match a pattern in a  different         sion below in the section on matching a pattern). This facility is pro-
2649         locale from the one in which it was compiled. Passing table pointers at         vided  for  use  with  pre-compiled  patterns  that have been saved and
2650         run time is discussed below in the section on matching a pattern.         reloaded.  Character tables are not saved with patterns, so if  a  non-
2651           standard table was used at compile time, it must be provided again when
2652           the reloaded pattern is matched. Attempting to  use  this  facility  to
2653           match a pattern in a different locale from the one in which it was com-
2654           piled is likely to lead to anomalous (usually incorrect) results.
2655    
2656    
2657  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
# Line 2743  INFORMATION ABOUT A PATTERN Line 2792  INFORMATION ABOUT A PATTERN
2792         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
2793    
2794         Since for the 32-bit library using the non-UTF-32 mode,  this  function         Since for the 32-bit library using the non-UTF-32 mode,  this  function
2795         is  unable to return the full 32-bit range of the character, this value         is  unable to return the full 32-bit range of characters, this value is
2796         is   deprecated;   instead    the    PCRE_INFO_REQUIREDCHARFLAGS    and         deprecated;     instead     the     PCRE_INFO_REQUIREDCHARFLAGS     and
2797         PCRE_INFO_REQUIREDCHAR values should be used.         PCRE_INFO_REQUIREDCHAR values should be used.
2798    
2799             PCRE_INFO_MATCH_EMPTY
2800    
2801           Return  1  if  the  pattern can match an empty string, otherwise 0. The
2802           fourth argument should point to an int variable.
2803    
2804           PCRE_INFO_MATCHLIMIT           PCRE_INFO_MATCHLIMIT
2805    
2806         If  the  pattern  set  a  match  limit by including an item of the form         If the pattern set a match limit by  including  an  item  of  the  form
2807         (*LIMIT_MATCH=nnnn) at the start, the value  is  returned.  The  fourth         (*LIMIT_MATCH=nnnn)  at  the  start,  the value is returned. The fourth
2808         argument  should  point to an unsigned 32-bit integer. If no such value         argument should point to an unsigned 32-bit integer. If no  such  value
2809         has  been  set,  the  call  to  pcre_fullinfo()   returns   the   error         has   been   set,   the  call  to  pcre_fullinfo()  returns  the  error
2810         PCRE_ERROR_UNSET.         PCRE_ERROR_UNSET.
2811    
2812           PCRE_INFO_MAXLOOKBEHIND           PCRE_INFO_MAXLOOKBEHIND
2813    
2814         Return  the  number  of  characters  (NB not data units) in the longest         Return the number of characters (NB not  data  units)  in  the  longest
2815         lookbehind assertion in the pattern. This information  is  useful  when         lookbehind  assertion  in  the pattern. This information is useful when
2816         doing  multi-segment  matching  using  the partial matching facilities.         doing multi-segment matching using  the  partial  matching  facilities.
2817         Note that the simple assertions \b and \B require a one-character look-         Note that the simple assertions \b and \B require a one-character look-
2818         behind.  \A  also  registers a one-character lookbehind, though it does         behind. \A also registers a one-character lookbehind,  though  it  does
2819         not actually inspect the previous character. This is to ensure that  at         not  actually inspect the previous character. This is to ensure that at
2820         least one character from the old segment is retained when a new segment         least one character from the old segment is retained when a new segment
2821         is processed. Otherwise, if there are no lookbehinds in the pattern, \A         is processed. Otherwise, if there are no lookbehinds in the pattern, \A
2822         might match incorrectly at the start of a new segment.         might match incorrectly at the start of a new segment.
2823    
2824           PCRE_INFO_MINLENGTH           PCRE_INFO_MINLENGTH
2825    
2826         If  the  pattern  was studied and a minimum length for matching subject         If the pattern was studied and a minimum length  for  matching  subject
2827         strings was computed, its value is  returned.  Otherwise  the  returned         strings  was  computed,  its  value is returned. Otherwise the returned
2828         value is -1. The value is a number of characters, which in UTF mode may         value is -1. The value is a number of characters, which in UTF mode may
2829         be different from the number of data units. The fourth argument  should         be  different from the number of data units. The fourth argument should
2830         point  to an int variable. A non-negative value is a lower bound to the         point to an int variable. A non-negative value is a lower bound to  the
2831         length of any matching string. There may not be  any  strings  of  that         length  of  any  matching  string. There may not be any strings of that
2832         length  that  do actually match, but every string that does match is at         length that do actually match, but every string that does match  is  at
2833         least that long.         least that long.
2834    
2835           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
2836           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
2837           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
2838    
2839         PCRE supports the use of named as well as numbered capturing  parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
2840         ses.  The names are just an additional way of identifying the parenthe-         ses. The names are just an additional way of identifying the  parenthe-
2841         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
2842         pcre_get_named_substring()  are  provided  for extracting captured sub-         pcre_get_named_substring() are provided for  extracting  captured  sub-
2843         strings by name. It is also possible to extract the data  directly,  by         strings  by  name. It is also possible to extract the data directly, by
2844         first  converting  the  name to a number in order to access the correct         first converting the name to a number in order to  access  the  correct
2845         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
2846         the  conversion,  you  need  to  use  the  name-to-number map, which is         the conversion, you need  to  use  the  name-to-number  map,  which  is
2847         described by these three values.         described by these three values.
2848    
2849         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2850         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2851         of each entry; both of these  return  an  int  value.  The  entry  size         of  each  entry;  both  of  these  return  an int value. The entry size
2852         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
2853         a pointer to the first entry of the table. This is a pointer to char in         a pointer to the first entry of the table. This is a pointer to char in
2854         the 8-bit library, where the first two bytes of each entry are the num-         the 8-bit library, where the first two bytes of each entry are the num-
2855         ber of the capturing parenthesis, most significant byte first.  In  the         ber  of  the capturing parenthesis, most significant byte first. In the
2856         16-bit  library,  the pointer points to 16-bit data units, the first of         16-bit library, the pointer points to 16-bit data units, the  first  of
2857         which contains the parenthesis  number.  In  the  32-bit  library,  the         which  contains  the  parenthesis  number.  In  the 32-bit library, the
2858         pointer  points  to  32-bit data units, the first of which contains the         pointer points to 32-bit data units, the first of  which  contains  the
2859         parenthesis number. The rest of the entry is  the  corresponding  name,         parenthesis  number.  The  rest of the entry is the corresponding name,
2860         zero terminated.         zero terminated.
2861    
2862         The  names are in alphabetical order. Duplicate names may appear if (?|         The names are in alphabetical order. If (?| is used to create  multiple
2863         is used to create multiple groups with the same number, as described in         groups  with  the same number, as described in the section on duplicate
2864         the  section  on  duplicate subpattern numbers in the pcrepattern page.         subpattern numbers in the pcrepattern page, the groups may be given the
2865         Duplicate names for subpatterns with different  numbers  are  permitted         same  name,  but  there is only one entry in the table. Different names
2866         only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they         for groups of the same number are not permitted.  Duplicate  names  for
2867         appear in the table in the order in which they were found in  the  pat-         subpatterns with different numbers are permitted, but only if PCRE_DUP-
2868         tern.  In  the  absence  of (?| this is the order of increasing number;         NAMES is set. They appear in the table in the order in which they  were
2869         when (?| is used this is not necessarily the case because later subpat-         found  in  the  pattern.  In  the  absence  of (?| this is the order of
2870         terns may have lower numbers.         increasing number; when (?| is used this is not  necessarily  the  case
2871           because later subpatterns may have lower numbers.
2872    
2873         As  a  simple  example of the name/number table, consider the following         As  a  simple  example of the name/number table, consider the following
2874         pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is         pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
# Line 2923  INFORMATION ABOUT A PATTERN Line 2978  INFORMATION ABOUT A PATTERN
2978    
2979           PCRE_INFO_FIRSTCHARACTER           PCRE_INFO_FIRSTCHARACTER
2980    
2981         Return  the  fixed  first character value, if PCRE_INFO_FIRSTCHARACTER-         Return   the  fixed  first  character  value  in  the  situation  where
2982         FLAGS returned 1; otherwise returns 0. The fourth argument should point         PCRE_INFO_FIRSTCHARACTERFLAGS returns 1; otherwise return 0. The fourth
2983         to an uint_t variable.         argument should point to an uint_t variable.
2984    
2985         In  the 8-bit library, the value is always less than 256. In the 16-bit         In  the 8-bit library, the value is always less than 256. In the 16-bit
2986         library the value can be up to 0xffff. In the 32-bit library in  UTF-32         library the value can be up to 0xffff. In the 32-bit library in  UTF-32
2987         mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not         mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not
2988         using UTF-32 mode.         using UTF-32 mode.
2989    
        If there is no fixed first value, and if either  
   
        (a) the pattern was compiled with the PCRE_MULTILINE option, and  every  
        branch starts with "^", or  
   
        (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not  
        set (if it were set, the pattern would be anchored),  
   
        -1 is returned, indicating that the pattern matches only at  the  start  
        of  a  subject string or after any newline within the string. Otherwise  
        -2 is returned. For anchored patterns, -2 is returned.  
   
2990           PCRE_INFO_REQUIREDCHARFLAGS           PCRE_INFO_REQUIREDCHARFLAGS
2991    
2992         Returns 1 if there is a rightmost literal data unit that must exist  in         Returns 1 if there is a rightmost literal data unit that must exist  in
# Line 3132  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3175  MATCHING A PATTERN: THE TRADITIONAL FUNC
3175         The callout_data field is used in conjunction with the  "callout"  fea-         The callout_data field is used in conjunction with the  "callout"  fea-
3176         ture, and is described in the pcrecallout documentation.         ture, and is described in the pcrecallout documentation.
3177    
3178         The  tables  field  is  used  to  pass  a  character  tables pointer to         The  tables field is provided for use with patterns that have been pre-
3179         pcre_exec(); this overrides the value that is stored with the  compiled         compiled using custom character tables, saved to disc or elsewhere, and
3180         pattern.  A  non-NULL value is stored with the compiled pattern only if         then  reloaded,  because the tables that were used to compile a pattern
3181         custom tables were supplied to pcre_compile() via  its  tableptr  argu-         are not saved with it. See the pcreprecompile documentation for a  dis-
3182         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         cussion  of  saving  compiled patterns for later use. If NULL is passed
3183         PCRE's internal tables to be used. This facility is  helpful  when  re-         using this mechanism, it forces PCRE's internal tables to be used.
3184         using  patterns  that  have been saved after compiling with an external  
3185         set of tables, because the external tables  might  be  at  a  different         Warning: The tables that pcre_exec() uses must be  the  same  as  those
3186         address  when  pcre_exec() is called. See the pcreprecompile documenta-         that  were used when the pattern was compiled. If this is not the case,
3187         tion for a discussion of saving compiled patterns for later use.         the behaviour of pcre_exec() is undefined. Therefore, when a pattern is
3188           compiled  and  matched  in the same process, this field should never be
3189           set. In this (the most common) case, the correct table pointer is auto-
3190           matically  passed  with  the  compiled  pattern  from pcre_compile() to
3191           pcre_exec().
3192    
3193         If PCRE_EXTRA_MARK is set in the flags field, the mark  field  must  be         If PCRE_EXTRA_MARK is set in the flags field, the mark  field  must  be
3194         set  to point to a suitable variable. If the pattern contains any back-         set  to point to a suitable variable. If the pattern contains any back-
# Line 3350  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3397  MATCHING A PATTERN: THE TRADITIONAL FUNC
3397         points  to  the  start of a character (or the end of the subject). When         points  to  the  start of a character (or the end of the subject). When
3398         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3399         subject  or  an invalid value of startoffset is undefined. Your program         subject  or  an invalid value of startoffset is undefined. Your program
3400         may crash.         may crash or loop.
3401    
3402           PCRE_PARTIAL_HARD           PCRE_PARTIAL_HARD
3403           PCRE_PARTIAL_SOFT           PCRE_PARTIAL_SOFT
# Line 4130  MATCHING A PATTERN: THE ALTERNATIVE FUNC Line 4177  MATCHING A PATTERN: THE ALTERNATIVE FUNC
4177         filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()         filled  with  the  longest matches. Unlike pcre_exec(), pcre_dfa_exec()
4178         can use the entire ovector for returning matched strings.         can use the entire ovector for returning matched strings.
4179    
4180           NOTE: PCRE's "auto-possessification" optimization  usually  applies  to
4181           character  repeats at the end of a pattern (as well as internally). For
4182           example, the pattern "a\d+" is compiled as if it were  "a\d++"  because
4183           there is no point even considering the possibility of backtracking into
4184           the repeated digits. For DFA matching, this means that only one  possi-
4185           ble  match  is  found.  If  you really do want multiple matches in such
4186           cases,  either  use  an  ungreedy   repeat   ("a\d+?")   or   set   the
4187           PCRE_NO_AUTO_POSSESS option when compiling.
4188    
4189     Error returns from pcre_dfa_exec()     Error returns from pcre_dfa_exec()
4190    
4191         The pcre_dfa_exec() function returns a negative number when  it  fails.         The  pcre_dfa_exec()  function returns a negative number when it fails.
4192         Many  of  the  errors  are  the  same as for pcre_exec(), and these are         Many of the errors are the same  as  for  pcre_exec(),  and  these  are
4193         described above.  There are in addition the following errors  that  are         described  above.   There are in addition the following errors that are
4194         specific to pcre_dfa_exec():         specific to pcre_dfa_exec():
4195    
4196           PCRE_ERROR_DFA_UITEM      (-16)           PCRE_ERROR_DFA_UITEM      (-16)
4197    
4198         This  return is given if pcre_dfa_exec() encounters an item in the pat-         This return is given if pcre_dfa_exec() encounters an item in the  pat-
4199         tern that it does not support, for instance, the use of \C  or  a  back         tern  that  it  does not support, for instance, the use of \C or a back
4200         reference.         reference.
4201    
4202           PCRE_ERROR_DFA_UCOND      (-17)           PCRE_ERROR_DFA_UCOND      (-17)
4203    
4204         This  return  is  given  if pcre_dfa_exec() encounters a condition item         This return is given if pcre_dfa_exec()  encounters  a  condition  item
4205         that uses a back reference for the condition, or a test  for  recursion         that  uses  a back reference for the condition, or a test for recursion
4206         in a specific group. These are not supported.         in a specific group. These are not supported.
4207    
4208           PCRE_ERROR_DFA_UMLIMIT    (-18)           PCRE_ERROR_DFA_UMLIMIT    (-18)
4209    
4210         This  return  is given if pcre_dfa_exec() is called with an extra block         This return is given if pcre_dfa_exec() is called with an  extra  block
4211         that contains a setting of  the  match_limit  or  match_limit_recursion         that  contains  a  setting  of the match_limit or match_limit_recursion
4212         fields.  This  is  not  supported (these fields are meaningless for DFA         fields. This is not supported (these fields  are  meaningless  for  DFA
4213         matching).         matching).
4214    
4215           PCRE_ERROR_DFA_WSSIZE     (-19)           PCRE_ERROR_DFA_WSSIZE     (-19)
4216    
4217         This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the         This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
4218         workspace vector.         workspace vector.
4219    
4220           PCRE_ERROR_DFA_RECURSE    (-20)           PCRE_ERROR_DFA_RECURSE    (-20)
4221    
4222         When  a  recursive subpattern is processed, the matching function calls         When a recursive subpattern is processed, the matching  function  calls
4223         itself recursively, using private vectors for  ovector  and  workspace.         itself  recursively,  using  private vectors for ovector and workspace.
4224         This  error  is  given  if  the output vector is not large enough. This         This error is given if the output vector  is  not  large  enough.  This
4225         should be extremely rare, as a vector of size 1000 is used.         should be extremely rare, as a vector of size 1000 is used.
4226    
4227           PCRE_ERROR_DFA_BADRESTART (-30)           PCRE_ERROR_DFA_BADRESTART (-30)
4228    
4229         When pcre_dfa_exec() is called with the PCRE_DFA_RESTART  option,  some         When  pcre_dfa_exec()  is called with the PCRE_DFA_RESTART option, some
4230         plausibility  checks  are  made on the contents of the workspace, which         plausibility checks are made on the contents of  the  workspace,  which
4231         should contain data about the previous partial match. If any  of  these         should  contain  data about the previous partial match. If any of these
4232         checks fail, this error is given.         checks fail, this error is given.
4233    
4234    
4235  SEE ALSO  SEE ALSO
4236    
4237         pcre16(3),   pcre32(3),  pcrebuild(3),  pcrecallout(3),  pcrecpp(3)(3),         pcre16(3),  pcre32(3),  pcrebuild(3),  pcrecallout(3),   pcrecpp(3)(3),
4238         pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-         pcrematching(3), pcrepartial(3), pcreposix(3), pcreprecompile(3), pcre-
4239         sample(3), pcrestack(3).         sample(3), pcrestack(3).
4240    
# Line 4192  AUTHOR Line 4248  AUTHOR
4248    
4249  REVISION  REVISION
4250    
4251         Last updated: 12 June 2013         Last updated: 12 November 2013
4252         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
4253  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4254    
4255    
4256  PCRECALLOUT(3)             Library Functions Manual             PCRECALLOUT(3)  PCRECALLOUT(3)             Library Functions Manual             PCRECALLOUT(3)
4257    
4258    
# Line 4255  DESCRIPTION Line 4311  DESCRIPTION
4311         independent groups).         independent groups).
4312    
4313         Automatic callouts can be used for tracking  the  progress  of  pattern         Automatic callouts can be used for tracking  the  progress  of  pattern
4314         matching.  The pcretest command has an option that sets automatic call-         matching.   The pcretest program has a pattern qualifier (/C) that sets
4315         outs; when it is used, the output indicates how the pattern is matched.         automatic callouts; when it is used, the output indicates how the  pat-
4316         This  is useful information when you are trying to optimize the perfor-         tern  is  being matched. This is useful information when you are trying
4317         mance of a particular pattern.         to optimize the performance of a particular pattern.
4318    
4319    
4320  MISSING CALLOUTS  MISSING CALLOUTS
4321    
4322         You should be aware that, because of  optimizations  in  the  way  PCRE         You should be aware that, because of optimizations in the way PCRE com-
4323         matches  patterns  by  default,  callouts  sometimes do not happen. For         piles and matches patterns, callouts sometimes do not happen exactly as
4324         example, if the pattern is         you might expect.
4325    
4326           At compile time, PCRE "auto-possessifies" repeated items when it  knows
4327           that  what follows cannot be part of the repeat. For example, a+[bc] is
4328           compiled as if it were a++[bc]. The pcretest output when  this  pattern
4329           is  anchored  and  then  applied  with automatic callouts to the string
4330           "aaaa" is:
4331    
4332             --->aaaa
4333              +0 ^        ^
4334              +1 ^        a+
4335              +3 ^   ^    [bc]
4336             No match
4337    
4338           This indicates that when matching [bc] fails, there is no  backtracking
4339           into  a+  and  therefore the callouts that would be taken for the back-
4340           tracks do not occur.  You can disable the  auto-possessify  feature  by
4341           passing PCRE_NO_AUTO_POSSESS to pcre_compile(), or starting the pattern
4342           with (*NO_AUTO_POSSESS). If this is done  in  pcretest  (using  the  /O
4343           qualifier), the output changes to this:
4344    
4345             --->aaaa
4346              +0 ^        ^
4347              +1 ^        a+
4348              +3 ^   ^    [bc]
4349              +3 ^  ^     [bc]
4350              +3 ^ ^      [bc]
4351              +3 ^^       [bc]
4352             No match
4353    
4354           This time, when matching [bc] fails, the matcher backtracks into a+ and
4355           tries again, repeatedly, until a+ itself fails.
4356    
4357           Other optimizations that provide fast "no match"  results  also  affect
4358           callouts.  For example, if the pattern is
4359    
4360           ab(?C4)cd           ab(?C4)cd
4361    
4362         PCRE knows that any matching string must contain the letter "d". If the         PCRE knows that any matching string must contain the letter "d". If the
4363         subject  string  is "abyz", the lack of "d" means that matching doesn't         subject string is "abyz", the lack of "d" means that  matching  doesn't
4364         ever start, and the callout is never  reached.  However,  with  "abyd",         ever  start,  and  the  callout is never reached. However, with "abyd",
4365         though the result is still no match, the callout is obeyed.         though the result is still no match, the callout is obeyed.
4366    
4367         If  the pattern is studied, PCRE knows the minimum length of a matching         If the pattern is studied, PCRE knows the minimum length of a  matching
4368         string, and will immediately give a "no match" return without  actually         string,  and will immediately give a "no match" return without actually
4369         running  a  match if the subject is not long enough, or, for unanchored         running a match if the subject is not long enough, or,  for  unanchored
4370         patterns, if it has been scanned far enough.         patterns, if it has been scanned far enough.
4371    
4372         You can disable these optimizations by passing the  PCRE_NO_START_OPTI-         You  can disable these optimizations by passing the PCRE_NO_START_OPTI-
4373         MIZE  option  to the matching function, or by starting the pattern with         MIZE option to the matching function, or by starting the  pattern  with
4374         (*NO_START_OPT). This slows down the matching process, but does  ensure         (*NO_START_OPT).  This slows down the matching process, but does ensure
4375         that callouts such as the example above are obeyed.         that callouts such as the example above are obeyed.
4376    
4377    
4378  THE CALLOUT INTERFACE  THE CALLOUT INTERFACE
4379    
4380         During  matching, when PCRE reaches a callout point, the external func-         During matching, when PCRE reaches a callout point, the external  func-
4381         tion defined by pcre_callout or pcre[16|32]_callout is called (if it is         tion defined by pcre_callout or pcre[16|32]_callout is called (if it is
4382         set).  This  applies to both normal and DFA matching. The only argument         set). This applies to both normal and DFA matching. The  only  argument
4383         to  the  callout  function  is  a  pointer   to   a   pcre_callout   or         to   the   callout   function   is  a  pointer  to  a  pcre_callout  or
4384         pcre[16|32]_callout  block.   These  structures  contains the following         pcre[16|32]_callout block.  These  structures  contains  the  following
4385         fields:         fields:
4386    
4387           int           version;           int           version;
# Line 4312  THE CALLOUT INTERFACE Line 4402  THE CALLOUT INTERFACE
4402           const PCRE_UCHAR16  *mark;       (16-bit version)           const PCRE_UCHAR16  *mark;       (16-bit version)
4403           const PCRE_UCHAR32  *mark;       (32-bit version)           const PCRE_UCHAR32  *mark;       (32-bit version)
4404    
4405         The version field is an integer containing the version  number  of  the         The  version  field  is an integer containing the version number of the
4406         block  format. The initial version was 0; the current version is 2. The         block format. The initial version was 0; the current version is 2.  The
4407         version number will change again in future  if  additional  fields  are         version  number  will  change  again in future if additional fields are
4408         added, but the intention is never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
4409    
4410         The  callout_number  field  contains the number of the callout, as com-         The callout_number field contains the number of the  callout,  as  com-
4411         piled into the pattern (that is, the number after ?C for  manual  call-         piled  into  the pattern (that is, the number after ?C for manual call-
4412         outs, and 255 for automatically generated callouts).         outs, and 255 for automatically generated callouts).
4413    
4414         The  offset_vector field is a pointer to the vector of offsets that was         The offset_vector field is a pointer to the vector of offsets that  was
4415         passed by the caller to the  matching  function.  When  pcre_exec()  or         passed  by  the  caller  to  the matching function. When pcre_exec() or
4416         pcre[16|32]_exec()  is used, the contents can be inspected, in order to         pcre[16|32]_exec() is used, the contents can be inspected, in order  to
4417         extract substrings that have been matched so far, in the  same  way  as         extract  substrings  that  have been matched so far, in the same way as
4418         for  extracting  substrings  after  a  match has completed. For the DFA         for extracting substrings after a match  has  completed.  For  the  DFA
4419         matching functions, this field is not useful.         matching functions, this field is not useful.
4420    
4421         The subject and subject_length fields contain copies of the values that         The subject and subject_length fields contain copies of the values that
4422         were passed to the matching function.         were passed to the matching function.
4423    
4424         The  start_match  field normally contains the offset within the subject         The start_match field normally contains the offset within  the  subject
4425         at which the current match attempt  started.  However,  if  the  escape         at  which  the  current  match  attempt started. However, if the escape
4426         sequence  \K has been encountered, this value is changed to reflect the         sequence \K has been encountered, this value is changed to reflect  the
4427         modified starting point. If the pattern is not  anchored,  the  callout         modified  starting  point.  If the pattern is not anchored, the callout
4428         function may be called several times from the same point in the pattern         function may be called several times from the same point in the pattern
4429         for different starting points in the subject.         for different starting points in the subject.
4430    
4431         The current_position field contains the offset within  the  subject  of         The  current_position  field  contains the offset within the subject of
4432         the current match pointer.         the current match pointer.
4433    
4434         When  the  pcre_exec()  or  pcre[16|32]_exec() is used, the capture_top         When the pcre_exec() or pcre[16|32]_exec()  is  used,  the  capture_top
4435         field contains one more than the number of the  highest  numbered  cap-         field  contains  one  more than the number of the highest numbered cap-
4436         tured  substring so far. If no substrings have been captured, the value         tured substring so far. If no substrings have been captured, the  value
4437         of capture_top is one. This is always the case when the  DFA  functions         of  capture_top  is one. This is always the case when the DFA functions
4438         are used, because they do not support captured substrings.         are used, because they do not support captured substrings.
4439    
4440         The  capture_last  field  contains the number of the most recently cap-         The capture_last field contains the number of the  most  recently  cap-
4441         tured substring. However, when a recursion exits, the value reverts  to         tured  substring. However, when a recursion exits, the value reverts to
4442         what  it  was  outside  the recursion, as do the values of all captured         what it was outside the recursion, as do the  values  of  all  captured
4443         substrings. If no substrings have been  captured,  the  value  of  cap-         substrings.  If  no  substrings  have  been captured, the value of cap-
4444         ture_last  is  -1.  This  is always the case for the DFA matching func-         ture_last is -1. This is always the case for  the  DFA  matching  func-
4445         tions.         tions.
4446    
4447         The callout_data field contains a value that is passed  to  a  matching         The  callout_data  field  contains a value that is passed to a matching
4448         function  specifically so that it can be passed back in callouts. It is         function specifically so that it can be passed back in callouts. It  is
4449         passed in the callout_data field of a pcre_extra  or  pcre[16|32]_extra         passed  in  the callout_data field of a pcre_extra or pcre[16|32]_extra
4450         data  structure.  If no such data was passed, the value of callout_data         data structure. If no such data was passed, the value  of  callout_data
4451         in a callout block is NULL. There is a description  of  the  pcre_extra         in  a  callout  block is NULL. There is a description of the pcre_extra
4452         structure in the pcreapi documentation.         structure in the pcreapi documentation.
4453    
4454         The  pattern_position  field  is  present from version 1 of the callout         The pattern_position field is present from version  1  of  the  callout
4455         structure. It contains the offset to the next item to be matched in the         structure. It contains the offset to the next item to be matched in the
4456         pattern string.         pattern string.
4457    
4458         The  next_item_length  field  is  present from version 1 of the callout         The next_item_length field is present from version  1  of  the  callout
4459         structure. It contains the length of the next item to be matched in the         structure. It contains the length of the next item to be matched in the
4460         pattern  string.  When  the callout immediately precedes an alternation         pattern string. When the callout immediately  precedes  an  alternation
4461         bar, a closing parenthesis, or the end of the pattern,  the  length  is         bar,  a  closing  parenthesis, or the end of the pattern, the length is
4462         zero.  When  the callout precedes an opening parenthesis, the length is         zero. When the callout precedes an opening parenthesis, the  length  is
4463         that of the entire subpattern.         that of the entire subpattern.
4464    
4465         The pattern_position and next_item_length fields are intended  to  help         The  pattern_position  and next_item_length fields are intended to help
4466         in  distinguishing between different automatic callouts, which all have         in distinguishing between different automatic callouts, which all  have
4467         the same callout number. However, they are set for all callouts.         the same callout number. However, they are set for all callouts.
4468    
4469         The mark field is present from version 2 of the callout  structure.  In         The  mark  field is present from version 2 of the callout structure. In
4470         callouts  from  pcre_exec() or pcre[16|32]_exec() it contains a pointer         callouts from pcre_exec() or pcre[16|32]_exec() it contains  a  pointer
4471         to the zero-terminated  name  of  the  most  recently  passed  (*MARK),         to  the  zero-terminated  name  of  the  most  recently passed (*MARK),
4472         (*PRUNE),  or  (*THEN) item in the match, or NULL if no such items have         (*PRUNE), or (*THEN) item in the match, or NULL if no such  items  have
4473         been passed. Instances of (*PRUNE) or (*THEN) without  a  name  do  not         been  passed.  Instances  of  (*PRUNE) or (*THEN) without a name do not
4474         obliterate  a previous (*MARK). In callouts from the DFA matching func-         obliterate a previous (*MARK). In callouts from the DFA matching  func-
4475         tions this field always contains NULL.         tions this field always contains NULL.
4476    
4477    
4478  RETURN VALUES  RETURN VALUES
4479    
4480         The external callout function returns an integer to PCRE. If the  value         The  external callout function returns an integer to PCRE. If the value
4481         is  zero,  matching  proceeds  as  normal. If the value is greater than         is zero, matching proceeds as normal. If  the  value  is  greater  than
4482         zero, matching fails at the current point, but  the  testing  of  other         zero,  matching  fails  at  the current point, but the testing of other
4483         matching possibilities goes ahead, just as if a lookahead assertion had         matching possibilities goes ahead, just as if a lookahead assertion had
4484         failed. If the value is less than zero, the  match  is  abandoned,  the         failed.  If  the  value  is less than zero, the match is abandoned, the
4485         matching function returns the negative value.         matching function returns the negative value.
4486    
4487         Negative   values   should   normally   be   chosen  from  the  set  of         Negative  values  should  normally  be   chosen   from   the   set   of
4488         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
4489         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is
4490         reserved for use by callout functions; it will never be  used  by  PCRE         reserved  for  use  by callout functions; it will never be used by PCRE
4491         itself.         itself.
4492    
4493    
# Line 4410  AUTHOR Line 4500  AUTHOR
4500    
4501  REVISION  REVISION
4502    
4503         Last updated: 03 March 2013         Last updated: 12 November 2013
4504         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
4505  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4506    
4507    
4508  PCRECOMPAT(3)              Library Functions Manual              PCRECOMPAT(3)  PCRECOMPAT(3)              Library Functions Manual              PCRECOMPAT(3)
4509    
4510    
# Line 4532  DIFFERENCES BETWEEN PCRE AND PERL Line 4622  DIFFERENCES BETWEEN PCRE AND PERL
4622    
4623         15. Perl recognizes comments in some places that  PCRE  does  not,  for         15. Perl recognizes comments in some places that  PCRE  does  not,  for
4624         example,  between  the  ( and ? at the start of a subpattern. If the /x         example,  between  the  ( and ? at the start of a subpattern. If the /x
4625         modifier is set, Perl allows white space between ( and ? but PCRE never         modifier is set, Perl allows white space between ( and ?  (though  cur-
4626         does, even if the PCRE_EXTENDED option is set.         rent  Perls  warn that this is deprecated) but PCRE never does, even if
4627           the PCRE_EXTENDED option is set.
4628    
4629           16. Perl, when in warning mode, gives warnings  for  character  classes
4630           such  as  [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4631           als. PCRE has no warning features, so it gives an error in these  cases
4632           because they are almost certainly user mistakes.
4633    
4634         16.  In  PCRE,  the upper/lower case character properties Lu and Ll are         17.  In  PCRE,  the upper/lower case character properties Lu and Ll are
4635         not affected when case-independent matching is specified. For  example,         not affected when case-independent matching is specified. For  example,
4636         \p{Lu} always matches an upper case letter. I think Perl has changed in         \p{Lu} always matches an upper case letter. I think Perl has changed in
4637         this respect; in the release at the time of writing (5.16), \p{Lu}  and         this respect; in the release at the time of writing (5.16), \p{Lu}  and
4638         \p{Ll} match all letters, regardless of case, when case independence is         \p{Ll} match all letters, regardless of case, when case independence is
4639         specified.         specified.
4640    
4641         17. PCRE provides some extensions to the Perl regular expression facil-         18. PCRE provides some extensions to the Perl regular expression facil-
4642         ities.   Perl  5.10  includes new features that are not in earlier ver-         ities.   Perl  5.10  includes new features that are not in earlier ver-
4643         sions of Perl, some of which (such as named parentheses) have  been  in         sions of Perl, some of which (such as named parentheses) have  been  in
4644         PCRE for some time. This list is with respect to Perl 5.10:         PCRE for some time. This list is with respect to Perl 5.10:
# Line 4599  AUTHOR Line 4695  AUTHOR
4695    
4696  REVISION  REVISION
4697    
4698         Last updated: 19 March 2013         Last updated: 10 November 2013
4699         Copyright (c) 1997-2013 University of Cambridge.         Copyright (c) 1997-2013 University of Cambridge.
4700  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
4701    
4702    
4703  PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)  PCREPATTERN(3)             Library Functions Manual             PCREPATTERN(3)
4704    
4705    
# Line 4678  SPECIAL START-OF-PATTERN ITEMS Line 4774  SPECIAL START-OF-PATTERN ITEMS
4774    
4775     Unicode property support     Unicode property support
4776    
4777         Another special sequence that may appear at the start of a pattern is         Another special sequence that may appear at the start of a  pattern  is
4778           (*UCP).   This  has  the same effect as setting the PCRE_UCP option: it
4779           (*UCP)         causes sequences such as \d and \w to use Unicode properties to  deter-
4780           mine character types, instead of recognizing only characters with codes
4781         This has the same effect as setting  the  PCRE_UCP  option:  it  causes         less than 128 via a lookup table.
4782         sequences  such  as  \d  and  \w to use Unicode properties to determine  
4783         character types, instead of recognizing only characters with codes less     Disabling auto-possessification
4784         than 128 via a lookup table.  
4785           If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect  as
4786           setting  the  PCRE_NO_AUTO_POSSESS  option  at compile time. This stops
4787           PCRE from making quantifiers possessive when what follows cannot  match
4788           the  repeated item. For example, by default a+b is treated as a++b. For
4789           more details, see the pcreapi documentation.
4790    
4791     Disabling start-up optimizations     Disabling start-up optimizations
4792    
4793         If  a  pattern  starts  with (*NO_START_OPT), it has the same effect as         If a pattern starts with (*NO_START_OPT), it has  the  same  effect  as
4794         setting the PCRE_NO_START_OPTIMIZE option either at compile or matching         setting the PCRE_NO_START_OPTIMIZE option either at compile or matching
4795         time.         time. This disables several  optimizations  for  quickly  reaching  "no
4796           match" results. For more details, see the pcreapi documentation.
4797    
4798     Newline conventions     Newline conventions
4799    
# Line 4745  SPECIAL START-OF-PATTERN ITEMS Line 4847  SPECIAL START-OF-PATTERN ITEMS
4847           (*LIMIT_RECURSION=d)           (*LIMIT_RECURSION=d)
4848    
4849         where d is any number of decimal digits. However, the value of the set-         where d is any number of decimal digits. However, the value of the set-
4850         ting must be less than the value set by the caller of  pcre_exec()  for         ting must be less than the value set (or defaulted) by  the  caller  of
4851         it to have any effect. In other words, the pattern writer can lower the         pcre_exec()  for  it  to  have  any effect. In other words, the pattern
4852         limit set by the programmer, but not raise it. If there  is  more  than         writer can lower the limits set by the programmer, but not raise  them.
4853         one setting of one of these limits, the lower value is used.         If  there  is  more  than one setting of one of these limits, the lower
4854           value is used.
4855    
4856    
4857  EBCDIC CHARACTER CODES  EBCDIC CHARACTER CODES
4858    
4859         PCRE  can  be compiled to run in an environment that uses EBCDIC as its         PCRE can be compiled to run in an environment that uses EBCDIC  as  its
4860         character code rather than ASCII or Unicode (typically a mainframe sys-         character code rather than ASCII or Unicode (typically a mainframe sys-
4861         tem).  In  the  sections below, character code values are ASCII or Uni-         tem). In the sections below, character code values are  ASCII  or  Uni-
4862         code; in an EBCDIC environment these characters may have different code         code; in an EBCDIC environment these characters may have different code
4863         values, and there are no code points greater than 255.         values, and there are no code points greater than 255.
4864    
4865    
4866  CHARACTERS AND METACHARACTERS  CHARACTERS AND METACHARACTERS
4867    
4868         A  regular  expression  is  a pattern that is matched against a subject         A regular expression is a pattern that is  matched  against  a  subject
4869         string from left to right. Most characters stand for  themselves  in  a         string  from  left  to right. Most characters stand for themselves in a
4870         pattern,  and  match  the corresponding characters in the subject. As a         pattern, and match the corresponding characters in the  subject.  As  a
4871         trivial example, the pattern         trivial example, the pattern
4872    
4873           The quick brown fox           The quick brown fox
4874    
4875         matches a portion of a subject string that is identical to itself. When         matches a portion of a subject string that is identical to itself. When
4876         caseless  matching is specified (the PCRE_CASELESS option), letters are         caseless matching is specified (the PCRE_CASELESS option), letters  are
4877         matched independently of case. In a UTF mode, PCRE  always  understands         matched  independently  of case. In a UTF mode, PCRE always understands
4878         the  concept  of case for characters whose values are less than 128, so         the concept of case for characters whose values are less than  128,  so
4879         caseless matching is always possible. For characters with  higher  val-         caseless  matching  is always possible. For characters with higher val-
4880         ues,  the concept of case is supported if PCRE is compiled with Unicode         ues, the concept of case is supported if PCRE is compiled with  Unicode
4881         property support, but not otherwise.   If  you  want  to  use  caseless         property  support,  but  not  otherwise.   If  you want to use caseless
4882         matching  for  characters  128  and above, you must ensure that PCRE is         matching for characters 128 and above, you must  ensure  that  PCRE  is
4883         compiled with Unicode property support as well as with UTF support.         compiled with Unicode property support as well as with UTF support.
4884    
4885         The power of regular expressions comes  from  the  ability  to  include         The  power  of  regular  expressions  comes from the ability to include
4886         alternatives  and  repetitions in the pattern. These are encoded in the         alternatives and repetitions in the pattern. These are encoded  in  the
4887         pattern by the use of metacharacters, which do not stand for themselves         pattern by the use of metacharacters, which do not stand for themselves
4888         but instead are interpreted in some special way.         but instead are interpreted in some special way.
4889    
4890         There  are  two different sets of metacharacters: those that are recog-         There are two different sets of metacharacters: those that  are  recog-
4891         nized anywhere in the pattern except within square brackets, and  those         nized  anywhere in the pattern except within square brackets, and those
4892         that  are  recognized  within square brackets. Outside square brackets,         that are recognized within square brackets.  Outside  square  brackets,
4893         the metacharacters are as follows:         the metacharacters are as follows:
4894    
4895           \      general escape character with several uses           \      general escape character with several uses
# Line 4805  CHARACTERS AND METACHARACTERS Line 4908  CHARACTERS AND METACHARACTERS
4908                  also "possessive quantifier"                  also "possessive quantifier"
4909           {      start min/max quantifier           {      start min/max quantifier
4910    
4911         Part of a pattern that is in square brackets  is  called  a  "character         Part  of  a  pattern  that is in square brackets is called a "character
4912         class". In a character class the only metacharacters are:         class". In a character class the only metacharacters are:
4913    
4914           \      general escape character           \      general escape character
# Line 4822  BACKSLASH Line 4925  BACKSLASH
4925    
4926         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
4927         a character that is not a number or a letter, it takes away any special         a character that is not a number or a letter, it takes away any special
4928         meaning  that  character  may  have. This use of backslash as an escape         meaning that character may have. This use of  backslash  as  an  escape
4929         character applies both inside and outside character classes.         character applies both inside and outside character classes.
4930    
4931         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
4932         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
4933         character would otherwise be interpreted as a metacharacter, so  it  is         character  would  otherwise be interpreted as a metacharacter, so it is
4934         always  safe  to  precede  a non-alphanumeric with backslash to specify         always safe to precede a non-alphanumeric  with  backslash  to  specify
4935         that it stands for itself. In particular, if you want to match a  back-         that  it stands for itself. In particular, if you want to match a back-
4936         slash, you write \\.         slash, you write \\.
4937    
4938         In  a UTF mode, only ASCII numbers and letters have any special meaning         In a UTF mode, only ASCII numbers and letters have any special  meaning
4939         after a backslash. All other characters  (in  particular,  those  whose         after  a  backslash.  All  other characters (in particular, those whose
4940         codepoints are greater than 127) are treated as literals.         codepoints are greater than 127) are treated as literals.
4941    
4942         If  a pattern is compiled with the PCRE_EXTENDED option, white space in         If a pattern is compiled with  the  PCRE_EXTENDED  option,  most  white
4943         the pattern (other than in a character class) and characters between  a         space  in the pattern (other than in a character class), and characters
4944         # outside a character class and the next newline are ignored. An escap-         between a # outside a character class and the next newline,  inclusive,
4945         ing backslash can be used to include a white space or  #  character  as         are ignored. An escaping backslash can be used to include a white space
4946         part of the pattern.         or # character as part of the pattern.
4947    
4948         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
4949         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
4950         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
4951         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
4952         tion. Note the following examples:         tion. Note the following examples:
4953    
4954           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 4855  BACKSLASH Line 4958  BACKSLASH
4958           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
4959           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
4960    
4961         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
4962         classes.  An isolated \E that is not preceded by \Q is ignored.  If  \Q         classes.   An  isolated \E that is not preceded by \Q is ignored. If \Q
4963         is  not followed by \E later in the pattern, the literal interpretation         is not followed by \E later in the pattern, the literal  interpretation
4964         continues to the end of the pattern (that is,  \E  is  assumed  at  the         continues  to  the  end  of  the pattern (that is, \E is assumed at the
4965         end).  If  the  isolated \Q is inside a character class, this causes an         end). If the isolated \Q is inside a character class,  this  causes  an
4966         error, because the character class is not terminated.         error, because the character class is not terminated.
4967    
4968     Non-printing characters     Non-printing characters
4969    
4970         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
4971         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
4972         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
4973         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
4974         editing, it is  often  easier  to  use  one  of  the  following  escape         editing,  it  is  often  easier  to  use  one  of  the following escape
4975         sequences than the binary character it represents:         sequences than the binary character it represents:
4976    
4977           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
# Line 4878  BACKSLASH Line 4981  BACKSLASH
4981           \n        linefeed (hex 0A)           \n        linefeed (hex 0A)
4982           \r        carriage return (hex 0D)           \r        carriage return (hex 0D)
4983           \t        tab (hex 09)           \t        tab (hex 09)
4984             \0dd      character with octal code 0dd
4985           \ddd      character with octal code ddd, or back reference           \ddd      character with octal code ddd, or back reference
4986             \o{ddd..} character with octal code ddd..
4987           \xhh      character with hex code hh           \xhh      character with hex code hh
4988           \x{hhh..} character with hex code hhh.. (non-JavaScript mode)           \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
4989           \uhhhh    character with hex code hhhh (JavaScript mode only)           \uhhhh    character with hex code hhhh (JavaScript mode only)
4990    
4991         The  precise effect of \cx on ASCII characters is as follows: if x is a         The precise effect of \cx on ASCII characters is as follows: if x is  a
4992         lower case letter, it is converted to upper case. Then  bit  6  of  the         lower  case  letter,  it  is converted to upper case. Then bit 6 of the
4993         character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A         character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
4994         (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and  \c;  becomes         (A  is  41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
4995         hex  7B (; is 3B). If the data item (byte or 16-bit value) following \c         hex 7B (; is 3B). If the data item (byte or 16-bit value) following  \c
4996         has a value greater than 127, a compile-time error occurs.  This  locks         has  a  value greater than 127, a compile-time error occurs. This locks
4997         out non-ASCII characters in all modes.         out non-ASCII characters in all modes.
4998    
4999         The  \c  facility  was designed for use with ASCII characters, but with         The \c facility was designed for use with ASCII  characters,  but  with
5000         the extension to Unicode it is even less useful than it  once  was.  It         the  extension  to  Unicode it is even less useful than it once was. It
5001         is,  however,  recognized  when  PCRE is compiled in EBCDIC mode, where         is, however, recognized when PCRE is compiled  in  EBCDIC  mode,  where
5002         data items are always bytes. In this mode, all values are  valid  after         data  items  are always bytes. In this mode, all values are valid after
5003         \c.  If  the  next character is a lower case letter, it is converted to         \c. If the next character is a lower case letter, it  is  converted  to
5004         upper case. Then the 0xc0 bits of  the  byte  are  inverted.  Thus  \cA         upper  case.  Then  the  0xc0  bits  of the byte are inverted. Thus \cA
5005         becomes  hex  01, as in ASCII (A is C1), but because the EBCDIC letters         becomes hex 01, as in ASCII (A is C1), but because the  EBCDIC  letters
5006         are disjoint, \cZ becomes hex 29 (Z is E9), and other  characters  also         are  disjoint,  \cZ becomes hex 29 (Z is E9), and other characters also
5007         generate different values.         generate different values.
5008    
5009         By  default,  after  \x,  from  zero to two hexadecimal digits are read         After \0 up to two further octal digits are read. If  there  are  fewer
5010         (letters can be in upper or lower case). Any number of hexadecimal dig-         than  two  digits,  just  those  that  are  present  are used. Thus the
        its may appear between \x{ and }, but the character code is constrained  
        as follows:  
   
          8-bit non-UTF mode    less than 0x100  
          8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint  
          16-bit non-UTF mode   less than 0x10000  
          16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint  
          32-bit non-UTF mode   less than 0x80000000  
          32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint  
   
        Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-  
        called "surrogate" codepoints), and 0xffef.  
   
        If  characters  other than hexadecimal digits appear between \x{ and },  
        or if there is no terminating }, this form of escape is not recognized.  
        Instead,  the  initial  \x  will  be interpreted as a basic hexadecimal  
        escape, with no following digits, giving a  character  whose  value  is  
        zero.  
   
        If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x  
        is as just described only when it is followed by two  hexadecimal  dig-  
        its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript  
        mode, support for code points greater than 256 is provided by \u, which  
        must  be  followed  by  four hexadecimal digits; otherwise it matches a  
        literal "u" character.  Character codes specified by \u  in  JavaScript  
        mode  are  constrained in the same was as those specified by \x in non-  
        JavaScript mode.  
   
        Characters whose value is less than 256 can be defined by either of the  
        two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-  
        ence in the way they are handled. For example, \xdc is exactly the same  
        as \x{dc} (or \u00dc in JavaScript mode).  
   
        After  \0  up  to two further octal digits are read. If there are fewer  
        than two digits, just  those  that  are  present  are  used.  Thus  the  
5011         sequence \0\x\07 specifies two binary zeros followed by a BEL character         sequence \0\x\07 specifies two binary zeros followed by a BEL character
5012         (code value 7). Make sure you supply two digits after the initial  zero         (code  value 7). Make sure you supply two digits after the initial zero
5013         if the pattern character that follows is itself an octal digit.         if the pattern character that follows is itself an octal digit.
5014    
5015           The escape \o must be followed by a sequence of octal digits,  enclosed
5016           in  braces.  An  error occurs if this is not the case. This escape is a
5017           recent addition to Perl; it provides way of specifying  character  code
5018           points  as  octal  numbers  greater than 0777, and it also allows octal
5019           numbers and back references to be unambiguously specified.
5020    
5021           For greater clarity and unambiguity, it is best to avoid following \ by
5022           a digit greater than zero. Instead, use \o{} or \x{} to specify charac-
5023           ter numbers, and \g{} to specify back references. The  following  para-
5024           graphs describe the old, ambiguous syntax.
5025    
5026         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
5027         cated.  Outside a character class, PCRE reads it and any following dig-         cated, and Perl has changed in recent releases, causing  PCRE  also  to
5028         its  as  a  decimal  number. If the number is less than 10, or if there         change. Outside a character class, PCRE reads the digit and any follow-
5029         have been at least that many previous capturing left parentheses in the         ing digits as a decimal number. If the number is less  than  8,  or  if
5030         expression,  the  entire  sequence  is  taken  as  a  back reference. A         there  have been at least that many previous capturing left parentheses
5031         description of how this works is given later, following the  discussion         in the expression, the entire sequence is taken as a back reference.  A
5032           description  of how this works is given later, following the discussion
5033         of parenthesized subpatterns.         of parenthesized subpatterns.
5034    
5035         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if  the  decimal  number  following  \  is
5036         and there have not been that many capturing subpatterns, PCRE  re-reads         greater than 7 and there have not been that many capturing subpatterns,
5037         up to three octal digits following the backslash, and uses them to gen-         PCRE handles \8 and \9 as the literal characters "8" and "9", and  oth-
5038         erate a data character. Any subsequent digits stand for themselves. The         erwise re-reads up to three octal digits following the backslash, using
5039         value  of  the  character  is constrained in the same way as characters         them to generate a data character.  Any  subsequent  digits  stand  for
5040         specified in hexadecimal.  For example:         themselves. For example:
5041    
5042           \040   is another way of writing an ASCII space           \040   is another way of writing an ASCII space
5043           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 4969  BACKSLASH Line 5051  BACKSLASH
5051                     character with octal code 113                     character with octal code 113
5052           \377   might be a back reference, otherwise           \377   might be a back reference, otherwise
5053                     the value 255 (decimal)                     the value 255 (decimal)
5054           \81    is either a back reference, or a binary zero           \81    is either a back reference, or the two
5055                     followed by the two characters "8" and "1"                     characters "8" and "1"
5056    
5057         Note that octal values of 100 or greater must not be  introduced  by  a         Note  that octal values of 100 or greater that are specified using this
5058         leading zero, because no more than three octal digits are ever read.         syntax must not be introduced by a leading zero, because no  more  than
5059           three octal digits are ever read.
5060    
5061           By  default, after \x that is not followed by {, from zero to two hexa-
5062           decimal digits are read (letters can be in upper or  lower  case).  Any
5063           number of hexadecimal digits may appear between \x{ and }. If a charac-
5064           ter other than a hexadecimal digit appears between \x{  and  },  or  if
5065           there is no terminating }, an error occurs.
5066    
5067           If  the  PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x
5068           is as just described only when it is followed by two  hexadecimal  dig-
5069           its.   Otherwise,  it  matches  a  literal "x" character. In JavaScript
5070           mode, support for code points greater than 256 is provided by \u, which
5071           must  be  followed  by  four hexadecimal digits; otherwise it matches a
5072           literal "u" character.
5073    
5074           Characters whose value is less than 256 can be defined by either of the
5075           two  syntaxes for \x (or by \u in JavaScript mode). There is no differ-
5076           ence in the way they are handled. For example, \xdc is exactly the same
5077           as \x{dc} (or \u00dc in JavaScript mode).
5078    
5079       Constraints on character values
5080    
5081           Characters  that  are  specified using octal or hexadecimal numbers are
5082           limited to certain values, as follows:
5083    
5084             8-bit non-UTF mode    less than 0x100
5085             8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
5086             16-bit non-UTF mode   less than 0x10000
5087             16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
5088             32-bit non-UTF mode   less than 0x100000000
5089             32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
5090    
5091           Invalid Unicode codepoints are the range  0xd800  to  0xdfff  (the  so-
5092           called "surrogate" codepoints), and 0xffef.
5093    
5094       Escape sequences in character classes
5095    
5096         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
5097         inside and outside character classes. In addition, inside  a  character         inside and outside character classes. In addition, inside  a  character
# Line 5038  BACKSLASH Line 5156  BACKSLASH
5156         the subject string, all of them fail, because there is no character  to         the subject string, all of them fail, because there is no character  to
5157         match.         match.
5158    
5159         For  compatibility  with Perl, \s does not match the VT character (code         For  compatibility with Perl, \s did not used to match the VT character
5160         11).  This makes it different from the the POSIX "space" class. The  \s         (code 11), which made it different from the the  POSIX  "space"  class.
5161         characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If         However,  Perl  added  VT  at  release  5.18, and PCRE followed suit at
5162         "use locale;" is included in a Perl script, \s may match the VT charac-         release 8.34. The default \s characters are now HT  (9),  LF  (10),  VT
5163         ter. In PCRE, it never does.         (11),  FF  (12),  CR  (13),  and space (32), which are defined as white
5164           space in the "C" locale. This list may vary if locale-specific matching
5165         A  "word"  character is an underscore or any character that is a letter         is  taking  place;  in  particular,  in  some locales the "non-breaking
5166         or digit.  By default, the definition of letters  and  digits  is  con-         space" character (\xA0) is recognized as white space.
5167         trolled  by PCRE's low-valued character tables, and may vary if locale-  
5168         specific matching is taking place (see "Locale support" in the  pcreapi         A "word" character is an underscore or any character that is  a  letter
5169         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like         or  digit.   By  default,  the definition of letters and digits is con-
5170         systems, or "french" in Windows, some character codes greater than  128         trolled by PCRE's low-valued character tables, and may vary if  locale-
5171         are  used  for  accented letters, and these are then matched by \w. The         specific  matching is taking place (see "Locale support" in the pcreapi
5172           page). For example, in a French locale such  as  "fr_FR"  in  Unix-like
5173           systems,  or "french" in Windows, some character codes greater than 127
5174           are used for accented letters, and these are then matched  by  \w.  The
5175         use of locales with Unicode is discouraged.         use of locales with Unicode is discouraged.
5176    
5177         By default, in a UTF mode, characters  with  values  greater  than  128         By  default,  characters  whose  code points are greater than 127 never
5178         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These         match \d, \s, or \w, and always match \D, \S, and \W, although this may
5179         sequences retain their original meanings from before  UTF  support  was         vary  for characters in the range 128-255 when locale-specific matching
5180         available,  mainly for efficiency reasons. However, if PCRE is compiled         is happening.  These escape sequences retain  their  original  meanings
5181         with Unicode property support, and the PCRE_UCP option is set, the  be-         from  before  Unicode support was available, mainly for efficiency rea-
5182         haviour  is  changed  so  that Unicode properties are used to determine         sons. If PCRE is  compiled  with  Unicode  property  support,  and  the
5183         character types, as follows:         PCRE_UCP  option is set, the behaviour is changed so that Unicode prop-
5184           erties are used to determine character types, as follows:
5185           \d  any character that \p{Nd} matches (decimal digit)  
5186           \s  any character that \p{Z} matches, plus HT, LF, FF, CR           \d  any character that matches \p{Nd} (decimal digit)
5187           \w  any character that \p{L} or \p{N} matches, plus underscore           \s  any character that matches \p{Z} or \h or \v
5188             \w  any character that matches \p{L} or \p{N}, plus underscore
5189    
5190         The upper case escapes match the inverse sets of characters. Note  that         The upper case escapes match the inverse sets of characters. Note  that
5191         \d  matches  only decimal digits, whereas \w matches any Unicode digit,         \d  matches  only decimal digits, whereas \w matches any Unicode digit,
# Line 5074  BACKSLASH Line 5196  BACKSLASH
5196         The sequences \h, \H, \v, and \V are features that were added  to  Perl         The sequences \h, \H, \v, and \V are features that were added  to  Perl
5197         at  release  5.10. In contrast to the other sequences, which match only         at  release  5.10. In contrast to the other sequences, which match only
5198         ASCII characters by default, these  always  match  certain  high-valued         ASCII characters by default, these  always  match  certain  high-valued
5199         codepoints,  whether or not PCRE_UCP is set. The horizontal space char-         code points, whether or not PCRE_UCP is set. The horizontal space char-
5200         acters are:         acters are:
5201    
5202           U+0009     Horizontal tab (HT)           U+0009     Horizontal tab (HT)
# Line 5340  BACKSLASH Line 5462  BACKSLASH
5462    
5463         As well as the standard Unicode properties described above,  PCRE  sup-         As well as the standard Unicode properties described above,  PCRE  sup-
5464         ports  four  more  that  make it possible to convert traditional escape         ports  four  more  that  make it possible to convert traditional escape
5465         sequences such as \w and \s and POSIX character classes to use  Unicode         sequences such as \w and \s to use Unicode properties. PCRE uses  these
5466         properties.  PCRE  uses  these non-standard, non-Perl properties inter-         non-standard, non-Perl properties internally when PCRE_UCP is set. How-
5467         nally when PCRE_UCP is set. However, they may also be used  explicitly.         ever, they may also be used explicitly. These properties are:
        These properties are:  
5468    
5469           Xan   Any alphanumeric character           Xan   Any alphanumeric character
5470           Xps   Any POSIX space character           Xps   Any POSIX space character
5471           Xsp   Any Perl space character           Xsp   Any Perl space character
5472           Xwd   Any Perl "word" character           Xwd   Any Perl "word" character
5473    
5474         Xan  matches  characters that have either the L (letter) or the N (num-         Xan matches characters that have either the L (letter) or the  N  (num-
5475         ber) property. Xps matches the characters tab, linefeed, vertical  tab,         ber)  property. Xps matches the characters tab, linefeed, vertical tab,
5476         form  feed,  or carriage return, and any other character that has the Z         form feed, or carriage return, and any other character that has  the  Z
5477         (separator) property.  Xsp is the same as Xps, except that vertical tab         (separator)  property.  Xsp is the same as Xps; it used to exclude ver-
5478         is excluded. Xwd matches the same characters as Xan, plus underscore.         tical tab, for Perl compatibility, but Perl changed, and so  PCRE  fol-
5479           lowed  at  release  8.34.  Xwd matches the same characters as Xan, plus
5480         There  is another non-standard property, Xuc, which matches any charac-         underscore.
5481         ter that can be represented by a Universal Character Name  in  C++  and  
5482         other  programming  languages.  These are the characters $, @, ` (grave         There is another non-standard property, Xuc, which matches any  charac-
5483         accent), and all characters with Unicode code points  greater  than  or         ter  that  can  be represented by a Universal Character Name in C++ and
5484         equal  to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that         other programming languages. These are the characters $,  @,  `  (grave
5485         most base (ASCII) characters are excluded. (Universal  Character  Names         accent),  and  all  characters with Unicode code points greater than or
5486         are  of  the  form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.         equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note  that
5487           most  base  (ASCII) characters are excluded. (Universal Character Names
5488           are of the form \uHHHH or \UHHHHHHHH where H is  a  hexadecimal  digit.
5489         Note that the Xuc property does not match these sequences but the char-         Note that the Xuc property does not match these sequences but the char-
5490         acters that they represent.)         acters that they represent.)
5491    
5492     Resetting the match start     Resetting the match start
5493    
5494         The  escape sequence \K causes any previously matched characters not to         The escape sequence \K causes any previously matched characters not  to
5495         be included in the final matched sequence. For example, the pattern:         be included in the final matched sequence. For example, the pattern:
5496    
5497           foo\Kbar           foo\Kbar
5498    
5499         matches "foobar", but reports that it has matched "bar".  This  feature         matches  "foobar",  but reports that it has matched "bar". This feature
5500         is  similar  to  a lookbehind assertion (described below).  However, in         is similar to a lookbehind assertion (described  below).   However,  in
5501         this case, the part of the subject before the real match does not  have         this  case, the part of the subject before the real match does not have
5502         to  be of fixed length, as lookbehind assertions do. The use of \K does         to be of fixed length, as lookbehind assertions do. The use of \K  does
5503         not interfere with the setting of captured  substrings.   For  example,         not  interfere  with  the setting of captured substrings.  For example,
5504         when the pattern         when the pattern
5505    
5506           (foo)\Kbar           (foo)\Kbar
5507    
5508         matches "foobar", the first substring is still set to "foo".         matches "foobar", the first substring is still set to "foo".
5509    
5510         Perl  documents  that  the  use  of  \K  within assertions is "not well         Perl documents that the use  of  \K  within  assertions  is  "not  well
5511         defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive         defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
5512         assertions, but is ignored in negative assertions.         assertions, but is ignored in negative assertions.
5513    
5514     Simple assertions     Simple assertions
5515    
5516         The  final use of backslash is for certain simple assertions. An asser-         The final use of backslash is for certain simple assertions. An  asser-
5517         tion specifies a condition that has to be met at a particular point  in         tion  specifies a condition that has to be met at a particular point in
5518         a  match, without consuming any characters from the subject string. The         a match, without consuming any characters from the subject string.  The
5519         use of subpatterns for more complicated assertions is described  below.         use  of subpatterns for more complicated assertions is described below.
5520         The backslashed assertions are:         The backslashed assertions are:
5521    
5522           \b     matches at a word boundary           \b     matches at a word boundary
# Line 5404  BACKSLASH Line 5527  BACKSLASH
5527           \z     matches only at the end of the subject           \z     matches only at the end of the subject
5528           \G     matches at the first matching position in the subject           \G     matches at the first matching position in the subject
5529    
5530         Inside  a  character  class, \b has a different meaning; it matches the         Inside a character class, \b has a different meaning;  it  matches  the
5531         backspace character. If any other of  these  assertions  appears  in  a         backspace  character.  If  any  other  of these assertions appears in a
5532         character  class, by default it matches the corresponding literal char-         character class, by default it matches the corresponding literal  char-
5533         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
5534         PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-         PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
5535         ated instead.         ated instead.
5536    
5537         A word boundary is a position in the subject string where  the  current         A  word  boundary is a position in the subject string where the current
5538         character  and  the previous character do not both match \w or \W (i.e.         character and the previous character do not both match \w or  \W  (i.e.
5539         one matches \w and the other matches \W), or the start or  end  of  the         one  matches  \w  and the other matches \W), or the start or end of the
5540         string  if  the  first or last character matches \w, respectively. In a         string if the first or last character matches \w,  respectively.  In  a
5541         UTF mode, the meanings of \w and \W  can  be  changed  by  setting  the         UTF  mode,  the  meanings  of  \w  and \W can be changed by setting the
5542         PCRE_UCP  option. When this is done, it also affects \b and \B. Neither         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
5543         PCRE nor Perl has a separate "start of word" or "end of  word"  metase-         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
5544         quence.  However,  whatever follows \b normally determines which it is.         quence. However, whatever follows \b normally determines which  it  is.
5545         For example, the fragment \ba matches "a" at the start of a word.         For example, the fragment \ba matches "a" at the start of a word.
5546    
5547         The \A, \Z, and \z assertions differ from  the  traditional  circumflex         The  \A,  \Z,  and \z assertions differ from the traditional circumflex
5548         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
5549         at the very start and end of the subject string, whatever  options  are         at  the  very start and end of the subject string, whatever options are
5550         set.  Thus,  they are independent of multiline mode. These three asser-         set. Thus, they are independent of multiline mode. These  three  asser-
5551         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
5552         affect  only the behaviour of the circumflex and dollar metacharacters.         affect only the behaviour of the circumflex and dollar  metacharacters.
5553         However, if the startoffset argument of pcre_exec() is non-zero,  indi-         However,  if the startoffset argument of pcre_exec() is non-zero, indi-
5554         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
5555         the subject, \A can never match. The difference between \Z  and  \z  is         the  subject,  \A  can never match. The difference between \Z and \z is
5556         that \Z matches before a newline at the end of the string as well as at         that \Z matches before a newline at the end of the string as well as at
5557         the very end, whereas \z matches only at the end.         the very end, whereas \z matches only at the end.
5558    
5559         The \G assertion is true only when the current matching position is  at         The  \G assertion is true only when the current matching position is at
5560         the  start point of the match, as specified by the startoffset argument         the start point of the match, as specified by the startoffset  argument
5561         of pcre_exec(). It differs from \A when the  value  of  startoffset  is         of  pcre_exec().  It  differs  from \A when the value of startoffset is
5562         non-zero.  By calling pcre_exec() multiple times with appropriate argu-         non-zero. By calling pcre_exec() multiple times with appropriate  argu-
5563         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
5564         mentation where \G can be useful.         mentation where \G can be useful.
5565    
5566         Note,  however,  that  PCRE's interpretation of \G, as the start of the         Note, however, that PCRE's interpretation of \G, as the  start  of  the
5567         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
5568         end  of  the  previous  match. In Perl, these can be different when the         end of the previous match. In Perl, these can  be  different  when  the
5569         previously matched string was empty. Because PCRE does just  one  match         previously  matched  string was empty. Because PCRE does just one match
5570         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
5571    
5572         If  all  the alternatives of a pattern begin with \G, the expression is         If all the alternatives of a pattern begin with \G, the  expression  is
5573         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
5574         in the compiled regular expression.         in the compiled regular expression.
5575    
5576    
5577  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
5578    
5579         The  circumflex  and  dollar  metacharacters are zero-width assertions.         The circumflex and dollar  metacharacters  are  zero-width  assertions.
5580         That is, they test for a particular condition being true  without  con-         That  is,  they test for a particular condition being true without con-
5581         suming any characters from the subject string.         suming any characters from the subject string.
5582    
5583         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
5584         character is an assertion that is true only  if  the  current  matching         character  is  an  assertion  that is true only if the current matching
5585         point  is  at the start of the subject string. If the startoffset argu-         point is at the start of the subject string. If the  startoffset  argu-
5586         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
5587         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
5588         has an entirely different meaning (see below).         has an entirely different meaning (see below).
5589    
5590         Circumflex need not be the first character of the pattern if  a  number         Circumflex  need  not be the first character of the pattern if a number
5591         of  alternatives are involved, but it should be the first thing in each         of alternatives are involved, but it should be the first thing in  each
5592         alternative in which it appears if the pattern is ever  to  match  that         alternative  in  which  it appears if the pattern is ever to match that
5593         branch.  If all possible alternatives start with a circumflex, that is,         branch. If all possible alternatives start with a circumflex, that  is,
5594         if the pattern is constrained to match only at the start  of  the  sub-         if  the  pattern  is constrained to match only at the start of the sub-
5595         ject,  it  is  said  to be an "anchored" pattern. (There are also other         ject, it is said to be an "anchored" pattern.  (There  are  also  other
5596         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
5597    
5598         The dollar character is an assertion that is true only if  the  current         The  dollar  character is an assertion that is true only if the current
5599         matching  point  is  at  the  end of the subject string, or immediately         matching point is at the end of  the  subject  string,  or  immediately
5600         before a newline at the end of the string (by default). Note,  however,         before  a newline at the end of the string (by default). Note, however,
5601         that  it  does  not  actually match the newline. Dollar need not be the         that it does not actually match the newline. Dollar  need  not  be  the
5602         last character of the pattern if a number of alternatives are involved,         last character of the pattern if a number of alternatives are involved,
5603         but  it should be the last item in any branch in which it appears. Dol-         but it should be the last item in any branch in which it appears.  Dol-
5604         lar has no special meaning in a character class.         lar has no special meaning in a character class.
5605    
5606         The meaning of dollar can be changed so that it  matches  only  at  the         The  meaning  of  dollar  can be changed so that it matches only at the
5607         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at
5608         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
5609    
5610         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
5611         PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex
5612         matches immediately after internal newlines as well as at the start  of         matches  immediately after internal newlines as well as at the start of
5613         the  subject  string.  It  does not match after a newline that ends the         the subject string. It does not match after a  newline  that  ends  the
5614         string. A dollar matches before any newlines in the string, as well  as         string.  A dollar matches before any newlines in the string, as well as
5615         at  the very end, when PCRE_MULTILINE is set. When newline is specified         at the very end, when PCRE_MULTILINE is set. When newline is  specified
5616         as the two-character sequence CRLF, isolated CR and  LF  characters  do         as  the  two-character  sequence CRLF, isolated CR and LF characters do
5617         not indicate newlines.         not indicate newlines.
5618    
5619         For  example, the pattern /^abc$/ matches the subject string "def\nabc"         For example, the pattern /^abc$/ matches the subject string  "def\nabc"
5620         (where \n represents a newline) in multiline mode, but  not  otherwise.         (where  \n  represents a newline) in multiline mode, but not otherwise.
5621         Consequently,  patterns  that  are anchored in single line mode because         Consequently, patterns that are anchored in single  line  mode  because
5622         all branches start with ^ are not anchored in  multiline  mode,  and  a         all  branches  start  with  ^ are not anchored in multiline mode, and a
5623         match  for  circumflex  is  possible  when  the startoffset argument of         match for circumflex is  possible  when  the  startoffset  argument  of
5624         pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
5625         PCRE_MULTILINE is set.         PCRE_MULTILINE is set.
5626    
5627         Note  that  the sequences \A, \Z, and \z can be used to match the start         Note that the sequences \A, \Z, and \z can be used to match  the  start
5628         and end of the subject in both modes, and if all branches of a  pattern         and  end of the subject in both modes, and if all branches of a pattern
5629         start  with  \A it is always anchored, whether or not PCRE_MULTILINE is         start with \A it is always anchored, whether or not  PCRE_MULTILINE  is
5630         set.         set.
5631    
5632    
5633  FULL STOP (PERIOD, DOT) AND \N  FULL STOP (PERIOD, DOT) AND \N
5634    
5635         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
5636         ter  in  the subject string except (by default) a character that signi-         ter in the subject string except (by default) a character  that  signi-
5637         fies the end of a line.         fies the end of a line.
5638    
5639         When a line ending is defined as a single character, dot never  matches         When  a line ending is defined as a single character, dot never matches
5640         that  character; when the two-character sequence CRLF is used, dot does         that character; when the two-character sequence CRLF is used, dot  does
5641         not match CR if it is immediately followed  by  LF,  but  otherwise  it         not  match  CR  if  it  is immediately followed by LF, but otherwise it
5642         matches  all characters (including isolated CRs and LFs). When any Uni-         matches all characters (including isolated CRs and LFs). When any  Uni-
5643         code line endings are being recognized, dot does not match CR or LF  or         code  line endings are being recognized, dot does not match CR or LF or
5644         any of the other line ending characters.         any of the other line ending characters.
5645    
5646         The  behaviour  of  dot  with regard to newlines can be changed. If the         The behaviour of dot with regard to newlines can  be  changed.  If  the
5647         PCRE_DOTALL option is set, a dot matches  any  one  character,  without         PCRE_DOTALL  option  is  set,  a dot matches any one character, without
5648         exception. If the two-character sequence CRLF is present in the subject         exception. If the two-character sequence CRLF is present in the subject
5649         string, it takes two dots to match it.         string, it takes two dots to match it.
5650    
5651         The handling of dot is entirely independent of the handling of  circum-         The  handling of dot is entirely independent of the handling of circum-
5652         flex  and  dollar,  the  only relationship being that they both involve         flex and dollar, the only relationship being  that  they  both  involve
5653         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
5654    
5655         The escape sequence \N behaves like  a  dot,  except  that  it  is  not         The  escape  sequence  \N  behaves  like  a  dot, except that it is not
5656         affected  by  the  PCRE_DOTALL  option.  In other words, it matches any         affected by the PCRE_DOTALL option. In  other  words,  it  matches  any
5657         character except one that signifies the end of a line. Perl  also  uses         character  except  one that signifies the end of a line. Perl also uses
5658         \N to match characters by name; PCRE does not support this.         \N to match characters by name; PCRE does not support this.
5659    
5660    
5661  MATCHING A SINGLE DATA UNIT  MATCHING A SINGLE DATA UNIT
5662    
5663         Outside  a character class, the escape sequence \C matches any one data         Outside a character class, the escape sequence \C matches any one  data
5664         unit, whether or not a UTF mode is set. In the 8-bit library, one  data         unit,  whether or not a UTF mode is set. In the 8-bit library, one data
5665         unit  is  one  byte;  in the 16-bit library it is a 16-bit unit; in the         unit is one byte; in the 16-bit library it is a  16-bit  unit;  in  the
5666         32-bit library it is a 32-bit unit. Unlike a  dot,  \C  always  matches         32-bit  library  it  is  a 32-bit unit. Unlike a dot, \C always matches
5667         line-ending  characters.  The  feature  is provided in Perl in order to         line-ending characters. The feature is provided in  Perl  in  order  to
5668         match individual bytes in UTF-8 mode, but it is unclear how it can use-         match individual bytes in UTF-8 mode, but it is unclear how it can use-
5669         fully  be  used.  Because  \C breaks up characters into individual data         fully be used. Because \C breaks up  characters  into  individual  data
5670         units, matching one unit with \C in a UTF mode means that the  rest  of         units,  matching  one unit with \C in a UTF mode means that the rest of
5671         the string may start with a malformed UTF character. This has undefined         the string may start with a malformed UTF character. This has undefined
5672         results, because PCRE assumes that it is dealing with valid UTF strings         results, because PCRE assumes that it is dealing with valid UTF strings
5673         (and  by  default  it checks this at the start of processing unless the         (and by default it checks this at the start of  processing  unless  the
5674         PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or  PCRE_NO_UTF32_CHECK  option         PCRE_NO_UTF8_CHECK,  PCRE_NO_UTF16_CHECK  or PCRE_NO_UTF32_CHECK option
5675         is used).         is used).
5676    
5677         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE does not allow \C to appear in  lookbehind  assertions  (described
5678         below) in a UTF mode, because this would make it impossible  to  calcu-         below)  in  a UTF mode, because this would make it impossible to calcu-
5679         late the length of the lookbehind.         late the length of the lookbehind.
5680    
5681         In general, the \C escape sequence is best avoided. However, one way of         In general, the \C escape sequence is best avoided. However, one way of
5682         using it that avoids the problem of malformed UTF characters is to  use         using  it that avoids the problem of malformed UTF characters is to use
5683         a  lookahead to check the length of the next character, as in this pat-         a lookahead to check the length of the next character, as in this  pat-
5684         tern, which could be used with a UTF-8 string (ignore white  space  and         tern,  which  could be used with a UTF-8 string (ignore white space and
5685         line breaks):         line breaks):
5686    
5687           (?| (?=[\x00-\x7f])(\C) |           (?| (?=[\x00-\x7f])(\C) |
# Line 5566  MATCHING A SINGLE DATA UNIT Line 5689  MATCHING A SINGLE DATA UNIT
5689               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |               (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
5690               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))               (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
5691    
5692         A  group  that starts with (?| resets the capturing parentheses numbers         A group that starts with (?| resets the capturing  parentheses  numbers
5693         in each alternative (see "Duplicate  Subpattern  Numbers"  below).  The         in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
5694         assertions  at  the start of each branch check the next UTF-8 character         assertions at the start of each branch check the next  UTF-8  character
5695         for values whose encoding uses 1, 2, 3, or 4 bytes,  respectively.  The         for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
5696         character's  individual bytes are then captured by the appropriate num-         character's individual bytes are then captured by the appropriate  num-
5697         ber of groups.         ber of groups.
5698    
5699    
# Line 5580  SQUARE BRACKETS AND CHARACTER CLASSES Line 5703  SQUARE BRACKETS AND CHARACTER CLASSES
5703         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
5704         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
5705         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
5706         square bracket is required as a member of the class, it should  be  the         square  bracket  is required as a member of the class, it should be the
5707         first  data  character  in  the  class (after an initial circumflex, if         first data character in the class  (after  an  initial  circumflex,  if
5708         present) or escaped with a backslash.         present) or escaped with a backslash.
5709    
5710         A character class matches a single character in the subject. In  a  UTF         A  character  class matches a single character in the subject. In a UTF
5711         mode,  the  character  may  be  more than one data unit long. A matched         mode, the character may be more than one  data  unit  long.  A  matched
5712         character must be in the set of characters defined by the class, unless         character must be in the set of characters defined by the class, unless
5713         the  first  character in the class definition is a circumflex, in which         the first character in the class definition is a circumflex,  in  which
5714         case the subject character must not be in the set defined by the class.         case the subject character must not be in the set defined by the class.
5715         If  a  circumflex is actually required as a member of the class, ensure         If a circumflex is actually required as a member of the  class,  ensure
5716         it is not the first character, or escape it with a backslash.         it is not the first character, or escape it with a backslash.
5717    
5718         For example, the character class [aeiou] matches any lower case  vowel,         For  example, the character class [aeiou] matches any lower case vowel,
5719         while  [^aeiou]  matches  any character that is not a lower case vowel.         while [^aeiou] matches any character that is not a  lower  case  vowel.
5720         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
5721         characters  that  are in the class by enumerating those that are not. A         characters that are in the class by enumerating those that are  not.  A
5722         class that starts with a circumflex is not an assertion; it still  con-         class  that starts with a circumflex is not an assertion; it still con-
5723         sumes  a  character  from the subject string, and therefore it fails if         sumes a character from the subject string, and therefore  it  fails  if
5724         the current pointer is at the end of the string.         the current pointer is at the end of the string.
5725    
5726         In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255         In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255
5727         (0xffff)  can be included in a class as a literal string of data units,         (0xffff) can be included in a class as a literal string of data  units,
5728         or by using the \x{ escaping mechanism.         or by using the \x{ escaping mechanism.
5729    
5730         When caseless matching is set, any letters in a  class  represent  both         When  caseless  matching  is set, any letters in a class represent both
5731         their  upper  case  and lower case versions, so for example, a caseless         their upper case and lower case versions, so for  example,  a  caseless
5732         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
5733         match  "A", whereas a caseful version would. In a UTF mode, PCRE always         match "A", whereas a caseful version would. In a UTF mode, PCRE  always
5734         understands the concept of case for characters whose  values  are  less         understands  the  concept  of case for characters whose values are less
5735         than  128, so caseless matching is always possible. For characters with         than 128, so caseless matching is always possible. For characters  with
5736         higher values, the concept of case is supported  if  PCRE  is  compiled         higher  values,  the  concept  of case is supported if PCRE is compiled
5737         with  Unicode  property support, but not otherwise.  If you want to use         with Unicode property support, but not otherwise.  If you want  to  use
5738         caseless matching in a UTF mode for characters 128 and above, you  must         caseless  matching in a UTF mode for characters 128 and above, you must
5739         ensure  that  PCRE is compiled with Unicode property support as well as         ensure that PCRE is compiled with Unicode property support as  well  as
5740         with UTF support.         with UTF support.
5741    
5742         Characters that might indicate line breaks are  never  treated  in  any         Characters  that  might  indicate  line breaks are never treated in any
5743         special  way  when  matching  character  classes,  whatever line-ending         special way  when  matching  character  classes,  whatever  line-ending
5744         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
5745         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
5746         of these characters.         of these characters.
5747    
5748         The minus (hyphen) character can be used to specify a range of  charac-         The  minus (hyphen) character can be used to specify a range of charac-
5749         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters in a character  class.  For  example,  [d-m]  matches  any  letter
5750         between d and m, inclusive. If a  minus  character  is  required  in  a         between  d  and  m,  inclusive.  If  a minus character is required in a
5751         class,  it  must  be  escaped  with a backslash or appear in a position         class, it must be escaped with a backslash  or  appear  in  a  position
5752         where it cannot be interpreted as indicating a range, typically as  the         where  it cannot be interpreted as indicating a range, typically as the
5753         first or last character in the class.         first or last character in the class, or immediately after a range. For
5754           example,  [b-d-z] matches letters in the range b to d, a hyphen charac-
5755           ter, or z.
5756    
5757         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
5758         ter of a range. A pattern such as [W-]46] is interpreted as a class  of         ter  of a range. A pattern such as [W-]46] is interpreted as a class of
5759         two  characters ("W" and "-") followed by a literal string "46]", so it         two characters ("W" and "-") followed by a literal string "46]", so  it
5760         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
5761         backslash  it is interpreted as the end of range, so [W-\]46] is inter-         backslash it is interpreted as the end of range, so [W-\]46] is  inter-
5762         preted as a class containing a range followed by two other  characters.         preted  as a class containing a range followed by two other characters.
5763         The  octal or hexadecimal representation of "]" can also be used to end         The octal or hexadecimal representation of "]" can also be used to  end
5764         a range.         a range.
5765    
5766           An  error  is  generated  if  a POSIX character class (see below) or an
5767           escape sequence other than one that defines a single character  appears
5768           at  a  point  where  a range ending character is expected. For example,
5769           [z-\xff] is valid, but [A-\d] and [A-[:digit:]] are not.
5770    
5771         Ranges operate in the collating sequence of character values. They  can         Ranges operate in the collating sequence of character values. They  can
5772         also   be  used  for  characters  specified  numerically,  for  example         also   be  used  for  characters  specified  numerically,  for  example
5773         [\000-\037]. Ranges can include any characters that are valid  for  the         [\000-\037]. Ranges can include any characters that are valid  for  the
# Line 5700  POSIX CHARACTER CLASSES Line 5830  POSIX CHARACTER CLASSES
5830           lower    lower case letters           lower    lower case letters
5831           print    printing characters, including space           print    printing characters, including space
5832           punct    printing characters, excluding letters and digits and space           punct    printing characters, excluding letters and digits and space
5833           space    white space (not quite the same as \s)           space    white space (the same as \s from PCRE 8.34)
5834           upper    upper case letters           upper    upper case letters
5835           word     "word" characters (same as \w)           word     "word" characters (same as \w)
5836           xdigit   hexadecimal digits           xdigit   hexadecimal digits
5837    
5838         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),         The  default  "space" characters are HT (9), LF (10), VT (11), FF (12),
5839         and space (32). Notice that this list includes the VT  character  (code         CR (13), and space (32). If locale-specific matching is  taking  place,
5840         11). This makes "space" different to \s, which does not include VT (for         there  may be additional space characters. "Space" used to be different
5841         Perl compatibility).         to \s, which did not include VT, for Perl compatibility. However,  Perl
5842           changed at release 5.18, and PCRE followed at release 8.34. "Space" and
5843           \s now match the same set of characters.
5844    
5845         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension
5846         from  Perl  5.8. Another Perl extension is negation, which is indicated         from  Perl  5.8. Another Perl extension is negation, which is indicated
# Line 5720  POSIX CHARACTER CLASSES Line 5852  POSIX CHARACTER CLASSES
5852         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
5853         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
5854    
5855         By default, in UTF modes, characters with values greater  than  128  do         By default, characters with values greater than 128 do not match any of
5856         not  match any of the POSIX character classes. However, if the PCRE_UCP         the  POSIX character classes. However, if the PCRE_UCP option is passed
5857         option is passed to pcre_compile(), some of the classes are changed  so         to pcre_compile(), some of the classes  are  changed  so  that  Unicode
5858         that Unicode character properties are used. This is achieved by replac-         character  properties  are  used. This is achieved by replacing certain
5859         ing the POSIX classes by other sequences, as follows:         POSIX classes by other sequences, as follows:
5860    
5861           [:alnum:]  becomes  \p{Xan}           [:alnum:]  becomes  \p{Xan}
5862           [:alpha:]  becomes  \p{L}           [:alpha:]  becomes  \p{L}
# Line 5735  POSIX CHARACTER CLASSES Line 5867  POSIX CHARACTER CLASSES
5867           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
5868           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
5869    
5870         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other         Negated versions, such as [:^alpha:] use \P instead of \p. Three  other
5871         POSIX classes are unchanged, and match only characters with code points         POSIX classes are handled specially in UCP mode:
5872         less than 128.  
5873           [:graph:] This  matches  characters that have glyphs that mark the page
5874                     when printed. In Unicode property terms, it matches all char-
5875                     acters with the L, M, N, P, S, or Cf properties, except for:
5876    
5877                       U+061C           Arabic Letter Mark
5878                       U+180E           Mongolian Vowel Separator
5879                       U+2066 - U+2069  Various "isolate"s
5880    
5881    
5882           [:print:] This  matches  the  same  characters  as [:graph:] plus space
5883                     characters that are not controls, that  is,  characters  with
5884                     the Zs property.
5885    
5886           [:punct:] This matches all characters that have the Unicode P (punctua-
5887                     tion) property, plus those characters whose code  points  are
5888                     less than 128 that have the S (Symbol) property.
5889    
5890           The  other  POSIX classes are unchanged, and match only characters with
5891           code points less than 128.
5892    
5893    
5894  VERTICAL BAR  VERTICAL BAR
# Line 5934  NAMED SUBPATTERNS Line 6085  NAMED SUBPATTERNS
6085         references,  recursion,  and conditions, can be made by name as well as         references,  recursion,  and conditions, can be made by name as well as
6086         by number.         by number.
6087    
6088         Names consist of up to  32  alphanumeric  characters  and  underscores.         Names consist of up to 32 alphanumeric characters and underscores,  but
6089         Named  capturing  parentheses  are  still  allocated numbers as well as         must  start  with  a  non-digit.  Named capturing parentheses are still
6090         names, exactly as if the names were not present. The PCRE API  provides         allocated numbers as well as names, exactly as if the  names  were  not
6091         function calls for extracting the name-to-number translation table from         present.  The PCRE API provides function calls for extracting the name-
6092         a compiled pattern. There is also a convenience function for extracting         to-number translation table from a compiled pattern. There  is  also  a
6093         a captured substring by name.         convenience function for extracting a captured substring by name.
6094    
6095         By  default, a name must be unique within a pattern, but it is possible         By  default, a name must be unique within a pattern, but it is possible
6096         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
# Line 5967  NAMED SUBPATTERNS Line 6118  NAMED SUBPATTERNS
6118         subpattern it was.         subpattern it was.
6119    
6120         If you make a back reference to  a  non-unique  named  subpattern  from         If you make a back reference to  a  non-unique  named  subpattern  from
6121         elsewhere  in the pattern, the one that corresponds to the first occur-         elsewhere  in the pattern, the subpatterns to which the name refers are
6122         rence of the name is used. In the absence of duplicate numbers (see the         checked in the order in which they appear in the overall  pattern.  The
6123         previous  section) this is the one with the lowest number. If you use a         first one that is set is used for the reference. For example, this pat-
6124         named reference in a condition test (see the section  about  conditions         tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
6125         below),  either  to check whether a subpattern has matched, or to check  
6126         for recursion, all subpatterns with the same name are  tested.  If  the           (?:(?<n>foo)|(?<n>bar))\k<n>
6127         condition  is  true for any one of them, the overall condition is true.  
6128         This is the same behaviour as testing by number. For further details of  
6129         the interfaces for handling named subpatterns, see the pcreapi documen-         If you make a subroutine call to a non-unique named subpattern, the one
6130         tation.         that  corresponds  to  the first occurrence of the name is used. In the
6131           absence of duplicate numbers (see the previous section) this is the one
6132           with the lowest number.
6133    
6134           If you use a named reference in a condition test (see the section about
6135           conditions below), either to check whether a subpattern has matched, or
6136           to  check for recursion, all subpatterns with the same name are tested.
6137           If the condition is true for any one of them, the overall condition  is
6138           true.  This  is  the  same  behaviour as testing by number. For further
6139           details of the interfaces  for  handling  named  subpatterns,  see  the
6140           pcreapi documentation.
6141    
6142         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
6143         patterns  with  the same number because PCRE uses only the numbers when         patterns with the same number because PCRE uses only the  numbers  when
6144         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
6145         ent  names  are given to subpatterns with the same number. However, you         ent names are given to subpatterns with the same number.  However,  you
6146         can give the same name to subpatterns with the same number,  even  when         can always give the same name to subpatterns with the same number, even
6147         PCRE_DUPNAMES is not set.         when PCRE_DUPNAMES is not set.
6148    
6149    
6150  REPETITION  REPETITION
6151    
6152         Repetition  is  specified  by  quantifiers, which can follow any of the         Repetition is specified by quantifiers, which can  follow  any  of  the
6153         following items:         following items:
6154    
6155           a literal data character           a literal data character
# Line 6002  REPETITION Line 6163  REPETITION
6163           a parenthesized subpattern (including assertions)           a parenthesized subpattern (including assertions)
6164           a subroutine call to a subpattern (recursive or otherwise)           a subroutine call to a subpattern (recursive or otherwise)
6165    
6166         The general repetition quantifier specifies a minimum and maximum  num-         The  general repetition quantifier specifies a minimum and maximum num-
6167         ber  of  permitted matches, by giving the two numbers in curly brackets         ber of permitted matches, by giving the two numbers in  curly  brackets
6168         (braces), separated by a comma. The numbers must be  less  than  65536,         (braces),  separated  by  a comma. The numbers must be less than 65536,
6169         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
6170    
6171           z{2,4}           z{2,4}
6172    
6173         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a
6174         special character. If the second number is omitted, but  the  comma  is         special  character.  If  the second number is omitted, but the comma is
6175         present,  there  is  no upper limit; if the second number and the comma         present, there is no upper limit; if the second number  and  the  comma
6176         are both omitted, the quantifier specifies an exact number of  required         are  both omitted, the quantifier specifies an exact number of required
6177         matches. Thus         matches. Thus
6178    
6179           [aeiou]{3,}           [aeiou]{3,}
# Line 6021  REPETITION Line 6182  REPETITION
6182    
6183           \d{8}           \d{8}
6184    
6185         matches  exactly  8  digits. An opening curly bracket that appears in a         matches exactly 8 digits. An opening curly bracket that  appears  in  a
6186         position where a quantifier is not allowed, or one that does not  match         position  where a quantifier is not allowed, or one that does not match
6187         the  syntax of a quantifier, is taken as a literal character. For exam-         the syntax of a quantifier, is taken as a literal character. For  exam-
6188         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
6189    
6190         In UTF modes, quantifiers apply to characters rather than to individual         In UTF modes, quantifiers apply to characters rather than to individual
6191         data  units. Thus, for example, \x{100}{2} matches two characters, each         data units. Thus, for example, \x{100}{2} matches two characters,  each
6192         of which is represented by a two-byte sequence in a UTF-8 string. Simi-         of which is represented by a two-byte sequence in a UTF-8 string. Simi-
6193         larly,  \X{3} matches three Unicode extended grapheme clusters, each of         larly, \X{3} matches three Unicode extended grapheme clusters, each  of
6194         which may be several data units long (and  they  may  be  of  different         which  may  be  several  data  units long (and they may be of different
6195         lengths).         lengths).
6196    
6197         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
6198         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
6199         ful  for  subpatterns that are referenced as subroutines from elsewhere         ful for subpatterns that are referenced as subroutines  from  elsewhere
6200         in the pattern (but see also the section entitled "Defining subpatterns         in the pattern (but see also the section entitled "Defining subpatterns
6201         for  use  by  reference only" below). Items other than subpatterns that         for use by reference only" below). Items other  than  subpatterns  that
6202         have a {0} quantifier are omitted from the compiled pattern.         have a {0} quantifier are omitted from the compiled pattern.
6203    
6204         For convenience, the three most common quantifiers have  single-charac-         For  convenience, the three most common quantifiers have single-charac-
6205         ter abbreviations:         ter abbreviations:
6206    
6207           *    is equivalent to {0,}           *    is equivalent to {0,}
6208           +    is equivalent to {1,}           +    is equivalent to {1,}
6209           ?    is equivalent to {0,1}           ?    is equivalent to {0,1}
6210    
6211         It  is  possible  to construct infinite loops by following a subpattern         It is possible to construct infinite loops by  following  a  subpattern
6212         that can match no characters with a quantifier that has no upper limit,         that can match no characters with a quantifier that has no upper limit,
6213         for example:         for example:
6214    
6215           (a?)*           (a?)*
6216    
6217         Earlier versions of Perl and PCRE used to give an error at compile time         Earlier versions of Perl and PCRE used to give an error at compile time
6218         for such patterns. However, because there are cases where this  can  be         for  such  patterns. However, because there are cases where this can be
6219         useful,  such  patterns  are now accepted, but if any repetition of the         useful, such patterns are now accepted, but if any  repetition  of  the
6220         subpattern does in fact match no characters, the loop is forcibly  bro-         subpattern  does in fact match no characters, the loop is forcibly bro-
6221         ken.         ken.
6222    
6223         By  default,  the quantifiers are "greedy", that is, they match as much         By default, the quantifiers are "greedy", that is, they match  as  much
6224         as possible (up to the maximum  number  of  permitted  times),  without         as  possible  (up  to  the  maximum number of permitted times), without
6225         causing  the  rest of the pattern to fail. The classic example of where         causing the rest of the pattern to fail. The classic example  of  where
6226         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
6227         appear  between  /*  and  */ and within the comment, individual * and /         appear between /* and */ and within the comment,  individual  *  and  /
6228         characters may appear. An attempt to match C comments by  applying  the         characters  may  appear. An attempt to match C comments by applying the
6229         pattern         pattern
6230    
6231           /\*.*\*/           /\*.*\*/
# Line 6073  REPETITION Line 6234  REPETITION
6234    
6235           /* first comment */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
6236    
6237         fails,  because it matches the entire string owing to the greediness of         fails, because it matches the entire string owing to the greediness  of
6238         the .*  item.         the .*  item.
6239    
6240         However, if a quantifier is followed by a question mark, it  ceases  to         However,  if  a quantifier is followed by a question mark, it ceases to
6241         be greedy, and instead matches the minimum number of times possible, so         be greedy, and instead matches the minimum number of times possible, so
6242         the pattern         the pattern
6243    
6244           /\*.*?\*/           /\*.*?\*/
6245    
6246         does the right thing with the C comments. The meaning  of  the  various         does  the  right  thing with the C comments. The meaning of the various
6247         quantifiers  is  not  otherwise  changed,  just the preferred number of         quantifiers is not otherwise changed,  just  the  preferred  number  of
6248         matches.  Do not confuse this use of question mark with its  use  as  a         matches.   Do  not  confuse this use of question mark with its use as a
6249         quantifier  in its own right. Because it has two uses, it can sometimes         quantifier in its own right. Because it has two uses, it can  sometimes
6250         appear doubled, as in         appear doubled, as in
6251    
6252           \d??\d           \d??\d
# Line 6093  REPETITION Line 6254  REPETITION
6254         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
6255         only way the rest of the pattern matches.         only way the rest of the pattern matches.
6256    
6257         If  the PCRE_UNGREEDY option is set (an option that is not available in         If the PCRE_UNGREEDY option is set (an option that is not available  in
6258         Perl), the quantifiers are not greedy by default, but  individual  ones         Perl),  the  quantifiers are not greedy by default, but individual ones
6259         can  be  made  greedy  by following them with a question mark. In other         can be made greedy by following them with a  question  mark.  In  other
6260         words, it inverts the default behaviour.         words, it inverts the default behaviour.
6261    
6262         When a parenthesized subpattern is quantified  with  a  minimum  repeat         When  a  parenthesized  subpattern  is quantified with a minimum repeat
6263         count  that is greater than 1 or with a limited maximum, more memory is         count that is greater than 1 or with a limited maximum, more memory  is
6264         required for the compiled pattern, in proportion to  the  size  of  the         required  for  the  compiled  pattern, in proportion to the size of the
6265         minimum or maximum.         minimum or maximum.
6266    
6267         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
6268         alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,
6269         the  pattern  is  implicitly anchored, because whatever follows will be         the pattern is implicitly anchored, because whatever  follows  will  be
6270         tried against every character position in the subject string, so  there         tried  against every character position in the subject string, so there
6271         is  no  point  in  retrying the overall match at any position after the         is no point in retrying the overall match at  any  position  after  the
6272         first. PCRE normally treats such a pattern as though it  were  preceded         first.  PCRE  normally treats such a pattern as though it were preceded
6273         by \A.         by \A.
6274    
6275         In  cases  where  it  is known that the subject string contains no new-         In cases where it is known that the subject  string  contains  no  new-
6276         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-
6277         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
6278    
6279         However,  there  are  some cases where the optimization cannot be used.         However, there are some cases where the optimization  cannot  be  used.
6280         When .*  is inside capturing parentheses that are the subject of a back         When .*  is inside capturing parentheses that are the subject of a back
6281         reference elsewhere in the pattern, a match at the start may fail where         reference elsewhere in the pattern, a match at the start may fail where
6282         a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
6283    
6284           (.*)abc\1           (.*)abc\1
6285    
6286         If the subject is "xyz123abc123" the match point is the fourth  charac-         If  the subject is "xyz123abc123" the match point is the fourth charac-
6287         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
6288    
6289         Another  case where implicit anchoring is not applied is when the lead-         Another case where implicit anchoring is not applied is when the  lead-
6290         ing .* is inside an atomic group. Once again, a match at the start  may         ing  .* is inside an atomic group. Once again, a match at the start may
6291         fail where a later one succeeds. Consider this pattern:         fail where a later one succeeds. Consider this pattern:
6292    
6293           (?>.*?a)b           (?>.*?a)b
6294    
6295         It  matches "ab" in the subject "aab". The use of the backtracking con-         It matches "ab" in the subject "aab". The use of the backtracking  con-
6296         trol verbs (*PRUNE) and (*SKIP) also disable this optimization.         trol verbs (*PRUNE) and (*SKIP) also disable this optimization.
6297    
6298         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 6140  REPETITION Line 6301  REPETITION
6301           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
6302    
6303         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
6304         is "tweedledee". However, if there are  nested  capturing  subpatterns,         is  "tweedledee".  However,  if there are nested capturing subpatterns,
6305         the  corresponding captured values may have been set in previous itera-         the corresponding captured values may have been set in previous  itera-
6306         tions. For example, after         tions. For example, after
6307    
6308           /(a|(b))+/           /(a|(b))+/
# Line 6151  REPETITION Line 6312  REPETITION
6312    
6313  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
6314    
6315         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")         With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
6316         repetition,  failure  of what follows normally causes the repeated item         repetition, failure of what follows normally causes the  repeated  item
6317         to be re-evaluated to see if a different number of repeats  allows  the         to  be  re-evaluated to see if a different number of repeats allows the
6318         rest  of  the pattern to match. Sometimes it is useful to prevent this,         rest of the pattern to match. Sometimes it is useful to  prevent  this,
6319         either to change the nature of the match, or to cause it  fail  earlier         either  to  change the nature of the match, or to cause it fail earlier
6320         than  it otherwise might, when the author of the pattern knows there is         than it otherwise might, when the author of the pattern knows there  is
6321         no point in carrying on.         no point in carrying on.
6322    
6323         Consider, for example, the pattern \d+foo when applied to  the  subject         Consider,  for  example, the pattern \d+foo when applied to the subject
6324         line         line
6325    
6326           123456bar           123456bar
6327    
6328         After matching all 6 digits and then failing to match "foo", the normal         After matching all 6 digits and then failing to match "foo", the normal
6329         action of the matcher is to try again with only 5 digits  matching  the         action  of  the matcher is to try again with only 5 digits matching the
6330         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.         \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
6331         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides         "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
6332         the  means for specifying that once a subpattern has matched, it is not         the means for specifying that once a subpattern has matched, it is  not
6333         to be re-evaluated in this way.         to be re-evaluated in this way.
6334    
6335         If we use atomic grouping for the previous example, the  matcher  gives         If  we  use atomic grouping for the previous example, the matcher gives
6336         up  immediately  on failing to match "foo" the first time. The notation         up immediately on failing to match "foo" the first time.  The  notation
6337         is a kind of special parenthesis, starting with (?> as in this example:         is a kind of special parenthesis, starting with (?> as in this example:
6338    
6339           (?>\d+)foo           (?>\d+)foo
6340    
6341         This kind of parenthesis "locks up" the  part of the  pattern  it  con-         This  kind  of  parenthesis "locks up" the  part of the pattern it con-
6342         tains  once  it  has matched, and a failure further into the pattern is         tains once it has matched, and a failure further into  the  pattern  is
6343         prevented from backtracking into it. Backtracking past it  to  previous         prevented  from  backtracking into it. Backtracking past it to previous
6344         items, however, works as normal.         items, however, works as normal.
6345    
6346         An  alternative  description  is that a subpattern of this type matches         An alternative description is that a subpattern of  this  type  matches
6347         the string of characters that an  identical  standalone  pattern  would         the  string  of  characters  that an identical standalone pattern would
6348         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
6349    
6350         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
6351         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
6352         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
6353         pared to adjust the number of digits they match in order  to  make  the         pared  to  adjust  the number of digits they match in order to make the
6354         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
6355         digits.         digits.
6356    
6357         Atomic groups in general can of course contain arbitrarily  complicated         Atomic  groups in general can of course contain arbitrarily complicated
6358         subpatterns,  and  can  be  nested. However, when the subpattern for an         subpatterns, and can be nested. However, when  the  subpattern  for  an
6359         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
6360         simpler  notation,  called  a "possessive quantifier" can be used. This         simpler notation, called a "possessive quantifier" can  be  used.  This
6361         consists of an additional + character  following  a  quantifier.  Using         consists  of  an  additional  + character following a quantifier. Using
6362         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
6363    
6364           \d++foo           \d++foo
# Line 6207  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 6368  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
6368    
6369           (abc|xyz){2,3}+           (abc|xyz){2,3}+
6370    
6371         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive   quantifiers   are   always  greedy;  the  setting  of  the
6372         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
6373         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
6374         meaning  of  a  possessive  quantifier and the equivalent atomic group,         meaning of a possessive quantifier and  the  equivalent  atomic  group,
6375         though there may be a performance  difference;  possessive  quantifiers         though  there  may  be a performance difference; possessive quantifiers
6376         should be slightly faster.         should be slightly faster.
6377    
6378         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
6379         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
6380         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
6381         built Sun's Java package, and PCRE copied it from there. It  ultimately         built  Sun's Java package, and PCRE copied it from there. It ultimately
6382         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
6383    
6384         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
6385         ple pattern constructs. For example, the sequence  A+B  is  treated  as         ple  pattern  constructs.  For  example, the sequence A+B is treated as
6386         A++B  because  there is no point in backtracking into a sequence of A's         A++B because there is no point in backtracking into a sequence  of  A's
6387         when B must follow.         when B must follow.
6388    
6389         When a pattern contains an unlimited repeat inside  a  subpattern  that         When  a  pattern  contains an unlimited repeat inside a subpattern that
6390         can  itself  be  repeated  an  unlimited number of times, the use of an         can itself be repeated an unlimited number of  times,  the  use  of  an
6391         atomic group is the only way to avoid some  failing  matches  taking  a         atomic  group  is  the  only way to avoid some failing matches taking a
6392         very long time indeed. The pattern         very long time indeed. The pattern
6393    
6394           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
6395    
6396         matches  an  unlimited number of substrings that either consist of non-         matches an unlimited number of substrings that either consist  of  non-
6397         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
6398         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
6399    
6400           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
6401    
6402         it  takes  a  long  time  before reporting failure. This is because the         it takes a long time before reporting  failure.  This  is  because  the
6403         string can be divided between the internal \D+ repeat and the  external         string  can be divided between the internal \D+ repeat and the external
6404         *  repeat  in  a  large  number of ways, and all have to be tried. (The         * repeat in a large number of ways, and all  have  to  be  tried.  (The
6405         example uses [!?] rather than a single character at  the  end,  because         example  uses  [!?]  rather than a single character at the end, because
6406         both  PCRE  and  Perl have an optimization that allows for fast failure         both PCRE and Perl have an optimization that allows  for  fast  failure
6407         when a single character is used. They remember the last single  charac-         when  a single character is used. They remember the last single charac-
6408         ter  that  is required for a match, and fail early if it is not present         ter that is required for a match, and fail early if it is  not  present
6409         in the string.) If the pattern is changed so that  it  uses  an  atomic         in  the  string.)  If  the pattern is changed so that it uses an atomic
6410         group, like this:         group, like this:
6411    
6412           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
# Line 6257  BACK REFERENCES Line 6418  BACK REFERENCES
6418    
6419         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
6420         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
6421         pattern  earlier  (that is, to its left) in the pattern, provided there         pattern earlier (that is, to its left) in the pattern,  provided  there
6422         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
6423    
6424         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
6425         it  is  always  taken  as a back reference, and causes an error only if         it is always taken as a back reference, and causes  an  error  only  if
6426         there are not that many capturing left parentheses in the  entire  pat-         there  are  not that many capturing left parentheses in the entire pat-
6427         tern.  In  other words, the parentheses that are referenced need not be         tern. In other words, the parentheses that are referenced need  not  be
6428         to the left of the reference for numbers less than 10. A "forward  back         to  the left of the reference for numbers less than 10. A "forward back
6429         reference"  of  this  type can make sense when a repetition is involved         reference" of this type can make sense when a  repetition  is  involved
6430         and the subpattern to the right has participated in an  earlier  itera-         and  the  subpattern to the right has participated in an earlier itera-
6431         tion.         tion.
6432    
6433         It  is  not  possible to have a numerical "forward back reference" to a         It is not possible to have a numerical "forward back  reference"  to  a
6434         subpattern whose number is 10 or  more  using  this  syntax  because  a         subpattern  whose  number  is  10  or  more using this syntax because a
6435         sequence  such  as  \50 is interpreted as a character defined in octal.         sequence such as \50 is interpreted as a character  defined  in  octal.
6436         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
6437         details  of  the  handling of digits following a backslash. There is no         details of the handling of digits following a backslash.  There  is  no
6438         such problem when named parentheses are used. A back reference  to  any         such  problem  when named parentheses are used. A back reference to any
6439         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
6440    
6441         Another  way  of  avoiding  the ambiguity inherent in the use of digits         Another way of avoiding the ambiguity inherent in  the  use  of  digits
6442         following a backslash is to use the \g  escape  sequence.  This  escape         following  a  backslash  is  to use the \g escape sequence. This escape
6443         must be followed by an unsigned number or a negative number, optionally         must be followed by an unsigned number or a negative number, optionally
6444         enclosed in braces. These examples are all identical:         enclosed in braces. These examples are all identical:
6445    
# Line 6286  BACK REFERENCES Line 6447  BACK REFERENCES
6447           (ring), \g1           (ring), \g1
6448           (ring), \g{1}           (ring), \g{1}
6449    
6450         An unsigned number specifies an absolute reference without the  ambigu-         An  unsigned number specifies an absolute reference without the ambigu-
6451         ity that is present in the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
6452         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
6453         Consider this example:         Consider this example:
# Line 6295  BACK REFERENCES Line 6456  BACK REFERENCES
6456    
6457         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
6458         ing subpattern before \g, that is, is it equivalent to \2 in this exam-         ing subpattern before \g, that is, is it equivalent to \2 in this exam-
6459         ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative         ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
6460         references can be helpful in long patterns, and also in  patterns  that         references  can  be helpful in long patterns, and also in patterns that
6461         are  created  by  joining  together  fragments  that contain references         are created by  joining  together  fragments  that  contain  references
6462         within themselves.         within themselves.
6463    
6464         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
6465         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
6466         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
6467         of doing that). So the pattern         of doing that). So the pattern
6468    
6469           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
6470    
6471         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
6472         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
6473         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
6474         ple,         ple,
6475    
6476           ((?i)rah)\s+\1           ((?i)rah)\s+\1
6477    
6478         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
6479         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
6480    
6481         There  are  several  different ways of writing back references to named         There are several different ways of writing back  references  to  named
6482         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
6483         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
6484         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
6485         and  named  references,  is  also supported. We could rewrite the above         and named references, is also supported. We  could  rewrite  the  above
6486         example in any of the following ways:         example in any of the following ways:
6487    
6488           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 6329  BACK REFERENCES Line 6490  BACK REFERENCES
6490           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
6491           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
6492    
6493         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
6494         before or after the reference.         before or after the reference.
6495    
6496         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
6497         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
6498         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
6499    
6500           (a|(bc))\2           (a|(bc))\2
6501    
6502         always  fails  if  it starts to match "a" rather than "bc". However, if         always fails if it starts to match "a" rather than  "bc".  However,  if
6503         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
6504         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
6505    
6506         Because  there may be many capturing parentheses in a pattern, all dig-         Because there may be many capturing parentheses in a pattern, all  dig-
6507         its following a backslash are taken as part of a potential back  refer-         its  following a backslash are taken as part of a potential back refer-
6508         ence  number.   If  the  pattern continues with a digit character, some         ence number.  If the pattern continues with  a  digit  character,  some
6509         delimiter must  be  used  to  terminate  the  back  reference.  If  the         delimiter  must  be  used  to  terminate  the  back  reference.  If the
6510         PCRE_EXTENDED  option  is  set, this can be white space. Otherwise, the         PCRE_EXTENDED option is set, this can be white  space.  Otherwise,  the
6511         \g{ syntax or an empty comment (see "Comments" below) can be used.         \g{ syntax or an empty comment (see "Comments" below) can be used.
6512    
6513     Recursive back references     Recursive back references
6514    
6515         A back reference that occurs inside the parentheses to which it  refers         A  back reference that occurs inside the parentheses to which it refers
6516         fails  when  the subpattern is first used, so, for example, (a\1) never         fails when the subpattern is first used, so, for example,  (a\1)  never
6517         matches.  However, such references can be useful inside  repeated  sub-         matches.   However,  such references can be useful inside repeated sub-
6518         patterns. For example, the pattern         patterns. For example, the pattern
6519    
6520           (a|b\1)+           (a|b\1)+
6521    
6522         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
6523         ation of the subpattern,  the  back  reference  matches  the  character         ation  of  the  subpattern,  the  back  reference matches the character
6524         string  corresponding  to  the previous iteration. In order for this to         string corresponding to the previous iteration. In order  for  this  to
6525         work, the pattern must be such that the first iteration does  not  need         work,  the  pattern must be such that the first iteration does not need
6526         to  match the back reference. This can be done using alternation, as in         to match the back reference. This can be done using alternation, as  in
6527         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
6528    
6529         Back references of this type cause the group that they reference to  be         Back  references of this type cause the group that they reference to be
6530         treated  as  an atomic group.  Once the whole group has been matched, a         treated as an atomic group.  Once the whole group has been  matched,  a
6531         subsequent matching failure cannot cause backtracking into  the  middle         subsequent  matching  failure cannot cause backtracking into the middle
6532         of the group.         of the group.
6533    
6534    
6535  ASSERTIONS  ASSERTIONS
6536    
6537         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
6538         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
6539         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
6540         described above.         described above.
6541    
6542         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
6543         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
6544         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
6545         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
6546         matching position to be changed.         matching position to be changed.
6547    
6548         Assertion subpatterns are not capturing subpatterns. If such an  asser-         Assertion  subpatterns are not capturing subpatterns. If such an asser-
6549         tion  contains  capturing  subpatterns within it, these are counted for         tion contains capturing subpatterns within it, these  are  counted  for
6550         the purposes of numbering the capturing subpatterns in the  whole  pat-         the  purposes  of numbering the capturing subpatterns in the whole pat-
6551         tern.  However,  substring  capturing  is carried out only for positive         tern. However, substring capturing is carried  out  only  for  positive
6552         assertions. (Perl sometimes, but not always, does do capturing in nega-         assertions. (Perl sometimes, but not always, does do capturing in nega-
6553         tive assertions.)         tive assertions.)
6554    
6555         For  compatibility  with  Perl,  assertion subpatterns may be repeated;         For compatibility with Perl, assertion  subpatterns  may  be  repeated;
6556         though it makes no sense to assert the same thing  several  times,  the         though  it  makes  no sense to assert the same thing several times, the
6557         side  effect  of  capturing  parentheses may occasionally be useful. In         side effect of capturing parentheses may  occasionally  be  useful.  In
6558         practice, there only three cases:         practice, there only three cases:
6559    
6560         (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during         (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during
6561         matching.   However,  it  may  contain internal capturing parenthesized         matching.  However, it may  contain  internal  capturing  parenthesized
6562         groups that are called from elsewhere via the subroutine mechanism.         groups that are called from elsewhere via the subroutine mechanism.
6563    
6564         (2) If quantifier is {0,n} where n is greater than zero, it is  treated         (2)  If quantifier is {0,n} where n is greater than zero, it is treated
6565         as  if  it  were  {0,1}.  At run time, the rest of the pattern match is         as if it were {0,1}. At run time, the rest  of  the  pattern  match  is
6566         tried with and without the assertion, the order depending on the greed-         tried with and without the assertion, the order depending on the greed-
6567         iness of the quantifier.         iness of the quantifier.
6568    
6569         (3)  If  the minimum repetition is greater than zero, the quantifier is         (3) If the minimum repetition is greater than zero, the  quantifier  is
6570         ignored.  The assertion is obeyed just  once  when  encountered  during         ignored.   The  assertion  is  obeyed just once when encountered during
6571         matching.         matching.
6572    
6573     Lookahead assertions     Lookahead assertions
# Line 6416  ASSERTIONS Line 6577  ASSERTIONS
6577    
6578           \w+(?=;)           \w+(?=;)
6579    
6580         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
6581         colon in the match, and         colon in the match, and
6582    
6583           foo(?!bar)           foo(?!bar)
6584    
6585         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
6586         that the apparently similar pattern         that the apparently similar pattern
6587    
6588           (?!foo)bar           (?!foo)bar
6589    
6590         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
6591         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
6592         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
6593         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
6594    
6595         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
6596         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
6597         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
6598         string must always fail.  The backtracking control verb (*FAIL) or (*F)         string must always fail.  The backtracking control verb (*FAIL) or (*F)
6599         is a synonym for (?!).         is a synonym for (?!).
6600    
6601     Lookbehind assertions     Lookbehind assertions
6602    
6603         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
6604         for negative assertions. For example,         for negative assertions. For example,
6605    
6606           (?<!foo)bar           (?<!foo)bar
6607    
6608         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
6609         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
6610         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
6611         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
6612         fixed length. Thus         fixed length. Thus
6613    
6614           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 6456  ASSERTIONS Line 6617  ASSERTIONS
6617    
6618           (?<!dogs?|cats?)           (?<!dogs?|cats?)
6619    
6620         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
6621         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
6622         This is an extension compared with Perl, which requires all branches to         This is an extension compared with Perl, which requires all branches to
6623         match the same length of string. An assertion such as         match the same length of string. An assertion such as
6624    
6625           (?<=ab(c|de))           (?<=ab(c|de))
6626    
6627         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
6628         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
6629         top-level branches:         top-level branches:
6630    
6631           (?<=abc|abde)           (?<=abc|abde)
6632    
6633         In  some  cases, the escape sequence \K (see above) can be used instead         In some cases, the escape sequence \K (see above) can be  used  instead
6634         of a lookbehind assertion to get round the fixed-length restriction.         of a lookbehind assertion to get round the fixed-length restriction.
6635    
6636         The implementation of lookbehind assertions is, for  each  alternative,         The  implementation  of lookbehind assertions is, for each alternative,
6637         to  temporarily  move the current position back by the fixed length and         to temporarily move the current position back by the fixed  length  and
6638         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
6639         rent position, the assertion fails.         rent position, the assertion fails.
6640    
6641         In  a UTF mode, PCRE does not allow the \C escape (which matches a sin-         In a UTF mode, PCRE does not allow the \C escape (which matches a  sin-
6642         gle data unit even in a UTF mode) to appear in  lookbehind  assertions,         gle  data  unit even in a UTF mode) to appear in lookbehind assertions,
6643         because  it  makes it impossible to calculate the length of the lookbe-         because it makes it impossible to calculate the length of  the  lookbe-
6644         hind. The \X and \R escapes, which can match different numbers of  data         hind.  The \X and \R escapes, which can match different numbers of data
6645         units, are also not permitted.         units, are also not permitted.
6646    
6647         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
6648         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
6649         Recursion, however, is not supported.         Recursion, however, is not supported.
6650    
6651         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
6652         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
6653         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
6654    
6655           abcd$           abcd$
6656    
6657         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
6658         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
6659         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
6660         pattern is specified as         pattern is specified as
6661    
6662           ^.*abcd$           ^.*abcd$
6663    
6664         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
6665         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
6666         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
6667         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
6668         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
6669    
6670           ^.*+(?<=abcd)           ^.*+(?<=abcd)
6671    
6672         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
6673         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
6674         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
6675         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
6676         processing time.         processing time.
6677    
6678     Using multiple assertions     Using multiple assertions
# Line 6520  ASSERTIONS Line 6681  ASSERTIONS
6681    
6682           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
6683    
6684         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
6685         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
6686         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
6687         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
6688         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
6689         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
6690         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
6691         foo". A pattern to do that is         foo". A pattern to do that is
6692    
6693           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
6694    
6695         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
6696         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
6697         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
6698    
# Line 6539  ASSERTIONS Line 6700  ASSERTIONS
6700    
6701           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
6702    
6703         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
6704         is not preceded by "foo", while         is not preceded by "foo", while
6705    
6706           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
6707    
6708         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
6709         three characters that are not "999".         three characters that are not "999".
6710    
6711    
6712  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
6713    
6714         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
6715         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
6716         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
6717         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
6718         subpattern are:         subpattern are:
6719    
6720           (?(condition)yes-pattern)           (?(condition)yes-pattern)
6721           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
6722    
6723         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
6724         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
6725         tives  in  the subpattern, a compile-time error occurs. Each of the two         tives in the subpattern, a compile-time error occurs. Each of  the  two
6726         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
6727         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
6728         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 6570  CONDITIONAL SUBPATTERNS Line 6731  CONDITIONAL SUBPATTERNS
6731           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
6732    
6733    
6734         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
6735         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
6736    
6737     Checking for a used subpattern by number     Checking for a used subpattern by number
6738    
6739         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
6740         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
6741         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
6742         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
6743         numbers), the condition is true if any of them have matched. An  alter-         numbers),  the condition is true if any of them have matched. An alter-
6744         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
6745         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
6746         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
6747         most recent by (?(-2), and so on. Inside loops it can also  make  sense         most  recent  by (?(-2), and so on. Inside loops it can also make sense
6748         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
6749         referenced as (?(+1), and so on. (The value zero in any of these  forms         referenced  as (?(+1), and so on. (The value zero in any of these forms
6750         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
6751    
6752         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
6753         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
6754         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
6755    
6756           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
6757    
6758         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
6759         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
6760         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
6761         third part is a conditional subpattern that tests whether  or  not  the         third  part  is  a conditional subpattern that tests whether or not the
6762         first  set  of  parentheses  matched.  If they did, that is, if subject         first set of parentheses matched. If they  did,  that  is,  if  subject
6763         started with an opening parenthesis, the condition is true, and so  the         started  with an opening parenthesis, the condition is true, and so the
6764         yes-pattern  is  executed and a closing parenthesis is required. Other-         yes-pattern is executed and a closing parenthesis is  required.  Other-
6765         wise, since no-pattern is not present, the subpattern matches  nothing.         wise,  since no-pattern is not present, the subpattern matches nothing.
6766         In  other  words,  this  pattern matches a sequence of non-parentheses,         In other words, this pattern matches  a  sequence  of  non-parentheses,
6767         optionally enclosed in parentheses.         optionally enclosed in parentheses.
6768    
6769         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
6770         relative reference:         relative reference:
6771    
6772           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
6773    
6774         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
6775         pattern.         pattern.
6776    
6777     Checking for a used subpattern by name     Checking for a used subpattern by name
6778    
6779         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
6780         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
6781         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
6782         also  recognized. However, there is a possible ambiguity with this syn-         also recognized.
        tax, because subpattern names may  consist  entirely  of  digits.  PCRE  
        looks  first for a named subpattern; if it cannot find one and the name  
        consists entirely of digits, PCRE looks for a subpattern of  that  num-  
        ber,  which must be greater than zero. Using subpattern names that con-  
        sist entirely of digits is not recommended.  
6783    
6784         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
6785    
# Line 7032  CALLOUTS Line 7188  CALLOUTS
7188         tion is called. It is provided with the  number  of  the  callout,  the         tion is called. It is provided with the  number  of  the  callout,  the
7189         position  in  the pattern, and, optionally, one item of data originally         position  in  the pattern, and, optionally, one item of data originally
7190         supplied by the caller of the matching function. The  callout  function         supplied by the caller of the matching function. The  callout  function
7191         may  cause  matching to proceed, to backtrack, or to fail altogether. A         may cause matching to proceed, to backtrack, or to fail altogether.
7192         complete description of the interface to the callout function is  given  
7193         in the pcrecallout documentation.         By  default,  PCRE implements a number of optimizations at compile time
7194           and matching time, and one side-effect is that sometimes  callouts  are
7195           skipped.  If  you need all possible callouts to happen, you need to set
7196           options that disable the relevant optimizations. More  details,  and  a
7197           complete  description  of  the  interface  to the callout function, are
7198           given in the pcrecallout documentation.
7199    
7200    
7201  BACKTRACKING CONTROL  BACKTRACKING CONTROL
7202    
7203         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
7204         which are still described in the Perl  documentation  as  "experimental         which  are  still  described in the Perl documentation as "experimental
7205         and  subject to change or removal in a future version of Perl". It goes         and subject to change or removal in a future version of Perl". It  goes
7206         on to say: "Their usage in production code should  be  noted  to  avoid         on  to  say:  "Their  usage in production code should be noted to avoid
7207         problems  during upgrades." The same remarks apply to the PCRE features         problems during upgrades." The same remarks apply to the PCRE  features
7208         described in this section.         described in this section.
7209    
7210         The new verbs make use of what was previously invalid syntax: an  open-         The  new verbs make use of what was previously invalid syntax: an open-
7211         ing parenthesis followed by an asterisk. They are generally of the form         ing parenthesis followed by an asterisk. They are generally of the form
7212         (*VERB) or (*VERB:NAME). Some may take either form,  possibly  behaving         (*VERB)  or  (*VERB:NAME). Some may take either form, possibly behaving
7213         differently  depending  on  whether or not a name is present. A name is         differently depending on whether or not a name is present.  A  name  is
7214         any sequence of characters that does not include a closing parenthesis.         any sequence of characters that does not include a closing parenthesis.
7215         The maximum length of name is 255 in the 8-bit library and 65535 in the         The maximum length of name is 255 in the 8-bit library and 65535 in the
7216         16-bit and 32-bit libraries. If the name is  empty,  that  is,  if  the         16-bit  and  32-bit  libraries.  If  the name is empty, that is, if the
7217         closing  parenthesis immediately follows the colon, the effect is as if         closing parenthesis immediately follows the colon, the effect is as  if
7218         the colon were not there.  Any number of these verbs  may  occur  in  a         the  colon  were  not  there.  Any number of these verbs may occur in a
7219         pattern.         pattern.
7220    
7221         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
7222         them can be used only when the pattern is to be matched  using  one  of         them  can  be  used only when the pattern is to be matched using one of
7223         the  traditional  matching  functions, because these use a backtracking         the traditional matching functions, because these  use  a  backtracking
7224         algorithm. With the exception of (*FAIL), which behaves like a  failing         algorithm.  With the exception of (*FAIL), which behaves like a failing
7225         negative  assertion,  the  backtracking control verbs cause an error if         negative assertion, the backtracking control verbs cause  an  error  if
7226         encountered by a DFA matching function.         encountered by a DFA matching function.
7227    
7228         The behaviour of these verbs in repeated  groups,  assertions,  and  in         The  behaviour  of  these  verbs in repeated groups, assertions, and in
7229         subpatterns called as subroutines (whether or not recursively) is docu-         subpatterns called as subroutines (whether or not recursively) is docu-
7230         mented below.         mented below.
7231    
7232     Optimizations that affect backtracking verbs     Optimizations that affect backtracking verbs
7233    
7234         PCRE contains some optimizations that are used to speed up matching  by         PCRE  contains some optimizations that are used to speed up matching by
7235         running some checks at the start of each match attempt. For example, it         running some checks at the start of each match attempt. For example, it
7236         may know the minimum length of matching subject, or that  a  particular         may  know  the minimum length of matching subject, or that a particular
7237         character must be present. When one of these optimizations bypasses the         character must be present. When one of these optimizations bypasses the
7238         running of a match,  any  included  backtracking  verbs  will  not,  of         running  of  a  match,  any  included  backtracking  verbs will not, of
7239         course, be processed. You can suppress the start-of-match optimizations         course, be processed. You can suppress the start-of-match optimizations
7240         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-         by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-
7241         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
7242         There is more discussion of this option in the section entitled "Option         There is more discussion of this option in the section entitled "Option
7243         bits for pcre_exec()" in the pcreapi documentation.         bits for pcre_exec()" in the pcreapi documentation.
7244    
7245         Experiments  with  Perl  suggest that it too has similar optimizations,         Experiments with Perl suggest that it too  has  similar  optimizations,
7246         sometimes leading to anomalous results.         sometimes leading to anomalous results.
7247    
7248     Verbs that act immediately     Verbs that act immediately
7249    
7250         The following verbs act as soon as they are encountered. They  may  not         The  following  verbs act as soon as they are encountered. They may not
7251         be followed by a name.         be followed by a name.
7252    
7253            (*ACCEPT)            (*ACCEPT)
7254    
7255         This  verb causes the match to end successfully, skipping the remainder         This verb causes the match to end successfully, skipping the  remainder
7256         of the pattern. However, when it is inside a subpattern that is  called         of  the pattern. However, when it is inside a subpattern that is called
7257         as  a  subroutine, only that subpattern is ended successfully. Matching         as a subroutine, only that subpattern is ended  successfully.  Matching
7258         then continues at the outer level. If (*ACCEPT) in triggered in a posi-         then continues at the outer level. If (*ACCEPT) in triggered in a posi-
7259         tive  assertion,  the  assertion succeeds; in a negative assertion, the         tive assertion, the assertion succeeds; in a  negative  assertion,  the
7260         assertion fails.         assertion fails.
7261    
7262         If (*ACCEPT) is inside capturing parentheses, the data so far  is  cap-         If  (*ACCEPT)  is inside capturing parentheses, the data so far is cap-
7263         tured. For example:         tured. For example:
7264    
7265           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
7266    
7267         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
7268         tured by the outer parentheses.         tured by the outer parentheses.
7269    
7270           (*FAIL) or (*F)           (*FAIL) or (*F)
7271    
7272         This verb causes a matching failure, forcing backtracking to occur.  It         This  verb causes a matching failure, forcing backtracking to occur. It
7273         is  equivalent to (?!) but easier to read. The Perl documentation notes         is equivalent to (?!) but easier to read. The Perl documentation  notes
7274         that it is probably useful only when combined  with  (?{})  or  (??{}).         that  it  is  probably  useful only when combined with (?{}) or (??{}).
7275         Those  are,  of course, Perl features that are not present in PCRE. The         Those are, of course, Perl features that are not present in  PCRE.  The
7276         nearest equivalent is the callout feature, as for example in this  pat-         nearest  equivalent is the callout feature, as for example in this pat-
7277         tern:         tern:
7278    
7279           a+(?C)(*FAIL)           a+(?C)(*FAIL)
7280    
7281         A  match  with the string "aaaa" always fails, but the callout is taken         A match with the string "aaaa" always fails, but the callout  is  taken
7282         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
7283    
7284     Recording which path was taken     Recording which path was taken
7285    
7286         There is one verb whose main purpose  is  to  track  how  a  match  was         There  is  one  verb  whose  main  purpose  is to track how a match was
7287         arrived  at,  though  it  also  has a secondary use in conjunction with         arrived at, though it also has a  secondary  use  in  conjunction  with
7288         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
7289    
7290           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
7291    
7292         A name is always  required  with  this  verb.  There  may  be  as  many         A  name  is  always  required  with  this  verb.  There  may be as many
7293         instances  of  (*MARK) as you like in a pattern, and their names do not         instances of (*MARK) as you like in a pattern, and their names  do  not
7294         have to be unique.         have to be unique.
7295    
7296         When a match succeeds, the name of the  last-encountered  (*MARK:NAME),         When  a  match succeeds, the name of the last-encountered (*MARK:NAME),
7297         (*PRUNE:NAME),  or  (*THEN:NAME) on the matching path is passed back to         (*PRUNE:NAME), or (*THEN:NAME) on the matching path is passed  back  to
7298         the caller as  described  in  the  section  entitled  "Extra  data  for         the  caller  as  described  in  the  section  entitled  "Extra data for
7299         pcre_exec()"  in  the  pcreapi  documentation.  Here  is  an example of         pcre_exec()" in the  pcreapi  documentation.  Here  is  an  example  of
7300         pcretest output, where the /K modifier requests the retrieval and  out-         pcretest  output, where the /K modifier requests the retrieval and out-
7301         putting of (*MARK) data:         putting of (*MARK) data:
7302    
7303             re> /X(*MARK:A)Y|X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
# Line 7148  BACKTRACKING CONTROL Line 7309  BACKTRACKING CONTROL
7309           MK: B           MK: B
7310    
7311         The (*MARK) name is tagged with "MK:" in this output, and in this exam-         The (*MARK) name is tagged with "MK:" in this output, and in this exam-
7312         ple it indicates which of the two alternatives matched. This is a  more         ple  it indicates which of the two alternatives matched. This is a more
7313         efficient  way of obtaining this information than putting each alterna-         efficient way of obtaining this information than putting each  alterna-
7314         tive in its own capturing parentheses.         tive in its own capturing parentheses.
7315    
7316         If a verb with a name is encountered in a positive  assertion  that  is         If  a  verb  with a name is encountered in a positive assertion that is
7317         true,  the  name  is recorded and passed back if it is the last-encoun-         true, the name is recorded and passed back if it  is  the  last-encoun-
7318         tered. This does not happen for negative assertions or failing positive         tered. This does not happen for negative assertions or failing positive
7319         assertions.         assertions.
7320    
7321         After  a  partial match or a failed match, the last encountered name in         After a partial match or a failed match, the last encountered  name  in
7322         the entire match process is returned. For example:         the entire match process is returned. For example:
7323    
7324             re> /X(*MARK:A)Y|X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
7325           data> XP           data> XP
7326           No match, mark = B           No match, mark = B
7327    
7328         Note that in this unanchored example the  mark  is  retained  from  the         Note  that  in  this  unanchored  example the mark is retained from the
7329         match attempt that started at the letter "X" in the subject. Subsequent         match attempt that started at the letter "X" in the subject. Subsequent
7330         match attempts starting at "P" and then with an empty string do not get         match attempts starting at "P" and then with an empty string do not get
7331         as far as the (*MARK) item, but nevertheless do not reset it.         as far as the (*MARK) item, but nevertheless do not reset it.
7332    
7333         If  you  are  interested  in  (*MARK)  values after failed matches, you         If you are interested in  (*MARK)  values  after  failed  matches,  you
7334         should probably set the PCRE_NO_START_OPTIMIZE option  (see  above)  to         should  probably  set  the PCRE_NO_START_OPTIMIZE option (see above) to
7335         ensure that the match is always attempted.         ensure that the match is always attempted.
7336    
7337     Verbs that act after backtracking     Verbs that act after backtracking
7338    
7339         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
7340         tinues with what follows, but if there is no subsequent match,  causing         tinues  with what follows, but if there is no subsequent match, causing
7341         a  backtrack  to  the  verb, a failure is forced. That is, backtracking         a backtrack to the verb, a failure is  forced.  That  is,  backtracking
7342         cannot pass to the left of the verb. However, when one of  these  verbs         cannot  pass  to the left of the verb. However, when one of these verbs
7343         appears inside an atomic group or an assertion that is true, its effect         appears inside an atomic group or an assertion that is true, its effect
7344         is confined to that group, because once the  group  has  been  matched,         is  confined  to  that  group, because once the group has been matched,
7345         there  is never any backtracking into it. In this situation, backtrack-         there is never any backtracking into it. In this situation,  backtrack-
7346         ing can "jump back" to the left of the entire atomic  group  or  asser-         ing  can  "jump  back" to the left of the entire atomic group or asser-
7347         tion.  (Remember  also,  as  stated  above, that this localization also         tion. (Remember also, as stated  above,  that  this  localization  also
7348         applies in subroutine calls.)         applies in subroutine calls.)
7349    
7350         These verbs differ in exactly what kind of failure  occurs  when  back-         These  verbs  differ  in exactly what kind of failure occurs when back-
7351         tracking  reaches  them.  The behaviour described below is what happens         tracking reaches them. The behaviour described below  is  what  happens
7352         when the verb is not in a subroutine or an assertion.  Subsequent  sec-         when  the  verb is not in a subroutine or an assertion. Subsequent sec-
7353         tions cover these special cases.         tions cover these special cases.
7354    
7355           (*COMMIT)           (*COMMIT)
7356    
7357         This  verb, which may not be followed by a name, causes the whole match         This verb, which may not be followed by a name, causes the whole  match
7358         to fail outright if there is a later matching failure that causes back-         to fail outright if there is a later matching failure that causes back-
7359         tracking  to  reach  it.  Even if the pattern is unanchored, no further         tracking to reach it. Even if the pattern  is  unanchored,  no  further
7360         attempts to find a match by advancing the starting point take place. If         attempts to find a match by advancing the starting point take place. If
7361         (*COMMIT)  is  the  only backtracking verb that is encountered, once it         (*COMMIT) is the only backtracking verb that is  encountered,  once  it
7362         has been passed pcre_exec() is committed to finding a match at the cur-         has been passed pcre_exec() is committed to finding a match at the cur-
7363         rent starting point, or not at all. For example:         rent starting point, or not at all. For example:
7364    
7365           a+(*COMMIT)b           a+(*COMMIT)b
7366    
7367         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
7368         of dynamic anchor, or "I've started, so I must finish." The name of the