/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 738 by ph10, Fri Oct 21 09:04:01 2011 UTC revision 784 by ph10, Mon Dec 5 12:33:44 2011 UTC
# Line 120  REVISION Line 120  REVISION
120         Last updated: 24 August 2011         Last updated: 24 August 2011
121         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
122  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
123    
124    
125  PCREBUILD(3)                                                      PCREBUILD(3)  PCREBUILD(3)                                                      PCREBUILD(3)
126    
127    
# Line 484  REVISION Line 484  REVISION
484         Last updated: 06 September 2011         Last updated: 06 September 2011
485         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
486  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
487    
488    
489  PCREMATCHING(3)                                                PCREMATCHING(3)  PCREMATCHING(3)                                                PCREMATCHING(3)
490    
491    
# Line 633  THE ALTERNATIVE MATCHING ALGORITHM Line 633  THE ALTERNATIVE MATCHING ALGORITHM
633         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
634    
635         7.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The \C escape sequence, which (in the standard algorithm) matches a
636         single byte, even in UTF-8 mode, is not supported because the  alterna-         single byte, even in UTF-8  mode,  is  not  supported  in  UTF-8  mode,
637         tive  algorithm  moves  through  the  subject string one character at a         because  the alternative algorithm moves through the subject string one
638         time, for all active paths through the tree.         character at a time, for all active paths through the tree.
639    
640         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
641         are  not  supported.  (*FAIL)  is supported, and behaves like a failing         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
# Line 685  AUTHOR Line 685  AUTHOR
685    
686  REVISION  REVISION
687    
688         Last updated: 17 November 2010         Last updated: 19 November 2011
689         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
690  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
691    
692    
693  PCREAPI(3)                                                          PCREAPI(3)  PCREAPI(3)                                                          PCREAPI(3)
694    
695    
# Line 1256  COMPILING A PATTERN Line 1256  COMPILING A PATTERN
1256         set  (assuming  it can find an "a" in the subject), whereas it fails by         set  (assuming  it can find an "a" in the subject), whereas it fails by
1257         default, for Perl compatibility.         default, for Perl compatibility.
1258    
1259           (3) \U matches an upper case "U" character; by default \U causes a com-
1260           pile time error (Perl uses \U to upper case subsequent characters).
1261    
1262           (4) \u matches a lower case "u" character unless it is followed by four
1263           hexadecimal digits, in which case the hexadecimal  number  defines  the
1264           code  point  to match. By default, \u causes a compile time error (Perl
1265           uses it to upper case the following character).
1266    
1267           (5) \x matches a lower case "x" character unless it is followed by  two
1268           hexadecimal  digits,  in  which case the hexadecimal number defines the
1269           code point to match. By default, as in Perl, a  hexadecimal  number  is
1270           always expected after \x, but it may have zero, one, or two digits (so,
1271           for example, \xz matches a binary zero character followed by z).
1272    
1273           PCRE_MULTILINE           PCRE_MULTILINE
1274    
1275         By default, PCRE treats the subject string as consisting  of  a  single         By default, PCRE treats the subject string as consisting  of  a  single
# Line 1710  INFORMATION ABOUT A PATTERN Line 1724  INFORMATION ABOUT A PATTERN
1724         compiler could not handle this particular pattern. See the pcrejit doc-         compiler could not handle this particular pattern. See the pcrejit doc-
1725         umentation for details of what can and cannot be handled.         umentation for details of what can and cannot be handled.
1726    
1727             PCRE_INFO_JITSIZE
1728    
1729           If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE
1730           option, return the size of the  JIT  compiled  code,  otherwise  return
1731           zero. The fourth argument should point to a size_t variable.
1732    
1733           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1734    
1735         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 1818  INFORMATION ABOUT A PATTERN Line 1838  INFORMATION ABOUT A PATTERN
1838    
1839           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1840    
1841         Return  the  size  of the compiled pattern, that is, the value that was         Return  the  size  of  the compiled pattern. The fourth argument should
1842         passed as the argument to pcre_malloc() when PCRE was getting memory in         point to a size_t variable. This value does not include the size of the
1843         which to place the compiled data. The fourth argument should point to a         pcre  structure  that  is returned by pcre_compile(). The value that is
1844         size_t variable.         passed as the argument to pcre_malloc() when pcre_compile() is  getting
1845           memory  in  which  to  place the compiled data is the value returned by
1846           this option plus the size of the pcre structure.  Studying  a  compiled
1847           pattern, with or without JIT, does not alter the value returned by this
1848           option.
1849    
1850           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1851    
# Line 2980  AUTHOR Line 3004  AUTHOR
3004    
3005  REVISION  REVISION
3006    
3007         Last updated: 23 September 2011         Last updated: 02 December 2011
3008         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3009  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3010    
3011    
3012  PCRECALLOUT(3)                                                  PCRECALLOUT(3)  PCRECALLOUT(3)                                                  PCRECALLOUT(3)
3013    
3014    
# Line 3143  THE CALLOUT INTERFACE Line 3167  THE CALLOUT INTERFACE
3167    
3168         The mark field is present from version 2 of the pcre_callout structure.         The mark field is present from version 2 of the pcre_callout structure.
3169         In  callouts  from pcre_exec() it contains a pointer to the zero-termi-         In  callouts  from pcre_exec() it contains a pointer to the zero-termi-
3170         nated name of the most recently passed (*MARK) item in  the  match,  or         nated name of the most recently passed (*MARK),  (*PRUNE),  or  (*THEN)
3171         NULL if there are no (*MARK)s in the current matching path. In callouts         item in the match, or NULL if no such items have been passed. Instances
3172         from pcre_dfa_exec() this field always contains NULL.         of (*PRUNE) or (*THEN) without a name  do  not  obliterate  a  previous
3173           (*MARK).  In  callouts  from pcre_dfa_exec() this field always contains
3174           NULL.
3175    
3176    
3177  RETURN VALUES  RETURN VALUES
# Line 3173  AUTHOR Line 3199  AUTHOR
3199    
3200  REVISION  REVISION
3201    
3202         Last updated: 26 August 2011         Last updated: 30 November 2011
3203         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3204  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3205    
3206    
3207  PCRECOMPAT(3)                                                    PCRECOMPAT(3)  PCRECOMPAT(3)                                                    PCRECOMPAT(3)
3208    
3209    
# Line 3218  DIFFERENCES BETWEEN PCRE AND PERL Line 3244  DIFFERENCES BETWEEN PCRE AND PERL
3244         its own, matching a non-newline character, is supported.) In fact these         its own, matching a non-newline character, is supported.) In fact these
3245         are implemented by Perl's general string-handling and are not  part  of         are implemented by Perl's general string-handling and are not  part  of
3246         its  pattern  matching engine. If any of these are encountered by PCRE,         its  pattern  matching engine. If any of these are encountered by PCRE,
3247         an error is generated.         an error is generated by default. However, if the  PCRE_JAVASCRIPT_COM-
3248           PAT  option  is set, \U and \u are interpreted as JavaScript interprets
3249           them.
3250    
3251         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE         6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
3252         is  built  with Unicode character property support. The properties that         is  built  with Unicode character property support. The properties that
# Line 3345  AUTHOR Line 3373  AUTHOR
3373    
3374  REVISION  REVISION
3375    
3376         Last updated: 09 October 2011         Last updated: 14 November 2011
3377         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
3378  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
3379    
3380    
3381  PCREPATTERN(3)                                                  PCREPATTERN(3)  PCREPATTERN(3)                                                  PCREPATTERN(3)
3382    
3383    
# Line 3572  BACKSLASH Line 3600  BACKSLASH
3600           \t        tab (hex 09)           \t        tab (hex 09)
3601           \ddd      character with octal code ddd, or back reference           \ddd      character with octal code ddd, or back reference
3602           \xhh      character with hex code hh           \xhh      character with hex code hh
3603           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
3604             \uhhhh    character with hex code hhhh (JavaScript mode only)
3605    
3606         The precise effect of \cx is as follows: if x is a lower  case  letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
3607         it  is converted to upper case. Then bit 6 of the character (hex 40) is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
# Line 3583  BACKSLASH Line 3612  BACKSLASH
3612         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case         is compiled in EBCDIC mode, all byte values are  valid.  A  lower  case
3613         letter is converted to upper case, and then the 0xc0 bits are flipped.)         letter is converted to upper case, and then the 0xc0 bits are flipped.)
3614    
3615         After  \x, from zero to two hexadecimal digits are read (letters can be         By  default,  after  \x,  from  zero to two hexadecimal digits are read
3616         in upper or lower case). Any number of hexadecimal  digits  may  appear         (letters can be in upper or lower case). Any number of hexadecimal dig-
3617         between  \x{  and  },  but the value of the character code must be less         its  may  appear between \x{ and }, but the value of the character code
3618         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,         must be less than 256 in non-UTF-8 mode, and less than 2**31  in  UTF-8
3619         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger         mode.  That is, the maximum value in hexadecimal is 7FFFFFFF. Note that
3620         than the largest Unicode code point, which is 10FFFF.         this is bigger than the largest Unicode code point, which is 10FFFF.
3621    
3622         If characters other than hexadecimal digits appear between \x{  and  },         If characters other than hexadecimal digits appear between \x{  and  },
3623         or if there is no terminating }, this form of escape is not recognized.         or if there is no terminating }, this form of escape is not recognized.
# Line 3596  BACKSLASH Line 3625  BACKSLASH
3625         escape,  with  no  following  digits, giving a character whose value is         escape,  with  no  following  digits, giving a character whose value is
3626         zero.         zero.
3627    
3628           If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation  of  \x
3629           is  as  just described only when it is followed by two hexadecimal dig-
3630           its.  Otherwise, it matches a  literal  "x"  character.  In  JavaScript
3631           mode, support for code points greater than 256 is provided by \u, which
3632           must be followed by four hexadecimal digits;  otherwise  it  matches  a
3633           literal "u" character.
3634    
3635         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3636         two  syntaxes  for  \x. There is no difference in the way they are han-         two syntaxes for \x (or by \u in JavaScript mode). There is no  differ-
3637         dled. For example, \xdc is exactly the same as \x{dc}.         ence in the way they are handled. For example, \xdc is exactly the same
3638           as \x{dc} (or \u00dc in JavaScript mode).
3639    
3640         After \0 up to two further octal digits are read. If  there  are  fewer         After \0 up to two further octal digits are read. If  there  are  fewer
3641         than  two  digits,  just  those  that  are  present  are used. Thus the         than  two  digits,  just  those  that  are  present  are used. Thus the
# Line 3642  BACKSLASH Line 3679  BACKSLASH
3679    
3680         All the sequences that define a single character value can be used both         All the sequences that define a single character value can be used both
3681         inside and outside character classes. In addition, inside  a  character         inside and outside character classes. In addition, inside  a  character
3682         class,  the  sequence \b is interpreted as the backspace character (hex         class, \b is interpreted as the backspace character (hex 08).
3683         08). The sequences \B, \N, \R, and \X are not special inside a  charac-  
3684         ter  class.  Like  any  other  unrecognized  escape sequences, they are         \N  is not allowed in a character class. \B, \R, and \X are not special
3685         treated as the literal characters "B", "N", "R", and  "X"  by  default,         inside a character class. Like  other  unrecognized  escape  sequences,
3686         but cause an error if the PCRE_EXTRA option is set. Outside a character         they  are  treated  as  the  literal  characters  "B",  "R", and "X" by
3687         class, these sequences have different meanings.         default, but cause an error if the PCRE_EXTRA option is set. Outside  a
3688           character class, these sequences have different meanings.
3689    
3690       Unsupported escape sequences
3691    
3692           In  Perl, the sequences \l, \L, \u, and \U are recognized by its string
3693           handler and used  to  modify  the  case  of  following  characters.  By
3694           default,  PCRE does not support these escape sequences. However, if the
3695           PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U"  character,  and
3696           \u can be used to define a character by code point, as described in the
3697           previous section.
3698    
3699     Absolute and relative back references     Absolute and relative back references
3700    
# Line 3682  BACKSLASH Line 3729  BACKSLASH
3729    
3730         There is also the single sequence \N, which matches a non-newline char-         There is also the single sequence \N, which matches a non-newline char-
3731         acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is         acter.   This  is the same as the "." metacharacter when PCRE_DOTALL is
3732         not set.         not set. Perl also uses \N to match characters by name; PCRE  does  not
3733           support this.
3734    
3735         Each pair of lower and upper case escape sequences partitions the  com-         Each  pair of lower and upper case escape sequences partitions the com-
3736         plete  set  of  characters  into two disjoint sets. Any given character         plete set of characters into two disjoint  sets.  Any  given  character
3737         matches one, and only one, of each pair. The sequences can appear  both         matches  one, and only one, of each pair. The sequences can appear both
3738         inside  and outside character classes. They each match one character of         inside and outside character classes. They each match one character  of
3739         the appropriate type. If the current matching point is at  the  end  of         the  appropriate  type.  If the current matching point is at the end of
3740         the  subject string, all of them fail, because there is no character to         the subject string, all of them fail, because there is no character  to
3741         match.         match.
3742    
3743         For compatibility with Perl, \s does not match the VT  character  (code         For  compatibility  with Perl, \s does not match the VT character (code
3744         11).   This makes it different from the the POSIX "space" class. The \s         11).  This makes it different from the the POSIX "space" class. The  \s
3745         characters are HT (9), LF (10), FF (12), CR (13), and  space  (32).  If         characters  are  HT  (9), LF (10), FF (12), CR (13), and space (32). If
3746         "use locale;" is included in a Perl script, \s may match the VT charac-         "use locale;" is included in a Perl script, \s may match the VT charac-
3747         ter. In PCRE, it never does.         ter. In PCRE, it never does.
3748    
3749         A "word" character is an underscore or any character that is  a  letter         A  "word"  character is an underscore or any character that is a letter
3750         or  digit.   By  default,  the definition of letters and digits is con-         or digit.  By default, the definition of letters  and  digits  is  con-
3751         trolled by PCRE's low-valued character tables, and may vary if  locale-         trolled  by PCRE's low-valued character tables, and may vary if locale-
3752         specific  matching is taking place (see "Locale support" in the pcreapi         specific matching is taking place (see "Locale support" in the  pcreapi
3753         page). For example, in a French locale such  as  "fr_FR"  in  Unix-like         page).  For  example,  in  a French locale such as "fr_FR" in Unix-like
3754         systems,  or "french" in Windows, some character codes greater than 128         systems, or "french" in Windows, some character codes greater than  128
3755         are used for accented letters, and these are then matched  by  \w.  The         are  used  for  accented letters, and these are then matched by \w. The
3756         use of locales with Unicode is discouraged.         use of locales with Unicode is discouraged.
3757    
3758         By  default,  in  UTF-8  mode,  characters with values greater than 128         By default, in UTF-8 mode, characters  with  values  greater  than  128
3759         never match \d, \s, or \w, and always  match  \D,  \S,  and  \W.  These         never  match  \d,  \s,  or  \w,  and always match \D, \S, and \W. These
3760         sequences  retain their original meanings from before UTF-8 support was         sequences retain their original meanings from before UTF-8 support  was
3761         available, mainly for efficiency reasons. However, if PCRE is  compiled         available,  mainly for efficiency reasons. However, if PCRE is compiled
3762         with  Unicode property support, and the PCRE_UCP option is set, the be-         with Unicode property support, and the PCRE_UCP option is set, the  be-
3763         haviour is changed so that Unicode properties  are  used  to  determine         haviour  is  changed  so  that Unicode properties are used to determine
3764         character types, as follows:         character types, as follows:
3765    
3766           \d  any character that \p{Nd} matches (decimal digit)           \d  any character that \p{Nd} matches (decimal digit)
3767           \s  any character that \p{Z} matches, plus HT, LF, FF, CR           \s  any character that \p{Z} matches, plus HT, LF, FF, CR
3768           \w  any character that \p{L} or \p{N} matches, plus underscore           \w  any character that \p{L} or \p{N} matches, plus underscore
3769    
3770         The  upper case escapes match the inverse sets of characters. Note that         The upper case escapes match the inverse sets of characters. Note  that
3771         \d matches only decimal digits, whereas \w matches any  Unicode  digit,         \d  matches  only decimal digits, whereas \w matches any Unicode digit,
3772         as  well as any Unicode letter, and underscore. Note also that PCRE_UCP         as well as any Unicode letter, and underscore. Note also that  PCRE_UCP
3773         affects \b, and \B because they are defined in  terms  of  \w  and  \W.         affects  \b,  and  \B  because  they are defined in terms of \w and \W.
3774         Matching these sequences is noticeably slower when PCRE_UCP is set.         Matching these sequences is noticeably slower when PCRE_UCP is set.
3775    
3776         The  sequences  \h, \H, \v, and \V are features that were added to Perl         The sequences \h, \H, \v, and \V are features that were added  to  Perl
3777         at release 5.10. In contrast to the other sequences, which  match  only         at  release  5.10. In contrast to the other sequences, which match only
3778         ASCII  characters  by  default,  these always match certain high-valued         ASCII characters by default, these  always  match  certain  high-valued
3779         codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The  horizon-         codepoints  in UTF-8 mode, whether or not PCRE_UCP is set. The horizon-
3780         tal space characters are:         tal space characters are:
3781    
3782           U+0009     Horizontal tab           U+0009     Horizontal tab
# Line 3763  BACKSLASH Line 3811  BACKSLASH
3811    
3812     Newline sequences     Newline sequences
3813    
3814         Outside  a  character class, by default, the escape sequence \R matches         Outside a character class, by default, the escape sequence  \R  matches
3815         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the         any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the
3816         following:         following:
3817    
3818           (?>\r\n|\n|\x0b|\f|\r|\x85)           (?>\r\n|\n|\x0b|\f|\r|\x85)
3819    
3820         This  is  an  example  of an "atomic group", details of which are given         This is an example of an "atomic group", details  of  which  are  given
3821         below.  This particular group matches either the two-character sequence         below.  This particular group matches either the two-character sequence
3822         CR  followed  by  LF,  or  one  of  the single characters LF (linefeed,         CR followed by LF, or  one  of  the  single  characters  LF  (linefeed,
3823         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage         U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3824         return, U+000D), or NEL (next line, U+0085). The two-character sequence         return, U+000D), or NEL (next line, U+0085). The two-character sequence
3825         is treated as a single unit that cannot be split.         is treated as a single unit that cannot be split.
3826    
3827         In UTF-8 mode, two additional characters whose codepoints  are  greater         In  UTF-8  mode, two additional characters whose codepoints are greater
3828         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-         than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3829         rator, U+2029).  Unicode character property support is not  needed  for         rator,  U+2029).   Unicode character property support is not needed for
3830         these characters to be recognized.         these characters to be recognized.
3831    
3832         It is possible to restrict \R to match only CR, LF, or CRLF (instead of         It is possible to restrict \R to match only CR, LF, or CRLF (instead of
3833         the complete set  of  Unicode  line  endings)  by  setting  the  option         the  complete  set  of  Unicode  line  endings)  by  setting the option
3834         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.         PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched.
3835         (BSR is an abbrevation for "backslash R".) This can be made the default         (BSR is an abbrevation for "backslash R".) This can be made the default
3836         when  PCRE  is  built;  if this is the case, the other behaviour can be         when PCRE is built; if this is the case, the  other  behaviour  can  be
3837         requested via the PCRE_BSR_UNICODE option.   It  is  also  possible  to         requested  via  the  PCRE_BSR_UNICODE  option.   It is also possible to
3838         specify  these  settings  by  starting a pattern string with one of the         specify these settings by starting a pattern string  with  one  of  the
3839         following sequences:         following sequences:
3840    
3841           (*BSR_ANYCRLF)   CR, LF, or CRLF only           (*BSR_ANYCRLF)   CR, LF, or CRLF only
3842           (*BSR_UNICODE)   any Unicode newline sequence           (*BSR_UNICODE)   any Unicode newline sequence
3843    
3844         These override the default and the options given to  pcre_compile()  or         These  override  the default and the options given to pcre_compile() or
3845         pcre_compile2(),  but  they  can  be  overridden  by  options  given to         pcre_compile2(), but  they  can  be  overridden  by  options  given  to
3846         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which         pcre_exec() or pcre_dfa_exec(). Note that these special settings, which
3847         are  not  Perl-compatible,  are  recognized only at the very start of a         are not Perl-compatible, are recognized only at the  very  start  of  a
3848         pattern, and that they must be in upper case. If more than one of  them         pattern,  and that they must be in upper case. If more than one of them
3849         is present, the last one is used. They can be combined with a change of         is present, the last one is used. They can be combined with a change of
3850         newline convention; for example, a pattern can start with:         newline convention; for example, a pattern can start with:
3851    
3852           (*ANY)(*BSR_ANYCRLF)           (*ANY)(*BSR_ANYCRLF)
3853    
3854         They can also be combined with the (*UTF8) or (*UCP) special sequences.         They can also be combined with the (*UTF8) or (*UCP) special sequences.
3855         Inside  a  character  class,  \R  is  treated as an unrecognized escape         Inside a character class, \R  is  treated  as  an  unrecognized  escape
3856         sequence, and so matches the letter "R" by default, but causes an error         sequence, and so matches the letter "R" by default, but causes an error
3857         if PCRE_EXTRA is set.         if PCRE_EXTRA is set.
3858    
3859     Unicode character properties     Unicode character properties
3860    
3861         When PCRE is built with Unicode character property support, three addi-         When PCRE is built with Unicode character property support, three addi-
3862         tional escape sequences that match characters with specific  properties         tional  escape sequences that match characters with specific properties
3863         are  available.   When not in UTF-8 mode, these sequences are of course         are available.  When not in UTF-8 mode, these sequences are  of  course
3864         limited to testing characters whose codepoints are less than  256,  but         limited  to  testing characters whose codepoints are less than 256, but
3865         they do work in this mode.  The extra escape sequences are:         they do work in this mode.  The extra escape sequences are:
3866    
3867           \p{xx}   a character with the xx property           \p{xx}   a character with the xx property
3868           \P{xx}   a character without the xx property           \P{xx}   a character without the xx property
3869           \X       an extended Unicode sequence           \X       an extended Unicode sequence
3870    
3871         The  property  names represented by xx above are limited to the Unicode         The property names represented by xx above are limited to  the  Unicode
3872         script names, the general category properties, "Any", which matches any         script names, the general category properties, "Any", which matches any
3873         character   (including  newline),  and  some  special  PCRE  properties         character  (including  newline),  and  some  special  PCRE   properties
3874         (described in the next section).  Other Perl properties such as  "InMu-         (described  in the next section).  Other Perl properties such as "InMu-
3875         sicalSymbols"  are  not  currently supported by PCRE. Note that \P{Any}         sicalSymbols" are not currently supported by PCRE.  Note  that  \P{Any}
3876         does not match any characters, so always causes a match failure.         does not match any characters, so always causes a match failure.
3877    
3878         Sets of Unicode characters are defined as belonging to certain scripts.         Sets of Unicode characters are defined as belonging to certain scripts.
3879         A  character from one of these sets can be matched using a script name.         A character from one of these sets can be matched using a script  name.
3880         For example:         For example:
3881    
3882           \p{Greek}           \p{Greek}
3883           \P{Han}           \P{Han}
3884    
3885         Those that are not part of an identified script are lumped together  as         Those  that are not part of an identified script are lumped together as
3886         "Common". The current list of scripts is:         "Common". The current list of scripts is:
3887    
3888         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,         Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille,
3889         Buginese, Buhid, Canadian_Aboriginal, Carian, Cham,  Cherokee,  Common,         Buginese,  Buhid,  Canadian_Aboriginal, Carian, Cham, Cherokee, Common,
3890         Coptic,   Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,  Egyp-         Coptic,  Cuneiform,  Cypriot,  Cyrillic,  Deseret,  Devanagari,   Egyp-
3891         tian_Hieroglyphs,  Ethiopic,  Georgian,  Glagolitic,   Gothic,   Greek,         tian_Hieroglyphs,   Ethiopic,   Georgian,  Glagolitic,  Gothic,  Greek,
3892         Gujarati,  Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana, Impe-         Gujarati, Gurmukhi,  Han,  Hangul,  Hanunoo,  Hebrew,  Hiragana,  Impe-
3893         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,         rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian,
3894         Javanese,  Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao,         Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer,  Lao,
3895         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,         Latin,  Lepcha,  Limbu,  Linear_B,  Lisu,  Lycian,  Lydian,  Malayalam,
3896         Meetei_Mayek,  Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic,         Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham,  Old_Italic,
3897         Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki,  Oriya,  Osmanya,         Old_Persian,  Old_South_Arabian,  Old_Turkic, Ol_Chiki, Oriya, Osmanya,
3898         Phags_Pa,  Phoenician,  Rejang,  Runic, Samaritan, Saurashtra, Shavian,         Phags_Pa, Phoenician, Rejang, Runic,  Samaritan,  Saurashtra,  Shavian,
3899         Sinhala, Sundanese, Syloti_Nagri, Syriac,  Tagalog,  Tagbanwa,  Tai_Le,         Sinhala,  Sundanese,  Syloti_Nagri,  Syriac, Tagalog, Tagbanwa, Tai_Le,
3900         Tai_Tham,  Tai_Viet,  Tamil,  Telugu,  Thaana, Thai, Tibetan, Tifinagh,         Tai_Tham, Tai_Viet, Tamil, Telugu,  Thaana,  Thai,  Tibetan,  Tifinagh,
3901         Ugaritic, Vai, Yi.         Ugaritic, Vai, Yi.
3902    
3903         Each character has exactly one Unicode general category property, spec-         Each character has exactly one Unicode general category property, spec-
3904         ified  by a two-letter abbreviation. For compatibility with Perl, nega-         ified by a two-letter abbreviation. For compatibility with Perl,  nega-
3905         tion can be specified by including a  circumflex  between  the  opening         tion  can  be  specified  by including a circumflex between the opening
3906         brace  and  the  property  name.  For  example,  \p{^Lu} is the same as         brace and the property name.  For  example,  \p{^Lu}  is  the  same  as
3907         \P{Lu}.         \P{Lu}.
3908    
3909         If only one letter is specified with \p or \P, it includes all the gen-         If only one letter is specified with \p or \P, it includes all the gen-
3910         eral  category properties that start with that letter. In this case, in         eral category properties that start with that letter. In this case,  in
3911         the absence of negation, the curly brackets in the escape sequence  are         the  absence of negation, the curly brackets in the escape sequence are
3912         optional; these two examples have the same effect:         optional; these two examples have the same effect:
3913    
3914           \p{L}           \p{L}
# Line 3912  BACKSLASH Line 3960  BACKSLASH
3960           Zp    Paragraph separator           Zp    Paragraph separator
3961           Zs    Space separator           Zs    Space separator
3962    
3963         The  special property L& is also supported: it matches a character that         The special property L& is also supported: it matches a character  that
3964         has the Lu, Ll, or Lt property, in other words, a letter  that  is  not         has  the  Lu,  Ll, or Lt property, in other words, a letter that is not
3965         classified as a modifier or "other".         classified as a modifier or "other".
3966    
3967         The  Cs  (Surrogate)  property  applies only to characters in the range         The Cs (Surrogate) property applies only to  characters  in  the  range
3968         U+D800 to U+DFFF. Such characters are not valid in UTF-8  strings  (see         U+D800  to  U+DFFF. Such characters are not valid in UTF-8 strings (see
3969         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-         RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check-
3970         ing has been turned off (see the discussion  of  PCRE_NO_UTF8_CHECK  in         ing  has  been  turned off (see the discussion of PCRE_NO_UTF8_CHECK in
3971         the pcreapi page). Perl does not support the Cs property.         the pcreapi page). Perl does not support the Cs property.
3972    
3973         The  long  synonyms  for  property  names  that  Perl supports (such as         The long synonyms for  property  names  that  Perl  supports  (such  as
3974         \p{Letter}) are not supported by PCRE, nor is it  permitted  to  prefix         \p{Letter})  are  not  supported by PCRE, nor is it permitted to prefix
3975         any of these properties with "Is".         any of these properties with "Is".
3976    
3977         No character that is in the Unicode table has the Cn (unassigned) prop-         No character that is in the Unicode table has the Cn (unassigned) prop-
3978         erty.  Instead, this property is assumed for any code point that is not         erty.  Instead, this property is assumed for any code point that is not
3979         in the Unicode table.         in the Unicode table.
3980    
3981         Specifying  caseless  matching  does not affect these escape sequences.         Specifying caseless matching does not affect  these  escape  sequences.
3982         For example, \p{Lu} always matches only upper case letters.         For example, \p{Lu} always matches only upper case letters.
3983    
3984         The \X escape matches any number of Unicode  characters  that  form  an         The  \X  escape  matches  any number of Unicode characters that form an
3985         extended Unicode sequence. \X is equivalent to         extended Unicode sequence. \X is equivalent to
3986    
3987           (?>\PM\pM*)           (?>\PM\pM*)
3988    
3989         That  is,  it matches a character without the "mark" property, followed         That is, it matches a character without the "mark"  property,  followed
3990         by zero or more characters with the "mark"  property,  and  treats  the         by  zero  or  more  characters with the "mark" property, and treats the
3991         sequence  as  an  atomic group (see below).  Characters with the "mark"         sequence as an atomic group (see below).  Characters  with  the  "mark"
3992         property are typically accents that  affect  the  preceding  character.         property  are  typically  accents  that affect the preceding character.
3993         None  of  them  have  codepoints less than 256, so in non-UTF-8 mode \X         None of them have codepoints less than 256, so  in  non-UTF-8  mode  \X
3994         matches any one character.         matches any one character.
3995    
3996         Note that recent versions of Perl have changed \X to match what Unicode         Note that recent versions of Perl have changed \X to match what Unicode
3997         calls an "extended grapheme cluster", which has a more complicated def-         calls an "extended grapheme cluster", which has a more complicated def-
3998         inition.         inition.
3999    
4000         Matching characters by Unicode property is not fast, because  PCRE  has         Matching  characters  by Unicode property is not fast, because PCRE has
4001         to  search  a  structure  that  contains data for over fifteen thousand         to search a structure that contains  data  for  over  fifteen  thousand
4002         characters. That is why the traditional escape sequences such as \d and         characters. That is why the traditional escape sequences such as \d and
4003         \w  do  not  use  Unicode properties in PCRE by default, though you can         \w do not use Unicode properties in PCRE by  default,  though  you  can
4004         make them do so by setting the PCRE_UCP option for pcre_compile() or by         make them do so by setting the PCRE_UCP option for pcre_compile() or by
4005         starting the pattern with (*UCP).         starting the pattern with (*UCP).
4006    
4007     PCRE's additional properties     PCRE's additional properties
4008    
4009         As  well  as  the standard Unicode properties described in the previous         As well as the standard Unicode properties described  in  the  previous
4010         section, PCRE supports four more that make it possible to convert  tra-         section,  PCRE supports four more that make it possible to convert tra-
4011         ditional escape sequences such as \w and \s and POSIX character classes         ditional escape sequences such as \w and \s and POSIX character classes
4012         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-         to use Unicode properties. PCRE uses these non-standard, non-Perl prop-
4013         erties internally when PCRE_UCP is set. They are:         erties internally when PCRE_UCP is set. They are:
# Line 3969  BACKSLASH Line 4017  BACKSLASH
4017           Xsp   Any Perl space character           Xsp   Any Perl space character
4018           Xwd   Any Perl "word" character           Xwd   Any Perl "word" character
4019    
4020         Xan  matches  characters that have either the L (letter) or the N (num-         Xan matches characters that have either the L (letter) or the  N  (num-
4021         ber) property. Xps matches the characters tab, linefeed, vertical  tab,         ber)  property. Xps matches the characters tab, linefeed, vertical tab,
4022         formfeed,  or  carriage  return, and any other character that has the Z         formfeed, or carriage return, and any other character that  has  the  Z
4023         (separator) property.  Xsp is the same as Xps, except that vertical tab         (separator) property.  Xsp is the same as Xps, except that vertical tab
4024         is excluded. Xwd matches the same characters as Xan, plus underscore.         is excluded. Xwd matches the same characters as Xan, plus underscore.
4025    
4026     Resetting the match start     Resetting the match start
4027    
4028         The  escape sequence \K causes any previously matched characters not to         The escape sequence \K causes any previously matched characters not  to
4029         be included in the final matched sequence. For example, the pattern:         be included in the final matched sequence. For example, the pattern:
4030    
4031           foo\Kbar           foo\Kbar
4032    
4033         matches "foobar", but reports that it has matched "bar".  This  feature         matches  "foobar",  but reports that it has matched "bar". This feature
4034         is  similar  to  a lookbehind assertion (described below).  However, in         is similar to a lookbehind assertion (described  below).   However,  in
4035         this case, the part of the subject before the real match does not  have         this  case, the part of the subject before the real match does not have
4036         to  be of fixed length, as lookbehind assertions do. The use of \K does         to be of fixed length, as lookbehind assertions do. The use of \K  does
4037         not interfere with the setting of captured  substrings.   For  example,         not  interfere  with  the setting of captured substrings.  For example,
4038         when the pattern         when the pattern
4039    
4040           (foo)\Kbar           (foo)\Kbar
4041    
4042         matches "foobar", the first substring is still set to "foo".         matches "foobar", the first substring is still set to "foo".
4043    
4044         Perl  documents  that  the  use  of  \K  within assertions is "not well         Perl documents that the use  of  \K  within  assertions  is  "not  well
4045         defined". In PCRE, \K is acted upon  when  it  occurs  inside  positive         defined".  In  PCRE,  \K  is  acted upon when it occurs inside positive
4046         assertions, but is ignored in negative assertions.         assertions, but is ignored in negative assertions.
4047    
4048     Simple assertions     Simple assertions
4049    
4050         The  final use of backslash is for certain simple assertions. An asser-         The final use of backslash is for certain simple assertions. An  asser-
4051         tion specifies a condition that has to be met at a particular point  in         tion  specifies a condition that has to be met at a particular point in
4052         a  match, without consuming any characters from the subject string. The         a match, without consuming any characters from the subject string.  The
4053         use of subpatterns for more complicated assertions is described  below.         use  of subpatterns for more complicated assertions is described below.
4054         The backslashed assertions are:         The backslashed assertions are:
4055    
4056           \b     matches at a word boundary           \b     matches at a word boundary
# Line 4013  BACKSLASH Line 4061  BACKSLASH
4061           \z     matches only at the end of the subject           \z     matches only at the end of the subject
4062           \G     matches at the first matching position in the subject           \G     matches at the first matching position in the subject
4063    
4064         Inside  a  character  class, \b has a different meaning; it matches the         Inside a character class, \b has a different meaning;  it  matches  the
4065         backspace character. If any other of  these  assertions  appears  in  a         backspace  character.  If  any  other  of these assertions appears in a
4066         character  class, by default it matches the corresponding literal char-         character class, by default it matches the corresponding literal  char-
4067         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the         acter  (for  example,  \B  matches  the  letter  B).  However,  if  the
4068         PCRE_EXTRA  option is set, an "invalid escape sequence" error is gener-         PCRE_EXTRA option is set, an "invalid escape sequence" error is  gener-
4069         ated instead.         ated instead.
4070    
4071         A word boundary is a position in the subject string where  the  current         A  word  boundary is a position in the subject string where the current
4072         character  and  the previous character do not both match \w or \W (i.e.         character and the previous character do not both match \w or  \W  (i.e.
4073         one matches \w and the other matches \W), or the start or  end  of  the         one  matches  \w  and the other matches \W), or the start or end of the
4074         string  if  the  first  or  last character matches \w, respectively. In         string if the first or last  character  matches  \w,  respectively.  In
4075         UTF-8 mode, the meanings of \w and \W can be  changed  by  setting  the         UTF-8  mode,  the  meanings  of \w and \W can be changed by setting the
4076         PCRE_UCP  option. When this is done, it also affects \b and \B. Neither         PCRE_UCP option. When this is done, it also affects \b and \B.  Neither
4077         PCRE nor Perl has a separate "start of word" or "end of  word"  metase-         PCRE  nor  Perl has a separate "start of word" or "end of word" metase-
4078         quence.  However,  whatever follows \b normally determines which it is.         quence. However, whatever follows \b normally determines which  it  is.
4079         For example, the fragment \ba matches "a" at the start of a word.         For example, the fragment \ba matches "a" at the start of a word.
4080    
4081         The \A, \Z, and \z assertions differ from  the  traditional  circumflex         The  \A,  \Z,  and \z assertions differ from the traditional circumflex
4082         and dollar (described in the next section) in that they only ever match         and dollar (described in the next section) in that they only ever match
4083         at the very start and end of the subject string, whatever  options  are         at  the  very start and end of the subject string, whatever options are
4084         set.  Thus,  they are independent of multiline mode. These three asser-         set. Thus, they are independent of multiline mode. These  three  asser-
4085         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which         tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
4086         affect  only the behaviour of the circumflex and dollar metacharacters.         affect only the behaviour of the circumflex and dollar  metacharacters.
4087         However, if the startoffset argument of pcre_exec() is non-zero,  indi-         However,  if the startoffset argument of pcre_exec() is non-zero, indi-
4088         cating that matching is to start at a point other than the beginning of         cating that matching is to start at a point other than the beginning of
4089         the subject, \A can never match. The difference between \Z  and  \z  is         the  subject,  \A  can never match. The difference between \Z and \z is
4090         that \Z matches before a newline at the end of the string as well as at         that \Z matches before a newline at the end of the string as well as at
4091         the very end, whereas \z matches only at the end.         the very end, whereas \z matches only at the end.
4092    
4093         The \G assertion is true only when the current matching position is  at         The  \G assertion is true only when the current matching position is at
4094         the  start point of the match, as specified by the startoffset argument         the start point of the match, as specified by the startoffset  argument
4095         of pcre_exec(). It differs from \A when the  value  of  startoffset  is         of  pcre_exec().  It  differs  from \A when the value of startoffset is
4096         non-zero.  By calling pcre_exec() multiple times with appropriate argu-         non-zero. By calling pcre_exec() multiple times with appropriate  argu-
4097         ments, you can mimic Perl's /g option, and it is in this kind of imple-         ments, you can mimic Perl's /g option, and it is in this kind of imple-
4098         mentation where \G can be useful.         mentation where \G can be useful.
4099    
4100         Note,  however,  that  PCRE's interpretation of \G, as the start of the         Note, however, that PCRE's interpretation of \G, as the  start  of  the
4101         current match, is subtly different from Perl's, which defines it as the         current match, is subtly different from Perl's, which defines it as the
4102         end  of  the  previous  match. In Perl, these can be different when the         end of the previous match. In Perl, these can  be  different  when  the
4103         previously matched string was empty. Because PCRE does just  one  match         previously  matched  string was empty. Because PCRE does just one match
4104         at a time, it cannot reproduce this behaviour.         at a time, it cannot reproduce this behaviour.
4105    
4106         If  all  the alternatives of a pattern begin with \G, the expression is         If all the alternatives of a pattern begin with \G, the  expression  is
4107         anchored to the starting match position, and the "anchored" flag is set         anchored to the starting match position, and the "anchored" flag is set
4108         in the compiled regular expression.         in the compiled regular expression.
4109    
# Line 4063  BACKSLASH Line 4111  BACKSLASH
4111  CIRCUMFLEX AND DOLLAR  CIRCUMFLEX AND DOLLAR
4112    
4113         Outside a character class, in the default matching mode, the circumflex         Outside a character class, in the default matching mode, the circumflex
4114         character is an assertion that is true only  if  the  current  matching         character  is  an  assertion  that is true only if the current matching
4115         point  is  at the start of the subject string. If the startoffset argu-         point is at the start of the subject string. If the  startoffset  argu-
4116         ment of pcre_exec() is non-zero, circumflex  can  never  match  if  the         ment  of  pcre_exec()  is  non-zero,  circumflex can never match if the
4117         PCRE_MULTILINE  option  is  unset. Inside a character class, circumflex         PCRE_MULTILINE option is unset. Inside a  character  class,  circumflex
4118         has an entirely different meaning (see below).         has an entirely different meaning (see below).
4119    
4120         Circumflex need not be the first character of the pattern if  a  number         Circumflex  need  not be the first character of the pattern if a number
4121         of  alternatives are involved, but it should be the first thing in each         of alternatives are involved, but it should be the first thing in  each
4122         alternative in which it appears if the pattern is ever  to  match  that         alternative  in  which  it appears if the pattern is ever to match that
4123         branch.  If all possible alternatives start with a circumflex, that is,         branch. If all possible alternatives start with a circumflex, that  is,
4124         if the pattern is constrained to match only at the start  of  the  sub-         if  the  pattern  is constrained to match only at the start of the sub-
4125         ject,  it  is  said  to be an "anchored" pattern. (There are also other         ject, it is said to be an "anchored" pattern.  (There  are  also  other
4126         constructs that can cause a pattern to be anchored.)         constructs that can cause a pattern to be anchored.)
4127    
4128         A dollar character is an assertion that is true  only  if  the  current         A  dollar  character  is  an assertion that is true only if the current
4129         matching  point  is  at  the  end of the subject string, or immediately         matching point is at the end of  the  subject  string,  or  immediately
4130         before a newline at the end of the string (by default). Dollar need not         before a newline at the end of the string (by default). Dollar need not
4131         be  the  last  character of the pattern if a number of alternatives are         be the last character of the pattern if a number  of  alternatives  are
4132         involved, but it should be the last item in  any  branch  in  which  it         involved,  but  it  should  be  the last item in any branch in which it
4133         appears. Dollar has no special meaning in a character class.         appears. Dollar has no special meaning in a character class.
4134    
4135         The  meaning  of  dollar  can be changed so that it matches only at the         The meaning of dollar can be changed so that it  matches  only  at  the
4136         very end of the string, by setting the  PCRE_DOLLAR_ENDONLY  option  at         very  end  of  the string, by setting the PCRE_DOLLAR_ENDONLY option at
4137         compile time. This does not affect the \Z assertion.         compile time. This does not affect the \Z assertion.
4138    
4139         The meanings of the circumflex and dollar characters are changed if the         The meanings of the circumflex and dollar characters are changed if the
4140         PCRE_MULTILINE option is set. When  this  is  the  case,  a  circumflex         PCRE_MULTILINE  option  is  set.  When  this  is the case, a circumflex
4141         matches  immediately after internal newlines as well as at the start of         matches immediately after internal newlines as well as at the start  of
4142         the subject string. It does not match after a  newline  that  ends  the         the  subject  string.  It  does not match after a newline that ends the
4143         string.  A dollar matches before any newlines in the string, as well as         string. A dollar matches before any newlines in the string, as well  as
4144         at the very end, when PCRE_MULTILINE is set. When newline is  specified         at  the very end, when PCRE_MULTILINE is set. When newline is specified
4145         as  the  two-character  sequence CRLF, isolated CR and LF characters do         as the two-character sequence CRLF, isolated CR and  LF  characters  do
4146         not indicate newlines.         not indicate newlines.
4147    
4148         For example, the pattern /^abc$/ matches the subject string  "def\nabc"         For  example, the pattern /^abc$/ matches the subject string "def\nabc"
4149         (where  \n  represents a newline) in multiline mode, but not otherwise.         (where \n represents a newline) in multiline mode, but  not  otherwise.
4150         Consequently, patterns that are anchored in single  line  mode  because         Consequently,  patterns  that  are anchored in single line mode because
4151         all  branches  start  with  ^ are not anchored in multiline mode, and a         all branches start with ^ are not anchored in  multiline  mode,  and  a
4152         match for circumflex is  possible  when  the  startoffset  argument  of         match  for  circumflex  is  possible  when  the startoffset argument of
4153         pcre_exec()  is  non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if         pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is  ignored  if
4154         PCRE_MULTILINE is set.         PCRE_MULTILINE is set.
4155    
4156         Note that the sequences \A, \Z, and \z can be used to match  the  start         Note  that  the sequences \A, \Z, and \z can be used to match the start
4157         and  end of the subject in both modes, and if all branches of a pattern         and end of the subject in both modes, and if all branches of a  pattern
4158         start with \A it is always anchored, whether or not  PCRE_MULTILINE  is         start  with  \A it is always anchored, whether or not PCRE_MULTILINE is
4159         set.         set.
4160    
4161    
4162  FULL STOP (PERIOD, DOT) AND \N  FULL STOP (PERIOD, DOT) AND \N
4163    
4164         Outside a character class, a dot in the pattern matches any one charac-         Outside a character class, a dot in the pattern matches any one charac-
4165         ter in the subject string except (by default) a character  that  signi-         ter  in  the subject string except (by default) a character that signi-
4166         fies  the  end  of  a line. In UTF-8 mode, the matched character may be         fies the end of a line. In UTF-8 mode, the  matched  character  may  be
4167         more than one byte long.         more than one byte long.
4168    
4169         When a line ending is defined as a single character, dot never  matches         When  a line ending is defined as a single character, dot never matches
4170         that  character; when the two-character sequence CRLF is used, dot does         that character; when the two-character sequence CRLF is used, dot  does
4171         not match CR if it is immediately followed  by  LF,  but  otherwise  it         not  match  CR  if  it  is immediately followed by LF, but otherwise it
4172         matches  all characters (including isolated CRs and LFs). When any Uni-         matches all characters (including isolated CRs and LFs). When any  Uni-
4173         code line endings are being recognized, dot does not match CR or LF  or         code  line endings are being recognized, dot does not match CR or LF or
4174         any of the other line ending characters.         any of the other line ending characters.
4175    
4176         The  behaviour  of  dot  with regard to newlines can be changed. If the         The behaviour of dot with regard to newlines can  be  changed.  If  the
4177         PCRE_DOTALL option is set, a dot matches  any  one  character,  without         PCRE_DOTALL  option  is  set,  a dot matches any one character, without
4178         exception. If the two-character sequence CRLF is present in the subject         exception. If the two-character sequence CRLF is present in the subject
4179         string, it takes two dots to match it.         string, it takes two dots to match it.
4180    
4181         The handling of dot is entirely independent of the handling of  circum-         The  handling of dot is entirely independent of the handling of circum-
4182         flex  and  dollar,  the  only relationship being that they both involve         flex and dollar, the only relationship being  that  they  both  involve
4183         newlines. Dot has no special meaning in a character class.         newlines. Dot has no special meaning in a character class.
4184    
4185         The escape sequence \N behaves like  a  dot,  except  that  it  is  not         The  escape  sequence  \N  behaves  like  a  dot, except that it is not
4186         affected  by  the  PCRE_DOTALL  option.  In other words, it matches any         affected by the PCRE_DOTALL option. In  other  words,  it  matches  any
4187         character except one that signifies the end of a line.         character  except  one that signifies the end of a line. Perl also uses
4188           \N to match characters by name; PCRE does not support this.
4189    
4190    
4191  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
# Line 4153  MATCHING A SINGLE BYTE Line 4202  MATCHING A SINGLE BYTE
4202         PCRE_NO_UTF8_CHECK option is used).         PCRE_NO_UTF8_CHECK option is used).
4203    
4204         PCRE  does  not  allow \C to appear in lookbehind assertions (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
4205         below), because in UTF-8 mode this would make it impossible  to  calcu-         below) in UTF-8 mode, because this would make it impossible  to  calcu-
4206         late the length of the lookbehind.         late the length of the lookbehind.
4207    
4208         In  general, the \C escape sequence is best avoided in UTF-8 mode. How-         In  general, the \C escape sequence is best avoided in UTF-8 mode. How-
# Line 5060  ASSERTIONS Line 5109  ASSERTIONS
5109         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
5110         rent position, the assertion fails.         rent position, the assertion fails.
5111    
5112         PCRE does not allow the \C escape (which matches a single byte in UTF-8         In  UTF-8 mode, PCRE does not allow the \C escape (which matches a sin-
5113         mode) to appear in lookbehind assertions, because it makes it  impossi-         gle byte, even in UTF-8  mode)  to  appear  in  lookbehind  assertions,
5114         ble  to  calculate the length of the lookbehind. The \X and \R escapes,         because  it  makes it impossible to calculate the length of the lookbe-
5115         which can match different numbers of bytes, are also not permitted.         hind. The \X and \R escapes,  which  can  match  different  numbers  of
5116           bytes, are also not permitted.
5117    
5118         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in
5119         lookbehinds,  as  long as the subpattern matches a fixed-length string.         lookbehinds, as long as the subpattern matches a  fixed-length  string.
5120         Recursion, however, is not supported.         Recursion, however, is not supported.
5121    
5122         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind
5123         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
5124         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
5125    
5126           abcd$           abcd$
5127    
5128         when applied to a long string that does  not  match.  Because  matching         when  applied  to  a  long string that does not match. Because matching
5129         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
5130         and then see if what follows matches the rest of the  pattern.  If  the         and  then  see  if what follows matches the rest of the pattern. If the
5131         pattern is specified as         pattern is specified as
5132    
5133           ^.*abcd$           ^.*abcd$
5134    
5135         the  initial .* matches the entire string at first, but when this fails         the initial .* matches the entire string at first, but when this  fails
5136         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
5137         last  character,  then all but the last two characters, and so on. Once         last character, then all but the last two characters, and so  on.  Once
5138         again the search for "a" covers the entire string, from right to  left,         again  the search for "a" covers the entire string, from right to left,
5139         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
5140    
5141           ^.*+(?<=abcd)           ^.*+(?<=abcd)
5142    
5143         there  can  be  no backtracking for the .*+ item; it can match only the         there can be no backtracking for the .*+ item; it can  match  only  the
5144         entire string. The subsequent lookbehind assertion does a  single  test         entire  string.  The subsequent lookbehind assertion does a single test
5145         on  the last four characters. If it fails, the match fails immediately.         on the last four characters. If it fails, the match fails  immediately.
5146         For long strings, this approach makes a significant difference  to  the         For  long  strings, this approach makes a significant difference to the
5147         processing time.         processing time.
5148    
5149     Using multiple assertions     Using multiple assertions
# Line 5102  ASSERTIONS Line 5152  ASSERTIONS
5152    
5153           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
5154    
5155         matches  "foo" preceded by three digits that are not "999". Notice that         matches "foo" preceded by three digits that are not "999". Notice  that
5156         each of the assertions is applied independently at the  same  point  in         each  of  the  assertions is applied independently at the same point in
5157         the  subject  string.  First  there  is a check that the previous three         the subject string. First there is a  check  that  the  previous  three
5158         characters are all digits, and then there is  a  check  that  the  same         characters  are  all  digits,  and  then there is a check that the same
5159         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
5160         ceded by six characters, the first of which are  digits  and  the  last         ceded  by  six  characters,  the first of which are digits and the last
5161         three  of  which  are not "999". For example, it doesn't match "123abc-         three of which are not "999". For example, it  doesn't  match  "123abc-
5162         foo". A pattern to do that is         foo". A pattern to do that is
5163    
5164           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
5165    
5166         This time the first assertion looks at the  preceding  six  characters,         This  time  the  first assertion looks at the preceding six characters,
5167         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
5168         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
5169    
# Line 5121  ASSERTIONS Line 5171  ASSERTIONS
5171    
5172           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
5173    
5174         matches an occurrence of "baz" that is preceded by "bar" which in  turn         matches  an occurrence of "baz" that is preceded by "bar" which in turn
5175         is not preceded by "foo", while         is not preceded by "foo", while
5176    
5177           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
5178    
5179         is  another pattern that matches "foo" preceded by three digits and any         is another pattern that matches "foo" preceded by three digits and  any
5180         three characters that are not "999".         three characters that are not "999".
5181    
5182    
5183  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
5184    
5185         It is possible to cause the matching process to obey a subpattern  con-         It  is possible to cause the matching process to obey a subpattern con-
5186         ditionally  or to choose between two alternative subpatterns, depending         ditionally or to choose between two alternative subpatterns,  depending
5187         on the result of an assertion, or whether a specific capturing  subpat-         on  the result of an assertion, or whether a specific capturing subpat-
5188         tern  has  already  been matched. The two possible forms of conditional         tern has already been matched. The two possible  forms  of  conditional
5189         subpattern are:         subpattern are:
5190    
5191           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5192           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5193    
5194         If the condition is satisfied, the yes-pattern is used;  otherwise  the         If  the  condition is satisfied, the yes-pattern is used; otherwise the
5195         no-pattern  (if  present)  is used. If there are more than two alterna-         no-pattern (if present) is used. If there are more  than  two  alterna-
5196         tives in the subpattern, a compile-time error occurs. Each of  the  two         tives  in  the subpattern, a compile-time error occurs. Each of the two
5197         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
5198         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5199         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 5152  CONDITIONAL SUBPATTERNS Line 5202  CONDITIONAL SUBPATTERNS
5202           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5203    
5204    
5205         There are four kinds of condition: references  to  subpatterns,  refer-         There  are  four  kinds of condition: references to subpatterns, refer-
5206         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
5207    
5208     Checking for a used subpattern by number     Checking for a used subpattern by number
5209    
5210         If  the  text between the parentheses consists of a sequence of digits,         If the text between the parentheses consists of a sequence  of  digits,
5211         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
5212         viously  matched.  If  there is more than one capturing subpattern with         viously matched. If there is more than one  capturing  subpattern  with
5213         the same number (see the earlier  section  about  duplicate  subpattern         the  same  number  (see  the earlier section about duplicate subpattern
5214         numbers),  the condition is true if any of them have matched. An alter-         numbers), the condition is true if any of them have matched. An  alter-
5215         native notation is to precede the digits with a plus or minus sign.  In         native  notation is to precede the digits with a plus or minus sign. In
5216         this  case, the subpattern number is relative rather than absolute. The         this case, the subpattern number is relative rather than absolute.  The
5217         most recently opened parentheses can be referenced by (?(-1), the  next         most  recently opened parentheses can be referenced by (?(-1), the next
5218         most  recent  by (?(-2), and so on. Inside loops it can also make sense         most recent by (?(-2), and so on. Inside loops it can also  make  sense
5219         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
5220         referenced  as (?(+1), and so on. (The value zero in any of these forms         referenced as (?(+1), and so on. (The value zero in any of these  forms
5221         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
5222    
5223         Consider the following pattern, which  contains  non-significant  white         Consider  the  following  pattern, which contains non-significant white
5224         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
5225         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
5226    
5227           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5228    
5229         The first part matches an optional opening  parenthesis,  and  if  that         The  first  part  matches  an optional opening parenthesis, and if that
5230         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5231         ond part matches one or more characters that are not  parentheses.  The         ond  part  matches one or more characters that are not parentheses. The
5232         third  part  is  a conditional subpattern that tests whether or not the         third part is a conditional subpattern that tests whether  or  not  the
5233         first set of parentheses matched. If they  did,  that  is,  if  subject         first  set  of  parentheses  matched.  If they did, that is, if subject
5234         started  with an opening parenthesis, the condition is true, and so the         started with an opening parenthesis, the condition is true, and so  the
5235         yes-pattern is executed and a closing parenthesis is  required.  Other-         yes-pattern  is  executed and a closing parenthesis is required. Other-
5236         wise,  since no-pattern is not present, the subpattern matches nothing.         wise, since no-pattern is not present, the subpattern matches  nothing.
5237         In other words, this pattern matches  a  sequence  of  non-parentheses,         In  other  words,  this  pattern matches a sequence of non-parentheses,
5238         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5239    
5240         If  you  were  embedding  this pattern in a larger one, you could use a         If you were embedding this pattern in a larger one,  you  could  use  a
5241         relative reference:         relative reference:
5242    
5243           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5244    
5245         This makes the fragment independent of the parentheses  in  the  larger         This  makes  the  fragment independent of the parentheses in the larger
5246         pattern.         pattern.
5247    
5248     Checking for a used subpattern by name     Checking for a used subpattern by name
5249    
5250         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
5251         used subpattern by name. For compatibility  with  earlier  versions  of         used  subpattern  by  name.  For compatibility with earlier versions of
5252         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
5253         also recognized. However, there is a possible ambiguity with this  syn-         also  recognized. However, there is a possible ambiguity with this syn-
5254         tax,  because  subpattern  names  may  consist entirely of digits. PCRE         tax, because subpattern names may  consist  entirely  of  digits.  PCRE
5255         looks first for a named subpattern; if it cannot find one and the  name         looks  first for a named subpattern; if it cannot find one and the name
5256         consists  entirely  of digits, PCRE looks for a subpattern of that num-         consists entirely of digits, PCRE looks for a subpattern of  that  num-
5257         ber, which must be greater than zero. Using subpattern names that  con-         ber,  which must be greater than zero. Using subpattern names that con-
5258         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5259    
5260         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5261    
5262           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5263    
5264         If  the  name used in a condition of this kind is a duplicate, the test         If the name used in a condition of this kind is a duplicate,  the  test
5265         is applied to all subpatterns of the same name, and is true if any  one         is  applied to all subpatterns of the same name, and is true if any one
5266         of them has matched.         of them has matched.
5267    
5268     Checking for pattern recursion     Checking for pattern recursion
5269    
5270         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5271         name R, the condition is true if a recursive call to the whole  pattern         name  R, the condition is true if a recursive call to the whole pattern
5272         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5273         sand follow the letter R, for example:         sand follow the letter R, for example:
5274    
# Line 5226  CONDITIONAL SUBPATTERNS Line 5276  CONDITIONAL SUBPATTERNS
5276    
5277         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5278         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5279         recursion stack. If the name used in a condition  of  this  kind  is  a         recursion  stack.  If  the  name  used in a condition of this kind is a
5280         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5281         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5282    
5283         At "top level", all these recursion test  conditions  are  false.   The         At  "top  level",  all  these recursion test conditions are false.  The
5284         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5285    
5286     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5287    
5288         If  the  condition  is  the string (DEFINE), and there is no subpattern         If the condition is the string (DEFINE), and  there  is  no  subpattern
5289         with the name DEFINE, the condition is  always  false.  In  this  case,         with  the  name  DEFINE,  the  condition is always false. In this case,
5290         there  may  be  only  one  alternative  in the subpattern. It is always         there may be only one alternative  in  the  subpattern.  It  is  always
5291         skipped if control reaches this point  in  the  pattern;  the  idea  of         skipped  if  control  reaches  this  point  in the pattern; the idea of
5292         DEFINE  is that it can be used to define subroutines that can be refer-         DEFINE is that it can be used to define subroutines that can be  refer-
5293         enced from elsewhere. (The use of subroutines is described below.)  For         enced  from elsewhere. (The use of subroutines is described below.) For
5294         example,  a  pattern  to match an IPv4 address such as "192.168.23.245"         example, a pattern to match an IPv4 address  such  as  "192.168.23.245"
5295         could be written like this (ignore whitespace and line breaks):         could be written like this (ignore whitespace and line breaks):
5296    
5297           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5298           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
5299    
5300         The first part of the pattern is a DEFINE group inside which a  another         The  first part of the pattern is a DEFINE group inside which a another
5301         group  named "byte" is defined. This matches an individual component of         group named "byte" is defined. This matches an individual component  of
5302         an IPv4 address (a number less than 256). When  matching  takes  place,         an  IPv4  address  (a number less than 256). When matching takes place,
5303         this  part  of  the pattern is skipped because DEFINE acts like a false         this part of the pattern is skipped because DEFINE acts  like  a  false
5304         condition. The rest of the pattern uses references to the  named  group         condition.  The  rest of the pattern uses references to the named group
5305         to  match the four dot-separated components of an IPv4 address, insist-         to match the four dot-separated components of an IPv4 address,  insist-
5306         ing on a word boundary at each end.         ing on a word boundary at each end.
5307    
5308     Assertion conditions     Assertion conditions
5309    
5310         If the condition is not in any of the above  formats,  it  must  be  an         If  the  condition  is  not  in any of the above formats, it must be an
5311         assertion.   This may be a positive or negative lookahead or lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
5312         assertion. Consider  this  pattern,  again  containing  non-significant         assertion.  Consider  this  pattern,  again  containing non-significant
5313         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
5314    
5315           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
5316           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
5317    
5318         The  condition  is  a  positive  lookahead  assertion  that  matches an         The condition  is  a  positive  lookahead  assertion  that  matches  an
5319         optional sequence of non-letters followed by a letter. In other  words,         optional  sequence of non-letters followed by a letter. In other words,
5320         it  tests  for the presence of at least one letter in the subject. If a         it tests for the presence of at least one letter in the subject.  If  a
5321         letter is found, the subject is matched against the first  alternative;         letter  is found, the subject is matched against the first alternative;
5322         otherwise  it  is  matched  against  the  second.  This pattern matches         otherwise it is  matched  against  the  second.  This  pattern  matches
5323         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
5324         letters and dd are digits.         letters and dd are digits.
5325    
5326    
# Line 5279  COMMENTS Line 5329  COMMENTS
5329         There are two ways of including comments in patterns that are processed         There are two ways of including comments in patterns that are processed
5330         by PCRE. In both cases, the start of the comment must not be in a char-         by PCRE. In both cases, the start of the comment must not be in a char-
5331         acter class, nor in the middle of any other sequence of related charac-         acter class, nor in the middle of any other sequence of related charac-
5332         ters such as (?: or a subpattern name or number.  The  characters  that         ters  such  as  (?: or a subpattern name or number. The characters that
5333         make up a comment play no part in the pattern matching.         make up a comment play no part in the pattern matching.
5334    
5335         The  sequence (?# marks the start of a comment that continues up to the         The sequence (?# marks the start of a comment that continues up to  the
5336         next closing parenthesis. Nested parentheses are not permitted. If  the         next  closing parenthesis. Nested parentheses are not permitted. If the
5337         PCRE_EXTENDED option is set, an unescaped # character also introduces a         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5338         comment, which in this case continues to  immediately  after  the  next         comment,  which  in  this  case continues to immediately after the next
5339         newline  character  or character sequence in the pattern. Which charac-         newline character or character sequence in the pattern.  Which  charac-
5340         ters are interpreted as newlines is controlled by the options passed to         ters are interpreted as newlines is controlled by the options passed to
5341         pcre_compile() or by a special sequence at the start of the pattern, as         pcre_compile() or by a special sequence at the start of the pattern, as
5342         described in the section entitled  "Newline  conventions"  above.  Note         described  in  the  section  entitled "Newline conventions" above. Note
5343         that  the  end of this type of comment is a literal newline sequence in         that the end of this type of comment is a literal newline  sequence  in
5344         the pattern; escape sequences that happen to represent a newline do not         the pattern; escape sequences that happen to represent a newline do not
5345         count.  For  example,  consider this pattern when PCRE_EXTENDED is set,         count. For example, consider this pattern when  PCRE_EXTENDED  is  set,
5346         and the default newline convention is in force:         and the default newline convention is in force:
5347    
5348           abc #comment \n still comment           abc #comment \n still comment
5349    
5350         On encountering the # character, pcre_compile()  skips  along,  looking         On  encountering  the  # character, pcre_compile() skips along, looking
5351         for  a newline in the pattern. The sequence \n is still literal at this         for a newline in the pattern. The sequence \n is still literal at  this
5352         stage, so it does not terminate the comment. Only an  actual  character         stage,  so  it does not terminate the comment. Only an actual character
5353         with the code value 0x0a (the default newline) does so.         with the code value 0x0a (the default newline) does so.
5354    
5355    
5356  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5357    
5358         Consider  the problem of matching a string in parentheses, allowing for         Consider the problem of matching a string in parentheses, allowing  for
5359         unlimited nested parentheses. Without the use of  recursion,  the  best         unlimited  nested  parentheses.  Without the use of recursion, the best
5360         that  can  be  done  is  to use a pattern that matches up to some fixed         that can be done is to use a pattern that  matches  up  to  some  fixed
5361         depth of nesting. It is not possible to  handle  an  arbitrary  nesting         depth  of  nesting.  It  is not possible to handle an arbitrary nesting
5362         depth.         depth.
5363    
5364         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5365         sions to recurse (amongst other things). It does this by  interpolating         sions  to recurse (amongst other things). It does this by interpolating
5366         Perl  code in the expression at run time, and the code can refer to the         Perl code in the expression at run time, and the code can refer to  the
5367         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5368         parentheses problem can be created like this:         parentheses problem can be created like this:
5369    
# Line 5323  RECURSIVE PATTERNS Line 5373  RECURSIVE PATTERNS
5373         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5374    
5375         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5376         it  supports  special  syntax  for recursion of the entire pattern, and         it supports special syntax for recursion of  the  entire  pattern,  and
5377         also for individual subpattern recursion.  After  its  introduction  in         also  for  individual  subpattern  recursion. After its introduction in
5378         PCRE  and  Python,  this  kind of recursion was subsequently introduced         PCRE and Python, this kind of  recursion  was  subsequently  introduced
5379         into Perl at release 5.10.         into Perl at release 5.10.
5380    
5381         A special item that consists of (? followed by a  number  greater  than         A  special  item  that consists of (? followed by a number greater than
5382         zero  and  a  closing parenthesis is a recursive subroutine call of the         zero and a closing parenthesis is a recursive subroutine  call  of  the
5383         subpattern of the given number, provided that  it  occurs  inside  that         subpattern  of  the  given  number, provided that it occurs inside that
5384         subpattern.  (If  not,  it is a non-recursive subroutine call, which is         subpattern. (If not, it is a non-recursive subroutine  call,  which  is
5385         described in the next section.) The special item  (?R)  or  (?0)  is  a         described  in  the  next  section.)  The special item (?R) or (?0) is a
5386         recursive call of the entire regular expression.         recursive call of the entire regular expression.
5387    
5388         This  PCRE  pattern  solves  the nested parentheses problem (assume the         This PCRE pattern solves the nested  parentheses  problem  (assume  the
5389         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5390    
5391           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5392    
5393         First it matches an opening parenthesis. Then it matches any number  of         First  it matches an opening parenthesis. Then it matches any number of
5394         substrings  which  can  either  be  a sequence of non-parentheses, or a         substrings which can either be a  sequence  of  non-parentheses,  or  a
5395         recursive match of the pattern itself (that is, a  correctly  parenthe-         recursive  match  of the pattern itself (that is, a correctly parenthe-
5396         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5397         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5398         parentheses.         parentheses.
5399    
5400         If  this  were  part of a larger pattern, you would not want to recurse         If this were part of a larger pattern, you would not  want  to  recurse
5401         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5402    
5403           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5404    
5405         We have put the pattern into parentheses, and caused the  recursion  to         We  have  put the pattern into parentheses, and caused the recursion to
5406         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5407    
5408         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be
5409         tricky. This is made easier by the use of relative references.  Instead         tricky.  This is made easier by the use of relative references. Instead
5410         of (?1) in the pattern above you can write (?-2) to refer to the second         of (?1) in the pattern above you can write (?-2) to refer to the second
5411         most recently opened parentheses  preceding  the  recursion.  In  other         most  recently  opened  parentheses  preceding  the recursion. In other
5412         words,  a  negative  number counts capturing parentheses leftwards from         words, a negative number counts capturing  parentheses  leftwards  from
5413         the point at which it is encountered.         the point at which it is encountered.
5414    
5415         It is also possible to refer to  subsequently  opened  parentheses,  by         It  is  also  possible  to refer to subsequently opened parentheses, by
5416         writing  references  such  as (?+2). However, these cannot be recursive         writing references such as (?+2). However, these  cannot  be  recursive
5417         because the reference is not inside the  parentheses  that  are  refer-         because  the  reference  is  not inside the parentheses that are refer-
5418         enced.  They are always non-recursive subroutine calls, as described in         enced. They are always non-recursive subroutine calls, as described  in
5419         the next section.         the next section.
5420    
5421         An alternative approach is to use named parentheses instead.  The  Perl         An  alternative  approach is to use named parentheses instead. The Perl
5422         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also
5423         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5424    
5425           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5426    
5427         If there is more than one subpattern with the same name,  the  earliest         If  there  is more than one subpattern with the same name, the earliest
5428         one is used.         one is used.
5429    
5430         This  particular  example pattern that we have been looking at contains         This particular example pattern that we have been looking  at  contains
5431         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5432         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5433         tern to strings that do not match. For example, when  this  pattern  is         tern  to  strings  that do not match. For example, when this pattern is
5434         applied to         applied to
5435    
5436           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5437    
5438         it  yields  "no  match" quickly. However, if a possessive quantifier is         it yields "no match" quickly. However, if a  possessive  quantifier  is
5439         not used, the match runs for a very long time indeed because there  are         not  used, the match runs for a very long time indeed because there are
5440         so  many  different  ways the + and * repeats can carve up the subject,         so many different ways the + and * repeats can carve  up  the  subject,
5441         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5442    
5443         At the end of a match, the values of capturing  parentheses  are  those         At  the  end  of a match, the values of capturing parentheses are those
5444         from  the outermost level. If you want to obtain intermediate values, a         from the outermost level. If you want to obtain intermediate values,  a
5445         callout function can be used (see below and the pcrecallout  documenta-         callout  function can be used (see below and the pcrecallout documenta-
5446         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5447    
5448           (ab(cd)ef)           (ab(cd)ef)
5449    
5450         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",         the value for the inner capturing parentheses  (numbered  2)  is  "ef",
5451         which is the last value taken on at the top level. If a capturing  sub-         which  is the last value taken on at the top level. If a capturing sub-
5452         pattern  is  not  matched at the top level, its final captured value is         pattern is not matched at the top level, its final  captured  value  is
5453         unset, even if it was (temporarily) set at a deeper  level  during  the         unset,  even  if  it was (temporarily) set at a deeper level during the
5454         matching process.         matching process.
5455    
5456         If  there are more than 15 capturing parentheses in a pattern, PCRE has         If there are more than 15 capturing parentheses in a pattern, PCRE  has
5457         to obtain extra memory to store data during a recursion, which it  does         to  obtain extra memory to store data during a recursion, which it does
5458         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5459         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5460    
5461         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
5462         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
5463         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
5464         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
5465         ted at the outer level.         ted at the outer level.
5466    
5467           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5468    
5469         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
5470         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
5471         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5472    
5473     Differences in recursion processing between PCRE and Perl     Differences in recursion processing between PCRE and Perl
5474    
5475         Recursion processing in PCRE differs from Perl in two  important  ways.         Recursion  processing  in PCRE differs from Perl in two important ways.
5476         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is
5477         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5478         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5479         alternatives and there is a subsequent matching failure.  This  can  be         alternatives  and  there  is a subsequent matching failure. This can be
5480         illustrated  by the following pattern, which purports to match a palin-         illustrated by the following pattern, which purports to match a  palin-
5481         dromic string that contains an odd number of characters  (for  example,         dromic  string  that contains an odd number of characters (for example,
5482         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5483    
5484           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5485    
5486         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5487         characters surrounding a sub-palindrome. In Perl, this  pattern  works;         characters  surrounding  a sub-palindrome. In Perl, this pattern works;
5488         in  PCRE  it  does  not if the pattern is longer than three characters.         in PCRE it does not if the pattern is  longer  than  three  characters.
5489         Consider the subject string "abcba":         Consider the subject string "abcba":
5490    
5491         At the top level, the first character is matched, but as it is  not  at         At  the  top level, the first character is matched, but as it is not at
5492         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5493         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5494         tern  1  successfully  matches the next character ("b"). (Note that the         tern 1 successfully matches the next character ("b").  (Note  that  the
5495         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
5496    
5497         Back at the top level, the next character ("c") is compared  with  what         Back  at  the top level, the next character ("c") is compared with what
5498         subpattern  2 matched, which was "a". This fails. Because the recursion         subpattern 2 matched, which was "a". This fails. Because the  recursion
5499         is treated as an atomic group, there are now  no  backtracking  points,         is  treated  as  an atomic group, there are now no backtracking points,
5500         and  so  the  entire  match fails. (Perl is able, at this point, to re-         and so the entire match fails. (Perl is able, at  this  point,  to  re-
5501         enter the recursion and try the second alternative.)  However,  if  the         enter  the  recursion  and try the second alternative.) However, if the
5502         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
5503         different:         different:
5504    
5505           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
5506    
5507         This time, the recursing alternative is tried first, and  continues  to         This  time,  the recursing alternative is tried first, and continues to
5508         recurse  until  it runs out of characters, at which point the recursion         recurse until it runs out of characters, at which point  the  recursion
5509         fails. But this time we do have  another  alternative  to  try  at  the         fails.  But  this  time  we  do  have another alternative to try at the
5510         higher  level.  That  is  the  big difference: in the previous case the         higher level. That is the big difference:  in  the  previous  case  the
5511         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
5512         use.         use.
5513    
5514         To  change  the pattern so that it matches all palindromic strings, not         To change the pattern so that it matches all palindromic  strings,  not
5515         just those with an odd number of characters, it is tempting  to  change         just  those  with an odd number of characters, it is tempting to change
5516         the pattern to this:         the pattern to this:
5517    
5518           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
5519    
5520         Again,  this  works  in Perl, but not in PCRE, and for the same reason.         Again, this works in Perl, but not in PCRE, and for  the  same  reason.
5521         When a deeper recursion has matched a single character,  it  cannot  be         When  a  deeper  recursion has matched a single character, it cannot be
5522         entered  again  in  order  to match an empty string. The solution is to         entered again in order to match an empty string.  The  solution  is  to
5523         separate the two cases, and write out the odd and even cases as  alter-         separate  the two cases, and write out the odd and even cases as alter-
5524         natives at the higher level:         natives at the higher level:
5525    
5526           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
5527    
5528         If  you  want  to match typical palindromic phrases, the pattern has to         If you want to match typical palindromic phrases, the  pattern  has  to
5529         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
5530    
5531           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5532    
5533         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5534         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5535         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-
5536         ing  into  sequences of non-word characters. Without this, PCRE takes a         ing into sequences of non-word characters. Without this, PCRE  takes  a
5537         great deal longer (ten times or more) to  match  typical  phrases,  and         great  deal  longer  (ten  times or more) to match typical phrases, and
5538         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5539    
5540         WARNING:  The  palindrome-matching patterns above work only if the sub-         WARNING: The palindrome-matching patterns above work only if  the  sub-
5541         ject string does not start with a palindrome that is shorter  than  the         ject  string  does not start with a palindrome that is shorter than the
5542         entire  string.  For example, although "abcba" is correctly matched, if         entire string.  For example, although "abcba" is correctly matched,  if
5543         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,
5544         then  fails at top level because the end of the string does not follow.         then fails at top level because the end of the string does not  follow.
5545         Once again, it cannot jump back into the recursion to try other  alter-         Once  again, it cannot jump back into the recursion to try other alter-
5546         natives, so the entire match fails.         natives, so the entire match fails.
5547    
5548         The  second  way  in which PCRE and Perl differ in their recursion pro-         The second way in which PCRE and Perl differ in  their  recursion  pro-
5549         cessing is in the handling of captured values. In Perl, when a  subpat-         cessing  is in the handling of captured values. In Perl, when a subpat-
5550         tern  is  called recursively or as a subpattern (see the next section),         tern is called recursively or as a subpattern (see the  next  section),
5551         it has no access to any values that were captured  outside  the  recur-         it  has  no  access to any values that were captured outside the recur-
5552         sion,  whereas  in  PCRE  these values can be referenced. Consider this         sion, whereas in PCRE these values can  be  referenced.  Consider  this
5553         pattern:         pattern:
5554    
5555           ^(.)(\1|a(?2))           ^(.)(\1|a(?2))
5556    
5557         In PCRE, this pattern matches "bab". The  first  capturing  parentheses         In  PCRE,  this  pattern matches "bab". The first capturing parentheses
5558         match  "b",  then in the second group, when the back reference \1 fails         match "b", then in the second group, when the back reference  \1  fails
5559         to match "b", the second alternative matches "a" and then recurses.  In         to  match "b", the second alternative matches "a" and then recurses. In
5560         the  recursion,  \1 does now match "b" and so the whole match succeeds.         the recursion, \1 does now match "b" and so the whole  match  succeeds.
5561         In Perl, the pattern fails to match because inside the  recursive  call         In  Perl,  the pattern fails to match because inside the recursive call
5562         \1 cannot access the externally set value.         \1 cannot access the externally set value.
5563    
5564    
5565  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5566    
5567         If  the  syntax for a recursive subpattern call (either by number or by         If the syntax for a recursive subpattern call (either by number  or  by
5568         name) is used outside the parentheses to which it refers,  it  operates         name)  is  used outside the parentheses to which it refers, it operates
5569         like  a subroutine in a programming language. The called subpattern may         like a subroutine in a programming language. The called subpattern  may
5570         be defined before or after the reference. A numbered reference  can  be         be  defined  before or after the reference. A numbered reference can be
5571         absolute or relative, as in these examples:         absolute or relative, as in these examples:
5572    
5573           (...(absolute)...)...(?2)...           (...(absolute)...)...(?2)...
# Line 5528  SUBPATTERNS AS SUBROUTINES Line 5578  SUBPATTERNS AS SUBROUTINES
5578    
5579           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5580    
5581         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
5582         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5583    
5584           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5585    
5586         is used, it does match "sense and responsibility" as well as the  other         is  used, it does match "sense and responsibility" as well as the other
5587         two  strings.  Another  example  is  given  in the discussion of DEFINE         two strings. Another example is  given  in  the  discussion  of  DEFINE
5588         above.         above.
5589    
5590         All subroutine calls, whether recursive or not, are always  treated  as         All  subroutine  calls, whether recursive or not, are always treated as
5591         atomic  groups. That is, once a subroutine has matched some of the sub-         atomic groups. That is, once a subroutine has matched some of the  sub-
5592         ject string, it is never re-entered, even if it contains untried alter-         ject string, it is never re-entered, even if it contains untried alter-
5593         natives  and  there  is  a  subsequent  matching failure. Any capturing         natives and there is  a  subsequent  matching  failure.  Any  capturing
5594         parentheses that are set during the subroutine  call  revert  to  their         parentheses  that  are  set  during the subroutine call revert to their
5595         previous values afterwards.         previous values afterwards.
5596    
5597         Processing  options  such as case-independence are fixed when a subpat-         Processing options such as case-independence are fixed when  a  subpat-
5598         tern is defined, so if it is used as a subroutine, such options  cannot         tern  is defined, so if it is used as a subroutine, such options cannot
5599         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5600    
5601           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5602    
5603         It  matches  "abcabc". It does not match "abcABC" because the change of         It matches "abcabc". It does not match "abcABC" because the  change  of
5604         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5605    
5606    
5607  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5608    
5609         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a
5610         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5611         an alternative syntax for referencing a  subpattern  as  a  subroutine,         an  alternative  syntax  for  referencing a subpattern as a subroutine,
5612         possibly  recursively. Here are two of the examples used above, rewrit-         possibly recursively. Here are two of the examples used above,  rewrit-
5613         ten using this syntax:         ten using this syntax:
5614    
5615           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5616           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5617    
5618         PCRE supports an extension to Oniguruma: if a number is preceded  by  a         PCRE  supports  an extension to Oniguruma: if a number is preceded by a
5619         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5620    
5621           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5622    
5623         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not
5624         synonymous. The former is a back reference; the latter is a  subroutine         synonymous.  The former is a back reference; the latter is a subroutine
5625         call.         call.
5626    
5627    
5628  CALLOUTS  CALLOUTS
5629    
5630         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5631         Perl code to be obeyed in the middle of matching a regular  expression.         Perl  code to be obeyed in the middle of matching a regular expression.
5632         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5633         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5634         tion.         tion.
5635    
5636         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5637         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5638         an  external function by putting its entry point in the global variable         an external function by putting its entry point in the global  variable
5639         pcre_callout.  By default, this variable contains NULL, which  disables         pcre_callout.   By default, this variable contains NULL, which disables
5640         all calling out.         all calling out.
5641    
5642         Within  a  regular  expression,  (?C) indicates the points at which the         Within a regular expression, (?C) indicates the  points  at  which  the
5643         external function is to be called. If you want  to  identify  different         external  function  is  to be called. If you want to identify different
5644         callout  points, you can put a number less than 256 after the letter C.         callout points, you can put a number less than 256 after the letter  C.
5645         The default value is zero.  For example, this pattern has  two  callout         The  default  value is zero.  For example, this pattern has two callout
5646         points:         points:
5647    
5648           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5649    
5650         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5651         automatically installed before each item in the pattern. They  are  all         automatically  installed  before each item in the pattern. They are all
5652         numbered 255.         numbered 255.
5653    
5654         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5655         set), the external function is called. It is provided with  the  number         set),  the  external function is called. It is provided with the number
5656         of  the callout, the position in the pattern, and, optionally, one item         of the callout, the position in the pattern, and, optionally, one  item
5657         of data originally supplied by the caller of pcre_exec().  The  callout         of  data  originally supplied by the caller of pcre_exec(). The callout
5658         function  may cause matching to proceed, to backtrack, or to fail alto-         function may cause matching to proceed, to backtrack, or to fail  alto-
5659         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5660         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5661    
5662    
5663  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5664    
5665         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",
5666         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5667         ject  to  change or removal in a future version of Perl". It goes on to         ject to change or removal in a future version of Perl". It goes  on  to
5668         say: "Their usage in production code should be noted to avoid  problems         say:  "Their usage in production code should be noted to avoid problems
5669         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5670         in this section.         in this section.
5671    
5672         Since these verbs are specifically related  to  backtracking,  most  of         Since  these  verbs  are  specifically related to backtracking, most of
5673         them  can  be  used  only  when  the  pattern  is  to  be matched using         them can be  used  only  when  the  pattern  is  to  be  matched  using
5674         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5675         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5676         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5677    
5678         If any of these verbs are used in an assertion or in a subpattern  that         If  any of these verbs are used in an assertion or in a subpattern that
5679         is called as a subroutine (whether or not recursively), their effect is         is called as a subroutine (whether or not recursively), their effect is
5680         confined to that subpattern; it does not extend to the surrounding pat-         confined to that subpattern; it does not extend to the surrounding pat-
5681         tern,  with  one  exception:  a *MARK that is encountered in a positive         tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN)
5682         assertion is passed back (compare capturing parentheses in assertions).         that  is  encountered in a successful positive assertion is passed back
5683           when a match succeeds (compare capturing  parentheses  in  assertions).
5684         Note that such subpatterns are processed as anchored at the point where         Note that such subpatterns are processed as anchored at the point where
5685         they are tested. Note also that Perl's treatment of subroutines is dif-         they are tested. Note also that Perl's treatment of subroutines is dif-
5686         ferent in some cases.         ferent in some cases.
# Line 5652  BACKTRACKING CONTROL Line 5703  BACKTRACKING CONTROL
5703         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
5704         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
5705    
5706           Experiments  with  Perl  suggest that it too has similar optimizations,
5707           sometimes leading to anomalous results.
5708    
5709     Verbs that act immediately     Verbs that act immediately
5710    
5711         The  following  verbs act as soon as they are encountered. They may not         The following verbs act as soon as they are encountered. They  may  not
5712         be followed by a name.         be followed by a name.
5713    
5714            (*ACCEPT)            (*ACCEPT)
5715    
5716         This verb causes the match to end successfully, skipping the  remainder         This  verb causes the match to end successfully, skipping the remainder
5717         of  the pattern. However, when it is inside a subpattern that is called         of the pattern. However, when it is inside a subpattern that is  called
5718         as a subroutine, only that subpattern is ended  successfully.  Matching         as  a  subroutine, only that subpattern is ended successfully. Matching
5719         then  continues  at  the  outer level. If (*ACCEPT) is inside capturing         then continues at the outer level. If  (*ACCEPT)  is  inside  capturing
5720         parentheses, the data so far is captured. For example:         parentheses, the data so far is captured. For example:
5721    
5722           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5723    
5724         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
5725         tured by the outer parentheses.         tured by the outer parentheses.
5726    
5727           (*FAIL) or (*F)           (*FAIL) or (*F)
5728    
5729         This  verb causes a matching failure, forcing backtracking to occur. It         This verb causes a matching failure, forcing backtracking to occur.  It
5730         is equivalent to (?!) but easier to read. The Perl documentation  notes         is  equivalent to (?!) but easier to read. The Perl documentation notes
5731         that  it  is  probably  useful only when combined with (?{}) or (??{}).         that it is probably useful only when combined  with  (?{})  or  (??{}).
5732         Those are, of course, Perl features that are not present in  PCRE.  The         Those  are,  of course, Perl features that are not present in PCRE. The
5733         nearest  equivalent is the callout feature, as for example in this pat-         nearest equivalent is the callout feature, as for example in this  pat-
5734         tern:         tern:
5735    
5736           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5737    
5738         A match with the string "aaaa" always fails, but the callout  is  taken         A  match  with the string "aaaa" always fails, but the callout is taken
5739         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5740    
5741     Recording which path was taken     Recording which path was taken
5742    
5743         There  is  one  verb  whose  main  purpose  is to track how a match was         There is one verb whose main purpose  is  to  track  how  a  match  was
5744         arrived at, though it also has a  secondary  use  in  conjunction  with         arrived  at,  though  it  also  has a secondary use in conjunction with
5745         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
5746    
5747           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
5748    
5749         A  name  is  always  required  with  this  verb.  There  may be as many         A name is always  required  with  this  verb.  There  may  be  as  many
5750         instances of (*MARK) as you like in a pattern, and their names  do  not         instances  of  (*MARK) as you like in a pattern, and their names do not
5751         have to be unique.         have to be unique.
5752    
5753         When  a  match  succeeds,  the  name of the last-encountered (*MARK) is         When a match succeeds, the name of the last-encountered (*MARK) on  the
5754         passed back to  the  caller  via  the  pcre_extra  data  structure,  as         matching  path  is  passed  back  to the caller via the pcre_extra data
5755         described in the section on pcre_extra in the pcreapi documentation. No         structure, as described in the section on  pcre_extra  in  the  pcreapi
5756         data is returned for a partial match. Here is an  example  of  pcretest         documentation. Here is an example of pcretest output, where the /K mod-
5757         output,  where the /K modifier requests the retrieval and outputting of         ifier requests the retrieval and outputting of (*MARK) data:
        (*MARK) data:  
5758    
5759           /X(*MARK:A)Y|X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
5760           XY           data> XY
5761            0: XY            0: XY
5762           MK: A           MK: A
5763           XZ           XZ
# Line 5720  BACKTRACKING CONTROL Line 5773  BACKTRACKING CONTROL
5773         and passed back if it is the last-encountered. This does not happen for         and passed back if it is the last-encountered. This does not happen for
5774         negative assertions.         negative assertions.
5775    
5776         A  name  may  also  be  returned after a failed match if the final path         After  a  partial match or a failed match, the name of the last encoun-
5777         through the pattern involves (*MARK). However, unless (*MARK)  used  in         tered (*MARK) in the entire match process is returned. For example:
        conjunction  with  (*COMMIT),  this  is unlikely to happen for an unan-  
        chored pattern because, as the starting point for matching is advanced,  
        the final check is often with an empty string, causing a failure before  
        (*MARK) is reached. For example:  
   
          /X(*MARK:A)Y|X(*MARK:B)Z/K  
          XP  
          No match  
   
        There are three potential starting points for this match (starting with  
        X,  starting  with  P,  and  with  an  empty string). If the pattern is  
        anchored, the result is different:  
5778    
5779           /^X(*MARK:A)Y|^X(*MARK:B)Z/K             re> /X(*MARK:A)Y|X(*MARK:B)Z/K
5780           XP           data> XP
5781           No match, mark = B           No match, mark = B
5782    
5783         PCRE's start-of-match optimizations can also interfere with  this.  For         Note that in this unanchored example the  mark  is  retained  from  the
5784         example,  if, as a result of a call to pcre_study(), it knows the mini-         match attempt that started at the letter "X". Subsequent match attempts
5785         mum subject length for a match, a shorter subject will not  be  scanned         starting at "P" and then with an empty string do not get as far as  the
5786         at all.         (*MARK) item, but nevertheless do not reset it.
   
        Note that similar anomalies (though different in detail) exist in Perl,  
        no doubt for the same reasons. The use of (*MARK) data after  a  failed  
        match  of an unanchored pattern is not recommended, unless (*COMMIT) is  
        involved.  
5787    
5788     Verbs that act after backtracking     Verbs that act after backtracking
5789    
5790         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5791         tinues  with what follows, but if there is no subsequent match, causing         tinues with what follows, but if there is no subsequent match,  causing
5792         a backtrack to the verb, a failure is  forced.  That  is,  backtracking         a  backtrack  to  the  verb, a failure is forced. That is, backtracking
5793         cannot  pass  to the left of the verb. However, when one of these verbs         cannot pass to the left of the verb. However, when one of  these  verbs
5794         appears inside an atomic group, its effect is confined to  that  group,         appears  inside  an atomic group, its effect is confined to that group,
5795         because  once the group has been matched, there is never any backtrack-         because once the group has been matched, there is never any  backtrack-
5796         ing into it. In this situation, backtracking can  "jump  back"  to  the         ing  into  it.  In  this situation, backtracking can "jump back" to the
5797         left  of the entire atomic group. (Remember also, as stated above, that         left of the entire atomic group. (Remember also, as stated above,  that
5798         this localization also applies in subroutine calls and assertions.)         this localization also applies in subroutine calls and assertions.)
5799    
5800         These verbs differ in exactly what kind of failure  occurs  when  back-         These  verbs  differ  in exactly what kind of failure occurs when back-
5801         tracking reaches them.         tracking reaches them.
5802    
5803           (*COMMIT)           (*COMMIT)
5804    
5805         This  verb, which may not be followed by a name, causes the whole match         This verb, which may not be followed by a name, causes the whole  match
5806         to fail outright if the rest of the pattern does not match. Even if the         to fail outright if the rest of the pattern does not match. Even if the
5807         pattern is unanchored, no further attempts to find a match by advancing         pattern is unanchored, no further attempts to find a match by advancing
5808         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
5809         pcre_exec()  is  committed  to  finding a match at the current starting         pcre_exec() is committed to finding a match  at  the  current  starting
5810         point, or not at all. For example:         point, or not at all. For example:
5811    
5812           a+(*COMMIT)b           a+(*COMMIT)b
5813    
5814         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind
5815         of dynamic anchor, or "I've started, so I must finish." The name of the         of dynamic anchor, or "I've started, so I must finish." The name of the
5816         most recently passed (*MARK) in the path is passed back when  (*COMMIT)         most  recently passed (*MARK) in the path is passed back when (*COMMIT)
5817         forces a match failure.         forces a match failure.
5818    
5819         Note  that  (*COMMIT)  at  the start of a pattern is not the same as an         Note that (*COMMIT) at the start of a pattern is not  the  same  as  an
5820         anchor, unless PCRE's start-of-match optimizations are turned  off,  as         anchor,  unless  PCRE's start-of-match optimizations are turned off, as
5821         shown in this pcretest example:         shown in this pcretest example:
5822    
5823           /(*COMMIT)abc/             re> /(*COMMIT)abc/
5824           xyzabc           data> xyzabc
5825            0: abc            0: abc
5826           xyzabc\Y           xyzabc\Y
5827           No match           No match
5828    
5829         PCRE  knows  that  any  match  must start with "a", so the optimization         PCRE knows that any match must start  with  "a",  so  the  optimization
5830         skips along the subject to "a" before running the first match  attempt,         skips  along the subject to "a" before running the first match attempt,
5831         which  succeeds.  When the optimization is disabled by the \Y escape in         which succeeds. When the optimization is disabled by the \Y  escape  in
5832         the second subject, the match starts at "x" and so the (*COMMIT) causes         the second subject, the match starts at "x" and so the (*COMMIT) causes
5833         it to fail without trying any other starting points.         it to fail without trying any other starting points.
5834    
5835           (*PRUNE) or (*PRUNE:NAME)           (*PRUNE) or (*PRUNE:NAME)
5836    
5837         This  verb causes the match to fail at the current starting position in         This verb causes the match to fail at the current starting position  in
5838         the subject if the rest of the pattern does not match. If  the  pattern         the  subject  if the rest of the pattern does not match. If the pattern
5839         is  unanchored,  the  normal  "bumpalong"  advance to the next starting         is unanchored, the normal "bumpalong"  advance  to  the  next  starting
5840         character then happens. Backtracking can occur as usual to the left  of         character  then happens. Backtracking can occur as usual to the left of
5841         (*PRUNE),  before  it  is  reached,  or  when  matching to the right of         (*PRUNE), before it is reached,  or  when  matching  to  the  right  of
5842         (*PRUNE), but if there is no match to the  right,  backtracking  cannot         (*PRUNE),  but  if  there is no match to the right, backtracking cannot
5843         cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-         cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-
5844         native to an atomic group or possessive quantifier, but there are  some         native  to an atomic group or possessive quantifier, but there are some
5845         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
5846         iour of (*PRUNE:NAME) is the  same  as  (*MARK:NAME)(*PRUNE)  when  the         iour  of  (*PRUNE:NAME)  is  the  same  as  (*MARK:NAME)(*PRUNE). In an
5847         match  fails  completely;  the name is passed back if this is the final         anchored pattern (*PRUNE) has the same effect as (*COMMIT).
        attempt.  (*PRUNE:NAME) does not pass back a name  if  the  match  suc-  
        ceeds.  In  an  anchored pattern (*PRUNE) has the same effect as (*COM-  
        MIT).  
5848    
5849           (*SKIP)           (*SKIP)
5850    
# Line 5838  BACKTRACKING CONTROL Line 5871  BACKTRACKING CONTROL
5871         is searched for the most recent (*MARK) that has the same name. If  one         is searched for the most recent (*MARK) that has the same name. If  one
5872         is  found, the "bumpalong" advance is to the subject position that cor-         is  found, the "bumpalong" advance is to the subject position that cor-
5873         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
5874         If  no (*MARK) with a matching name is found, normal "bumpalong" of one         If no (*MARK) with a matching name is found, the (*SKIP) is ignored.
        character happens (that is, the (*SKIP) is ignored).  
5875    
5876           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
5877    
5878         This verb causes a skip to the next innermost alternative if  the  rest         This  verb  causes a skip to the next innermost alternative if the rest
5879         of  the  pattern does not match. That is, it cancels pending backtrack-         of the pattern does not match. That is, it cancels  pending  backtrack-
5880         ing, but only within the current alternative. Its name comes  from  the         ing,  but  only within the current alternative. Its name comes from the
5881         observation that it can be used for a pattern-based if-then-else block:         observation that it can be used for a pattern-based if-then-else block:
5882    
5883           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5884    
5885         If  the COND1 pattern matches, FOO is tried (and possibly further items         If the COND1 pattern matches, FOO is tried (and possibly further  items
5886         after the end of the group if FOO succeeds); on  failure,  the  matcher         after  the  end  of the group if FOO succeeds); on failure, the matcher
5887         skips  to  the second alternative and tries COND2, without backtracking         skips to the second alternative and tries COND2,  without  backtracking
5888         into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as         into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as
5889         (*MARK:NAME)(*THEN)  if  the  overall  match  fails.  If (*THEN) is not         (*MARK:NAME)(*THEN).  If (*THEN) is not inside an alternation, it  acts
5890         inside an alternation, it acts like (*PRUNE).         like (*PRUNE).
5891    
5892         Note that a subpattern that does not contain a | character  is  just  a         Note  that  a  subpattern that does not contain a | character is just a
5893         part  of the enclosing alternative; it is not a nested alternation with         part of the enclosing alternative; it is not a nested alternation  with
5894         only one alternative. The effect of (*THEN) extends beyond such a  sub-         only  one alternative. The effect of (*THEN) extends beyond such a sub-
5895         pattern  to  the enclosing alternative. Consider this pattern, where A,         pattern to the enclosing alternative. Consider this pattern,  where  A,
5896         B, etc. are complex pattern fragments that do not contain any | charac-         B, etc. are complex pattern fragments that do not contain any | charac-
5897         ters at this level:         ters at this level:
5898    
5899           A (B(*THEN)C) | D           A (B(*THEN)C) | D
5900    
5901         If  A and B are matched, but there is a failure in C, matching does not         If A and B are matched, but there is a failure in C, matching does  not
5902         backtrack into A; instead it moves to the next alternative, that is, D.         backtrack into A; instead it moves to the next alternative, that is, D.
5903         However,  if the subpattern containing (*THEN) is given an alternative,         However, if the subpattern containing (*THEN) is given an  alternative,
5904         it behaves differently:         it behaves differently:
5905    
5906           A (B(*THEN)C | (*FAIL)) | D           A (B(*THEN)C | (*FAIL)) | D
5907    
5908         The effect of (*THEN) is now confined to the inner subpattern. After  a         The  effect of (*THEN) is now confined to the inner subpattern. After a
5909         failure in C, matching moves to (*FAIL), which causes the whole subpat-         failure in C, matching moves to (*FAIL), which causes the whole subpat-
5910         tern to fail because there are no more alternatives  to  try.  In  this         tern  to  fail  because  there are no more alternatives to try. In this
5911         case, matching does now backtrack into A.         case, matching does now backtrack into A.
5912    
5913         Note also that a conditional subpattern is not considered as having two         Note also that a conditional subpattern is not considered as having two
5914         alternatives, because only one is ever used.  In  other  words,  the  |         alternatives,  because  only  one  is  ever used. In other words, the |
5915         character in a conditional subpattern has a different meaning. Ignoring         character in a conditional subpattern has a different meaning. Ignoring
5916         white space, consider:         white space, consider:
5917    
5918           ^.*? (?(?=a) a | b(*THEN)c )           ^.*? (?(?=a) a | b(*THEN)c )
5919    
5920         If the subject is "ba", this pattern does not  match.  Because  .*?  is         If  the  subject  is  "ba", this pattern does not match. Because .*? is
5921         ungreedy,  it  initially  matches  zero characters. The condition (?=a)         ungreedy, it initially matches zero  characters.  The  condition  (?=a)
5922         then fails, the character "b" is matched,  but  "c"  is  not.  At  this         then  fails,  the  character  "b"  is  matched, but "c" is not. At this
5923         point,  matching does not backtrack to .*? as might perhaps be expected         point, matching does not backtrack to .*? as might perhaps be  expected
5924         from the presence of the | character.  The  conditional  subpattern  is         from  the  presence  of  the | character. The conditional subpattern is
5925         part of the single alternative that comprises the whole pattern, and so         part of the single alternative that comprises the whole pattern, and so
5926         the match fails. (If there was a backtrack into  .*?,  allowing  it  to         the  match  fails.  (If  there was a backtrack into .*?, allowing it to
5927         match "b", the match would succeed.)         match "b", the match would succeed.)
5928    
5929         The  verbs just described provide four different "strengths" of control         The verbs just described provide four different "strengths" of  control
5930         when subsequent matching fails. (*THEN) is the weakest, carrying on the         when subsequent matching fails. (*THEN) is the weakest, carrying on the
5931         match  at  the next alternative. (*PRUNE) comes next, failing the match         match at the next alternative. (*PRUNE) comes next, failing  the  match
5932         at the current starting position, but allowing an advance to  the  next         at  the  current starting position, but allowing an advance to the next
5933         character  (for an unanchored pattern). (*SKIP) is similar, except that         character (for an unanchored pattern). (*SKIP) is similar, except  that
5934         the advance may be more than one character. (*COMMIT) is the strongest,         the advance may be more than one character. (*COMMIT) is the strongest,
5935         causing the entire match to fail.         causing the entire match to fail.
5936    
# Line 5908  BACKTRACKING CONTROL Line 5940  BACKTRACKING CONTROL
5940    
5941           (A(*COMMIT)B(*THEN)C|D)           (A(*COMMIT)B(*THEN)C|D)
5942    
5943         Once  A  has  matched,  PCRE is committed to this match, at the current         Once A has matched, PCRE is committed to this  match,  at  the  current
5944         starting position. If subsequently B matches, but C does not, the  nor-         starting  position. If subsequently B matches, but C does not, the nor-
5945         mal (*THEN) action of trying the next alternative (that is, D) does not         mal (*THEN) action of trying the next alternative (that is, D) does not
5946         happen because (*COMMIT) overrides.         happen because (*COMMIT) overrides.
5947    
# Line 5928  AUTHOR Line 5960  AUTHOR
5960    
5961  REVISION  REVISION
5962    
5963         Last updated: 19 October 2011         Last updated: 29 November 2011
5964         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
5965  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5966    
5967    
5968  PCRESYNTAX(3)                                                    PCRESYNTAX(3)  PCRESYNTAX(3)                                                    PCRESYNTAX(3)
5969    
5970    
# Line 6301  REVISION Line 6333  REVISION
6333         Last updated: 21 November 2010         Last updated: 21 November 2010
6334         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
6335  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6336    
6337    
6338  PCREUNICODE(3)                                                  PCREUNICODE(3)  PCREUNICODE(3)                                                  PCREUNICODE(3)
6339    
6340    
# Line 6455  REVISION Line 6487  REVISION
6487         Last updated: 19 October 2011         Last updated: 19 October 2011
6488         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
6489  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6490    
6491    
6492  PCREJIT(3)                                                          PCREJIT(3)  PCREJIT(3)                                                          PCREJIT(3)
6493    
6494    
# Line 6497  AVAILABILITY OF JIT SUPPORT Line 6529  AVAILABILITY OF JIT SUPPORT
6529         been  fully  tested. If --enable-jit is set on an unsupported platform,         been  fully  tested. If --enable-jit is set on an unsupported platform,
6530         compilation fails.         compilation fails.
6531    
6532         A program can tell if JIT support is available by calling pcre_config()         A program that is linked with PCRE 8.20 or later can tell if  JIT  sup-
6533         with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available,         port  is  available  by  calling pcre_config() with the PCRE_CONFIG_JIT
6534         and 0 otherwise. However, a simple program does not need to check  this         option. The result is 1 when JIT is available, and  0  otherwise.  How-
6535         in order to use JIT. The API is implemented in a way that falls back to         ever, a simple program does not need to check this in order to use JIT.
6536         the ordinary PCRE code if JIT is not available.         The API is implemented in a way that falls back to  the  ordinary  PCRE
6537           code if JIT is not available.
6538    
6539           If  your program may sometimes be linked with versions of PCRE that are
6540           older than 8.20, but you want to use JIT when it is available, you  can
6541           test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT
6542           macro such as PCRE_CONFIG_JIT, for compile-time control of your code.
6543    
6544    
6545  SIMPLE USE OF JIT  SIMPLE USE OF JIT
# Line 6517  SIMPLE USE OF JIT Line 6555  SIMPLE USE OF JIT
6555               no longer needed instead of just freeing it yourself. This               no longer needed instead of just freeing it yourself. This
6556               ensures that any JIT data is also freed.               ensures that any JIT data is also freed.
6557    
6558           For  a  program  that may be linked with pre-8.20 versions of PCRE, you
6559           can insert
6560    
6561             #ifndef PCRE_STUDY_JIT_COMPILE
6562             #define PCRE_STUDY_JIT_COMPILE 0
6563             #endif
6564    
6565           so that no option is passed to pcre_study(),  and  then  use  something
6566           like this to free the study data:
6567    
6568             #ifdef PCRE_CONFIG_JIT
6569                 pcre_free_study(study_ptr);
6570             #else
6571                 pcre_free(study_ptr);
6572             #endif
6573    
6574         In  some circumstances you may need to call additional functions. These         In  some circumstances you may need to call additional functions. These
6575         are described in the  section  entitled  "Controlling  the  JIT  stack"         are described in the  section  entitled  "Controlling  the  JIT  stack"
6576         below.         below.
# Line 6555  UNSUPPORTED OPTIONS AND PATTERN ITEMS Line 6609  UNSUPPORTED OPTIONS AND PATTERN ITEMS
6609    
6610         The unsupported pattern items are:         The unsupported pattern items are:
6611    
6612           \C            match a single byte; not supported in UTF-8 mode           \C             match a single byte; not supported in UTF-8 mode
6613           (?Cn)          callouts           (?Cn)          callouts
          (?(<name>)...  conditional test on setting of a named subpattern  
          (?(R)...       conditional test on whole pattern recursion  
          (?(Rn)...      conditional test on recursion, by number  
          (?(R&name)...  conditional test on recursion, by name  
6614           (*COMMIT)      )           (*COMMIT)      )
6615           (*MARK)        )           (*MARK)        )
6616           (*PRUNE)       ) the backtracking control verbs           (*PRUNE)       ) the backtracking control verbs
# Line 6609  CONTROLLING THE JIT STACK Line 6659  CONTROLLING THE JIT STACK
6659         large  or  complicated  patterns  need  more  than  this.   The   error         large  or  complicated  patterns  need  more  than  this.   The   error
6660         PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.         PCRE_ERROR_JIT_STACKLIMIT  is  given  when  there  is not enough stack.
6661         Three functions are provided for managing blocks of memory for  use  as         Three functions are provided for managing blocks of memory for  use  as
6662         JIT stacks.         JIT  stacks. There is further discussion about the use of JIT stacks in
6663           the section entitled "JIT stack FAQ" below.
6664    
6665         The  pcre_jit_stack_alloc() function creates a JIT stack. Its arguments         The pcre_jit_stack_alloc() function creates a JIT stack. Its  arguments
6666         are a starting size and a maximum size, and it returns a pointer to  an         are  a starting size and a maximum size, and it returns a pointer to an
6667         opaque  structure of type pcre_jit_stack, or NULL if there is an error.         opaque structure of type pcre_jit_stack, or NULL if there is an  error.
6668         The pcre_jit_stack_free() function can be used to free a stack that  is         The  pcre_jit_stack_free() function can be used to free a stack that is
6669         no  longer  needed.  (For  the technically minded: the address space is         no longer needed. (For the technically minded:  the  address  space  is
6670         allocated by mmap or VirtualAlloc.)         allocated by mmap or VirtualAlloc.)
6671    
6672         JIT uses far less memory for recursion than the interpretive code,  and         JIT  uses far less memory for recursion than the interpretive code, and
6673         a  maximum  stack size of 512K to 1M should be more than enough for any         a maximum stack size of 512K to 1M should be more than enough  for  any
6674         pattern.         pattern.
6675    
6676         The pcre_assign_jit_stack() function specifies  which  stack  JIT  code         The  pcre_assign_jit_stack()  function  specifies  which stack JIT code
6677         should use. Its arguments are as follows:         should use. Its arguments are as follows:
6678    
6679           pcre_extra         *extra           pcre_extra         *extra
6680           pcre_jit_callback  callback           pcre_jit_callback  callback
6681           void               *data           void               *data
6682    
6683         The  extra  argument  must  be  the  result  of studying a pattern with         The extra argument must be  the  result  of  studying  a  pattern  with
6684         PCRE_STUDY_JIT_COMPILE. There are three cases for  the  values  of  the         PCRE_STUDY_JIT_COMPILE.  There  are  three  cases for the values of the
6685         other two options:         other two options:
6686    
6687           (1) If callback is NULL and data is NULL, an internal 32K block           (1) If callback is NULL and data is NULL, an internal 32K block
# Line 6645  CONTROLLING THE JIT STACK Line 6696  CONTROLLING THE JIT STACK
6696               is used; otherwise the return value must be a valid JIT stack,               is used; otherwise the return value must be a valid JIT stack,
6697               the result of calling pcre_jit_stack_alloc().               the result of calling pcre_jit_stack_alloc().
6698    
6699         You  may  safely assign the same JIT stack to more than one pattern, as         You may safely assign the same JIT stack to more than one  pattern,  as
6700         long as they are all matched sequentially in the same thread. In a mul-         long as they are all matched sequentially in the same thread. In a mul-
6701         tithread application, each thread must use its own JIT stack.         tithread application, each thread must use its own JIT stack.
6702    
6703         Strictly  speaking, even more is allowed. You can assign the same stack         Strictly speaking, even more is allowed. You can assign the same  stack
6704         to any number of patterns as long as they are not used for matching  by         to  any number of patterns as long as they are not used for matching by
6705         multiple threads at the same time. For example, you can assign the same         multiple threads at the same time. For example, you can assign the same
6706         stack to all compiled patterns, and use a global mutex in the  callback         stack  to all compiled patterns, and use a global mutex in the callback
6707         to wait until the stack is available for use. However, this is an inef-         to wait until the stack is available for use. However, this is an inef-
6708         ficient solution, and not recommended.         ficient solution, and not recommended.
6709    
6710         This is a suggestion for how  a  typical  multithreaded  program  might         This  is  a  suggestion  for  how a typical multithreaded program might
6711         operate:         operate:
6712    
6713           During thread initalization           During thread initalization
# Line 6668  CONTROLLING THE JIT STACK Line 6719  CONTROLLING THE JIT STACK
6719           Use a one-line callback function           Use a one-line callback function
6720             return thread_local_var             return thread_local_var
6721    
6722         All  the  functions  described in this section do nothing if JIT is not         All the functions described in this section do nothing if  JIT  is  not
6723         available, and pcre_assign_jit_stack() does nothing  unless  the  extra         available,  and  pcre_assign_jit_stack()  does nothing unless the extra
6724         argument  is  non-NULL  and  points  to  a pcre_extra block that is the         argument is non-NULL and points to  a  pcre_extra  block  that  is  the
6725         result of a successful study with PCRE_STUDY_JIT_COMPILE.         result of a successful study with PCRE_STUDY_JIT_COMPILE.
6726    
6727    
6728    JIT STACK FAQ
6729    
6730           (1) Why do we need JIT stacks?
6731    
6732           PCRE  (and JIT) is a recursive, depth-first engine, so it needs a stack
6733           where the local data of the current node is pushed before checking  its
6734           child nodes.  Allocating real machine stack on some platforms is diffi-
6735           cult. For example, the stack chain needs to be updated every time if we
6736           extend  the  stack  on  PowerPC.  Although it is possible, its updating
6737           time overhead decreases performance. So we do the recursion in memory.
6738    
6739           (2) Why don't we simply allocate blocks of memory with malloc()?
6740    
6741           Modern operating systems have a  nice  feature:  they  can  reserve  an
6742           address space instead of allocating memory. We can safely allocate mem-
6743           ory pages inside this address space, so the stack  could  grow  without
6744           moving memory data (this is important because of pointers). Thus we can
6745           allocate 1M address space, and use only a single memory  page  (usually
6746           4K)  if  that is enough. However, we can still grow up to 1M anytime if
6747           needed.
6748    
6749           (3) Who "owns" a JIT stack?
6750    
6751           The owner of the stack is the user program, not the JIT studied pattern
6752           or  anything else. The user program must ensure that if a stack is used
6753           by pcre_exec(), (that is, it is assigned to the pattern currently  run-
6754           ning), that stack must not be used by any other threads (to avoid over-
6755           writing the same memory area). The best practice for multithreaded pro-
6756           grams  is  to  allocate  a stack for each thread, and return this stack
6757           through the JIT callback function.
6758    
6759           (4) When should a JIT stack be freed?
6760    
6761           You can free a JIT stack at any time, as long as it will not be used by
6762           pcre_exec()  again.  When  you  assign  the  stack to a pattern, only a
6763           pointer is set. There is no reference counting or any other magic.  You
6764           can  free  the  patterns  and stacks in any order, anytime. Just do not
6765           call pcre_exec() with a pattern pointing to an already freed stack,  as
6766           that  will cause SEGFAULT. (Also, do not free a stack currently used by
6767           pcre_exec() in another thread). You can also replace the  stack  for  a
6768           pattern  at  any  time.  You  can  even  free the previous stack before
6769           assigning a replacement.
6770    
6771           (5) Should I allocate/free a  stack  every  time  before/after  calling
6772           pcre_exec()?
6773    
6774           No,  because  this  is  too  costly in terms of resources. However, you
6775           could implement some clever idea which release the stack if it  is  not
6776           used in let's say two minutes. The JIT callback can help to achive this
6777           without keeping a list of the currently JIT studied patterns.
6778    
6779           (6) OK, the stack is for long term memory allocation. But what  happens
6780           if  a pattern causes stack overflow with a stack of 1M? Is that 1M kept
6781           until the stack is freed?
6782    
6783           Especially on embedded sytems, it might be a good idea to release  mem-
6784           ory  sometimes  without  freeing the stack. There is no API for this at
6785           the moment. Probably a function call which returns with  the  currently
6786           allocated  memory for any stack and another which allows releasing mem-
6787           ory (shrinking the stack) would be a good idea if someone needs this.
6788    
6789           (7) This is too much of a headache. Isn't there any better solution for
6790           JIT stack handling?
6791    
6792           No,  thanks to Windows. If POSIX threads were used everywhere, we could
6793           throw out this complicated API.
6794    
6795    
6796  EXAMPLE CODE  EXAMPLE CODE
6797    
6798         This is a single-threaded example that specifies a  JIT  stack  without         This is a single-threaded example that specifies a  JIT  stack  without
# Line 6705  SEE ALSO Line 6824  SEE ALSO
6824    
6825  AUTHOR  AUTHOR
6826    
6827         Philip Hazel         Philip Hazel (FAQ by Zoltan Herczeg)
6828         University Computing Service         University Computing Service
6829         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
6830    
6831    
6832  REVISION  REVISION
6833    
6834         Last updated: 19 October 2011         Last updated: 26 November 2011
6835         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
6836  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6837    
6838    
6839  PCREPARTIAL(3)                                                  PCREPARTIAL(3)  PCREPARTIAL(3)                                                  PCREPARTIAL(3)
6840    
6841    
# Line 7137  REVISION Line 7256  REVISION
7256         Last updated: 26 August 2011         Last updated: 26 August 2011
7257         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
7258  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7259    
7260    
7261  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)  PCREPRECOMPILE(3)                                            PCREPRECOMPILE(3)
7262    
7263    
# Line 7268  REVISION Line 7387  REVISION
7387         Last updated: 26 August 2011         Last updated: 26 August 2011
7388         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
7389  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7390    
7391    
7392  PCREPERFORM(3)                                                  PCREPERFORM(3)  PCREPERFORM(3)                                                  PCREPERFORM(3)
7393    
7394    
# Line 7436  REVISION Line 7555  REVISION
7555         Last updated: 16 May 2010         Last updated: 16 May 2010
7556         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
7557  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7558    
7559    
7560  PCREPOSIX(3)                                                      PCREPOSIX(3)  PCREPOSIX(3)                                                      PCREPOSIX(3)
7561    
7562    
# Line 7699  REVISION Line 7818  REVISION
7818         Last updated: 16 May 2010         Last updated: 16 May 2010
7819         Copyright (c) 1997-2010 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
7820  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
7821    
7822    
7823  PCRECPP(3)                                                          PCRECPP(3)  PCRECPP(3)                                                          PCRECPP(3)
7824    
7825    
# Line 8041  REVISION Line 8160  REVISION
8160         Last updated: 17 March 2009         Last updated: 17 March 2009
8161         Minor typo fixed: 25 July 2011         Minor typo fixed: 25 July 2011
8162  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
8163    
8164    
8165  PCRESAMPLE(3)                                                    PCRESAMPLE(3)  PCRESAMPLE(3)                                                    PCRESAMPLE(3)
8166    
8167    
# Line 8153  SIZE AND OTHER LIMITATIONS Line 8272  SIZE AND OTHER LIMITATIONS
8272         There is no limit to the number of parenthesized subpatterns, but there         There is no limit to the number of parenthesized subpatterns, but there
8273         can be no more than 65535 capturing subpatterns.         can be no more than 65535 capturing subpatterns.
8274    
8275           There is a limit to the number of forward references to subsequent sub-
8276           patterns of around 200,000.  Repeated  forward  references  with  fixed
8277           upper  limits,  for example, (?2){0,100} when subpattern number 2 is to
8278           the right, are included in the count. There is no limit to  the  number
8279           of backward references.
8280    
8281         The maximum length of name for a named subpattern is 32 characters, and         The maximum length of name for a named subpattern is 32 characters, and
8282         the maximum number of named subpatterns is 10000.         the maximum number of named subpatterns is 10000.
8283    
# Line 8173  AUTHOR Line 8298  AUTHOR
8298    
8299  REVISION  REVISION
8300    
8301         Last updated: 24 August 2011         Last updated: 30 November 2011
8302         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
8303  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
8304    
8305    
8306  PCRESTACK(3)                                                      PCRESTACK(3)  PCRESTACK(3)                                                      PCRESTACK(3)
8307    
8308    
# Line 8337  REVISION Line 8462  REVISION
8462         Last updated: 26 August 2011         Last updated: 26 August 2011
8463         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
8464  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
8465    
8466    

Legend:
Removed from v.738  
changed lines
  Added in v.784

  ViewVC Help
Powered by ViewVC 1.1.5