/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC revision 738 by ph10, Fri Oct 21 09:04:01 2011 UTC
# Line 4142  FULL STOP (PERIOD, DOT) AND \N Line 4142  FULL STOP (PERIOD, DOT) AND \N
4142  MATCHING A SINGLE BYTE  MATCHING A SINGLE BYTE
4143    
4144         Outside a character class, the escape sequence \C matches any one byte,         Outside a character class, the escape sequence \C matches any one byte,
4145         both  in  and  out  of  UTF-8 mode. Unlike a dot, it always matches any         both  in  and  out of UTF-8 mode. Unlike a dot, it always matches line-
4146         line-ending characters. The feature is provided in  Perl  in  order  to         ending characters. The feature is provided in Perl in  order  to  match
4147         match  individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-         individual  bytes  in UTF-8 mode, but it is unclear how it can usefully
4148         acters into individual bytes, the rest of the string may start  with  a         be used. Because \C breaks up characters into individual bytes,  match-
4149         malformed  UTF-8  character. For this reason, the \C escape sequence is         ing  one  byte  with \C in UTF-8 mode means that the rest of the string
4150         best avoided.         may start with a malformed UTF-8 character. This has undefined results,
4151           because  PCRE  assumes that it is dealing with valid UTF-8 strings (and
4152           by default it checks  this  at  the  start  of  processing  unless  the
4153           PCRE_NO_UTF8_CHECK option is used).
4154    
4155         PCRE does not allow \C to appear in  lookbehind  assertions  (described         PCRE  does  not  allow \C to appear in lookbehind assertions (described
4156         below),  because  in UTF-8 mode this would make it impossible to calcu-         below), because in UTF-8 mode this would make it impossible  to  calcu-
4157         late the length of the lookbehind.         late the length of the lookbehind.
4158    
4159           In  general, the \C escape sequence is best avoided in UTF-8 mode. How-
4160           ever, one way of using it that avoids the problem  of  malformed  UTF-8
4161           characters  is to use a lookahead to check the length of the next char-
4162           acter, as in this pattern (ignore white space and line breaks):
4163    
4164             (?| (?=[\x00-\x7f])(\C) |
4165                 (?=[\x80-\x{7ff}])(\C)(\C) |
4166                 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
4167                 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
4168    
4169           A group that starts with (?| resets the capturing  parentheses  numbers
4170           in  each  alternative  (see  "Duplicate Subpattern Numbers" below). The
4171           assertions at the start of each branch check the next  UTF-8  character
4172           for  values  whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
4173           character's individual bytes are then captured by the appropriate  num-
4174           ber of groups.
4175    
4176    
4177  SQUARE BRACKETS AND CHARACTER CLASSES  SQUARE BRACKETS AND CHARACTER CLASSES
4178    
# Line 4160  SQUARE BRACKETS AND CHARACTER CLASSES Line 4180  SQUARE BRACKETS AND CHARACTER CLASSES
4180         closing square bracket. A closing square bracket on its own is not spe-         closing square bracket. A closing square bracket on its own is not spe-
4181         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,         cial by default.  However, if the PCRE_JAVASCRIPT_COMPAT option is set,
4182         a lone closing square bracket causes a compile-time error. If a closing         a lone closing square bracket causes a compile-time error. If a closing
4183         square bracket is required as a member of the class, it should  be  the         square  bracket  is required as a member of the class, it should be the
4184         first  data  character  in  the  class (after an initial circumflex, if         first data character in the class  (after  an  initial  circumflex,  if
4185         present) or escaped with a backslash.         present) or escaped with a backslash.
4186    
4187         A character class matches a single character in the subject.  In  UTF-8         A  character  class matches a single character in the subject. In UTF-8
4188         mode, the character may be more than one byte long. A matched character         mode, the character may be more than one byte long. A matched character
4189         must be in the set of characters defined by the class, unless the first         must be in the set of characters defined by the class, unless the first
4190         character  in  the  class definition is a circumflex, in which case the         character in the class definition is a circumflex, in  which  case  the
4191         subject character must not be in the set defined by  the  class.  If  a         subject  character  must  not  be in the set defined by the class. If a
4192         circumflex  is actually required as a member of the class, ensure it is         circumflex is actually required as a member of the class, ensure it  is
4193         not the first character, or escape it with a backslash.         not the first character, or escape it with a backslash.
4194    
4195         For example, the character class [aeiou] matches any lower case  vowel,         For  example, the character class [aeiou] matches any lower case vowel,
4196         while  [^aeiou]  matches  any character that is not a lower case vowel.         while [^aeiou] matches any character that is not a  lower  case  vowel.
4197         Note that a circumflex is just a convenient notation for specifying the         Note that a circumflex is just a convenient notation for specifying the
4198         characters  that  are in the class by enumerating those that are not. A         characters that are in the class by enumerating those that are  not.  A
4199         class that starts with a circumflex is not an assertion; it still  con-         class  that starts with a circumflex is not an assertion; it still con-
4200         sumes  a  character  from the subject string, and therefore it fails if         sumes a character from the subject string, and therefore  it  fails  if
4201         the current pointer is at the end of the string.         the current pointer is at the end of the string.
4202    
4203         In UTF-8 mode, characters with values greater than 255 can be  included         In  UTF-8 mode, characters with values greater than 255 can be included
4204         in  a  class as a literal string of bytes, or by using the \x{ escaping         in a class as a literal string of bytes, or by using the  \x{  escaping
4205         mechanism.         mechanism.
4206    
4207         When caseless matching is set, any letters in a  class  represent  both         When  caseless  matching  is set, any letters in a class represent both
4208         their  upper  case  and lower case versions, so for example, a caseless         their upper case and lower case versions, so for  example,  a  caseless
4209         [aeiou] matches "A" as well as "a", and a caseless  [^aeiou]  does  not         [aeiou]  matches  "A"  as well as "a", and a caseless [^aeiou] does not
4210         match  "A", whereas a caseful version would. In UTF-8 mode, PCRE always         match "A", whereas a caseful version would. In UTF-8 mode, PCRE  always
4211         understands the concept of case for characters whose  values  are  less         understands  the  concept  of case for characters whose values are less
4212         than  128, so caseless matching is always possible. For characters with         than 128, so caseless matching is always possible. For characters  with
4213         higher values, the concept of case is supported  if  PCRE  is  compiled         higher  values,  the  concept  of case is supported if PCRE is compiled
4214         with  Unicode  property support, but not otherwise.  If you want to use         with Unicode property support, but not otherwise.  If you want  to  use
4215         caseless matching in UTF8-mode for characters 128 and above,  you  must         caseless  matching  in UTF8-mode for characters 128 and above, you must
4216         ensure  that  PCRE is compiled with Unicode property support as well as         ensure that PCRE is compiled with Unicode property support as  well  as
4217         with UTF-8 support.         with UTF-8 support.
4218    
4219         Characters that might indicate line breaks are  never  treated  in  any         Characters  that  might  indicate  line breaks are never treated in any
4220         special  way  when  matching  character  classes,  whatever line-ending         special way  when  matching  character  classes,  whatever  line-ending
4221         sequence is in  use,  and  whatever  setting  of  the  PCRE_DOTALL  and         sequence  is  in  use,  and  whatever  setting  of  the PCRE_DOTALL and
4222         PCRE_MULTILINE options is used. A class such as [^a] always matches one         PCRE_MULTILINE options is used. A class such as [^a] always matches one
4223         of these characters.         of these characters.
4224    
4225         The minus (hyphen) character can be used to specify a range of  charac-         The  minus (hyphen) character can be used to specify a range of charac-
4226         ters  in  a  character  class.  For  example,  [d-m] matches any letter         ters in a character  class.  For  example,  [d-m]  matches  any  letter
4227         between d and m, inclusive. If a  minus  character  is  required  in  a         between  d  and  m,  inclusive.  If  a minus character is required in a
4228         class,  it  must  be  escaped  with a backslash or appear in a position         class, it must be escaped with a backslash  or  appear  in  a  position
4229         where it cannot be interpreted as indicating a range, typically as  the         where  it cannot be interpreted as indicating a range, typically as the
4230         first or last character in the class.         first or last character in the class.
4231    
4232         It is not possible to have the literal character "]" as the end charac-         It is not possible to have the literal character "]" as the end charac-
4233         ter of a range. A pattern such as [W-]46] is interpreted as a class  of         ter  of a range. A pattern such as [W-]46] is interpreted as a class of
4234         two  characters ("W" and "-") followed by a literal string "46]", so it         two characters ("W" and "-") followed by a literal string "46]", so  it
4235         would match "W46]" or "-46]". However, if the "]"  is  escaped  with  a         would  match  "W46]"  or  "-46]". However, if the "]" is escaped with a
4236         backslash  it is interpreted as the end of range, so [W-\]46] is inter-         backslash it is interpreted as the end of range, so [W-\]46] is  inter-
4237         preted as a class containing a range followed by two other  characters.         preted  as a class containing a range followed by two other characters.
4238         The  octal or hexadecimal representation of "]" can also be used to end         The octal or hexadecimal representation of "]" can also be used to  end
4239         a range.         a range.
4240    
4241         Ranges operate in the collating sequence of character values. They  can         Ranges  operate in the collating sequence of character values. They can
4242         also   be  used  for  characters  specified  numerically,  for  example         also  be  used  for  characters  specified  numerically,  for   example
4243         [\000-\037]. In UTF-8 mode, ranges can include characters whose  values         [\000-\037].  In UTF-8 mode, ranges can include characters whose values
4244         are greater than 255, for example [\x{100}-\x{2ff}].         are greater than 255, for example [\x{100}-\x{2ff}].
4245    
4246         If a range that includes letters is used when caseless matching is set,         If a range that includes letters is used when caseless matching is set,
4247         it matches the letters in either case. For example, [W-c] is equivalent         it matches the letters in either case. For example, [W-c] is equivalent
4248         to  [][\\^_`wxyzabc],  matched  caselessly,  and  in non-UTF-8 mode, if         to [][\\^_`wxyzabc], matched caselessly,  and  in  non-UTF-8  mode,  if
4249         character tables for a French locale are in  use,  [\xc8-\xcb]  matches         character  tables  for  a French locale are in use, [\xc8-\xcb] matches
4250         accented  E  characters in both cases. In UTF-8 mode, PCRE supports the         accented E characters in both cases. In UTF-8 mode, PCRE  supports  the
4251         concept of case for characters with values greater than 128  only  when         concept  of  case for characters with values greater than 128 only when
4252         it is compiled with Unicode property support.         it is compiled with Unicode property support.
4253    
4254         The  character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v, \V,         The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,  \V,
4255         \w, and \W may appear in a character class, and add the characters that         \w, and \W may appear in a character class, and add the characters that
4256         they  match to the class. For example, [\dABCDEF] matches any hexadeci-         they match to the class. For example, [\dABCDEF] matches any  hexadeci-
4257         mal digit. In UTF-8 mode, the PCRE_UCP option affects the  meanings  of         mal  digit.  In UTF-8 mode, the PCRE_UCP option affects the meanings of
4258         \d,  \s,  \w  and  their upper case partners, just as it does when they         \d, \s, \w and their upper case partners, just as  it  does  when  they
4259         appear outside a character class, as described in the section  entitled         appear  outside a character class, as described in the section entitled
4260         "Generic character types" above. The escape sequence \b has a different         "Generic character types" above. The escape sequence \b has a different
4261         meaning inside a character class; it matches the  backspace  character.         meaning  inside  a character class; it matches the backspace character.
4262         The  sequences  \B,  \N,  \R, and \X are not special inside a character         The sequences \B, \N, \R, and \X are not  special  inside  a  character
4263         class. Like any other unrecognized escape sequences, they  are  treated         class.  Like  any other unrecognized escape sequences, they are treated
4264         as  the literal characters "B", "N", "R", and "X" by default, but cause         as the literal characters "B", "N", "R", and "X" by default, but  cause
4265         an error if the PCRE_EXTRA option is set.         an error if the PCRE_EXTRA option is set.
4266    
4267         A circumflex can conveniently be used with  the  upper  case  character         A  circumflex  can  conveniently  be used with the upper case character
4268         types  to specify a more restricted set of characters than the matching         types to specify a more restricted set of characters than the  matching
4269         lower case type.  For example, the class [^\W_] matches any  letter  or         lower  case  type.  For example, the class [^\W_] matches any letter or
4270         digit, but not underscore, whereas [\w] includes underscore. A positive         digit, but not underscore, whereas [\w] includes underscore. A positive
4271         character class should be read as "something OR something OR ..." and a         character class should be read as "something OR something OR ..." and a
4272         negative class as "NOT something AND NOT something AND NOT ...".         negative class as "NOT something AND NOT something AND NOT ...".
4273    
4274         The  only  metacharacters  that are recognized in character classes are         The only metacharacters that are recognized in  character  classes  are
4275         backslash, hyphen (only where it can be  interpreted  as  specifying  a         backslash,  hyphen  (only  where  it can be interpreted as specifying a
4276         range),  circumflex  (only  at the start), opening square bracket (only         range), circumflex (only at the start), opening  square  bracket  (only
4277         when it can be interpreted as introducing a POSIX class name - see  the         when  it can be interpreted as introducing a POSIX class name - see the
4278         next  section),  and  the  terminating closing square bracket. However,         next section), and the terminating  closing  square  bracket.  However,
4279         escaping other non-alphanumeric characters does no harm.         escaping other non-alphanumeric characters does no harm.
4280    
4281    
4282  POSIX CHARACTER CLASSES  POSIX CHARACTER CLASSES
4283    
4284         Perl supports the POSIX notation for character classes. This uses names         Perl supports the POSIX notation for character classes. This uses names
4285         enclosed  by  [: and :] within the enclosing square brackets. PCRE also         enclosed by [: and :] within the enclosing square brackets.  PCRE  also
4286         supports this notation. For example,         supports this notation. For example,
4287    
4288           [01[:alpha:]%]           [01[:alpha:]%]
# Line 4285  POSIX CHARACTER CLASSES Line 4305  POSIX CHARACTER CLASSES
4305           word     "word" characters (same as \w)           word     "word" characters (same as \w)
4306           xdigit   hexadecimal digits           xdigit   hexadecimal digits
4307    
4308         The  "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),         The "space" characters are HT (9), LF (10), VT (11), FF (12), CR  (13),
4309         and space (32). Notice that this list includes the VT  character  (code         and  space  (32). Notice that this list includes the VT character (code
4310         11). This makes "space" different to \s, which does not include VT (for         11). This makes "space" different to \s, which does not include VT (for
4311         Perl compatibility).         Perl compatibility).
4312    
4313         The name "word" is a Perl extension, and "blank"  is  a  GNU  extension         The  name  "word"  is  a Perl extension, and "blank" is a GNU extension
4314         from  Perl  5.8. Another Perl extension is negation, which is indicated         from Perl 5.8. Another Perl extension is negation, which  is  indicated
4315         by a ^ character after the colon. For example,         by a ^ character after the colon. For example,
4316    
4317           [12[:^digit:]]           [12[:^digit:]]
4318    
4319         matches "1", "2", or any non-digit. PCRE (and Perl) also recognize  the         matches  "1", "2", or any non-digit. PCRE (and Perl) also recognize the
4320         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but         POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
4321         these are not supported, and an error is given if they are encountered.         these are not supported, and an error is given if they are encountered.
4322    
4323         By default, in UTF-8 mode, characters with values greater than  128  do         By  default,  in UTF-8 mode, characters with values greater than 128 do
4324         not  match any of the POSIX character classes. However, if the PCRE_UCP         not match any of the POSIX character classes. However, if the  PCRE_UCP
4325         option is passed to pcre_compile(), some of the classes are changed  so         option  is passed to pcre_compile(), some of the classes are changed so
4326         that Unicode character properties are used. This is achieved by replac-         that Unicode character properties are used. This is achieved by replac-
4327         ing the POSIX classes by other sequences, as follows:         ing the POSIX classes by other sequences, as follows:
4328    
# Line 4315  POSIX CHARACTER CLASSES Line 4335  POSIX CHARACTER CLASSES
4335           [:upper:]  becomes  \p{Lu}           [:upper:]  becomes  \p{Lu}
4336           [:word:]   becomes  \p{Xwd}           [:word:]   becomes  \p{Xwd}
4337    
4338         Negated versions, such as [:^alpha:] use \P instead of  \p.  The  other         Negated  versions,  such  as [:^alpha:] use \P instead of \p. The other
4339         POSIX classes are unchanged, and match only characters with code points         POSIX classes are unchanged, and match only characters with code points
4340         less than 128.         less than 128.
4341    
4342    
4343  VERTICAL BAR  VERTICAL BAR
4344    
4345         Vertical bar characters are used to separate alternative patterns.  For         Vertical  bar characters are used to separate alternative patterns. For
4346         example, the pattern         example, the pattern
4347    
4348           gilbert|sullivan           gilbert|sullivan
4349    
4350         matches  either "gilbert" or "sullivan". Any number of alternatives may         matches either "gilbert" or "sullivan". Any number of alternatives  may
4351         appear, and an empty  alternative  is  permitted  (matching  the  empty         appear,  and  an  empty  alternative  is  permitted (matching the empty
4352         string). The matching process tries each alternative in turn, from left         string). The matching process tries each alternative in turn, from left
4353         to right, and the first one that succeeds is used. If the  alternatives         to  right, and the first one that succeeds is used. If the alternatives
4354         are  within a subpattern (defined below), "succeeds" means matching the         are within a subpattern (defined below), "succeeds" means matching  the
4355         rest of the main pattern as well as the alternative in the subpattern.         rest of the main pattern as well as the alternative in the subpattern.
4356    
4357    
4358  INTERNAL OPTION SETTING  INTERNAL OPTION SETTING
4359    
4360         The settings of the  PCRE_CASELESS,  PCRE_MULTILINE,  PCRE_DOTALL,  and         The  settings  of  the  PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
4361         PCRE_EXTENDED  options  (which are Perl-compatible) can be changed from         PCRE_EXTENDED options (which are Perl-compatible) can be  changed  from
4362         within the pattern by  a  sequence  of  Perl  option  letters  enclosed         within  the  pattern  by  a  sequence  of  Perl option letters enclosed
4363         between "(?" and ")".  The option letters are         between "(?" and ")".  The option letters are
4364    
4365           i  for PCRE_CASELESS           i  for PCRE_CASELESS
# Line 4349  INTERNAL OPTION SETTING Line 4369  INTERNAL OPTION SETTING
4369    
4370         For example, (?im) sets caseless, multiline matching. It is also possi-         For example, (?im) sets caseless, multiline matching. It is also possi-
4371         ble to unset these options by preceding the letter with a hyphen, and a         ble to unset these options by preceding the letter with a hyphen, and a
4372         combined  setting and unsetting such as (?im-sx), which sets PCRE_CASE-         combined setting and unsetting such as (?im-sx), which sets  PCRE_CASE-
4373         LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and  PCRE_EXTENDED,         LESS  and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
4374         is  also  permitted.  If  a  letter  appears  both before and after the         is also permitted. If a  letter  appears  both  before  and  after  the
4375         hyphen, the option is unset.         hyphen, the option is unset.
4376    
4377         The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and  PCRE_EXTRA         The  PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
4378         can  be changed in the same way as the Perl-compatible options by using         can be changed in the same way as the Perl-compatible options by  using
4379         the characters J, U and X respectively.         the characters J, U and X respectively.
4380    
4381         When one of these option changes occurs at  top  level  (that  is,  not         When  one  of  these  option  changes occurs at top level (that is, not
4382         inside  subpattern parentheses), the change applies to the remainder of         inside subpattern parentheses), the change applies to the remainder  of
4383         the pattern that follows. If the change is placed right at the start of         the pattern that follows. If the change is placed right at the start of
4384         a pattern, PCRE extracts it into the global options (and it will there-         a pattern, PCRE extracts it into the global options (and it will there-
4385         fore show up in data extracted by the pcre_fullinfo() function).         fore show up in data extracted by the pcre_fullinfo() function).
4386    
4387         An option change within a subpattern (see below for  a  description  of         An  option  change  within a subpattern (see below for a description of
4388         subpatterns)  affects only that part of the subpattern that follows it,         subpatterns) affects only that part of the subpattern that follows  it,
4389         so         so
4390    
4391           (a(?i)b)c           (a(?i)b)c
4392    
4393         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not         matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
4394         used).   By  this means, options can be made to have different settings         used).  By this means, options can be made to have  different  settings
4395         in different parts of the pattern. Any changes made in one  alternative         in  different parts of the pattern. Any changes made in one alternative
4396         do  carry  on  into subsequent branches within the same subpattern. For         do carry on into subsequent branches within the  same  subpattern.  For
4397         example,         example,
4398    
4399           (a(?i)b|c)           (a(?i)b|c)
4400    
4401         matches "ab", "aB", "c", and "C", even though  when  matching  "C"  the         matches  "ab",  "aB",  "c",  and "C", even though when matching "C" the
4402         first  branch  is  abandoned before the option setting. This is because         first branch is abandoned before the option setting.  This  is  because
4403         the effects of option settings happen at compile time. There  would  be         the  effects  of option settings happen at compile time. There would be
4404         some very weird behaviour otherwise.         some very weird behaviour otherwise.
4405    
4406         Note:  There  are  other  PCRE-specific  options that can be set by the         Note: There are other PCRE-specific options that  can  be  set  by  the
4407         application when the compile or match functions  are  called.  In  some         application  when  the  compile  or match functions are called. In some
4408         cases the pattern can contain special leading sequences such as (*CRLF)         cases the pattern can contain special leading sequences such as (*CRLF)
4409         to override what the application has set or what  has  been  defaulted.         to  override  what  the application has set or what has been defaulted.
4410         Details  are  given  in the section entitled "Newline sequences" above.         Details are given in the section entitled  "Newline  sequences"  above.
4411         There are also the (*UTF8) and (*UCP) leading  sequences  that  can  be         There  are  also  the  (*UTF8) and (*UCP) leading sequences that can be
4412         used  to  set  UTF-8 and Unicode property modes; they are equivalent to         used to set UTF-8 and Unicode property modes; they  are  equivalent  to
4413         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.         setting the PCRE_UTF8 and the PCRE_UCP options, respectively.
4414    
4415    
# Line 4402  SUBPATTERNS Line 4422  SUBPATTERNS
4422    
4423           cat(aract|erpillar|)           cat(aract|erpillar|)
4424    
4425         matches  "cataract",  "caterpillar", or "cat". Without the parentheses,         matches "cataract", "caterpillar", or "cat". Without  the  parentheses,
4426         it would match "cataract", "erpillar" or an empty string.         it would match "cataract", "erpillar" or an empty string.
4427    
4428         2. It sets up the subpattern as  a  capturing  subpattern.  This  means         2.  It  sets  up  the  subpattern as a capturing subpattern. This means
4429         that,  when  the  whole  pattern  matches,  that portion of the subject         that, when the whole pattern  matches,  that  portion  of  the  subject
4430         string that matched the subpattern is passed back to the caller via the         string that matched the subpattern is passed back to the caller via the
4431         ovector  argument  of pcre_exec(). Opening parentheses are counted from         ovector argument of pcre_exec(). Opening parentheses are  counted  from
4432         left to right (starting from 1) to obtain  numbers  for  the  capturing         left  to  right  (starting  from 1) to obtain numbers for the capturing
4433         subpatterns.  For  example,  if  the  string  "the red king" is matched         subpatterns. For example, if the  string  "the  red  king"  is  matched
4434         against the pattern         against the pattern
4435    
4436           the ((red|white) (king|queen))           the ((red|white) (king|queen))
# Line 4418  SUBPATTERNS Line 4438  SUBPATTERNS
4438         the captured substrings are "red king", "red", and "king", and are num-         the captured substrings are "red king", "red", and "king", and are num-
4439         bered 1, 2, and 3, respectively.         bered 1, 2, and 3, respectively.
4440    
4441         The  fact  that  plain  parentheses  fulfil two functions is not always         The fact that plain parentheses fulfil  two  functions  is  not  always
4442         helpful.  There are often times when a grouping subpattern is  required         helpful.   There are often times when a grouping subpattern is required
4443         without  a capturing requirement. If an opening parenthesis is followed         without a capturing requirement. If an opening parenthesis is  followed
4444         by a question mark and a colon, the subpattern does not do any  captur-         by  a question mark and a colon, the subpattern does not do any captur-
4445         ing,  and  is  not  counted when computing the number of any subsequent         ing, and is not counted when computing the  number  of  any  subsequent
4446         capturing subpatterns. For example, if the string "the white queen"  is         capturing  subpatterns. For example, if the string "the white queen" is
4447         matched against the pattern         matched against the pattern
4448    
4449           the ((?:red|white) (king|queen))           the ((?:red|white) (king|queen))
# Line 4431  SUBPATTERNS Line 4451  SUBPATTERNS
4451         the captured substrings are "white queen" and "queen", and are numbered         the captured substrings are "white queen" and "queen", and are numbered
4452         1 and 2. The maximum number of capturing subpatterns is 65535.         1 and 2. The maximum number of capturing subpatterns is 65535.
4453    
4454         As a convenient shorthand, if any option settings are required  at  the         As  a  convenient shorthand, if any option settings are required at the
4455         start  of  a  non-capturing  subpattern,  the option letters may appear         start of a non-capturing subpattern,  the  option  letters  may  appear
4456         between the "?" and the ":". Thus the two patterns         between the "?" and the ":". Thus the two patterns
4457    
4458           (?i:saturday|sunday)           (?i:saturday|sunday)
4459           (?:(?i)saturday|sunday)           (?:(?i)saturday|sunday)
4460    
4461         match exactly the same set of strings. Because alternative branches are         match exactly the same set of strings. Because alternative branches are
4462         tried  from  left  to right, and options are not reset until the end of         tried from left to right, and options are not reset until  the  end  of
4463         the subpattern is reached, an option setting in one branch does  affect         the  subpattern is reached, an option setting in one branch does affect
4464         subsequent  branches,  so  the above patterns match "SUNDAY" as well as         subsequent branches, so the above patterns match "SUNDAY"  as  well  as
4465         "Saturday".         "Saturday".
4466    
4467    
4468  DUPLICATE SUBPATTERN NUMBERS  DUPLICATE SUBPATTERN NUMBERS
4469    
4470         Perl 5.10 introduced a feature whereby each alternative in a subpattern         Perl 5.10 introduced a feature whereby each alternative in a subpattern
4471         uses  the same numbers for its capturing parentheses. Such a subpattern         uses the same numbers for its capturing parentheses. Such a  subpattern
4472         starts with (?| and is itself a non-capturing subpattern. For  example,         starts  with (?| and is itself a non-capturing subpattern. For example,
4473         consider this pattern:         consider this pattern:
4474    
4475           (?|(Sat)ur|(Sun))day           (?|(Sat)ur|(Sun))day
4476    
4477         Because  the two alternatives are inside a (?| group, both sets of cap-         Because the two alternatives are inside a (?| group, both sets of  cap-
4478         turing parentheses are numbered one. Thus, when  the  pattern  matches,         turing  parentheses  are  numbered one. Thus, when the pattern matches,
4479         you  can  look  at captured substring number one, whichever alternative         you can look at captured substring number  one,  whichever  alternative
4480         matched. This construct is useful when you want to  capture  part,  but         matched.  This  construct  is useful when you want to capture part, but
4481         not all, of one of a number of alternatives. Inside a (?| group, paren-         not all, of one of a number of alternatives. Inside a (?| group, paren-
4482         theses are numbered as usual, but the number is reset at the  start  of         theses  are  numbered as usual, but the number is reset at the start of
4483         each  branch.  The numbers of any capturing parentheses that follow the         each branch. The numbers of any capturing parentheses that  follow  the
4484         subpattern start after the highest number used in any branch. The  fol-         subpattern  start after the highest number used in any branch. The fol-
4485         lowing example is taken from the Perl documentation. The numbers under-         lowing example is taken from the Perl documentation. The numbers under-
4486         neath show in which buffer the captured content will be stored.         neath show in which buffer the captured content will be stored.
4487    
# Line 4469  DUPLICATE SUBPATTERN NUMBERS Line 4489  DUPLICATE SUBPATTERN NUMBERS
4489           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
4490           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
4491    
4492         A back reference to a numbered subpattern uses the  most  recent  value         A  back  reference  to a numbered subpattern uses the most recent value
4493         that  is  set  for that number by any subpattern. The following pattern         that is set for that number by any subpattern.  The  following  pattern
4494         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
4495    
4496           /(?|(abc)|(def))\1/           /(?|(abc)|(def))\1/
4497    
4498         In contrast, a subroutine call to a numbered subpattern  always  refers         In  contrast,  a subroutine call to a numbered subpattern always refers
4499         to  the  first  one in the pattern with the given number. The following         to the first one in the pattern with the given  number.  The  following
4500         pattern matches "abcabc" or "defabc":         pattern matches "abcabc" or "defabc":
4501    
4502           /(?|(abc)|(def))(?1)/           /(?|(abc)|(def))(?1)/
4503    
4504         If a condition test for a subpattern's having matched refers to a  non-         If  a condition test for a subpattern's having matched refers to a non-
4505         unique  number, the test is true if any of the subpatterns of that num-         unique number, the test is true if any of the subpatterns of that  num-
4506         ber have matched.         ber have matched.
4507    
4508         An alternative approach to using this "branch reset" feature is to  use         An  alternative approach to using this "branch reset" feature is to use
4509         duplicate named subpatterns, as described in the next section.         duplicate named subpatterns, as described in the next section.
4510    
4511    
4512  NAMED SUBPATTERNS  NAMED SUBPATTERNS
4513    
4514         Identifying  capturing  parentheses  by number is simple, but it can be         Identifying capturing parentheses by number is simple, but  it  can  be
4515         very hard to keep track of the numbers in complicated  regular  expres-         very  hard  to keep track of the numbers in complicated regular expres-
4516         sions.  Furthermore,  if  an  expression  is  modified, the numbers may         sions. Furthermore, if an  expression  is  modified,  the  numbers  may
4517         change. To help with this difficulty, PCRE supports the naming of  sub-         change.  To help with this difficulty, PCRE supports the naming of sub-
4518         patterns. This feature was not added to Perl until release 5.10. Python         patterns. This feature was not added to Perl until release 5.10. Python
4519         had the feature earlier, and PCRE introduced it at release  4.0,  using         had  the  feature earlier, and PCRE introduced it at release 4.0, using
4520         the  Python syntax. PCRE now supports both the Perl and the Python syn-         the Python syntax. PCRE now supports both the Perl and the Python  syn-
4521         tax. Perl allows identically numbered  subpatterns  to  have  different         tax.  Perl  allows  identically  numbered subpatterns to have different
4522         names, but PCRE does not.         names, but PCRE does not.
4523    
4524         In  PCRE,  a subpattern can be named in one of three ways: (?<name>...)         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
4525         or (?'name'...) as in Perl, or (?P<name>...) as in  Python.  References         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
4526         to  capturing parentheses from other parts of the pattern, such as back         to capturing parentheses from other parts of the pattern, such as  back
4527         references, recursion, and conditions, can be made by name as  well  as         references,  recursion,  and conditions, can be made by name as well as
4528         by number.         by number.
4529    
4530         Names  consist  of  up  to  32 alphanumeric characters and underscores.         Names consist of up to  32  alphanumeric  characters  and  underscores.
4531         Named capturing parentheses are still  allocated  numbers  as  well  as         Named  capturing  parentheses  are  still  allocated numbers as well as
4532         names,  exactly as if the names were not present. The PCRE API provides         names, exactly as if the names were not present. The PCRE API  provides
4533         function calls for extracting the name-to-number translation table from         function calls for extracting the name-to-number translation table from
4534         a compiled pattern. There is also a convenience function for extracting         a compiled pattern. There is also a convenience function for extracting
4535         a captured substring by name.         a captured substring by name.
4536    
4537         By default, a name must be unique within a pattern, but it is  possible         By  default, a name must be unique within a pattern, but it is possible
4538         to relax this constraint by setting the PCRE_DUPNAMES option at compile         to relax this constraint by setting the PCRE_DUPNAMES option at compile
4539         time. (Duplicate names are also always permitted for  subpatterns  with         time.  (Duplicate  names are also always permitted for subpatterns with
4540         the  same  number, set up as described in the previous section.) Dupli-         the same number, set up as described in the previous  section.)  Dupli-
4541         cate names can be useful for patterns where only one  instance  of  the         cate  names  can  be useful for patterns where only one instance of the
4542         named  parentheses  can  match. Suppose you want to match the name of a         named parentheses can match. Suppose you want to match the  name  of  a
4543         weekday, either as a 3-letter abbreviation or as the full name, and  in         weekday,  either as a 3-letter abbreviation or as the full name, and in
4544         both cases you want to extract the abbreviation. This pattern (ignoring         both cases you want to extract the abbreviation. This pattern (ignoring
4545         the line breaks) does the job:         the line breaks) does the job:
4546    
# Line 4530  NAMED SUBPATTERNS Line 4550  NAMED SUBPATTERNS
4550           (?<DN>Thu)(?:rsday)?|           (?<DN>Thu)(?:rsday)?|
4551           (?<DN>Sat)(?:urday)?           (?<DN>Sat)(?:urday)?
4552    
4553         There are five capturing substrings, but only one is ever set  after  a         There  are  five capturing substrings, but only one is ever set after a
4554         match.  (An alternative way of solving this problem is to use a "branch         match.  (An alternative way of solving this problem is to use a "branch
4555         reset" subpattern, as described in the previous section.)         reset" subpattern, as described in the previous section.)
4556    
4557         The convenience function for extracting the data by  name  returns  the         The  convenience  function  for extracting the data by name returns the
4558         substring  for  the first (and in this example, the only) subpattern of         substring for the first (and in this example, the only)  subpattern  of
4559         that name that matched. This saves searching  to  find  which  numbered         that  name  that  matched.  This saves searching to find which numbered
4560         subpattern it was.         subpattern it was.
4561    
4562         If  you  make  a  back  reference to a non-unique named subpattern from         If you make a back reference to  a  non-unique  named  subpattern  from
4563         elsewhere in the pattern, the one that corresponds to the first  occur-         elsewhere  in the pattern, the one that corresponds to the first occur-
4564         rence of the name is used. In the absence of duplicate numbers (see the         rence of the name is used. In the absence of duplicate numbers (see the
4565         previous section) this is the one with the lowest number. If you use  a         previous  section) this is the one with the lowest number. If you use a
4566         named  reference  in a condition test (see the section about conditions         named reference in a condition test (see the section  about  conditions
4567         below), either to check whether a subpattern has matched, or  to  check         below),  either  to check whether a subpattern has matched, or to check
4568         for  recursion,  all  subpatterns with the same name are tested. If the         for recursion, all subpatterns with the same name are  tested.  If  the
4569         condition is true for any one of them, the overall condition  is  true.         condition  is  true for any one of them, the overall condition is true.
4570         This is the same behaviour as testing by number. For further details of         This is the same behaviour as testing by number. For further details of
4571         the interfaces for handling named subpatterns, see the pcreapi documen-         the interfaces for handling named subpatterns, see the pcreapi documen-
4572         tation.         tation.
4573    
4574         Warning: You cannot use different names to distinguish between two sub-         Warning: You cannot use different names to distinguish between two sub-
4575         patterns with the same number because PCRE uses only the  numbers  when         patterns  with  the same number because PCRE uses only the numbers when
4576         matching. For this reason, an error is given at compile time if differ-         matching. For this reason, an error is given at compile time if differ-
4577         ent names are given to subpatterns with the same number.  However,  you         ent  names  are given to subpatterns with the same number. However, you
4578         can  give  the same name to subpatterns with the same number, even when         can give the same name to subpatterns with the same number,  even  when
4579         PCRE_DUPNAMES is not set.         PCRE_DUPNAMES is not set.
4580    
4581    
4582  REPETITION  REPETITION
4583    
4584         Repetition is specified by quantifiers, which can  follow  any  of  the         Repetition  is  specified  by  quantifiers, which can follow any of the
4585         following items:         following items:
4586    
4587           a literal data character           a literal data character
# Line 4575  REPETITION Line 4595  REPETITION
4595           a parenthesized subpattern (including assertions)           a parenthesized subpattern (including assertions)
4596           a subroutine call to a subpattern (recursive or otherwise)           a subroutine call to a subpattern (recursive or otherwise)
4597    
4598         The  general repetition quantifier specifies a minimum and maximum num-         The general repetition quantifier specifies a minimum and maximum  num-
4599         ber of permitted matches, by giving the two numbers in  curly  brackets         ber  of  permitted matches, by giving the two numbers in curly brackets
4600         (braces),  separated  by  a comma. The numbers must be less than 65536,         (braces), separated by a comma. The numbers must be  less  than  65536,
4601         and the first must be less than or equal to the second. For example:         and the first must be less than or equal to the second. For example:
4602    
4603           z{2,4}           z{2,4}
4604    
4605         matches "zz", "zzz", or "zzzz". A closing brace on its  own  is  not  a         matches  "zz",  "zzz",  or  "zzzz". A closing brace on its own is not a
4606         special  character.  If  the second number is omitted, but the comma is         special character. If the second number is omitted, but  the  comma  is
4607         present, there is no upper limit; if the second number  and  the  comma         present,  there  is  no upper limit; if the second number and the comma
4608         are  both omitted, the quantifier specifies an exact number of required         are both omitted, the quantifier specifies an exact number of  required
4609         matches. Thus         matches. Thus
4610    
4611           [aeiou]{3,}           [aeiou]{3,}
# Line 4594  REPETITION Line 4614  REPETITION
4614    
4615           \d{8}           \d{8}
4616    
4617         matches exactly 8 digits. An opening curly bracket that  appears  in  a         matches  exactly  8  digits. An opening curly bracket that appears in a
4618         position  where a quantifier is not allowed, or one that does not match         position where a quantifier is not allowed, or one that does not  match
4619         the syntax of a quantifier, is taken as a literal character. For  exam-         the  syntax of a quantifier, is taken as a literal character. For exam-
4620         ple, {,6} is not a quantifier, but a literal string of four characters.         ple, {,6} is not a quantifier, but a literal string of four characters.
4621    
4622         In  UTF-8  mode,  quantifiers  apply to UTF-8 characters rather than to         In UTF-8 mode, quantifiers apply to UTF-8  characters  rather  than  to
4623         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-         individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
4624         acters, each of which is represented by a two-byte sequence. Similarly,         acters, each of which is represented by a two-byte sequence. Similarly,
4625         when Unicode property support is available, \X{3} matches three Unicode         when Unicode property support is available, \X{3} matches three Unicode
4626         extended  sequences,  each of which may be several bytes long (and they         extended sequences, each of which may be several bytes long  (and  they
4627         may be of different lengths).         may be of different lengths).
4628    
4629         The quantifier {0} is permitted, causing the expression to behave as if         The quantifier {0} is permitted, causing the expression to behave as if
4630         the previous item and the quantifier were not present. This may be use-         the previous item and the quantifier were not present. This may be use-
4631         ful for subpatterns that are referenced as subroutines  from  elsewhere         ful  for  subpatterns that are referenced as subroutines from elsewhere
4632         in the pattern (but see also the section entitled "Defining subpatterns         in the pattern (but see also the section entitled "Defining subpatterns
4633         for use by reference only" below). Items other  than  subpatterns  that         for  use  by  reference only" below). Items other than subpatterns that
4634         have a {0} quantifier are omitted from the compiled pattern.         have a {0} quantifier are omitted from the compiled pattern.
4635    
4636         For  convenience, the three most common quantifiers have single-charac-         For convenience, the three most common quantifiers have  single-charac-
4637         ter abbreviations:         ter abbreviations:
4638    
4639           *    is equivalent to {0,}           *    is equivalent to {0,}
4640           +    is equivalent to {1,}           +    is equivalent to {1,}
4641           ?    is equivalent to {0,1}           ?    is equivalent to {0,1}
4642    
4643         It is possible to construct infinite loops by  following  a  subpattern         It  is  possible  to construct infinite loops by following a subpattern
4644         that can match no characters with a quantifier that has no upper limit,         that can match no characters with a quantifier that has no upper limit,
4645         for example:         for example:
4646    
4647           (a?)*           (a?)*
4648    
4649         Earlier versions of Perl and PCRE used to give an error at compile time         Earlier versions of Perl and PCRE used to give an error at compile time
4650         for  such  patterns. However, because there are cases where this can be         for such patterns. However, because there are cases where this  can  be
4651         useful, such patterns are now accepted, but if any  repetition  of  the         useful,  such  patterns  are now accepted, but if any repetition of the
4652         subpattern  does in fact match no characters, the loop is forcibly bro-         subpattern does in fact match no characters, the loop is forcibly  bro-
4653         ken.         ken.
4654    
4655         By default, the quantifiers are "greedy", that is, they match  as  much         By  default,  the quantifiers are "greedy", that is, they match as much
4656         as  possible  (up  to  the  maximum number of permitted times), without         as possible (up to the maximum  number  of  permitted  times),  without
4657         causing the rest of the pattern to fail. The classic example  of  where         causing  the  rest of the pattern to fail. The classic example of where
4658         this gives problems is in trying to match comments in C programs. These         this gives problems is in trying to match comments in C programs. These
4659         appear between /* and */ and within the comment,  individual  *  and  /         appear  between  /*  and  */ and within the comment, individual * and /
4660         characters  may  appear. An attempt to match C comments by applying the         characters may appear. An attempt to match C comments by  applying  the
4661         pattern         pattern
4662    
4663           /\*.*\*/           /\*.*\*/
# Line 4646  REPETITION Line 4666  REPETITION
4666    
4667           /* first comment */  not comment  /* second comment */           /* first comment */  not comment  /* second comment */
4668    
4669         fails, because it matches the entire string owing to the greediness  of         fails,  because it matches the entire string owing to the greediness of
4670         the .*  item.         the .*  item.
4671    
4672         However,  if  a quantifier is followed by a question mark, it ceases to         However, if a quantifier is followed by a question mark, it  ceases  to
4673         be greedy, and instead matches the minimum number of times possible, so         be greedy, and instead matches the minimum number of times possible, so
4674         the pattern         the pattern
4675    
4676           /\*.*?\*/           /\*.*?\*/
4677    
4678         does  the  right  thing with the C comments. The meaning of the various         does the right thing with the C comments. The meaning  of  the  various
4679         quantifiers is not otherwise changed,  just  the  preferred  number  of         quantifiers  is  not  otherwise  changed,  just the preferred number of
4680         matches.   Do  not  confuse this use of question mark with its use as a         matches.  Do not confuse this use of question mark with its  use  as  a
4681         quantifier in its own right. Because it has two uses, it can  sometimes         quantifier  in its own right. Because it has two uses, it can sometimes
4682         appear doubled, as in         appear doubled, as in
4683    
4684           \d??\d           \d??\d
# Line 4666  REPETITION Line 4686  REPETITION
4686         which matches one digit by preference, but can match two if that is the         which matches one digit by preference, but can match two if that is the
4687         only way the rest of the pattern matches.         only way the rest of the pattern matches.
4688    
4689         If the PCRE_UNGREEDY option is set (an option that is not available  in         If  the PCRE_UNGREEDY option is set (an option that is not available in
4690         Perl),  the  quantifiers are not greedy by default, but individual ones         Perl), the quantifiers are not greedy by default, but  individual  ones
4691         can be made greedy by following them with a  question  mark.  In  other         can  be  made  greedy  by following them with a question mark. In other
4692         words, it inverts the default behaviour.         words, it inverts the default behaviour.
4693    
4694         When  a  parenthesized  subpattern  is quantified with a minimum repeat         When a parenthesized subpattern is quantified  with  a  minimum  repeat
4695         count that is greater than 1 or with a limited maximum, more memory  is         count  that is greater than 1 or with a limited maximum, more memory is
4696         required  for  the  compiled  pattern, in proportion to the size of the         required for the compiled pattern, in proportion to  the  size  of  the
4697         minimum or maximum.         minimum or maximum.
4698    
4699         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-         If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
4700         alent  to  Perl's  /s) is set, thus allowing the dot to match newlines,         alent to Perl's /s) is set, thus allowing the dot  to  match  newlines,
4701         the pattern is implicitly anchored, because whatever  follows  will  be         the  pattern  is  implicitly anchored, because whatever follows will be
4702         tried  against every character position in the subject string, so there         tried against every character position in the subject string, so  there
4703         is no point in retrying the overall match at  any  position  after  the         is  no  point  in  retrying the overall match at any position after the
4704         first.  PCRE  normally treats such a pattern as though it were preceded         first. PCRE normally treats such a pattern as though it  were  preceded
4705         by \A.         by \A.
4706    
4707         In cases where it is known that the subject  string  contains  no  new-         In  cases  where  it  is known that the subject string contains no new-
4708         lines,  it  is  worth setting PCRE_DOTALL in order to obtain this opti-         lines, it is worth setting PCRE_DOTALL in order to  obtain  this  opti-
4709         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
4710    
4711         However, there is one situation where the optimization cannot be  used.         However,  there is one situation where the optimization cannot be used.
4712         When .*  is inside capturing parentheses that are the subject of a back         When .*  is inside capturing parentheses that are the subject of a back
4713         reference elsewhere in the pattern, a match at the start may fail where         reference elsewhere in the pattern, a match at the start may fail where
4714         a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
4715    
4716           (.*)abc\1           (.*)abc\1
4717    
4718         If  the subject is "xyz123abc123" the match point is the fourth charac-         If the subject is "xyz123abc123" the match point is the fourth  charac-
4719         ter. For this reason, such a pattern is not implicitly anchored.         ter. For this reason, such a pattern is not implicitly anchored.
4720    
4721         When a capturing subpattern is repeated, the value captured is the sub-         When a capturing subpattern is repeated, the value captured is the sub-
# Line 4704  REPETITION Line 4724  REPETITION
4724           (tweedle[dume]{3}\s*)+           (tweedle[dume]{3}\s*)+
4725    
4726         has matched "tweedledum tweedledee" the value of the captured substring         has matched "tweedledum tweedledee" the value of the captured substring
4727         is "tweedledee". However, if there are  nested  capturing  subpatterns,         is  "tweedledee".  However,  if there are nested capturing subpatterns,
4728         the  corresponding captured values may have been set in previous itera-         the corresponding captured values may have been set in previous  itera-
4729         tions. For example, after         tions. For example, after
4730    
4731           /(a|(b))+/           /(a|(b))+/
# Line 4715  REPETITION Line 4735  REPETITION
4735    
4736  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
4737    
4738         With both maximizing ("greedy") and minimizing ("ungreedy"  or  "lazy")         With  both  maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
4739         repetition,  failure  of what follows normally causes the repeated item         repetition, failure of what follows normally causes the  repeated  item
4740         to be re-evaluated to see if a different number of repeats  allows  the         to  be  re-evaluated to see if a different number of repeats allows the
4741         rest  of  the pattern to match. Sometimes it is useful to prevent this,         rest of the pattern to match. Sometimes it is useful to  prevent  this,
4742         either to change the nature of the match, or to cause it  fail  earlier         either  to  change the nature of the match, or to cause it fail earlier
4743         than  it otherwise might, when the author of the pattern knows there is         than it otherwise might, when the author of the pattern knows there  is
4744         no point in carrying on.         no point in carrying on.
4745    
4746         Consider, for example, the pattern \d+foo when applied to  the  subject         Consider,  for  example, the pattern \d+foo when applied to the subject
4747         line         line
4748    
4749           123456bar           123456bar
4750    
4751         After matching all 6 digits and then failing to match "foo", the normal         After matching all 6 digits and then failing to match "foo", the normal
4752         action of the matcher is to try again with only 5 digits  matching  the         action  of  the matcher is to try again with only 5 digits matching the
4753         \d+  item,  and  then  with  4,  and  so on, before ultimately failing.         \d+ item, and then with  4,  and  so  on,  before  ultimately  failing.
4754         "Atomic grouping" (a term taken from Jeffrey  Friedl's  book)  provides         "Atomic  grouping"  (a  term taken from Jeffrey Friedl's book) provides
4755         the  means for specifying that once a subpattern has matched, it is not         the means for specifying that once a subpattern has matched, it is  not
4756         to be re-evaluated in this way.         to be re-evaluated in this way.
4757    
4758         If we use atomic grouping for the previous example, the  matcher  gives         If  we  use atomic grouping for the previous example, the matcher gives
4759         up  immediately  on failing to match "foo" the first time. The notation         up immediately on failing to match "foo" the first time.  The  notation
4760         is a kind of special parenthesis, starting with (?> as in this example:         is a kind of special parenthesis, starting with (?> as in this example:
4761    
4762           (?>\d+)foo           (?>\d+)foo
4763    
4764         This kind of parenthesis "locks up" the  part of the  pattern  it  con-         This  kind  of  parenthesis "locks up" the  part of the pattern it con-
4765         tains  once  it  has matched, and a failure further into the pattern is         tains once it has matched, and a failure further into  the  pattern  is
4766         prevented from backtracking into it. Backtracking past it  to  previous         prevented  from  backtracking into it. Backtracking past it to previous
4767         items, however, works as normal.         items, however, works as normal.
4768    
4769         An  alternative  description  is that a subpattern of this type matches         An alternative description is that a subpattern of  this  type  matches
4770         the string of characters that an  identical  standalone  pattern  would         the  string  of  characters  that an identical standalone pattern would
4771         match, if anchored at the current point in the subject string.         match, if anchored at the current point in the subject string.
4772    
4773         Atomic grouping subpatterns are not capturing subpatterns. Simple cases         Atomic grouping subpatterns are not capturing subpatterns. Simple cases
4774         such as the above example can be thought of as a maximizing repeat that         such as the above example can be thought of as a maximizing repeat that
4775         must  swallow  everything  it can. So, while both \d+ and \d+? are pre-         must swallow everything it can. So, while both \d+ and  \d+?  are  pre-
4776         pared to adjust the number of digits they match in order  to  make  the         pared  to  adjust  the number of digits they match in order to make the
4777         rest of the pattern match, (?>\d+) can only match an entire sequence of         rest of the pattern match, (?>\d+) can only match an entire sequence of
4778         digits.         digits.
4779    
4780         Atomic groups in general can of course contain arbitrarily  complicated         Atomic  groups in general can of course contain arbitrarily complicated
4781         subpatterns,  and  can  be  nested. However, when the subpattern for an         subpatterns, and can be nested. However, when  the  subpattern  for  an
4782         atomic group is just a single repeated item, as in the example above, a         atomic group is just a single repeated item, as in the example above, a
4783         simpler  notation,  called  a "possessive quantifier" can be used. This         simpler notation, called a "possessive quantifier" can  be  used.  This
4784         consists of an additional + character  following  a  quantifier.  Using         consists  of  an  additional  + character following a quantifier. Using
4785         this notation, the previous example can be rewritten as         this notation, the previous example can be rewritten as
4786    
4787           \d++foo           \d++foo
# Line 4771  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE Line 4791  ATOMIC GROUPING AND POSSESSIVE QUANTIFIE
4791    
4792           (abc|xyz){2,3}+           (abc|xyz){2,3}+
4793    
4794         Possessive  quantifiers  are  always  greedy;  the   setting   of   the         Possessive   quantifiers   are   always  greedy;  the  setting  of  the
4795         PCRE_UNGREEDY option is ignored. They are a convenient notation for the         PCRE_UNGREEDY option is ignored. They are a convenient notation for the
4796         simpler forms of atomic group. However, there is no difference  in  the         simpler  forms  of atomic group. However, there is no difference in the
4797         meaning  of  a  possessive  quantifier and the equivalent atomic group,         meaning of a possessive quantifier and  the  equivalent  atomic  group,
4798         though there may be a performance  difference;  possessive  quantifiers         though  there  may  be a performance difference; possessive quantifiers
4799         should be slightly faster.         should be slightly faster.
4800    
4801         The  possessive  quantifier syntax is an extension to the Perl 5.8 syn-         The possessive quantifier syntax is an extension to the Perl  5.8  syn-
4802         tax.  Jeffrey Friedl originated the idea (and the name)  in  the  first         tax.   Jeffrey  Friedl  originated the idea (and the name) in the first
4803         edition of his book. Mike McCloskey liked it, so implemented it when he         edition of his book. Mike McCloskey liked it, so implemented it when he
4804         built Sun's Java package, and PCRE copied it from there. It  ultimately         built  Sun's Java package, and PCRE copied it from there. It ultimately
4805         found its way into Perl at release 5.10.         found its way into Perl at release 5.10.
4806    
4807         PCRE has an optimization that automatically "possessifies" certain sim-         PCRE has an optimization that automatically "possessifies" certain sim-
4808         ple pattern constructs. For example, the sequence  A+B  is  treated  as         ple  pattern  constructs.  For  example, the sequence A+B is treated as
4809         A++B  because  there is no point in backtracking into a sequence of A's         A++B because there is no point in backtracking into a sequence  of  A's
4810         when B must follow.         when B must follow.
4811    
4812         When a pattern contains an unlimited repeat inside  a  subpattern  that         When  a  pattern  contains an unlimited repeat inside a subpattern that
4813         can  itself  be  repeated  an  unlimited number of times, the use of an         can itself be repeated an unlimited number of  times,  the  use  of  an
4814         atomic group is the only way to avoid some  failing  matches  taking  a         atomic  group  is  the  only way to avoid some failing matches taking a
4815         very long time indeed. The pattern         very long time indeed. The pattern
4816    
4817           (\D+|<\d+>)*[!?]           (\D+|<\d+>)*[!?]
4818    
4819         matches  an  unlimited number of substrings that either consist of non-         matches an unlimited number of substrings that either consist  of  non-
4820         digits, or digits enclosed in <>, followed by either ! or  ?.  When  it         digits,  or  digits  enclosed in <>, followed by either ! or ?. When it
4821         matches, it runs quickly. However, if it is applied to         matches, it runs quickly. However, if it is applied to
4822    
4823           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
4824    
4825         it  takes  a  long  time  before reporting failure. This is because the         it takes a long time before reporting  failure.  This  is  because  the
4826         string can be divided between the internal \D+ repeat and the  external         string  can be divided between the internal \D+ repeat and the external
4827         *  repeat  in  a  large  number of ways, and all have to be tried. (The         * repeat in a large number of ways, and all  have  to  be  tried.  (The
4828         example uses [!?] rather than a single character at  the  end,  because         example  uses  [!?]  rather than a single character at the end, because
4829         both  PCRE  and  Perl have an optimization that allows for fast failure         both PCRE and Perl have an optimization that allows  for  fast  failure
4830         when a single character is used. They remember the last single  charac-         when  a single character is used. They remember the last single charac-
4831         ter  that  is required for a match, and fail early if it is not present         ter that is required for a match, and fail early if it is  not  present
4832         in the string.) If the pattern is changed so that  it  uses  an  atomic         in  the  string.)  If  the pattern is changed so that it uses an atomic
4833         group, like this:         group, like this:
4834    
4835           ((?>\D+)|<\d+>)*[!?]           ((?>\D+)|<\d+>)*[!?]
# Line 4821  BACK REFERENCES Line 4841  BACK REFERENCES
4841    
4842         Outside a character class, a backslash followed by a digit greater than         Outside a character class, a backslash followed by a digit greater than
4843         0 (and possibly further digits) is a back reference to a capturing sub-         0 (and possibly further digits) is a back reference to a capturing sub-
4844         pattern  earlier  (that is, to its left) in the pattern, provided there         pattern earlier (that is, to its left) in the pattern,  provided  there
4845         have been that many previous capturing left parentheses.         have been that many previous capturing left parentheses.
4846    
4847         However, if the decimal number following the backslash is less than 10,         However, if the decimal number following the backslash is less than 10,
4848         it  is  always  taken  as a back reference, and causes an error only if         it is always taken as a back reference, and causes  an  error  only  if
4849         there are not that many capturing left parentheses in the  entire  pat-         there  are  not that many capturing left parentheses in the entire pat-
4850         tern.  In  other words, the parentheses that are referenced need not be         tern. In other words, the parentheses that are referenced need  not  be
4851         to the left of the reference for numbers less than 10. A "forward  back         to  the left of the reference for numbers less than 10. A "forward back
4852         reference"  of  this  type can make sense when a repetition is involved         reference" of this type can make sense when a  repetition  is  involved
4853         and the subpattern to the right has participated in an  earlier  itera-         and  the  subpattern to the right has participated in an earlier itera-
4854         tion.         tion.
4855    
4856         It  is  not  possible to have a numerical "forward back reference" to a         It is not possible to have a numerical "forward back  reference"  to  a
4857         subpattern whose number is 10 or  more  using  this  syntax  because  a         subpattern  whose  number  is  10  or  more using this syntax because a
4858         sequence  such  as  \50 is interpreted as a character defined in octal.         sequence such as \50 is interpreted as a character  defined  in  octal.
4859         See the subsection entitled "Non-printing characters" above for further         See the subsection entitled "Non-printing characters" above for further
4860         details  of  the  handling of digits following a backslash. There is no         details of the handling of digits following a backslash.  There  is  no
4861         such problem when named parentheses are used. A back reference  to  any         such  problem  when named parentheses are used. A back reference to any
4862         subpattern is possible using named parentheses (see below).         subpattern is possible using named parentheses (see below).
4863    
4864         Another  way  of  avoiding  the ambiguity inherent in the use of digits         Another way of avoiding the ambiguity inherent in  the  use  of  digits
4865         following a backslash is to use the \g  escape  sequence.  This  escape         following  a  backslash  is  to use the \g escape sequence. This escape
4866         must be followed by an unsigned number or a negative number, optionally         must be followed by an unsigned number or a negative number, optionally
4867         enclosed in braces. These examples are all identical:         enclosed in braces. These examples are all identical:
4868    
# Line 4850  BACK REFERENCES Line 4870  BACK REFERENCES
4870           (ring), \g1           (ring), \g1
4871           (ring), \g{1}           (ring), \g{1}
4872    
4873         An unsigned number specifies an absolute reference without the  ambigu-         An  unsigned number specifies an absolute reference without the ambigu-
4874         ity that is present in the older syntax. It is also useful when literal         ity that is present in the older syntax. It is also useful when literal
4875         digits follow the reference. A negative number is a relative reference.         digits follow the reference. A negative number is a relative reference.
4876         Consider this example:         Consider this example:
# Line 4859  BACK REFERENCES Line 4879  BACK REFERENCES
4879    
4880         The sequence \g{-1} is a reference to the most recently started captur-         The sequence \g{-1} is a reference to the most recently started captur-
4881         ing subpattern before \g, that is, is it equivalent to \2 in this exam-         ing subpattern before \g, that is, is it equivalent to \2 in this exam-
4882         ple.   Similarly, \g{-2} would be equivalent to \1. The use of relative         ple.  Similarly, \g{-2} would be equivalent to \1. The use of  relative
4883         references can be helpful in long patterns, and also in  patterns  that         references  can  be helpful in long patterns, and also in patterns that
4884         are  created  by  joining  together  fragments  that contain references         are created by  joining  together  fragments  that  contain  references
4885         within themselves.         within themselves.
4886    
4887         A back reference matches whatever actually matched the  capturing  sub-         A  back  reference matches whatever actually matched the capturing sub-
4888         pattern  in  the  current subject string, rather than anything matching         pattern in the current subject string, rather  than  anything  matching
4889         the subpattern itself (see "Subpatterns as subroutines" below for a way         the subpattern itself (see "Subpatterns as subroutines" below for a way
4890         of doing that). So the pattern         of doing that). So the pattern
4891    
4892           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
4893    
4894         matches  "sense and sensibility" and "response and responsibility", but         matches "sense and sensibility" and "response and responsibility",  but
4895         not "sense and responsibility". If caseful matching is in force at  the         not  "sense and responsibility". If caseful matching is in force at the
4896         time  of the back reference, the case of letters is relevant. For exam-         time of the back reference, the case of letters is relevant. For  exam-
4897         ple,         ple,
4898    
4899           ((?i)rah)\s+\1           ((?i)rah)\s+\1
4900    
4901         matches "rah rah" and "RAH RAH", but not "RAH  rah",  even  though  the         matches  "rah  rah"  and  "RAH RAH", but not "RAH rah", even though the
4902         original capturing subpattern is matched caselessly.         original capturing subpattern is matched caselessly.
4903    
4904         There  are  several  different ways of writing back references to named         There are several different ways of writing back  references  to  named
4905         subpatterns. The .NET syntax \k{name} and the Perl syntax  \k<name>  or         subpatterns.  The  .NET syntax \k{name} and the Perl syntax \k<name> or
4906         \k'name'  are supported, as is the Python syntax (?P=name). Perl 5.10's         \k'name' are supported, as is the Python syntax (?P=name). Perl  5.10's
4907         unified back reference syntax, in which \g can be used for both numeric         unified back reference syntax, in which \g can be used for both numeric
4908         and  named  references,  is  also supported. We could rewrite the above         and named references, is also supported. We  could  rewrite  the  above
4909         example in any of the following ways:         example in any of the following ways:
4910    
4911           (?<p1>(?i)rah)\s+\k<p1>           (?<p1>(?i)rah)\s+\k<p1>
# Line 4893  BACK REFERENCES Line 4913  BACK REFERENCES
4913           (?P<p1>(?i)rah)\s+(?P=p1)           (?P<p1>(?i)rah)\s+(?P=p1)
4914           (?<p1>(?i)rah)\s+\g{p1}           (?<p1>(?i)rah)\s+\g{p1}
4915    
4916         A subpattern that is referenced by  name  may  appear  in  the  pattern         A  subpattern  that  is  referenced  by  name may appear in the pattern
4917         before or after the reference.         before or after the reference.
4918    
4919         There  may be more than one back reference to the same subpattern. If a         There may be more than one back reference to the same subpattern. If  a
4920         subpattern has not actually been used in a particular match,  any  back         subpattern  has  not actually been used in a particular match, any back
4921         references to it always fail by default. For example, the pattern         references to it always fail by default. For example, the pattern
4922    
4923           (a|(bc))\2           (a|(bc))\2
4924    
4925         always  fails  if  it starts to match "a" rather than "bc". However, if         always fails if it starts to match "a" rather than  "bc".  However,  if
4926         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-         the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer-
4927         ence to an unset value matches an empty string.         ence to an unset value matches an empty string.
4928    
4929         Because  there may be many capturing parentheses in a pattern, all dig-         Because there may be many capturing parentheses in a pattern, all  dig-
4930         its following a backslash are taken as part of a potential back  refer-         its  following a backslash are taken as part of a potential back refer-
4931         ence  number.   If  the  pattern continues with a digit character, some         ence number.  If the pattern continues with  a  digit  character,  some
4932         delimiter must  be  used  to  terminate  the  back  reference.  If  the         delimiter  must  be  used  to  terminate  the  back  reference.  If the
4933         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
4934         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
4935    
4936     Recursive back references     Recursive back references
4937    
4938         A back reference that occurs inside the parentheses to which it  refers         A  back reference that occurs inside the parentheses to which it refers
4939         fails  when  the subpattern is first used, so, for example, (a\1) never         fails when the subpattern is first used, so, for example,  (a\1)  never
4940         matches.  However, such references can be useful inside  repeated  sub-         matches.   However,  such references can be useful inside repeated sub-
4941         patterns. For example, the pattern         patterns. For example, the pattern
4942    
4943           (a|b\1)+           (a|b\1)+
4944    
4945         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-         matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4946         ation of the subpattern,  the  back  reference  matches  the  character         ation  of  the  subpattern,  the  back  reference matches the character
4947         string  corresponding  to  the previous iteration. In order for this to         string corresponding to the previous iteration. In order  for  this  to
4948         work, the pattern must be such that the first iteration does  not  need         work,  the  pattern must be such that the first iteration does not need
4949         to  match the back reference. This can be done using alternation, as in         to match the back reference. This can be done using alternation, as  in
4950         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4951    
4952         Back references of this type cause the group that they reference to  be         Back  references of this type cause the group that they reference to be
4953         treated  as  an atomic group.  Once the whole group has been matched, a         treated as an atomic group.  Once the whole group has been  matched,  a
4954         subsequent matching failure cannot cause backtracking into  the  middle         subsequent  matching  failure cannot cause backtracking into the middle
4955         of the group.         of the group.
4956    
4957    
4958  ASSERTIONS  ASSERTIONS
4959    
4960         An  assertion  is  a  test on the characters following or preceding the         An assertion is a test on the characters  following  or  preceding  the
4961         current matching point that does not actually consume  any  characters.         current  matching  point that does not actually consume any characters.
4962         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are
4963         described above.         described above.
4964    
4965         More complicated assertions are coded as  subpatterns.  There  are  two         More  complicated  assertions  are  coded as subpatterns. There are two
4966         kinds:  those  that  look  ahead of the current position in the subject         kinds: those that look ahead of the current  position  in  the  subject
4967         string, and those that look  behind  it.  An  assertion  subpattern  is         string,  and  those  that  look  behind  it. An assertion subpattern is
4968         matched  in  the  normal way, except that it does not cause the current         matched in the normal way, except that it does not  cause  the  current
4969         matching position to be changed.         matching position to be changed.
4970    
4971         Assertion subpatterns are not capturing subpatterns. If such an  asser-         Assertion  subpatterns are not capturing subpatterns. If such an asser-
4972         tion  contains  capturing  subpatterns within it, these are counted for         tion contains capturing subpatterns within it, these  are  counted  for
4973         the purposes of numbering the capturing subpatterns in the  whole  pat-         the  purposes  of numbering the capturing subpatterns in the whole pat-
4974         tern.  However,  substring  capturing  is carried out only for positive         tern. However, substring capturing is carried  out  only  for  positive
4975         assertions, because it does not make sense for negative assertions.         assertions, because it does not make sense for negative assertions.
4976    
4977         For compatibility with Perl, assertion  subpatterns  may  be  repeated;         For  compatibility  with  Perl,  assertion subpatterns may be repeated;
4978         though  it  makes  no sense to assert the same thing several times, the         though it makes no sense to assert the same thing  several  times,  the
4979         side effect of capturing parentheses may  occasionally  be  useful.  In         side  effect  of  capturing  parentheses may occasionally be useful. In
4980         practice, there only three cases:         practice, there only three cases:
4981    
4982         (1)  If  the  quantifier  is  {0}, the assertion is never obeyed during         (1) If the quantifier is {0}, the  assertion  is  never  obeyed  during
4983         matching.  However, it may  contain  internal  capturing  parenthesized         matching.   However,  it  may  contain internal capturing parenthesized
4984         groups that are called from elsewhere via the subroutine mechanism.         groups that are called from elsewhere via the subroutine mechanism.
4985    
4986         (2)  If quantifier is {0,n} where n is greater than zero, it is treated         (2) If quantifier is {0,n} where n is greater than zero, it is  treated
4987         as if it were {0,1}. At run time, the rest  of  the  pattern  match  is         as  if  it  were  {0,1}.  At run time, the rest of the pattern match is
4988         tried with and without the assertion, the order depending on the greed-         tried with and without the assertion, the order depending on the greed-
4989         iness of the quantifier.         iness of the quantifier.
4990    
4991         (3) If the minimum repetition is greater than zero, the  quantifier  is         (3)  If  the minimum repetition is greater than zero, the quantifier is
4992         ignored.   The  assertion  is  obeyed just once when encountered during         ignored.  The assertion is obeyed just  once  when  encountered  during
4993         matching.         matching.
4994    
4995     Lookahead assertions     Lookahead assertions
# Line 4979  ASSERTIONS Line 4999  ASSERTIONS
4999    
5000           \w+(?=;)           \w+(?=;)
5001    
5002         matches  a word followed by a semicolon, but does not include the semi-         matches a word followed by a semicolon, but does not include the  semi-
5003         colon in the match, and         colon in the match, and
5004    
5005           foo(?!bar)           foo(?!bar)
5006    
5007         matches any occurrence of "foo" that is not  followed  by  "bar".  Note         matches  any  occurrence  of  "foo" that is not followed by "bar". Note
5008         that the apparently similar pattern         that the apparently similar pattern
5009    
5010           (?!foo)bar           (?!foo)bar
5011    
5012         does  not  find  an  occurrence  of "bar" that is preceded by something         does not find an occurrence of "bar"  that  is  preceded  by  something
5013         other than "foo"; it finds any occurrence of "bar" whatsoever,  because         other  than "foo"; it finds any occurrence of "bar" whatsoever, because
5014         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
5015         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
5016    
5017         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
5018         most  convenient  way  to  do  it  is with (?!) because an empty string         most convenient way to do it is  with  (?!)  because  an  empty  string
5019         always matches, so an assertion that requires there not to be an  empty         always  matches, so an assertion that requires there not to be an empty
5020         string must always fail.  The backtracking control verb (*FAIL) or (*F)         string must always fail.  The backtracking control verb (*FAIL) or (*F)
5021         is a synonym for (?!).         is a synonym for (?!).
5022    
5023     Lookbehind assertions     Lookbehind assertions
5024    
5025         Lookbehind assertions start with (?<= for positive assertions and  (?<!         Lookbehind  assertions start with (?<= for positive assertions and (?<!
5026         for negative assertions. For example,         for negative assertions. For example,
5027    
5028           (?<!foo)bar           (?<!foo)bar
5029    
5030         does  find  an  occurrence  of "bar" that is not preceded by "foo". The         does find an occurrence of "bar" that is not  preceded  by  "foo".  The
5031         contents of a lookbehind assertion are restricted  such  that  all  the         contents  of  a  lookbehind  assertion are restricted such that all the
5032         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
5033         eral top-level alternatives, they do not all  have  to  have  the  same         eral  top-level  alternatives,  they  do  not all have to have the same
5034         fixed length. Thus         fixed length. Thus
5035    
5036           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 5019  ASSERTIONS Line 5039  ASSERTIONS
5039    
5040           (?<!dogs?|cats?)           (?<!dogs?|cats?)
5041    
5042         causes  an  error at compile time. Branches that match different length         causes an error at compile time. Branches that match  different  length
5043         strings are permitted only at the top level of a lookbehind  assertion.         strings  are permitted only at the top level of a lookbehind assertion.
5044         This is an extension compared with Perl, which requires all branches to         This is an extension compared with Perl, which requires all branches to
5045         match the same length of string. An assertion such as         match the same length of string. An assertion such as
5046    
5047           (?<=ab(c|de))           (?<=ab(c|de))
5048    
5049         is not permitted, because its single top-level  branch  can  match  two         is  not  permitted,  because  its single top-level branch can match two
5050         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
5051         top-level branches:         top-level branches:
5052    
5053           (?<=abc|abde)           (?<=abc|abde)
5054    
5055         In some cases, the escape sequence \K (see above) can be  used  instead         In  some  cases, the escape sequence \K (see above) can be used instead
5056         of a lookbehind assertion to get round the fixed-length restriction.         of a lookbehind assertion to get round the fixed-length restriction.
5057    
5058         The  implementation  of lookbehind assertions is, for each alternative,         The implementation of lookbehind assertions is, for  each  alternative,
5059         to temporarily move the current position back by the fixed  length  and         to  temporarily  move the current position back by the fixed length and
5060         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
5061         rent position, the assertion fails.         rent position, the assertion fails.
5062    
5063         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
5064         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode) to appear in lookbehind assertions, because it makes it  impossi-
5065         ble to calculate the length of the lookbehind. The \X and  \R  escapes,         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
5066         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
5067    
5068         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
5069         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
5070         Recursion, however, is not supported.         Recursion, however, is not supported.
5071    
5072         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
5073         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
5074         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
5075    
5076           abcd$           abcd$
5077    
5078         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
5079         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
5080         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
5081         pattern is specified as         pattern is specified as
5082    
5083           ^.*abcd$           ^.*abcd$
5084    
5085         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
5086         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
5087         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
5088         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
5089         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
5090    
5091           ^.*+(?<=abcd)           ^.*+(?<=abcd)
5092    
5093         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
5094         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
5095         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
5096         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
5097         processing time.         processing time.
5098    
5099     Using multiple assertions     Using multiple assertions
# Line 5082  ASSERTIONS Line 5102  ASSERTIONS
5102    
5103           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
5104    
5105         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
5106         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
5107         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
5108         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
5109         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
5110         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
5111         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
5112         foo". A pattern to do that is         foo". A pattern to do that is
5113    
5114           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
5115    
5116         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
5117         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
5118         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
5119    
# Line 5101  ASSERTIONS Line 5121  ASSERTIONS
5121    
5122           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
5123    
5124         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
5125         is not preceded by "foo", while         is not preceded by "foo", while
5126    
5127           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
5128    
5129         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
5130         three characters that are not "999".         three characters that are not "999".
5131    
5132    
5133  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
5134    
5135         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
5136         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
5137         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
5138         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
5139         subpattern are:         subpattern are:
5140    
5141           (?(condition)yes-pattern)           (?(condition)yes-pattern)
5142           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
5143    
5144         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
5145         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
5146         tives  in  the subpattern, a compile-time error occurs. Each of the two         tives in the subpattern, a compile-time error occurs. Each of  the  two
5147         alternatives may itself contain nested subpatterns of any form, includ-         alternatives may itself contain nested subpatterns of any form, includ-
5148         ing  conditional  subpatterns;  the  restriction  to  two  alternatives         ing  conditional  subpatterns;  the  restriction  to  two  alternatives
5149         applies only at the level of the condition. This pattern fragment is an         applies only at the level of the condition. This pattern fragment is an
# Line 5132  CONDITIONAL SUBPATTERNS Line 5152  CONDITIONAL SUBPATTERNS
5152           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )           (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
5153    
5154    
5155         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
5156         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
5157    
5158     Checking for a used subpattern by number     Checking for a used subpattern by number
5159    
5160         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
5161         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
5162         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
5163         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
5164         numbers), the condition is true if any of them have matched. An  alter-         numbers),  the condition is true if any of them have matched. An alter-
5165         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
5166         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
5167         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
5168         most recent by (?(-2), and so on. Inside loops it can also  make  sense         most  recent  by (?(-2), and so on. Inside loops it can also make sense
5169         to refer to subsequent groups. The next parentheses to be opened can be         to refer to subsequent groups. The next parentheses to be opened can be
5170         referenced as (?(+1), and so on. (The value zero in any of these  forms         referenced  as (?(+1), and so on. (The value zero in any of these forms
5171         is not used; it provokes a compile-time error.)         is not used; it provokes a compile-time error.)
5172    
5173         Consider  the  following  pattern, which contains non-significant white         Consider the following pattern, which  contains  non-significant  white
5174         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
5175         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
5176    
5177           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
5178    
5179         The  first  part  matches  an optional opening parenthesis, and if that         The first part matches an optional opening  parenthesis,  and  if  that
5180         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
5181         ond  part  matches one or more characters that are not parentheses. The         ond part matches one or more characters that are not  parentheses.  The
5182         third part is a conditional subpattern that tests whether  or  not  the         third  part  is  a conditional subpattern that tests whether or not the
5183         first  set  of  parentheses  matched.  If they did, that is, if subject         first set of parentheses matched. If they  did,  that  is,  if  subject
5184         started with an opening parenthesis, the condition is true, and so  the         started  with an opening parenthesis, the condition is true, and so the
5185         yes-pattern  is  executed and a closing parenthesis is required. Other-         yes-pattern is executed and a closing parenthesis is  required.  Other-
5186         wise, since no-pattern is not present, the subpattern matches  nothing.         wise,  since no-pattern is not present, the subpattern matches nothing.
5187         In  other  words,  this  pattern matches a sequence of non-parentheses,         In other words, this pattern matches  a  sequence  of  non-parentheses,
5188         optionally enclosed in parentheses.         optionally enclosed in parentheses.
5189    
5190         If you were embedding this pattern in a larger one,  you  could  use  a         If  you  were  embedding  this pattern in a larger one, you could use a
5191         relative reference:         relative reference:
5192    
5193           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
5194    
5195         This  makes  the  fragment independent of the parentheses in the larger         This makes the fragment independent of the parentheses  in  the  larger
5196         pattern.         pattern.
5197    
5198     Checking for a used subpattern by name     Checking for a used subpattern by name
5199    
5200         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a
5201         used  subpattern  by  name.  For compatibility with earlier versions of         used subpattern by name. For compatibility  with  earlier  versions  of
5202         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is
5203         also  recognized. However, there is a possible ambiguity with this syn-         also recognized. However, there is a possible ambiguity with this  syn-
5204         tax, because subpattern names may  consist  entirely  of  digits.  PCRE         tax,  because  subpattern  names  may  consist entirely of digits. PCRE
5205         looks  first for a named subpattern; if it cannot find one and the name         looks first for a named subpattern; if it cannot find one and the  name
5206         consists entirely of digits, PCRE looks for a subpattern of  that  num-         consists  entirely  of digits, PCRE looks for a subpattern of that num-
5207         ber,  which must be greater than zero. Using subpattern names that con-         ber, which must be greater than zero. Using subpattern names that  con-
5208         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
5209    
5210         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
5211    
5212           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
5213    
5214         If the name used in a condition of this kind is a duplicate,  the  test         If  the  name used in a condition of this kind is a duplicate, the test
5215         is  applied to all subpatterns of the same name, and is true if any one         is applied to all subpatterns of the same name, and is true if any  one
5216         of them has matched.         of them has matched.
5217    
5218     Checking for pattern recursion     Checking for pattern recursion
5219    
5220         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
5221         name  R, the condition is true if a recursive call to the whole pattern         name R, the condition is true if a recursive call to the whole  pattern
5222         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
5223         sand follow the letter R, for example:         sand follow the letter R, for example:
5224    
# Line 5206  CONDITIONAL SUBPATTERNS Line 5226  CONDITIONAL SUBPATTERNS
5226    
5227         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
5228         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
5229         recursion  stack.  If  the  name  used in a condition of this kind is a         recursion stack. If the name used in a condition  of  this  kind  is  a
5230         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
5231         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
5232    
5233         At  "top  level",  all  these recursion test conditions are false.  The         At "top level", all these recursion test  conditions  are  false.   The
5234         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
5235    
5236     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
5237    
5238         If the condition is the string (DEFINE), and  there  is  no  subpattern         If  the  condition  is  the string (DEFINE), and there is no subpattern
5239         with  the  name  DEFINE,  the  condition is always false. In this case,         with the name DEFINE, the condition is  always  false.  In  this  case,
5240         there may be only one alternative  in  the  subpattern.  It  is  always         there  may  be  only  one  alternative  in the subpattern. It is always
5241         skipped  if  control  reaches  this  point  in the pattern; the idea of         skipped if control reaches this point  in  the  pattern;  the  idea  of
5242         DEFINE is that it can be used to define subroutines that can be  refer-         DEFINE  is that it can be used to define subroutines that can be refer-
5243         enced  from elsewhere. (The use of subroutines is described below.) For         enced from elsewhere. (The use of subroutines is described below.)  For
5244         example, a pattern to match an IPv4 address  such  as  "192.168.23.245"         example,  a  pattern  to match an IPv4 address such as "192.168.23.245"
5245         could be written like this (ignore whitespace and line breaks):         could be written like this (ignore whitespace and line breaks):
5246    
5247           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
5248           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
5249    
5250         The  first part of the pattern is a DEFINE group inside which a another         The first part of the pattern is a DEFINE group inside which a  another
5251         group named "byte" is defined. This matches an individual component  of         group  named "byte" is defined. This matches an individual component of
5252         an  IPv4  address  (a number less than 256). When matching takes place,         an IPv4 address (a number less than 256). When  matching  takes  place,
5253         this part of the pattern is skipped because DEFINE acts  like  a  false         this  part  of  the pattern is skipped because DEFINE acts like a false
5254         condition.  The  rest of the pattern uses references to the named group         condition. The rest of the pattern uses references to the  named  group
5255         to match the four dot-separated components of an IPv4 address,  insist-         to  match the four dot-separated components of an IPv4 address, insist-
5256         ing on a word boundary at each end.         ing on a word boundary at each end.
5257    
5258     Assertion conditions     Assertion conditions
5259    
5260         If  the  condition  is  not  in any of the above formats, it must be an         If the condition is not in any of the above  formats,  it  must  be  an
5261         assertion.  This may be a positive or negative lookahead or  lookbehind         assertion.   This may be a positive or negative lookahead or lookbehind
5262         assertion.  Consider  this  pattern,  again  containing non-significant         assertion. Consider  this  pattern,  again  containing  non-significant
5263         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
5264    
5265           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
5266           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
5267    
5268         The condition  is  a  positive  lookahead  assertion  that  matches  an         The  condition  is  a  positive  lookahead  assertion  that  matches an
5269         optional  sequence of non-letters followed by a letter. In other words,         optional sequence of non-letters followed by a letter. In other  words,
5270         it tests for the presence of at least one letter in the subject.  If  a         it  tests  for the presence of at least one letter in the subject. If a
5271         letter  is found, the subject is matched against the first alternative;         letter is found, the subject is matched against the first  alternative;
5272         otherwise it is  matched  against  the  second.  This  pattern  matches         otherwise  it  is  matched  against  the  second.  This pattern matches
5273         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are
5274         letters and dd are digits.         letters and dd are digits.
5275    
5276    
# Line 5259  COMMENTS Line 5279  COMMENTS
5279         There are two ways of including comments in patterns that are processed         There are two ways of including comments in patterns that are processed
5280         by PCRE. In both cases, the start of the comment must not be in a char-         by PCRE. In both cases, the start of the comment must not be in a char-
5281         acter class, nor in the middle of any other sequence of related charac-         acter class, nor in the middle of any other sequence of related charac-
5282         ters  such  as  (?: or a subpattern name or number. The characters that         ters such as (?: or a subpattern name or number.  The  characters  that
5283         make up a comment play no part in the pattern matching.         make up a comment play no part in the pattern matching.
5284    
5285         The sequence (?# marks the start of a comment that continues up to  the         The  sequence (?# marks the start of a comment that continues up to the
5286         next  closing parenthesis. Nested parentheses are not permitted. If the         next closing parenthesis. Nested parentheses are not permitted. If  the
5287         PCRE_EXTENDED option is set, an unescaped # character also introduces a         PCRE_EXTENDED option is set, an unescaped # character also introduces a
5288         comment,  which  in  this  case continues to immediately after the next         comment, which in this case continues to  immediately  after  the  next
5289         newline character or character sequence in the pattern.  Which  charac-         newline  character  or character sequence in the pattern. Which charac-
5290         ters are interpreted as newlines is controlled by the options passed to         ters are interpreted as newlines is controlled by the options passed to
5291         pcre_compile() or by a special sequence at the start of the pattern, as         pcre_compile() or by a special sequence at the start of the pattern, as
5292         described  in  the  section  entitled "Newline conventions" above. Note         described in the section entitled  "Newline  conventions"  above.  Note
5293         that the end of this type of comment is a literal newline  sequence  in         that  the  end of this type of comment is a literal newline sequence in
5294         the pattern; escape sequences that happen to represent a newline do not         the pattern; escape sequences that happen to represent a newline do not
5295         count. For example, consider this pattern when  PCRE_EXTENDED  is  set,         count.  For  example,  consider this pattern when PCRE_EXTENDED is set,
5296         and the default newline convention is in force:         and the default newline convention is in force:
5297    
5298           abc #comment \n still comment           abc #comment \n still comment
5299    
5300         On  encountering  the  # character, pcre_compile() skips along, looking         On encountering the # character, pcre_compile()  skips  along,  looking
5301         for a newline in the pattern. The sequence \n is still literal at  this         for  a newline in the pattern. The sequence \n is still literal at this
5302         stage,  so  it does not terminate the comment. Only an actual character         stage, so it does not terminate the comment. Only an  actual  character
5303         with the code value 0x0a (the default newline) does so.         with the code value 0x0a (the default newline) does so.
5304    
5305    
5306  RECURSIVE PATTERNS  RECURSIVE PATTERNS
5307    
5308         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
5309         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
5310         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
5311         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
5312         depth.         depth.
5313    
5314         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
5315         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
5316         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
5317         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
5318         parentheses problem can be created like this:         parentheses problem can be created like this:
5319    
# Line 5303  RECURSIVE PATTERNS Line 5323  RECURSIVE PATTERNS
5323         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
5324    
5325         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
5326         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
5327         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
5328         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
5329         into Perl at release 5.10.         into Perl at release 5.10.
5330    
5331         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
5332         zero and a closing parenthesis is a recursive subroutine  call  of  the         zero  and  a  closing parenthesis is a recursive subroutine call of the
5333         subpattern  of  the  given  number, provided that it occurs inside that         subpattern of the given number, provided that  it  occurs  inside  that
5334         subpattern. (If not, it is a non-recursive subroutine  call,  which  is         subpattern.  (If  not,  it is a non-recursive subroutine call, which is
5335         described  in  the  next  section.)  The special item (?R) or (?0) is a         described in the next section.) The special item  (?R)  or  (?0)  is  a
5336         recursive call of the entire regular expression.         recursive call of the entire regular expression.
5337    
5338         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
5339         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
5340    
5341           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
5342    
5343         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
5344         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
5345         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
5346         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
5347         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
5348         parentheses.         parentheses.
5349    
5350         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
5351         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
5352    
5353           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
5354    
5355         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
5356         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
5357    
5358         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
5359         tricky.  This is made easier by the use of relative references. Instead         tricky. This is made easier by the use of relative references.  Instead
5360         of (?1) in the pattern above you can write (?-2) to refer to the second         of (?1) in the pattern above you can write (?-2) to refer to the second
5361         most  recently  opened  parentheses  preceding  the recursion. In other         most recently opened parentheses  preceding  the  recursion.  In  other
5362         words, a negative number counts capturing  parentheses  leftwards  from         words,  a  negative  number counts capturing parentheses leftwards from
5363         the point at which it is encountered.         the point at which it is encountered.
5364    
5365         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
5366         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
5367         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
5368         enced. They are always non-recursive subroutine calls, as described  in         enced.  They are always non-recursive subroutine calls, as described in
5369         the next section.         the next section.
5370    
5371         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
5372         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
5373         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
5374    
5375           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
5376    
5377         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
5378         one is used.         one is used.
5379    
5380         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
5381         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
5382         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
5383         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
5384         applied to         applied to
5385    
5386           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
5387    
5388         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
5389         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
5390         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
5391         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
5392    
5393         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
5394         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
5395         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
5396         tion). If the pattern above is matched against         tion). If the pattern above is matched against
5397    
5398           (ab(cd)ef)           (ab(cd)ef)
5399    
5400         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
5401         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
5402         pattern is not matched at the top level, its final  captured  value  is         pattern  is  not  matched at the top level, its final captured value is
5403         unset,  even  if  it was (temporarily) set at a deeper level during the         unset, even if it was (temporarily) set at a deeper  level  during  the
5404         matching process.         matching process.
5405    
5406         If there are more than 15 capturing parentheses in a pattern, PCRE  has         If  there are more than 15 capturing parentheses in a pattern, PCRE has
5407         to  obtain extra memory to store data during a recursion, which it does         to obtain extra memory to store data during a recursion, which it  does
5408         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
5409         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
5410    
5411         Do  not  confuse  the (?R) item with the condition (R), which tests for         Do not confuse the (?R) item with the condition (R),  which  tests  for
5412         recursion.  Consider this pattern, which matches text in  angle  brack-         recursion.   Consider  this pattern, which matches text in angle brack-
5413         ets,  allowing for arbitrary nesting. Only digits are allowed in nested         ets, allowing for arbitrary nesting. Only digits are allowed in  nested
5414         brackets (that is, when recursing), whereas any characters are  permit-         brackets  (that is, when recursing), whereas any characters are permit-
5415         ted at the outer level.         ted at the outer level.
5416    
5417           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
5418    
5419         In  this  pattern, (?(R) is the start of a conditional subpattern, with         In this pattern, (?(R) is the start of a conditional  subpattern,  with
5420         two different alternatives for the recursive and  non-recursive  cases.         two  different  alternatives for the recursive and non-recursive cases.
5421         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
5422    
5423     Differences in recursion processing between PCRE and Perl     Differences in recursion processing between PCRE and Perl
5424    
5425         Recursion  processing  in PCRE differs from Perl in two important ways.         Recursion processing in PCRE differs from Perl in two  important  ways.
5426         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
5427         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
5428         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
5429         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
5430         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
5431         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
5432         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
5433    
5434           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
5435    
5436         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
5437         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
5438         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
5439         Consider the subject string "abcba":         Consider the subject string "abcba":
5440    
5441         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
5442         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
5443         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
5444         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
5445         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
5446    
5447         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
5448         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
5449         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
5450         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
5451         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
5452         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
5453         different:         different:
5454    
5455           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
5456    
5457         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
5458         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
5459         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
5460         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
5461         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
5462         use.         use.
5463    
5464         To change the pattern so that it matches all palindromic  strings,  not         To  change  the pattern so that it matches all palindromic strings, not
5465         just  those  with an odd number of characters, it is tempting to change         just those with an odd number of characters, it is tempting  to  change
5466         the pattern to this:         the pattern to this:
5467    
5468           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
5469    
5470         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
5471         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
5472         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
5473         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
5474         natives at the higher level:         natives at the higher level:
5475    
5476           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
5477    
5478         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
5479         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
5480    
5481           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5482    
5483         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5484         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5485         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
5486         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
5487         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
5488         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5489    
5490         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
5491         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
5492         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
5493         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
5494         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
5495         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
5496         natives, so the entire match fails.         natives, so the entire match fails.
5497    
5498         The second way in which PCRE and Perl differ in  their  recursion  pro-         The  second  way  in which PCRE and Perl differ in their recursion pro-
5499         cessing  is in the handling of captured values. In Perl, when a subpat-         cessing is in the handling of captured values. In Perl, when a  subpat-
5500         tern is called recursively or as a subpattern (see the  next  section),         tern  is  called recursively or as a subpattern (see the next section),
5501         it  has  no  access to any values that were captured outside the recur-         it has no access to any values that were captured  outside  the  recur-
5502         sion, whereas in PCRE these values can  be  referenced.  Consider  this         sion,  whereas  in  PCRE  these values can be referenced. Consider this
5503         pattern:         pattern:
5504    
5505           ^(.)(\1|a(?2))           ^(.)(\1|a(?2))
5506    
5507         In  PCRE,  this  pattern matches "bab". The first capturing parentheses         In PCRE, this pattern matches "bab". The  first  capturing  parentheses
5508         match "b", then in the second group, when the back reference  \1  fails         match  "b",  then in the second group, when the back reference \1 fails
5509         to  match "b", the second alternative matches "a" and then recurses. In         to match "b", the second alternative matches "a" and then recurses.  In
5510         the recursion, \1 does now match "b" and so the whole  match  succeeds.         the  recursion,  \1 does now match "b" and so the whole match succeeds.
5511         In  Perl,  the pattern fails to match because inside the recursive call         In Perl, the pattern fails to match because inside the  recursive  call
5512         \1 cannot access the externally set value.         \1 cannot access the externally set value.
5513    
5514    
5515  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5516    
5517         If the syntax for a recursive subpattern call (either by number  or  by         If  the  syntax for a recursive subpattern call (either by number or by
5518         name)  is  used outside the parentheses to which it refers, it operates         name) is used outside the parentheses to which it refers,  it  operates
5519         like a subroutine in a programming language. The called subpattern  may         like  a subroutine in a programming language. The called subpattern may
5520         be  defined  before or after the reference. A numbered reference can be         be defined before or after the reference. A numbered reference  can  be
5521         absolute or relative, as in these examples:         absolute or relative, as in these examples:
5522    
5523           (...(absolute)...)...(?2)...           (...(absolute)...)...(?2)...
# Line 5508  SUBPATTERNS AS SUBROUTINES Line 5528  SUBPATTERNS AS SUBROUTINES
5528    
5529           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5530    
5531         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5532         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5533    
5534           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5535    
5536         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
5537         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
5538         above.         above.
5539    
5540         All  subroutine  calls, whether recursive or not, are always treated as         All subroutine calls, whether recursive or not, are always  treated  as
5541         atomic groups. That is, once a subroutine has matched some of the  sub-         atomic  groups. That is, once a subroutine has matched some of the sub-
5542         ject string, it is never re-entered, even if it contains untried alter-         ject string, it is never re-entered, even if it contains untried alter-
5543         natives and there is  a  subsequent  matching  failure.  Any  capturing         natives  and  there  is  a  subsequent  matching failure. Any capturing
5544         parentheses  that  are  set  during the subroutine call revert to their         parentheses that are set during the subroutine  call  revert  to  their
5545         previous values afterwards.         previous values afterwards.
5546    
5547         Processing options such as case-independence are fixed when  a  subpat-         Processing  options  such as case-independence are fixed when a subpat-
5548         tern  is defined, so if it is used as a subroutine, such options cannot         tern is defined, so if it is used as a subroutine, such options  cannot
5549         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5550    
5551           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5552    
5553         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
5554         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5555    
5556    
5557  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5558    
5559         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
5560         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5561         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
5562         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
5563         ten using this syntax:         ten using this syntax:
5564    
5565           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5566           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5567    
5568         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
5569         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5570    
5571           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5572    
5573         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
5574         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
5575         call.         call.
5576    
5577    
5578  CALLOUTS  CALLOUTS
5579    
5580         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5581         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
5582         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5583         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5584         tion.         tion.
5585    
5586         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5587         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5588         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
5589         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
5590         all calling out.         all calling out.
5591    
5592         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
5593         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
5594         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
5595         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
5596         points:         points:
5597    
5598           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5599    
5600         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5601         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
5602         numbered 255.         numbered 255.
5603    
5604         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5605         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
5606         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
5607         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
5608         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
5609         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5610         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5611    
5612    
5613  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5614    
5615         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5616         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5617         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5618         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5619         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5620         in this section.         in this section.
5621    
5622         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5623         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5624         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5625         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5626         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5627    
5628         If  any of these verbs are used in an assertion or in a subpattern that         If any of these verbs are used in an assertion or in a subpattern  that
5629         is called as a subroutine (whether or not recursively), their effect is         is called as a subroutine (whether or not recursively), their effect is
5630         confined to that subpattern; it does not extend to the surrounding pat-         confined to that subpattern; it does not extend to the surrounding pat-
5631         tern, with one exception: a *MARK that is  encountered  in  a  positive         tern,  with  one  exception:  a *MARK that is encountered in a positive
5632         assertion is passed back (compare capturing parentheses in assertions).         assertion is passed back (compare capturing parentheses in assertions).
5633         Note that such subpatterns are processed as anchored at the point where         Note that such subpatterns are processed as anchored at the point where
5634         they are tested. Note also that Perl's treatment of subroutines is dif-         they are tested. Note also that Perl's treatment of subroutines is dif-
5635         ferent in some cases.         ferent in some cases.
5636    
5637         The new verbs make use of what was previously invalid syntax: an  open-         The  new verbs make use of what was previously invalid syntax: an open-
5638         ing parenthesis followed by an asterisk. They are generally of the form         ing parenthesis followed by an asterisk. They are generally of the form
5639         (*VERB) or (*VERB:NAME). Some may take either form, with differing  be-         (*VERB)  or (*VERB:NAME). Some may take either form, with differing be-
5640         haviour,  depending on whether or not an argument is present. A name is         haviour, depending on whether or not an argument is present. A name  is
5641         any sequence of characters that does not include a closing parenthesis.         any sequence of characters that does not include a closing parenthesis.
5642         If  the  name is empty, that is, if the closing parenthesis immediately         If the name is empty, that is, if the closing  parenthesis  immediately
5643         follows the colon, the effect is as if the colon were  not  there.  Any         follows  the  colon,  the effect is as if the colon were not there. Any
5644         number of these verbs may occur in a pattern.         number of these verbs may occur in a pattern.
5645    
5646         PCRE  contains some optimizations that are used to speed up matching by         PCRE contains some optimizations that are used to speed up matching  by
5647         running some checks at the start of each match attempt. For example, it         running some checks at the start of each match attempt. For example, it
5648         may  know  the minimum length of matching subject, or that a particular         may know the minimum length of matching subject, or that  a  particular
5649         character must be present. When one of these  optimizations  suppresses         character  must  be present. When one of these optimizations suppresses
5650         the  running  of  a match, any included backtracking verbs will not, of         the running of a match, any included backtracking verbs  will  not,  of
5651         course, be processed. You can suppress the start-of-match optimizations         course, be processed. You can suppress the start-of-match optimizations
5652         by  setting  the  PCRE_NO_START_OPTIMIZE  option when calling pcre_com-         by setting the PCRE_NO_START_OPTIMIZE  option  when  calling  pcre_com-
5653         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).         pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT).
5654    
5655     Verbs that act immediately     Verbs that act immediately
5656    
5657         The following verbs act as soon as they are encountered. They  may  not         The  following  verbs act as soon as they are encountered. They may not
5658         be followed by a name.         be followed by a name.
5659    
5660            (*ACCEPT)            (*ACCEPT)
5661    
5662         This  verb causes the match to end successfully, skipping the remainder         This verb causes the match to end successfully, skipping the  remainder
5663         of the pattern. However, when it is inside a subpattern that is  called         of  the pattern. However, when it is inside a subpattern that is called
5664         as  a  subroutine, only that subpattern is ended successfully. Matching         as a subroutine, only that subpattern is ended  successfully.  Matching
5665         then continues at the outer level. If  (*ACCEPT)  is  inside  capturing         then  continues  at  the  outer level. If (*ACCEPT) is inside capturing
5666         parentheses, the data so far is captured. For example:         parentheses, the data so far is captured. For example:
5667    
5668           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5669    
5670         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
5671         tured by the outer parentheses.         tured by the outer parentheses.
5672    
5673           (*FAIL) or (*F)           (*FAIL) or (*F)
5674    
5675         This verb causes a matching failure, forcing backtracking to occur.  It         This  verb causes a matching failure, forcing backtracking to occur. It
5676         is  equivalent to (?!) but easier to read. The Perl documentation notes         is equivalent to (?!) but easier to read. The Perl documentation  notes
5677         that it is probably useful only when combined  with  (?{})  or  (??{}).         that  it  is  probably  useful only when combined with (?{}) or (??{}).
5678         Those  are,  of course, Perl features that are not present in PCRE. The         Those are, of course, Perl features that are not present in  PCRE.  The
5679         nearest equivalent is the callout feature, as for example in this  pat-         nearest  equivalent is the callout feature, as for example in this pat-
5680         tern:         tern:
5681    
5682           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5683    
5684         A  match  with the string "aaaa" always fails, but the callout is taken         A match with the string "aaaa" always fails, but the callout  is  taken
5685         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5686    
5687     Recording which path was taken     Recording which path was taken
5688    
5689         There is one verb whose main purpose  is  to  track  how  a  match  was         There  is  one  verb  whose  main  purpose  is to track how a match was
5690         arrived  at,  though  it  also  has a secondary use in conjunction with         arrived at, though it also has a  secondary  use  in  conjunction  with
5691         advancing the match starting point (see (*SKIP) below).         advancing the match starting point (see (*SKIP) below).
5692    
5693           (*MARK:NAME) or (*:NAME)           (*MARK:NAME) or (*:NAME)
5694    
5695         A name is always  required  with  this  verb.  There  may  be  as  many         A  name  is  always  required  with  this  verb.  There  may be as many
5696         instances  of  (*MARK) as you like in a pattern, and their names do not         instances of (*MARK) as you like in a pattern, and their names  do  not
5697         have to be unique.         have to be unique.
5698    
5699         When a match succeeds, the name  of  the  last-encountered  (*MARK)  is         When  a  match  succeeds,  the  name of the last-encountered (*MARK) is
5700         passed  back  to  the  caller  via  the  pcre_extra  data structure, as         passed back to  the  caller  via  the  pcre_extra  data  structure,  as
5701         described in the section on pcre_extra in the pcreapi documentation. No         described in the section on pcre_extra in the pcreapi documentation. No
5702         data  is  returned  for a partial match. Here is an example of pcretest         data is returned for a partial match. Here is an  example  of  pcretest
5703         output, where the /K modifier requests the retrieval and outputting  of         output,  where the /K modifier requests the retrieval and outputting of
5704         (*MARK) data:         (*MARK) data:
5705    
5706           /X(*MARK:A)Y|X(*MARK:B)Z/K           /X(*MARK:A)Y|X(*MARK:B)Z/K
# Line 5692  BACKTRACKING CONTROL Line 5712  BACKTRACKING CONTROL
5712           MK: B           MK: B
5713    
5714         The (*MARK) name is tagged with "MK:" in this output, and in this exam-         The (*MARK) name is tagged with "MK:" in this output, and in this exam-
5715         ple it indicates which of the two alternatives matched. This is a  more         ple  it indicates which of the two alternatives matched. This is a more
5716         efficient  way of obtaining this information than putting each alterna-         efficient way of obtaining this information than putting each  alterna-
5717         tive in its own capturing parentheses.         tive in its own capturing parentheses.
5718    
5719         If (*MARK) is encountered in a positive assertion, its name is recorded         If (*MARK) is encountered in a positive assertion, its name is recorded
5720         and passed back if it is the last-encountered. This does not happen for         and passed back if it is the last-encountered. This does not happen for
5721         negative assertions.         negative assertions.
5722    
5723         A name may also be returned after a failed  match  if  the  final  path         A  name  may  also  be  returned after a failed match if the final path
5724         through  the  pattern involves (*MARK). However, unless (*MARK) used in         through the pattern involves (*MARK). However, unless (*MARK)  used  in
5725         conjunction with (*COMMIT), this is unlikely to  happen  for  an  unan-         conjunction  with  (*COMMIT),  this  is unlikely to happen for an unan-
5726         chored pattern because, as the starting point for matching is advanced,         chored pattern because, as the starting point for matching is advanced,
5727         the final check is often with an empty string, causing a failure before         the final check is often with an empty string, causing a failure before
5728         (*MARK) is reached. For example:         (*MARK) is reached. For example:
# Line 5712  BACKTRACKING CONTROL Line 5732  BACKTRACKING CONTROL
5732           No match           No match
5733    
5734         There are three potential starting points for this match (starting with         There are three potential starting points for this match (starting with
5735         X, starting with P, and with  an  empty  string).  If  the  pattern  is         X,  starting  with  P,  and  with  an  empty string). If the pattern is
5736         anchored, the result is different:         anchored, the result is different:
5737    
5738           /^X(*MARK:A)Y|^X(*MARK:B)Z/K           /^X(*MARK:A)Y|^X(*MARK:B)Z/K
5739           XP           XP
5740           No match, mark = B           No match, mark = B
5741    
5742         PCRE's  start-of-match  optimizations can also interfere with this. For         PCRE's start-of-match optimizations can also interfere with  this.  For
5743         example, if, as a result of a call to pcre_study(), it knows the  mini-         example,  if, as a result of a call to pcre_study(), it knows the mini-
5744         mum  subject  length for a match, a shorter subject will not be scanned         mum subject length for a match, a shorter subject will not  be  scanned
5745         at all.         at all.
5746    
5747         Note that similar anomalies (though different in detail) exist in Perl,         Note that similar anomalies (though different in detail) exist in Perl,
5748         no  doubt  for the same reasons. The use of (*MARK) data after a failed         no doubt for the same reasons. The use of (*MARK) data after  a  failed
5749         match of an unanchored pattern is not recommended, unless (*COMMIT)  is         match  of an unanchored pattern is not recommended, unless (*COMMIT) is
5750         involved.         involved.
5751    
5752     Verbs that act after backtracking     Verbs that act after backtracking
5753    
5754         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5755         tinues with what follows, but if there is no subsequent match,  causing         tinues  with what follows, but if there is no subsequent match, causing
5756         a  backtrack  to  the  verb, a failure is forced. That is, backtracking         a backtrack to the verb, a failure is  forced.  That  is,  backtracking
5757         cannot pass to the left of the verb. However, when one of  these  verbs         cannot  pass  to the left of the verb. However, when one of these verbs
5758         appears  inside  an atomic group, its effect is confined to that group,         appears inside an atomic group, its effect is confined to  that  group,
5759         because once the group has been matched, there is never any  backtrack-         because  once the group has been matched, there is never any backtrack-
5760         ing  into  it.  In  this situation, backtracking can "jump back" to the         ing into it. In this situation, backtracking can  "jump  back"  to  the
5761         left of the entire atomic group. (Remember also, as stated above,  that         left  of the entire atomic group. (Remember also, as stated above, that
5762         this localization also applies in subroutine calls and assertions.)         this localization also applies in subroutine calls and assertions.)
5763    
5764         These  verbs  differ  in exactly what kind of failure occurs when back-         These verbs differ in exactly what kind of failure  occurs  when  back-
5765         tracking reaches them.         tracking reaches them.
5766    
5767           (*COMMIT)           (*COMMIT)
5768    
5769         This verb, which may not be followed by a name, causes the whole  match         This  verb, which may not be followed by a name, causes the whole match
5770         to fail outright if the rest of the pattern does not match. Even if the         to fail outright if the rest of the pattern does not match. Even if the
5771         pattern is unanchored, no further attempts to find a match by advancing         pattern is unanchored, no further attempts to find a match by advancing
5772         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,         the  starting  point  take  place.  Once  (*COMMIT)  has  been  passed,
5773         pcre_exec() is committed to finding a match  at  the  current  starting         pcre_exec()  is  committed  to  finding a match at the current starting
5774         point, or not at all. For example:         point, or not at all. For example:
5775    
5776           a+(*COMMIT)b           a+(*COMMIT)b
5777    
5778         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
5779         of dynamic anchor, or "I've started, so I must finish." The name of the         of dynamic anchor, or "I've started, so I must finish." The name of the
5780         most  recently passed (*MARK) in the path is passed back when (*COMMIT)         most recently passed (*MARK) in the path is passed back when  (*COMMIT)
5781         forces a match failure.         forces a match failure.
5782    
5783         Note that (*COMMIT) at the start of a pattern is not  the  same  as  an         Note  that  (*COMMIT)  at  the start of a pattern is not the same as an
5784         anchor,  unless  PCRE's start-of-match optimizations are turned off, as         anchor, unless PCRE's start-of-match optimizations are turned  off,  as
5785         shown in this pcretest example:         shown in this pcretest example:
5786    
5787           /(*COMMIT)abc/           /(*COMMIT)abc/
# Line 5770  BACKTRACKING CONTROL Line 5790  BACKTRACKING CONTROL
5790           xyzabc\Y           xyzabc\Y
5791           No match           No match
5792    
5793         PCRE knows that any match must start  with  "a",  so  the  optimization         PCRE  knows  that  any  match  must start with "a", so the optimization
5794         skips  along the subject to "a" before running the first match attempt,         skips along the subject to "a" before running the first match  attempt,
5795         which succeeds. When the optimization is disabled by the \Y  escape  in         which  succeeds.  When the optimization is disabled by the \Y escape in
5796         the second subject, the match starts at "x" and so the (*COMMIT) causes         the second subject, the match starts at "x" and so the (*COMMIT) causes
5797         it to fail without trying any other starting points.         it to fail without trying any other starting points.
5798    
5799           (*PRUNE) or (*PRUNE:NAME)           (*PRUNE) or (*PRUNE:NAME)
5800    
5801         This verb causes the match to fail at the current starting position  in         This  verb causes the match to fail at the current starting position in
5802         the  subject  if the rest of the pattern does not match. If the pattern         the subject if the rest of the pattern does not match. If  the  pattern
5803         is unanchored, the normal "bumpalong"  advance  to  the  next  starting         is  unanchored,  the  normal  "bumpalong"  advance to the next starting
5804         character  then happens. Backtracking can occur as usual to the left of         character then happens. Backtracking can occur as usual to the left  of
5805         (*PRUNE), before it is reached,  or  when  matching  to  the  right  of         (*PRUNE),  before  it  is  reached,  or  when  matching to the right of
5806         (*PRUNE),  but  if  there is no match to the right, backtracking cannot         (*PRUNE), but if there is no match to the  right,  backtracking  cannot
5807         cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an  alter-         cross  (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter-
5808         native  to an atomic group or possessive quantifier, but there are some         native to an atomic group or possessive quantifier, but there are  some
5809         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-         uses of (*PRUNE) that cannot be expressed in any other way.  The behav-
5810         iour  of  (*PRUNE:NAME)  is  the  same as (*MARK:NAME)(*PRUNE) when the         iour of (*PRUNE:NAME) is the  same  as  (*MARK:NAME)(*PRUNE)  when  the
5811         match fails completely; the name is passed back if this  is  the  final         match  fails  completely;  the name is passed back if this is the final
5812         attempt.   (*PRUNE:NAME)  does  not  pass back a name if the match suc-         attempt.  (*PRUNE:NAME) does not pass back a name  if  the  match  suc-
5813         ceeds. In an anchored pattern (*PRUNE) has the same  effect  as  (*COM-         ceeds.  In  an  anchored pattern (*PRUNE) has the same effect as (*COM-
5814         MIT).         MIT).
5815    
5816           (*SKIP)           (*SKIP)
5817    
5818         This  verb, when given without a name, is like (*PRUNE), except that if         This verb, when given without a name, is like (*PRUNE), except that  if
5819         the pattern is unanchored, the "bumpalong" advance is not to  the  next         the  pattern  is unanchored, the "bumpalong" advance is not to the next
5820         character, but to the position in the subject where (*SKIP) was encoun-         character, but to the position in the subject where (*SKIP) was encoun-
5821         tered. (*SKIP) signifies that whatever text was matched leading  up  to         tered.  (*SKIP)  signifies that whatever text was matched leading up to
5822         it cannot be part of a successful match. Consider:         it cannot be part of a successful match. Consider:
5823    
5824           a+(*SKIP)b           a+(*SKIP)b
5825    
5826         If  the  subject  is  "aaaac...",  after  the first match attempt fails         If the subject is "aaaac...",  after  the  first  match  attempt  fails
5827         (starting at the first character in the  string),  the  starting  point         (starting  at  the  first  character in the string), the starting point
5828         skips on to start the next attempt at "c". Note that a possessive quan-         skips on to start the next attempt at "c". Note that a possessive quan-
5829         tifer does not have the same effect as this example; although it  would         tifer  does not have the same effect as this example; although it would
5830         suppress  backtracking  during  the  first  match  attempt,  the second         suppress backtracking  during  the  first  match  attempt,  the  second
5831         attempt would start at the second character instead of skipping  on  to         attempt  would  start at the second character instead of skipping on to
5832         "c".         "c".
5833    
5834           (*SKIP:NAME)           (*SKIP:NAME)
5835    
5836         When  (*SKIP) has an associated name, its behaviour is modified. If the         When (*SKIP) has an associated name, its behaviour is modified. If  the
5837         following pattern fails to match, the previous path through the pattern         following pattern fails to match, the previous path through the pattern
5838         is  searched for the most recent (*MARK) that has the same name. If one         is searched for the most recent (*MARK) that has the same name. If  one
5839         is found, the "bumpalong" advance is to the subject position that  cor-         is  found, the "bumpalong" advance is to the subject position that cor-
5840         responds  to  that (*MARK) instead of to where (*SKIP) was encountered.         responds to that (*MARK) instead of to where (*SKIP)  was  encountered.
5841         If no (*MARK) with a matching name is found, normal "bumpalong" of  one         If  no (*MARK) with a matching name is found, normal "bumpalong" of one
5842         character happens (that is, the (*SKIP) is ignored).         character happens (that is, the (*SKIP) is ignored).
5843    
5844           (*THEN) or (*THEN:NAME)           (*THEN) or (*THEN:NAME)
5845    
5846         This  verb  causes a skip to the next innermost alternative if the rest         This verb causes a skip to the next innermost alternative if  the  rest
5847         of the pattern does not match. That is, it cancels  pending  backtrack-         of  the  pattern does not match. That is, it cancels pending backtrack-
5848         ing,  but  only within the current alternative. Its name comes from the         ing, but only within the current alternative. Its name comes  from  the
5849         observation that it can be used for a pattern-based if-then-else block:         observation that it can be used for a pattern-based if-then-else block:
5850    
5851           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5852    
5853         If the COND1 pattern matches, FOO is tried (and possibly further  items         If  the COND1 pattern matches, FOO is tried (and possibly further items
5854         after  the  end  of the group if FOO succeeds); on failure, the matcher         after the end of the group if FOO succeeds); on  failure,  the  matcher
5855         skips to the second alternative and tries COND2,  without  backtracking         skips  to  the second alternative and tries COND2, without backtracking
5856         into  COND1.  The  behaviour  of  (*THEN:NAME)  is  exactly the same as         into COND1. The behaviour  of  (*THEN:NAME)  is  exactly  the  same  as
5857         (*MARK:NAME)(*THEN) if the overall  match  fails.  If  (*THEN)  is  not         (*MARK:NAME)(*THEN)  if  the  overall  match  fails.  If (*THEN) is not
5858         inside an alternation, it acts like (*PRUNE).         inside an alternation, it acts like (*PRUNE).
5859    
5860         Note  that  a  subpattern that does not contain a | character is just a         Note that a subpattern that does not contain a | character  is  just  a
5861         part of the enclosing alternative; it is not a nested alternation  with         part  of the enclosing alternative; it is not a nested alternation with
5862         only  one alternative. The effect of (*THEN) extends beyond such a sub-         only one alternative. The effect of (*THEN) extends beyond such a  sub-
5863         pattern to the enclosing alternative. Consider this pattern,  where  A,         pattern  to  the enclosing alternative. Consider this pattern, where A,
5864         B, etc. are complex pattern fragments that do not contain any | charac-         B, etc. are complex pattern fragments that do not contain any | charac-
5865         ters at this level:         ters at this level:
5866    
5867           A (B(*THEN)C) | D           A (B(*THEN)C) | D
5868    
5869         If A and B are matched, but there is a failure in C, matching does  not         If  A and B are matched, but there is a failure in C, matching does not
5870         backtrack into A; instead it moves to the next alternative, that is, D.         backtrack into A; instead it moves to the next alternative, that is, D.
5871         However, if the subpattern containing (*THEN) is given an  alternative,         However,  if the subpattern containing (*THEN) is given an alternative,
5872         it behaves differently:         it behaves differently:
5873    
5874           A (B(*THEN)C | (*FAIL)) | D           A (B(*THEN)C | (*FAIL)) | D
5875    
5876         The  effect of (*THEN) is now confined to the inner subpattern. After a         The effect of (*THEN) is now confined to the inner subpattern. After  a
5877         failure in C, matching moves to (*FAIL), which causes the whole subpat-         failure in C, matching moves to (*FAIL), which causes the whole subpat-
5878         tern  to  fail  because  there are no more alternatives to try. In this         tern to fail because there are no more alternatives  to  try.  In  this
5879         case, matching does now backtrack into A.         case, matching does now backtrack into A.
5880    
5881         Note also that a conditional subpattern is not considered as having two         Note also that a conditional subpattern is not considered as having two
5882         alternatives,  because  only  one  is  ever used. In other words, the |         alternatives, because only one is ever used.  In  other  words,  the  |
5883         character in a conditional subpattern has a different meaning. Ignoring         character in a conditional subpattern has a different meaning. Ignoring
5884         white space, consider:         white space, consider:
5885    
5886           ^.*? (?(?=a) a | b(*THEN)c )           ^.*? (?(?=a) a | b(*THEN)c )
5887    
5888         If  the  subject  is  "ba", this pattern does not match. Because .*? is         If the subject is "ba", this pattern does not  match.  Because  .*?  is
5889         ungreedy, it initially matches zero  characters.  The  condition  (?=a)         ungreedy,  it  initially  matches  zero characters. The condition (?=a)
5890         then  fails,  the  character  "b"  is  matched, but "c" is not. At this         then fails, the character "b" is matched,  but  "c"  is  not.  At  this
5891         point, matching does not backtrack to .*? as might perhaps be  expected         point,  matching does not backtrack to .*? as might perhaps be expected
5892         from  the  presence  of  the | character. The conditional subpattern is         from the presence of the | character.  The  conditional  subpattern  is
5893         part of the single alternative that comprises the whole pattern, and so         part of the single alternative that comprises the whole pattern, and so
5894         the  match  fails.  (If  there was a backtrack into .*?, allowing it to         the match fails. (If there was a backtrack into  .*?,  allowing  it  to
5895         match "b", the match would succeed.)         match "b", the match would succeed.)
5896    
5897         The verbs just described provide four different "strengths" of  control         The  verbs just described provide four different "strengths" of control
5898         when subsequent matching fails. (*THEN) is the weakest, carrying on the         when subsequent matching fails. (*THEN) is the weakest, carrying on the
5899         match at the next alternative. (*PRUNE) comes next, failing  the  match         match  at  the next alternative. (*PRUNE) comes next, failing the match
5900         at  the  current starting position, but allowing an advance to the next         at the current starting position, but allowing an advance to  the  next
5901         character (for an unanchored pattern). (*SKIP) is similar, except  that         character  (for an unanchored pattern). (*SKIP) is similar, except that
5902         the advance may be more than one character. (*COMMIT) is the strongest,         the advance may be more than one character. (*COMMIT) is the strongest,
5903         causing the entire match to fail.         causing the entire match to fail.
5904    
# Line 5888  BACKTRACKING CONTROL Line 5908  BACKTRACKING CONTROL
5908    
5909           (A(*COMMIT)B(*THEN)C|D)           (A(*COMMIT)B(*THEN)C|D)
5910    
5911         Once A has matched, PCRE is committed to this  match,  at  the  current         Once  A  has  matched,  PCRE is committed to this match, at the current
5912         starting  position. If subsequently B matches, but C does not, the nor-         starting position. If subsequently B matches, but C does not, the  nor-
5913         mal (*THEN) action of trying the next alternative (that is, D) does not         mal (*THEN) action of trying the next alternative (that is, D) does not
5914         happen because (*COMMIT) overrides.         happen because (*COMMIT) overrides.
5915    
# Line 5908  AUTHOR Line 5928  AUTHOR
5928    
5929  REVISION  REVISION
5930    
5931         Last updated: 09 October 2011         Last updated: 19 October 2011
5932         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
5933  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5934    
# Line 6383  UTF-8 AND UNICODE PROPERTY SUPPORT Line 6403  UTF-8 AND UNICODE PROPERTY SUPPORT
6403         gle byte.         gle byte.
6404    
6405         5.  The  escape sequence \C can be used to match a single byte in UTF-8         5.  The  escape sequence \C can be used to match a single byte in UTF-8
6406         mode, but its use can lead to some strange effects.  This  facility  is         mode, but its use can lead to some strange effects because it breaks up
6407         not  available  in  the alternative matching function, pcre_dfa_exec(),         multibyte characters (see the description of \C in the pcrepattern doc-
6408         nor is it supported by the JIT  optimization  of  pcre_exec().  If  JIT         umentation). The use of \C is not supported in the alternative matching
6409         optimization  is  requested for a pattern that contains \C, it will not         function  pcre_dfa_exec(), nor is it supported in UTF-8 mode by the JIT
6410         succeed, and so the matching will be carried out by the  normal  inter-         optimization of pcre_exec(). If JIT optimization  is  requested  for  a
6411         pretive function.         UTF-8  pattern that contains \C, it will not succeed, and so the match-
6412           ing will be carried out by the normal interpretive function.
6413    
6414         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
6415         test characters of any code value, but, by default, the characters that         test characters of any code value, but, by default, the characters that
6416         PCRE  recognizes  as digits, spaces, or word characters remain the same         PCRE recognizes as digits, spaces, or word characters remain  the  same
6417         set as before, all with values less than 256. This  remains  true  even         set  as  before,  all with values less than 256. This remains true even
6418         when  PCRE  is built to include Unicode property support, because to do         when PCRE is built to include Unicode property support, because  to  do
6419         otherwise would slow down PCRE in many common cases. Note in particular         otherwise would slow down PCRE in many common cases. Note in particular
6420         that this applies to \b and \B, because they are defined in terms of \w         that this applies to \b and \B, because they are defined in terms of \w
6421         and \W. If you really want to test for a wider sense of, say,  "digit",         and  \W. If you really want to test for a wider sense of, say, "digit",
6422         you  can  use  explicit Unicode property tests such as \p{Nd}. Alterna-         you can use explicit Unicode property tests such  as  \p{Nd}.  Alterna-
6423         tively, if you set the PCRE_UCP option,  the  way  that  the  character         tively,  if  you  set  the  PCRE_UCP option, the way that the character
6424         escapes  work  is changed so that Unicode properties are used to deter-         escapes work is changed so that Unicode properties are used  to  deter-
6425         mine which characters match. There are more details in the  section  on         mine  which  characters match. There are more details in the section on
6426         generic character types in the pcrepattern documentation.         generic character types in the pcrepattern documentation.
6427    
6428         7.  Similarly,  characters that match the POSIX named character classes         7. Similarly, characters that match the POSIX named  character  classes
6429         are all low-valued characters, unless the PCRE_UCP option is set.         are all low-valued characters, unless the PCRE_UCP option is set.
6430    
6431         8. However, the horizontal and  vertical  whitespace  matching  escapes         8.  However,  the  horizontal  and vertical whitespace matching escapes
6432         (\h,  \H,  \v, and \V) do match all the appropriate Unicode characters,         (\h, \H, \v, and \V) do match all the appropriate  Unicode  characters,
6433         whether or not PCRE_UCP is set.         whether or not PCRE_UCP is set.
6434    
6435         9. Case-insensitive matching applies only to  characters  whose  values         9.  Case-insensitive  matching  applies only to characters whose values
6436         are  less than 128, unless PCRE is built with Unicode property support.         are less than 128, unless PCRE is built with Unicode property  support.
6437         Even when Unicode property support is available, PCRE  still  uses  its         Even  when  Unicode  property support is available, PCRE still uses its
6438         own  character  tables when checking the case of low-valued characters,         own character tables when checking the case of  low-valued  characters,
6439         so as not to degrade performance.  The Unicode property information  is         so  as not to degrade performance.  The Unicode property information is
6440         used only for characters with higher values. Furthermore, PCRE supports         used only for characters with higher values. Furthermore, PCRE supports
6441         case-insensitive matching only  when  there  is  a  one-to-one  mapping         case-insensitive  matching  only  when  there  is  a one-to-one mapping
6442         between  a letter's cases. There are a small number of many-to-one map-         between a letter's cases. There are a small number of many-to-one  map-
6443         pings in Unicode; these are not supported by PCRE.         pings in Unicode; these are not supported by PCRE.
6444    
6445    
# Line 6431  AUTHOR Line 6452  AUTHOR
6452    
6453  REVISION  REVISION
6454    
6455         Last updated: 06 September 2011         Last updated: 19 October 2011
6456         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
6457  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6458    
# Line 6534  UNSUPPORTED OPTIONS AND PATTERN ITEMS Line 6555  UNSUPPORTED OPTIONS AND PATTERN ITEMS
6555    
6556         The unsupported pattern items are:         The unsupported pattern items are:
6557    
6558           \C            match a single byte, even in UTF-8 mode           \C            match a single byte; not supported in UTF-8 mode
6559           (?Cn)          callouts           (?Cn)          callouts
6560           (?(<name>)...  conditional test on setting of a named subpattern           (?(<name>)...  conditional test on setting of a named subpattern
6561           (?(R)...       conditional test on whole pattern recursion           (?(R)...       conditional test on whole pattern recursion
# Line 6691  AUTHOR Line 6712  AUTHOR
6712    
6713  REVISION  REVISION
6714    
6715         Last updated: 05 October 2011         Last updated: 19 October 2011
6716         Copyright (c) 1997-2011 University of Cambridge.         Copyright (c) 1997-2011 University of Cambridge.
6717  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
6718    

Legend:
Removed from v.733  
changed lines
  Added in v.738

  ViewVC Help
Powered by ViewVC 1.1.5