/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 487 by ph10, Wed Jan 6 10:26:55 2010 UTC revision 488 by ph10, Mon Jan 11 15:29:42 2010 UTC
# Line 3246  BACKSLASH Line 3246  BACKSLASH
3246           \n        linefeed (hex 0A)           \n        linefeed (hex 0A)
3247           \r        carriage return (hex 0D)           \r        carriage return (hex 0D)
3248           \t        tab (hex 09)           \t        tab (hex 09)
3249           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or back reference
3250           \xhh      character with hex code hh           \xhh      character with hex code hh
3251           \x{hhh..} character with hex code hhh..           \x{hhh..} character with hex code hhh..
3252    
# Line 4051  DUPLICATE SUBPATTERN NUMBERS Line 4051  DUPLICATE SUBPATTERN NUMBERS
4051           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x           / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
4052           # 1            2         2  3        2     3     4           # 1            2         2  3        2     3     4
4053    
4054         A  backreference  to  a  numbered subpattern uses the most recent value         A  back  reference  to a numbered subpattern uses the most recent value
4055         that is set for that number by any subpattern.  The  following  pattern         that is set for that number by any subpattern.  The  following  pattern
4056         matches "abcabc" or "defdef":         matches "abcabc" or "defdef":
4057    
# Line 4085  NAMED SUBPATTERNS Line 4085  NAMED SUBPATTERNS
4085    
4086         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)         In PCRE, a subpattern can be named in one of three  ways:  (?<name>...)
4087         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References         or  (?'name'...)  as in Perl, or (?P<name>...) as in Python. References
4088         to capturing parentheses from other parts of the pattern, such as back-         to capturing parentheses from other parts of the pattern, such as  back
4089         references,  recursion,  and conditions, can be made by name as well as         references,  recursion,  and conditions, can be made by name as well as
4090         by number.         by number.
4091    
# Line 4121  NAMED SUBPATTERNS Line 4121  NAMED SUBPATTERNS
4121         that  name  that  matched.  This saves searching to find which numbered         that  name  that  matched.  This saves searching to find which numbered
4122         subpattern it was.         subpattern it was.
4123    
4124         If you make a backreference to a non-unique named subpattern from else-         If you make a back reference to  a  non-unique  named  subpattern  from
4125         where  in the pattern, the one that corresponds to the first occurrence         elsewhere  in the pattern, the one that corresponds to the first occur-
4126         of the name is used. In the absence of duplicate numbers (see the  pre-         rence of the name is used. In the absence of duplicate numbers (see the
4127         vious  section)  this  is  the one with the lowest number. If you use a         previous  section) this is the one with the lowest number. If you use a
4128         named reference in a condition test (see the section  about  conditions         named reference in a condition test (see the section  about  conditions
4129         below),  either  to check whether a subpattern has matched, or to check         below),  either  to check whether a subpattern has matched, or to check
4130         for recursion, all subpatterns with the same name are  tested.  If  the         for recursion, all subpatterns with the same name are  tested.  If  the
# Line 4270  REPETITION Line 4270  REPETITION
4270         mization, or alternatively using ^ to indicate anchoring explicitly.         mization, or alternatively using ^ to indicate anchoring explicitly.
4271    
4272         However, there is one situation where the optimization cannot be  used.         However, there is one situation where the optimization cannot be  used.
4273         When  .*   is  inside  capturing  parentheses that are the subject of a         When .*  is inside capturing parentheses that are the subject of a back
4274         backreference elsewhere in the pattern, a match at the start  may  fail         reference elsewhere in the pattern, a match at the start may fail where
4275         where a later one succeeds. Consider, for example:         a later one succeeds. Consider, for example:
4276    
4277           (.*)abc\1           (.*)abc\1
4278    
# Line 4494  BACK REFERENCES Line 4494  BACK REFERENCES
4494         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{         PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{
4495         syntax or an empty comment (see "Comments" below) can be used.         syntax or an empty comment (see "Comments" below) can be used.
4496    
4497       Recursive back references
4498    
4499         A back reference that occurs inside the parentheses to which it  refers         A back reference that occurs inside the parentheses to which it  refers
4500         fails  when  the subpattern is first used, so, for example, (a\1) never         fails  when  the subpattern is first used, so, for example, (a\1) never
4501         matches.  However, such references can be useful inside  repeated  sub-         matches.  However, such references can be useful inside  repeated  sub-
# Line 4508  BACK REFERENCES Line 4510  BACK REFERENCES
4510         to  match the back reference. This can be done using alternation, as in         to  match the back reference. This can be done using alternation, as in
4511         the example above, or by a quantifier with a minimum of zero.         the example above, or by a quantifier with a minimum of zero.
4512    
4513           Back references of this type cause the group that they reference to  be
4514           treated  as  an atomic group.  Once the whole group has been matched, a
4515           subsequent matching failure cannot cause backtracking into  the  middle
4516           of the group.
4517    
4518    
4519  ASSERTIONS  ASSERTIONS
4520    
4521         An assertion is a test on the characters  following  or  preceding  the         An  assertion  is  a  test on the characters following or preceding the
4522         current  matching  point that does not actually consume any characters.         current matching point that does not actually consume  any  characters.
4523         The simple assertions coded as \b, \B, \A, \G, \Z,  \z,  ^  and  $  are         The  simple  assertions  coded  as  \b, \B, \A, \G, \Z, \z, ^ and $ are
4524         described above.         described above.
4525    
4526         More  complicated  assertions  are  coded as subpatterns. There are two         More complicated assertions are coded as  subpatterns.  There  are  two
4527         kinds: those that look ahead of the current  position  in  the  subject         kinds:  those  that  look  ahead of the current position in the subject
4528         string,  and  those  that  look  behind  it. An assertion subpattern is         string, and those that look  behind  it.  An  assertion  subpattern  is
4529         matched in the normal way, except that it does not  cause  the  current         matched  in  the  normal way, except that it does not cause the current
4530         matching position to be changed.         matching position to be changed.
4531    
4532         Assertion  subpatterns  are  not  capturing subpatterns, and may not be         Assertion subpatterns are not capturing subpatterns,  and  may  not  be
4533         repeated, because it makes no sense to assert the  same  thing  several         repeated,  because  it  makes no sense to assert the same thing several
4534         times.  If  any kind of assertion contains capturing subpatterns within         times. If any kind of assertion contains capturing  subpatterns  within
4535         it, these are counted for the purposes of numbering the capturing  sub-         it,  these are counted for the purposes of numbering the capturing sub-
4536         patterns in the whole pattern.  However, substring capturing is carried         patterns in the whole pattern.  However, substring capturing is carried
4537         out only for positive assertions, because it does not  make  sense  for         out  only  for  positive assertions, because it does not make sense for
4538         negative assertions.         negative assertions.
4539    
4540     Lookahead assertions     Lookahead assertions
# Line 4537  ASSERTIONS Line 4544  ASSERTIONS
4544    
4545           \w+(?=;)           \w+(?=;)
4546    
4547         matches a word followed by a semicolon, but does not include the  semi-         matches  a word followed by a semicolon, but does not include the semi-
4548         colon in the match, and         colon in the match, and
4549    
4550           foo(?!bar)           foo(?!bar)
4551    
4552         matches  any  occurrence  of  "foo" that is not followed by "bar". Note         matches any occurrence of "foo" that is not  followed  by  "bar".  Note
4553         that the apparently similar pattern         that the apparently similar pattern
4554    
4555           (?!foo)bar           (?!foo)bar
4556    
4557         does not find an occurrence of "bar"  that  is  preceded  by  something         does  not  find  an  occurrence  of "bar" that is preceded by something
4558         other  than "foo"; it finds any occurrence of "bar" whatsoever, because         other than "foo"; it finds any occurrence of "bar" whatsoever,  because
4559         the assertion (?!foo) is always true when the next three characters are         the assertion (?!foo) is always true when the next three characters are
4560         "bar". A lookbehind assertion is needed to achieve the other effect.         "bar". A lookbehind assertion is needed to achieve the other effect.
4561    
4562         If you want to force a matching failure at some point in a pattern, the         If you want to force a matching failure at some point in a pattern, the
4563         most convenient way to do it is  with  (?!)  because  an  empty  string         most  convenient  way  to  do  it  is with (?!) because an empty string
4564         always  matches, so an assertion that requires there not to be an empty         always matches, so an assertion that requires there not to be an  empty
4565         string must always fail.   The  Perl  5.10  backtracking  control  verb         string  must  always  fail.   The  Perl  5.10 backtracking control verb
4566         (*FAIL) or (*F) is essentially a synonym for (?!).         (*FAIL) or (*F) is essentially a synonym for (?!).
4567    
4568     Lookbehind assertions     Lookbehind assertions
4569    
4570         Lookbehind  assertions start with (?<= for positive assertions and (?<!         Lookbehind assertions start with (?<= for positive assertions and  (?<!
4571         for negative assertions. For example,         for negative assertions. For example,
4572    
4573           (?<!foo)bar           (?<!foo)bar
4574    
4575         does find an occurrence of "bar" that is not  preceded  by  "foo".  The         does  find  an  occurrence  of "bar" that is not preceded by "foo". The
4576         contents  of  a  lookbehind  assertion are restricted such that all the         contents of a lookbehind assertion are restricted  such  that  all  the
4577         strings it matches must have a fixed length. However, if there are sev-         strings it matches must have a fixed length. However, if there are sev-
4578         eral  top-level  alternatives,  they  do  not all have to have the same         eral top-level alternatives, they do not all  have  to  have  the  same
4579         fixed length. Thus         fixed length. Thus
4580    
4581           (?<=bullock|donkey)           (?<=bullock|donkey)
# Line 4577  ASSERTIONS Line 4584  ASSERTIONS
4584    
4585           (?<!dogs?|cats?)           (?<!dogs?|cats?)
4586    
4587         causes an error at compile time. Branches that match  different  length         causes  an  error at compile time. Branches that match different length
4588         strings  are permitted only at the top level of a lookbehind assertion.         strings are permitted only at the top level of a lookbehind  assertion.
4589         This is an extension compared with Perl (5.8 and 5.10), which  requires         This  is an extension compared with Perl (5.8 and 5.10), which requires
4590         all branches to match the same length of string. An assertion such as         all branches to match the same length of string. An assertion such as
4591    
4592           (?<=ab(c|de))           (?<=ab(c|de))
4593    
4594         is  not  permitted,  because  its single top-level branch can match two         is not permitted, because its single top-level  branch  can  match  two
4595         different lengths, but it is acceptable to PCRE if rewritten to use two         different lengths, but it is acceptable to PCRE if rewritten to use two
4596         top-level branches:         top-level branches:
4597    
4598           (?<=abc|abde)           (?<=abc|abde)
4599    
4600         In some cases, the Perl 5.10 escape sequence \K (see above) can be used         In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4601         instead of  a  lookbehind  assertion  to  get  round  the  fixed-length         instead  of  a  lookbehind  assertion  to  get  round  the fixed-length
4602         restriction.         restriction.
4603    
4604         The  implementation  of lookbehind assertions is, for each alternative,         The implementation of lookbehind assertions is, for  each  alternative,
4605         to temporarily move the current position back by the fixed  length  and         to  temporarily  move the current position back by the fixed length and
4606         then try to match. If there are insufficient characters before the cur-         then try to match. If there are insufficient characters before the cur-
4607         rent position, the assertion fails.         rent position, the assertion fails.
4608    
4609         PCRE does not allow the \C escape (which matches a single byte in UTF-8         PCRE does not allow the \C escape (which matches a single byte in UTF-8
4610         mode)  to appear in lookbehind assertions, because it makes it impossi-         mode) to appear in lookbehind assertions, because it makes it  impossi-
4611         ble to calculate the length of the lookbehind. The \X and  \R  escapes,         ble  to  calculate the length of the lookbehind. The \X and \R escapes,
4612         which can match different numbers of bytes, are also not permitted.         which can match different numbers of bytes, are also not permitted.
4613    
4614         "Subroutine"  calls  (see below) such as (?2) or (?&X) are permitted in         "Subroutine" calls (see below) such as (?2) or (?&X) are  permitted  in
4615         lookbehinds, as long as the subpattern matches a  fixed-length  string.         lookbehinds,  as  long as the subpattern matches a fixed-length string.
4616         Recursion, however, is not supported.         Recursion, however, is not supported.
4617    
4618         Possessive  quantifiers  can  be  used  in  conjunction with lookbehind         Possessive quantifiers can  be  used  in  conjunction  with  lookbehind
4619         assertions to specify efficient matching of fixed-length strings at the         assertions to specify efficient matching of fixed-length strings at the
4620         end of subject strings. Consider a simple pattern such as         end of subject strings. Consider a simple pattern such as
4621    
4622           abcd$           abcd$
4623    
4624         when  applied  to  a  long string that does not match. Because matching         when applied to a long string that does  not  match.  Because  matching
4625         proceeds from left to right, PCRE will look for each "a" in the subject         proceeds from left to right, PCRE will look for each "a" in the subject
4626         and  then  see  if what follows matches the rest of the pattern. If the         and then see if what follows matches the rest of the  pattern.  If  the
4627         pattern is specified as         pattern is specified as
4628    
4629           ^.*abcd$           ^.*abcd$
4630    
4631         the initial .* matches the entire string at first, but when this  fails         the  initial .* matches the entire string at first, but when this fails
4632         (because there is no following "a"), it backtracks to match all but the         (because there is no following "a"), it backtracks to match all but the
4633         last character, then all but the last two characters, and so  on.  Once         last  character,  then all but the last two characters, and so on. Once
4634         again  the search for "a" covers the entire string, from right to left,         again the search for "a" covers the entire string, from right to  left,
4635         so we are no better off. However, if the pattern is written as         so we are no better off. However, if the pattern is written as
4636    
4637           ^.*+(?<=abcd)           ^.*+(?<=abcd)
4638    
4639         there can be no backtracking for the .*+ item; it can  match  only  the         there  can  be  no backtracking for the .*+ item; it can match only the
4640         entire  string.  The subsequent lookbehind assertion does a single test         entire string. The subsequent lookbehind assertion does a  single  test
4641         on the last four characters. If it fails, the match fails  immediately.         on  the last four characters. If it fails, the match fails immediately.
4642         For  long  strings, this approach makes a significant difference to the         For long strings, this approach makes a significant difference  to  the
4643         processing time.         processing time.
4644    
4645     Using multiple assertions     Using multiple assertions
# Line 4641  ASSERTIONS Line 4648  ASSERTIONS
4648    
4649           (?<=\d{3})(?<!999)foo           (?<=\d{3})(?<!999)foo
4650    
4651         matches "foo" preceded by three digits that are not "999". Notice  that         matches  "foo" preceded by three digits that are not "999". Notice that
4652         each  of  the  assertions is applied independently at the same point in         each of the assertions is applied independently at the  same  point  in
4653         the subject string. First there is a  check  that  the  previous  three         the  subject  string.  First  there  is a check that the previous three
4654         characters  are  all  digits,  and  then there is a check that the same         characters are all digits, and then there is  a  check  that  the  same
4655         three characters are not "999".  This pattern does not match "foo" pre-         three characters are not "999".  This pattern does not match "foo" pre-
4656         ceded  by  six  characters,  the first of which are digits and the last         ceded by six characters, the first of which are  digits  and  the  last
4657         three of which are not "999". For example, it  doesn't  match  "123abc-         three  of  which  are not "999". For example, it doesn't match "123abc-
4658         foo". A pattern to do that is         foo". A pattern to do that is
4659    
4660           (?<=\d{3}...)(?<!999)foo           (?<=\d{3}...)(?<!999)foo
4661    
4662         This  time  the  first assertion looks at the preceding six characters,         This time the first assertion looks at the  preceding  six  characters,
4663         checking that the first three are digits, and then the second assertion         checking that the first three are digits, and then the second assertion
4664         checks that the preceding three characters are not "999".         checks that the preceding three characters are not "999".
4665    
# Line 4660  ASSERTIONS Line 4667  ASSERTIONS
4667    
4668           (?<=(?<!foo)bar)baz           (?<=(?<!foo)bar)baz
4669    
4670         matches  an occurrence of "baz" that is preceded by "bar" which in turn         matches an occurrence of "baz" that is preceded by "bar" which in  turn
4671         is not preceded by "foo", while         is not preceded by "foo", while
4672    
4673           (?<=\d{3}(?!999)...)foo           (?<=\d{3}(?!999)...)foo
4674    
4675         is another pattern that matches "foo" preceded by three digits and  any         is  another pattern that matches "foo" preceded by three digits and any
4676         three characters that are not "999".         three characters that are not "999".
4677    
4678    
4679  CONDITIONAL SUBPATTERNS  CONDITIONAL SUBPATTERNS
4680    
4681         It  is possible to cause the matching process to obey a subpattern con-         It is possible to cause the matching process to obey a subpattern  con-
4682         ditionally or to choose between two alternative subpatterns,  depending         ditionally  or to choose between two alternative subpatterns, depending
4683         on  the result of an assertion, or whether a specific capturing subpat-         on the result of an assertion, or whether a specific capturing  subpat-
4684         tern has already been matched. The two possible  forms  of  conditional         tern  has  already  been matched. The two possible forms of conditional
4685         subpattern are:         subpattern are:
4686    
4687           (?(condition)yes-pattern)           (?(condition)yes-pattern)
4688           (?(condition)yes-pattern|no-pattern)           (?(condition)yes-pattern|no-pattern)
4689    
4690         If  the  condition is satisfied, the yes-pattern is used; otherwise the         If the condition is satisfied, the yes-pattern is used;  otherwise  the
4691         no-pattern (if present) is used. If there are more  than  two  alterna-         no-pattern  (if  present)  is used. If there are more than two alterna-
4692         tives in the subpattern, a compile-time error occurs.         tives in the subpattern, a compile-time error occurs.
4693    
4694         There  are  four  kinds of condition: references to subpatterns, refer-         There are four kinds of condition: references  to  subpatterns,  refer-
4695         ences to recursion, a pseudo-condition called DEFINE, and assertions.         ences to recursion, a pseudo-condition called DEFINE, and assertions.
4696    
4697     Checking for a used subpattern by number     Checking for a used subpattern by number
4698    
4699         If the text between the parentheses consists of a sequence  of  digits,         If  the  text between the parentheses consists of a sequence of digits,
4700         the condition is true if a capturing subpattern of that number has pre-         the condition is true if a capturing subpattern of that number has pre-
4701         viously matched. If there is more than one  capturing  subpattern  with         viously  matched.  If  there is more than one capturing subpattern with
4702         the  same  number  (see  the earlier section about duplicate subpattern         the same number (see the earlier  section  about  duplicate  subpattern
4703         numbers), the condition is true if any of them have been set. An alter-         numbers), the condition is true if any of them have been set. An alter-
4704         native  notation is to precede the digits with a plus or minus sign. In         native notation is to precede the digits with a plus or minus sign.  In
4705         this case, the subpattern number is relative rather than absolute.  The         this  case, the subpattern number is relative rather than absolute. The
4706         most  recently opened parentheses can be referenced by (?(-1), the next         most recently opened parentheses can be referenced by (?(-1), the  next
4707         most recent by (?(-2), and so on. In looping  constructs  it  can  also         most  recent  by  (?(-2),  and so on. In looping constructs it can also
4708         make  sense  to  refer  to  subsequent  groups  with constructs such as         make sense to refer  to  subsequent  groups  with  constructs  such  as
4709         (?(+2).         (?(+2).
4710    
4711         Consider the following pattern, which  contains  non-significant  white         Consider  the  following  pattern, which contains non-significant white
4712         space to make it more readable (assume the PCRE_EXTENDED option) and to         space to make it more readable (assume the PCRE_EXTENDED option) and to
4713         divide it into three parts for ease of discussion:         divide it into three parts for ease of discussion:
4714    
4715           ( \( )?    [^()]+    (?(1) \) )           ( \( )?    [^()]+    (?(1) \) )
4716    
4717         The first part matches an optional opening  parenthesis,  and  if  that         The  first  part  matches  an optional opening parenthesis, and if that
4718         character is present, sets it as the first captured substring. The sec-         character is present, sets it as the first captured substring. The sec-
4719         ond part matches one or more characters that are not  parentheses.  The         ond  part  matches one or more characters that are not parentheses. The
4720         third part is a conditional subpattern that tests whether the first set         third part is a conditional subpattern that tests whether the first set
4721         of parentheses matched or not. If they did, that is, if subject started         of parentheses matched or not. If they did, that is, if subject started
4722         with an opening parenthesis, the condition is true, and so the yes-pat-         with an opening parenthesis, the condition is true, and so the yes-pat-
4723         tern is executed and a  closing  parenthesis  is  required.  Otherwise,         tern  is  executed  and  a  closing parenthesis is required. Otherwise,
4724         since  no-pattern  is  not  present, the subpattern matches nothing. In         since no-pattern is not present, the  subpattern  matches  nothing.  In
4725         other words,  this  pattern  matches  a  sequence  of  non-parentheses,         other  words,  this  pattern  matches  a  sequence  of non-parentheses,
4726         optionally enclosed in parentheses.         optionally enclosed in parentheses.
4727    
4728         If  you  were  embedding  this pattern in a larger one, you could use a         If you were embedding this pattern in a larger one,  you  could  use  a
4729         relative reference:         relative reference:
4730    
4731           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...           ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
4732    
4733         This makes the fragment independent of the parentheses  in  the  larger         This  makes  the  fragment independent of the parentheses in the larger
4734         pattern.         pattern.
4735    
4736     Checking for a used subpattern by name     Checking for a used subpattern by name
4737    
4738         Perl  uses  the  syntax  (?(<name>)...) or (?('name')...) to test for a         Perl uses the syntax (?(<name>)...) or (?('name')...)  to  test  for  a
4739         used subpattern by name. For compatibility  with  earlier  versions  of         used  subpattern  by  name.  For compatibility with earlier versions of
4740         PCRE,  which  had this facility before Perl, the syntax (?(name)...) is         PCRE, which had this facility before Perl, the syntax  (?(name)...)  is
4741         also recognized. However, there is a possible ambiguity with this  syn-         also  recognized. However, there is a possible ambiguity with this syn-
4742         tax,  because  subpattern  names  may  consist entirely of digits. PCRE         tax, because subpattern names may  consist  entirely  of  digits.  PCRE
4743         looks first for a named subpattern; if it cannot find one and the  name         looks  first for a named subpattern; if it cannot find one and the name
4744         consists  entirely  of digits, PCRE looks for a subpattern of that num-         consists entirely of digits, PCRE looks for a subpattern of  that  num-
4745         ber, which must be greater than zero. Using subpattern names that  con-         ber,  which must be greater than zero. Using subpattern names that con-
4746         sist entirely of digits is not recommended.         sist entirely of digits is not recommended.
4747    
4748         Rewriting the above example to use a named subpattern gives this:         Rewriting the above example to use a named subpattern gives this:
4749    
4750           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )           (?<OPEN> \( )?    [^()]+    (?(<OPEN>) \) )
4751    
4752         If  the  name used in a condition of this kind is a duplicate, the test         If the name used in a condition of this kind is a duplicate,  the  test
4753         is applied to all subpatterns of the same name, and is true if any  one         is  applied to all subpatterns of the same name, and is true if any one
4754         of them has matched.         of them has matched.
4755    
4756     Checking for pattern recursion     Checking for pattern recursion
4757    
4758         If the condition is the string (R), and there is no subpattern with the         If the condition is the string (R), and there is no subpattern with the
4759         name R, the condition is true if a recursive call to the whole  pattern         name  R, the condition is true if a recursive call to the whole pattern
4760         or any subpattern has been made. If digits or a name preceded by amper-         or any subpattern has been made. If digits or a name preceded by amper-
4761         sand follow the letter R, for example:         sand follow the letter R, for example:
4762    
# Line 4757  CONDITIONAL SUBPATTERNS Line 4764  CONDITIONAL SUBPATTERNS
4764    
4765         the condition is true if the most recent recursion is into a subpattern         the condition is true if the most recent recursion is into a subpattern
4766         whose number or name is given. This condition does not check the entire         whose number or name is given. This condition does not check the entire
4767         recursion stack. If the name used in a condition  of  this  kind  is  a         recursion  stack.  If  the  name  used in a condition of this kind is a
4768         duplicate, the test is applied to all subpatterns of the same name, and         duplicate, the test is applied to all subpatterns of the same name, and
4769         is true if any one of them is the most recent recursion.         is true if any one of them is the most recent recursion.
4770    
4771         At "top level", all these recursion test  conditions  are  false.   The         At  "top  level",  all  these recursion test conditions are false.  The
4772         syntax for recursive patterns is described below.         syntax for recursive patterns is described below.
4773    
4774     Defining subpatterns for use by reference only     Defining subpatterns for use by reference only
4775    
4776         If  the  condition  is  the string (DEFINE), and there is no subpattern         If the condition is the string (DEFINE), and  there  is  no  subpattern
4777         with the name DEFINE, the condition is  always  false.  In  this  case,         with  the  name  DEFINE,  the  condition is always false. In this case,
4778         there  may  be  only  one  alternative  in the subpattern. It is always         there may be only one alternative  in  the  subpattern.  It  is  always
4779         skipped if control reaches this point  in  the  pattern;  the  idea  of         skipped  if  control  reaches  this  point  in the pattern; the idea of
4780         DEFINE  is that it can be used to define "subroutines" that can be ref-         DEFINE is that it can be used to define "subroutines" that can be  ref-
4781         erenced from elsewhere. (The use of "subroutines" is described  below.)         erenced  from elsewhere. (The use of "subroutines" is described below.)
4782         For  example,  a pattern to match an IPv4 address could be written like         For example, a pattern to match an IPv4 address could be  written  like
4783         this (ignore whitespace and line breaks):         this (ignore whitespace and line breaks):
4784    
4785           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )           (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
4786           \b (?&byte) (\.(?&byte)){3} \b           \b (?&byte) (\.(?&byte)){3} \b
4787    
4788         The first part of the pattern is a DEFINE group inside which a  another         The  first part of the pattern is a DEFINE group inside which a another
4789         group  named "byte" is defined. This matches an individual component of         group named "byte" is defined. This matches an individual component  of
4790         an IPv4 address (a number less than 256). When  matching  takes  place,         an  IPv4  address  (a number less than 256). When matching takes place,
4791         this  part  of  the pattern is skipped because DEFINE acts like a false         this part of the pattern is skipped because DEFINE acts  like  a  false
4792         condition. The rest of the pattern uses references to the  named  group         condition.  The  rest of the pattern uses references to the named group
4793         to  match the four dot-separated components of an IPv4 address, insist-         to match the four dot-separated components of an IPv4 address,  insist-
4794         ing on a word boundary at each end.         ing on a word boundary at each end.
4795    
4796     Assertion conditions     Assertion conditions
4797    
4798         If the condition is not in any of the above  formats,  it  must  be  an         If  the  condition  is  not  in any of the above formats, it must be an
4799         assertion.   This may be a positive or negative lookahead or lookbehind         assertion.  This may be a positive or negative lookahead or  lookbehind
4800         assertion. Consider  this  pattern,  again  containing  non-significant         assertion.  Consider  this  pattern,  again  containing non-significant
4801         white space, and with the two alternatives on the second line:         white space, and with the two alternatives on the second line:
4802    
4803           (?(?=[^a-z]*[a-z])           (?(?=[^a-z]*[a-z])
4804           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )           \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )
4805    
4806         The  condition  is  a  positive  lookahead  assertion  that  matches an         The condition  is  a  positive  lookahead  assertion  that  matches  an
4807         optional sequence of non-letters followed by a letter. In other  words,         optional  sequence of non-letters followed by a letter. In other words,
4808         it  tests  for the presence of at least one letter in the subject. If a         it tests for the presence of at least one letter in the subject.  If  a
4809         letter is found, the subject is matched against the first  alternative;         letter  is found, the subject is matched against the first alternative;
4810         otherwise  it  is  matched  against  the  second.  This pattern matches         otherwise it is  matched  against  the  second.  This  pattern  matches
4811         strings in one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are         strings  in  one  of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
4812         letters and dd are digits.         letters and dd are digits.
4813    
4814    
4815  COMMENTS  COMMENTS
4816    
4817         The  sequence (?# marks the start of a comment that continues up to the         The sequence (?# marks the start of a comment that continues up to  the
4818         next closing parenthesis. Nested parentheses  are  not  permitted.  The         next  closing  parenthesis.  Nested  parentheses are not permitted. The
4819         characters  that make up a comment play no part in the pattern matching         characters that make up a comment play no part in the pattern  matching
4820         at all.         at all.
4821    
4822         If the PCRE_EXTENDED option is set, an unescaped # character outside  a         If  the PCRE_EXTENDED option is set, an unescaped # character outside a
4823         character  class  introduces  a  comment  that continues to immediately         character class introduces a  comment  that  continues  to  immediately
4824         after the next newline in the pattern.         after the next newline in the pattern.
4825    
4826    
4827  RECURSIVE PATTERNS  RECURSIVE PATTERNS
4828    
4829         Consider the problem of matching a string in parentheses, allowing  for         Consider  the problem of matching a string in parentheses, allowing for
4830         unlimited  nested  parentheses.  Without the use of recursion, the best         unlimited nested parentheses. Without the use of  recursion,  the  best
4831         that can be done is to use a pattern that  matches  up  to  some  fixed         that  can  be  done  is  to use a pattern that matches up to some fixed
4832         depth  of  nesting.  It  is not possible to handle an arbitrary nesting         depth of nesting. It is not possible to  handle  an  arbitrary  nesting
4833         depth.         depth.
4834    
4835         For some time, Perl has provided a facility that allows regular expres-         For some time, Perl has provided a facility that allows regular expres-
4836         sions  to recurse (amongst other things). It does this by interpolating         sions to recurse (amongst other things). It does this by  interpolating
4837         Perl code in the expression at run time, and the code can refer to  the         Perl  code in the expression at run time, and the code can refer to the
4838         expression itself. A Perl pattern using code interpolation to solve the         expression itself. A Perl pattern using code interpolation to solve the
4839         parentheses problem can be created like this:         parentheses problem can be created like this:
4840    
# Line 4837  RECURSIVE PATTERNS Line 4844  RECURSIVE PATTERNS
4844         refers recursively to the pattern in which it appears.         refers recursively to the pattern in which it appears.
4845    
4846         Obviously, PCRE cannot support the interpolation of Perl code. Instead,         Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4847         it supports special syntax for recursion of  the  entire  pattern,  and         it  supports  special  syntax  for recursion of the entire pattern, and
4848         also  for  individual  subpattern  recursion. After its introduction in         also for individual subpattern recursion.  After  its  introduction  in
4849         PCRE and Python, this kind of  recursion  was  subsequently  introduced         PCRE  and  Python,  this  kind of recursion was subsequently introduced
4850         into Perl at release 5.10.         into Perl at release 5.10.
4851    
4852         A  special  item  that consists of (? followed by a number greater than         A special item that consists of (? followed by a  number  greater  than
4853         zero and a closing parenthesis is a recursive call of the subpattern of         zero and a closing parenthesis is a recursive call of the subpattern of
4854         the  given  number, provided that it occurs inside that subpattern. (If         the given number, provided that it occurs inside that  subpattern.  (If
4855         not, it is a "subroutine" call, which is described  in  the  next  sec-         not,  it  is  a  "subroutine" call, which is described in the next sec-
4856         tion.)  The special item (?R) or (?0) is a recursive call of the entire         tion.) The special item (?R) or (?0) is a recursive call of the  entire
4857         regular expression.         regular expression.
4858    
4859         This PCRE pattern solves the nested  parentheses  problem  (assume  the         This  PCRE  pattern  solves  the nested parentheses problem (assume the
4860         PCRE_EXTENDED option is set so that white space is ignored):         PCRE_EXTENDED option is set so that white space is ignored):
4861    
4862           \( ( [^()]++ | (?R) )* \)           \( ( [^()]++ | (?R) )* \)
4863    
4864         First  it matches an opening parenthesis. Then it matches any number of         First it matches an opening parenthesis. Then it matches any number  of
4865         substrings which can either be a  sequence  of  non-parentheses,  or  a         substrings  which  can  either  be  a sequence of non-parentheses, or a
4866         recursive  match  of the pattern itself (that is, a correctly parenthe-         recursive match of the pattern itself (that is, a  correctly  parenthe-
4867         sized substring).  Finally there is a closing parenthesis. Note the use         sized substring).  Finally there is a closing parenthesis. Note the use
4868         of a possessive quantifier to avoid backtracking into sequences of non-         of a possessive quantifier to avoid backtracking into sequences of non-
4869         parentheses.         parentheses.
4870    
4871         If this were part of a larger pattern, you would not  want  to  recurse         If  this  were  part of a larger pattern, you would not want to recurse
4872         the entire pattern, so instead you could use this:         the entire pattern, so instead you could use this:
4873    
4874           ( \( ( [^()]++ | (?1) )* \) )           ( \( ( [^()]++ | (?1) )* \) )
4875    
4876         We  have  put the pattern into parentheses, and caused the recursion to         We have put the pattern into parentheses, and caused the  recursion  to
4877         refer to them instead of the whole pattern.         refer to them instead of the whole pattern.
4878    
4879         In a larger pattern,  keeping  track  of  parenthesis  numbers  can  be         In  a  larger  pattern,  keeping  track  of  parenthesis numbers can be
4880         tricky.  This  is made easier by the use of relative references (a Perl         tricky. This is made easier by the use of relative references  (a  Perl
4881         5.10 feature).  Instead of (?1) in the  pattern  above  you  can  write         5.10  feature).   Instead  of  (?1)  in the pattern above you can write
4882         (?-2) to refer to the second most recently opened parentheses preceding         (?-2) to refer to the second most recently opened parentheses preceding
4883         the recursion. In other  words,  a  negative  number  counts  capturing         the  recursion.  In  other  words,  a  negative number counts capturing
4884         parentheses leftwards from the point at which it is encountered.         parentheses leftwards from the point at which it is encountered.
4885    
4886         It  is  also  possible  to refer to subsequently opened parentheses, by         It is also possible to refer to  subsequently  opened  parentheses,  by
4887         writing references such as (?+2). However, these  cannot  be  recursive         writing  references  such  as (?+2). However, these cannot be recursive
4888         because  the  reference  is  not inside the parentheses that are refer-         because the reference is not inside the  parentheses  that  are  refer-
4889         enced. They are always "subroutine" calls, as  described  in  the  next         enced.  They  are  always  "subroutine" calls, as described in the next
4890         section.         section.
4891    
4892         An  alternative  approach is to use named parentheses instead. The Perl         An alternative approach is to use named parentheses instead.  The  Perl
4893         syntax for this is (?&name); PCRE's earlier syntax  (?P>name)  is  also         syntax  for  this  is (?&name); PCRE's earlier syntax (?P>name) is also
4894         supported. We could rewrite the above example as follows:         supported. We could rewrite the above example as follows:
4895    
4896           (?<pn> \( ( [^()]++ | (?&pn) )* \) )           (?<pn> \( ( [^()]++ | (?&pn) )* \) )
4897    
4898         If  there  is more than one subpattern with the same name, the earliest         If there is more than one subpattern with the same name,  the  earliest
4899         one is used.         one is used.
4900    
4901         This particular example pattern that we have been looking  at  contains         This  particular  example pattern that we have been looking at contains
4902         nested unlimited repeats, and so the use of a possessive quantifier for         nested unlimited repeats, and so the use of a possessive quantifier for
4903         matching strings of non-parentheses is important when applying the pat-         matching strings of non-parentheses is important when applying the pat-
4904         tern  to  strings  that do not match. For example, when this pattern is         tern to strings that do not match. For example, when  this  pattern  is
4905         applied to         applied to
4906    
4907           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4908    
4909         it yields "no match" quickly. However, if a  possessive  quantifier  is         it  yields  "no  match" quickly. However, if a possessive quantifier is
4910         not  used, the match runs for a very long time indeed because there are         not used, the match runs for a very long time indeed because there  are
4911         so many different ways the + and * repeats can carve  up  the  subject,         so  many  different  ways the + and * repeats can carve up the subject,
4912         and all have to be tested before failure can be reported.         and all have to be tested before failure can be reported.
4913    
4914         At  the  end  of a match, the values of capturing parentheses are those         At the end of a match, the values of capturing  parentheses  are  those
4915         from the outermost level. If you want to obtain intermediate values,  a         from  the outermost level. If you want to obtain intermediate values, a
4916         callout  function can be used (see below and the pcrecallout documenta-         callout function can be used (see below and the pcrecallout  documenta-
4917         tion). If the pattern above is matched against         tion). If the pattern above is matched against
4918    
4919           (ab(cd)ef)           (ab(cd)ef)
4920    
4921         the value for the inner capturing parentheses  (numbered  2)  is  "ef",         the  value  for  the  inner capturing parentheses (numbered 2) is "ef",
4922         which  is the last value taken on at the top level. If a capturing sub-         which is the last value taken on at the top level. If a capturing  sub-
4923         pattern is not matched at the top level, its final value is unset, even         pattern is not matched at the top level, its final value is unset, even
4924         if it is (temporarily) set at a deeper level.         if it is (temporarily) set at a deeper level.
4925    
4926         If  there are more than 15 capturing parentheses in a pattern, PCRE has         If there are more than 15 capturing parentheses in a pattern, PCRE  has
4927         to obtain extra memory to store data during a recursion, which it  does         to  obtain extra memory to store data during a recursion, which it does
4928         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory         by using pcre_malloc, freeing it via pcre_free afterwards. If no memory
4929         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.         can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
4930    
4931         Do not confuse the (?R) item with the condition (R),  which  tests  for         Do  not  confuse  the (?R) item with the condition (R), which tests for
4932         recursion.   Consider  this pattern, which matches text in angle brack-         recursion.  Consider this pattern, which matches text in  angle  brack-
4933         ets, allowing for arbitrary nesting. Only digits are allowed in  nested         ets,  allowing for arbitrary nesting. Only digits are allowed in nested
4934         brackets  (that is, when recursing), whereas any characters are permit-         brackets (that is, when recursing), whereas any characters are  permit-
4935         ted at the outer level.         ted at the outer level.
4936    
4937           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >           < (?: (?(R) \d++  | [^<>]*+) | (?R)) * >
4938    
4939         In this pattern, (?(R) is the start of a conditional  subpattern,  with         In  this  pattern, (?(R) is the start of a conditional subpattern, with
4940         two  different  alternatives for the recursive and non-recursive cases.         two different alternatives for the recursive and  non-recursive  cases.
4941         The (?R) item is the actual recursive call.         The (?R) item is the actual recursive call.
4942    
4943     Recursion difference from Perl     Recursion difference from Perl
4944    
4945         In PCRE (like Python, but unlike Perl), a recursive subpattern call  is         In  PCRE (like Python, but unlike Perl), a recursive subpattern call is
4946         always treated as an atomic group. That is, once it has matched some of         always treated as an atomic group. That is, once it has matched some of
4947         the subject string, it is never re-entered, even if it contains untried         the subject string, it is never re-entered, even if it contains untried
4948         alternatives  and  there  is a subsequent matching failure. This can be         alternatives and there is a subsequent matching failure.  This  can  be
4949         illustrated by the following pattern, which purports to match a  palin-         illustrated  by the following pattern, which purports to match a palin-
4950         dromic  string  that contains an odd number of characters (for example,         dromic string that contains an odd number of characters  (for  example,
4951         "a", "aba", "abcba", "abcdcba"):         "a", "aba", "abcba", "abcdcba"):
4952    
4953           ^(.|(.)(?1)\2)$           ^(.|(.)(?1)\2)$
4954    
4955         The idea is that it either matches a single character, or two identical         The idea is that it either matches a single character, or two identical
4956         characters  surrounding  a sub-palindrome. In Perl, this pattern works;         characters surrounding a sub-palindrome. In Perl, this  pattern  works;
4957         in PCRE it does not if the pattern is  longer  than  three  characters.         in  PCRE  it  does  not if the pattern is longer than three characters.
4958         Consider the subject string "abcba":         Consider the subject string "abcba":
4959    
4960         At  the  top level, the first character is matched, but as it is not at         At the top level, the first character is matched, but as it is  not  at
4961         the end of the string, the first alternative fails; the second alterna-         the end of the string, the first alternative fails; the second alterna-
4962         tive is taken and the recursion kicks in. The recursive call to subpat-         tive is taken and the recursion kicks in. The recursive call to subpat-
4963         tern 1 successfully matches the next character ("b").  (Note  that  the         tern  1  successfully  matches the next character ("b"). (Note that the
4964         beginning and end of line tests are not part of the recursion).         beginning and end of line tests are not part of the recursion).
4965    
4966         Back  at  the top level, the next character ("c") is compared with what         Back at the top level, the next character ("c") is compared  with  what
4967         subpattern 2 matched, which was "a". This fails. Because the  recursion         subpattern  2 matched, which was "a". This fails. Because the recursion
4968         is  treated  as  an atomic group, there are now no backtracking points,         is treated as an atomic group, there are now  no  backtracking  points,
4969         and so the entire match fails. (Perl is able, at  this  point,  to  re-         and  so  the  entire  match fails. (Perl is able, at this point, to re-
4970         enter  the  recursion  and try the second alternative.) However, if the         enter the recursion and try the second alternative.)  However,  if  the
4971         pattern is written with the alternatives in the other order, things are         pattern is written with the alternatives in the other order, things are
4972         different:         different:
4973    
4974           ^((.)(?1)\2|.)$           ^((.)(?1)\2|.)$
4975    
4976         This  time,  the recursing alternative is tried first, and continues to         This time, the recursing alternative is tried first, and  continues  to
4977         recurse until it runs out of characters, at which point  the  recursion         recurse  until  it runs out of characters, at which point the recursion
4978         fails.  But  this  time  we  do  have another alternative to try at the         fails. But this time we do have  another  alternative  to  try  at  the
4979         higher level. That is the big difference:  in  the  previous  case  the         higher  level.  That  is  the  big difference: in the previous case the
4980         remaining alternative is at a deeper recursion level, which PCRE cannot         remaining alternative is at a deeper recursion level, which PCRE cannot
4981         use.         use.
4982    
4983         To change the pattern so that matches all palindromic strings, not just         To change the pattern so that matches all palindromic strings, not just
4984         those  with  an  odd number of characters, it is tempting to change the         those with an odd number of characters, it is tempting  to  change  the
4985         pattern to this:         pattern to this:
4986    
4987           ^((.)(?1)\2|.?)$           ^((.)(?1)\2|.?)$
4988    
4989         Again, this works in Perl, but not in PCRE, and for  the  same  reason.         Again,  this  works  in Perl, but not in PCRE, and for the same reason.
4990         When  a  deeper  recursion has matched a single character, it cannot be         When a deeper recursion has matched a single character,  it  cannot  be
4991         entered again in order to match an empty string.  The  solution  is  to         entered  again  in  order  to match an empty string. The solution is to
4992         separate  the two cases, and write out the odd and even cases as alter-         separate the two cases, and write out the odd and even cases as  alter-
4993         natives at the higher level:         natives at the higher level:
4994    
4995           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))           ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
4996    
4997         If you want to match typical palindromic phrases, the  pattern  has  to         If  you  want  to match typical palindromic phrases, the pattern has to
4998         ignore all non-word characters, which can be done like this:         ignore all non-word characters, which can be done like this:
4999    
5000           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$           ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
5001    
5002         If run with the PCRE_CASELESS option, this pattern matches phrases such         If run with the PCRE_CASELESS option, this pattern matches phrases such
5003         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and         as "A man, a plan, a canal: Panama!" and it works well in both PCRE and
5004         Perl.  Note the use of the possessive quantifier *+ to avoid backtrack-         Perl. Note the use of the possessive quantifier *+ to avoid  backtrack-
5005         ing into sequences of non-word characters. Without this, PCRE  takes  a         ing  into  sequences of non-word characters. Without this, PCRE takes a
5006         great  deal  longer  (ten  times or more) to match typical phrases, and         great deal longer (ten times or more) to  match  typical  phrases,  and
5007         Perl takes so long that you think it has gone into a loop.         Perl takes so long that you think it has gone into a loop.
5008    
5009         WARNING: The palindrome-matching patterns above work only if  the  sub-         WARNING:  The  palindrome-matching patterns above work only if the sub-
5010         ject  string  does not start with a palindrome that is shorter than the         ject string does not start with a palindrome that is shorter  than  the
5011         entire string.  For example, although "abcba" is correctly matched,  if         entire  string.  For example, although "abcba" is correctly matched, if
5012         the  subject  is "ababa", PCRE finds the palindrome "aba" at the start,         the subject is "ababa", PCRE finds the palindrome "aba" at  the  start,
5013         then fails at top level because the end of the string does not  follow.         then  fails at top level because the end of the string does not follow.
5014         Once  again, it cannot jump back into the recursion to try other alter-         Once again, it cannot jump back into the recursion to try other  alter-
5015         natives, so the entire match fails.         natives, so the entire match fails.
5016    
5017    
5018  SUBPATTERNS AS SUBROUTINES  SUBPATTERNS AS SUBROUTINES
5019    
5020         If the syntax for a recursive subpattern reference (either by number or         If the syntax for a recursive subpattern reference (either by number or
5021         by  name)  is used outside the parentheses to which it refers, it oper-         by name) is used outside the parentheses to which it refers,  it  oper-
5022         ates like a subroutine in a programming language. The "called"  subpat-         ates  like a subroutine in a programming language. The "called" subpat-
5023         tern may be defined before or after the reference. A numbered reference         tern may be defined before or after the reference. A numbered reference
5024         can be absolute or relative, as in these examples:         can be absolute or relative, as in these examples:
5025    
# Line 5024  SUBPATTERNS AS SUBROUTINES Line 5031  SUBPATTERNS AS SUBROUTINES
5031    
5032           (sens|respons)e and \1ibility           (sens|respons)e and \1ibility
5033    
5034         matches "sense and sensibility" and "response and responsibility",  but         matches  "sense and sensibility" and "response and responsibility", but
5035         not "sense and responsibility". If instead the pattern         not "sense and responsibility". If instead the pattern
5036    
5037           (sens|respons)e and (?1)ibility           (sens|respons)e and (?1)ibility
5038    
5039         is  used, it does match "sense and responsibility" as well as the other         is used, it does match "sense and responsibility" as well as the  other
5040         two strings. Another example is  given  in  the  discussion  of  DEFINE         two  strings.  Another  example  is  given  in the discussion of DEFINE
5041         above.         above.
5042    
5043         Like  recursive  subpatterns, a subroutine call is always treated as an         Like recursive subpatterns, a subroutine call is always treated  as  an
5044         atomic group. That is, once it has matched some of the subject  string,         atomic  group. That is, once it has matched some of the subject string,
5045         it  is  never  re-entered, even if it contains untried alternatives and         it is never re-entered, even if it contains  untried  alternatives  and
5046         there is a subsequent matching failure. Any capturing parentheses  that         there  is a subsequent matching failure. Any capturing parentheses that
5047         are  set  during  the  subroutine  call revert to their previous values         are set during the subroutine call  revert  to  their  previous  values
5048         afterwards.         afterwards.
5049    
5050         When a subpattern is used as a subroutine, processing options  such  as         When  a  subpattern is used as a subroutine, processing options such as
5051         case-independence are fixed when the subpattern is defined. They cannot         case-independence are fixed when the subpattern is defined. They cannot
5052         be changed for different calls. For example, consider this pattern:         be changed for different calls. For example, consider this pattern:
5053    
5054           (abc)(?i:(?-1))           (abc)(?i:(?-1))
5055    
5056         It matches "abcabc". It does not match "abcABC" because the  change  of         It  matches  "abcabc". It does not match "abcABC" because the change of
5057         processing option does not affect the called subpattern.         processing option does not affect the called subpattern.
5058    
5059    
5060  ONIGURUMA SUBROUTINE SYNTAX  ONIGURUMA SUBROUTINE SYNTAX
5061    
5062         For  compatibility with Oniguruma, the non-Perl syntax \g followed by a         For compatibility with Oniguruma, the non-Perl syntax \g followed by  a
5063         name or a number enclosed either in angle brackets or single quotes, is         name or a number enclosed either in angle brackets or single quotes, is
5064         an  alternative  syntax  for  referencing a subpattern as a subroutine,         an alternative syntax for referencing a  subpattern  as  a  subroutine,
5065         possibly recursively. Here are two of the examples used above,  rewrit-         possibly  recursively. Here are two of the examples used above, rewrit-
5066         ten using this syntax:         ten using this syntax:
5067    
5068           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )           (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
5069           (sens|respons)e and \g'1'ibility           (sens|respons)e and \g'1'ibility
5070    
5071         PCRE  supports  an extension to Oniguruma: if a number is preceded by a         PCRE supports an extension to Oniguruma: if a number is preceded  by  a
5072         plus or a minus sign it is taken as a relative reference. For example:         plus or a minus sign it is taken as a relative reference. For example:
5073    
5074           (abc)(?i:\g<-1>)           (abc)(?i:\g<-1>)
5075    
5076         Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are  not         Note  that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
5077         synonymous.  The former is a back reference; the latter is a subroutine         synonymous. The former is a back reference; the latter is a  subroutine
5078         call.         call.
5079    
5080    
5081  CALLOUTS  CALLOUTS
5082    
5083         Perl has a feature whereby using the sequence (?{...}) causes arbitrary         Perl has a feature whereby using the sequence (?{...}) causes arbitrary
5084         Perl  code to be obeyed in the middle of matching a regular expression.         Perl code to be obeyed in the middle of matching a regular  expression.
5085         This makes it possible, amongst other things, to extract different sub-         This makes it possible, amongst other things, to extract different sub-
5086         strings that match the same pair of parentheses when there is a repeti-         strings that match the same pair of parentheses when there is a repeti-
5087         tion.         tion.
5088    
5089         PCRE provides a similar feature, but of course it cannot obey arbitrary         PCRE provides a similar feature, but of course it cannot obey arbitrary
5090         Perl code. The feature is called "callout". The caller of PCRE provides         Perl code. The feature is called "callout". The caller of PCRE provides
5091         an external function by putting its entry point in the global  variable         an  external function by putting its entry point in the global variable
5092         pcre_callout.   By default, this variable contains NULL, which disables         pcre_callout.  By default, this variable contains NULL, which  disables
5093         all calling out.         all calling out.
5094    
5095         Within a regular expression, (?C) indicates the  points  at  which  the         Within  a  regular  expression,  (?C) indicates the points at which the
5096         external  function  is  to be called. If you want to identify different         external function is to be called. If you want  to  identify  different
5097         callout points, you can put a number less than 256 after the letter  C.         callout  points, you can put a number less than 256 after the letter C.
5098         The  default  value is zero.  For example, this pattern has two callout         The default value is zero.  For example, this pattern has  two  callout
5099         points:         points:
5100    
5101           (?C1)abc(?C2)def           (?C1)abc(?C2)def
5102    
5103         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are         If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
5104         automatically  installed  before each item in the pattern. They are all         automatically installed before each item in the pattern. They  are  all
5105         numbered 255.         numbered 255.
5106    
5107         During matching, when PCRE reaches a callout point (and pcre_callout is         During matching, when PCRE reaches a callout point (and pcre_callout is
5108         set),  the  external function is called. It is provided with the number         set), the external function is called. It is provided with  the  number
5109         of the callout, the position in the pattern, and, optionally, one  item         of  the callout, the position in the pattern, and, optionally, one item
5110         of  data  originally supplied by the caller of pcre_exec(). The callout         of data originally supplied by the caller of pcre_exec().  The  callout
5111         function may cause matching to proceed, to backtrack, or to fail  alto-         function  may cause matching to proceed, to backtrack, or to fail alto-
5112         gether. A complete description of the interface to the callout function         gether. A complete description of the interface to the callout function
5113         is given in the pcrecallout documentation.         is given in the pcrecallout documentation.
5114    
5115    
5116  BACKTRACKING CONTROL  BACKTRACKING CONTROL
5117    
5118         Perl 5.10 introduced a number of "Special Backtracking Control  Verbs",         Perl  5.10 introduced a number of "Special Backtracking Control Verbs",
5119         which are described in the Perl documentation as "experimental and sub-         which are described in the Perl documentation as "experimental and sub-
5120         ject to change or removal in a future version of Perl". It goes  on  to         ject  to  change or removal in a future version of Perl". It goes on to
5121         say:  "Their usage in production code should be noted to avoid problems         say: "Their usage in production code should be noted to avoid  problems
5122         during upgrades." The same remarks apply to the PCRE features described         during upgrades." The same remarks apply to the PCRE features described
5123         in this section.         in this section.
5124    
5125         Since  these  verbs  are  specifically related to backtracking, most of         Since these verbs are specifically related  to  backtracking,  most  of
5126         them can be  used  only  when  the  pattern  is  to  be  matched  using         them  can  be  used  only  when  the  pattern  is  to  be matched using
5127         pcre_exec(), which uses a backtracking algorithm. With the exception of         pcre_exec(), which uses a backtracking algorithm. With the exception of
5128         (*FAIL), which behaves like a failing negative assertion, they cause an         (*FAIL), which behaves like a failing negative assertion, they cause an
5129         error if encountered by pcre_dfa_exec().         error if encountered by pcre_dfa_exec().
5130    
5131         If any of these verbs are used in an assertion or subroutine subpattern         If any of these verbs are used in an assertion or subroutine subpattern
5132         (including recursive subpatterns), their effect  is  confined  to  that         (including  recursive  subpatterns),  their  effect is confined to that
5133         subpattern;  it  does  not extend to the surrounding pattern. Note that         subpattern; it does not extend to the surrounding  pattern.  Note  that
5134         such subpatterns are processed as anchored at the point where they  are         such  subpatterns are processed as anchored at the point where they are
5135         tested.         tested.
5136    
5137         The  new verbs make use of what was previously invalid syntax: an open-         The new verbs make use of what was previously invalid syntax: an  open-
5138         ing parenthesis followed by an asterisk. In Perl, they are generally of         ing parenthesis followed by an asterisk. In Perl, they are generally of
5139         the form (*VERB:ARG) but PCRE does not support the use of arguments, so         the form (*VERB:ARG) but PCRE does not support the use of arguments, so
5140         its general form is just (*VERB). Any number of these verbs  may  occur         its  general  form is just (*VERB). Any number of these verbs may occur
5141         in a pattern. There are two kinds:         in a pattern. There are two kinds:
5142    
5143     Verbs that act immediately     Verbs that act immediately
# Line 5139  BACKTRACKING CONTROL Line 5146  BACKTRACKING CONTROL
5146    
5147            (*ACCEPT)            (*ACCEPT)
5148    
5149         This  verb causes the match to end successfully, skipping the remainder         This verb causes the match to end successfully, skipping the  remainder
5150         of the pattern. When inside a recursion, only the innermost pattern  is         of  the pattern. When inside a recursion, only the innermost pattern is
5151         ended  immediately.  If  (*ACCEPT) is inside capturing parentheses, the         ended immediately. If (*ACCEPT) is inside  capturing  parentheses,  the
5152         data so far is captured. (This feature was added  to  PCRE  at  release         data  so  far  is  captured. (This feature was added to PCRE at release
5153         8.00.) For example:         8.00.) For example:
5154    
5155           A((?:A|B(*ACCEPT)|C)D)           A((?:A|B(*ACCEPT)|C)D)
5156    
5157         This  matches  "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-         This matches "AB", "AAD", or "ACD"; when it matches "AB", "B"  is  cap-
5158         tured by the outer parentheses.         tured by the outer parentheses.
5159    
5160           (*FAIL) or (*F)           (*FAIL) or (*F)
5161    
5162         This verb causes the match to fail, forcing backtracking to  occur.  It         This  verb  causes the match to fail, forcing backtracking to occur. It
5163         is  equivalent to (?!) but easier to read. The Perl documentation notes         is equivalent to (?!) but easier to read. The Perl documentation  notes
5164         that it is probably useful only when combined  with  (?{})  or  (??{}).         that  it  is  probably  useful only when combined with (?{}) or (??{}).
5165         Those  are,  of course, Perl features that are not present in PCRE. The         Those are, of course, Perl features that are not present in  PCRE.  The
5166         nearest equivalent is the callout feature, as for example in this  pat-         nearest  equivalent is the callout feature, as for example in this pat-
5167         tern:         tern:
5168    
5169           a+(?C)(*FAIL)           a+(?C)(*FAIL)
5170    
5171         A  match  with the string "aaaa" always fails, but the callout is taken         A match with the string "aaaa" always fails, but the callout  is  taken
5172         before each backtrack happens (in this example, 10 times).         before each backtrack happens (in this example, 10 times).
5173    
5174     Verbs that act after backtracking     Verbs that act after backtracking
5175    
5176         The following verbs do nothing when they are encountered. Matching con-         The following verbs do nothing when they are encountered. Matching con-
5177         tinues  with what follows, but if there is no subsequent match, a fail-         tinues with what follows, but if there is no subsequent match, a  fail-
5178         ure is forced.  The verbs  differ  in  exactly  what  kind  of  failure         ure  is  forced.   The  verbs  differ  in  exactly what kind of failure
5179         occurs.         occurs.
5180    
5181           (*COMMIT)           (*COMMIT)
5182    
5183         This  verb  causes  the whole match to fail outright if the rest of the         This verb causes the whole match to fail outright if the  rest  of  the
5184         pattern does not match. Even if the pattern is unanchored,  no  further         pattern  does  not match. Even if the pattern is unanchored, no further
5185         attempts  to  find  a match by advancing the starting point take place.         attempts to find a match by advancing the starting  point  take  place.
5186         Once (*COMMIT) has been passed, pcre_exec() is committed to  finding  a         Once  (*COMMIT)  has been passed, pcre_exec() is committed to finding a
5187         match at the current starting point, or not at all. For example:         match at the current starting point, or not at all. For example:
5188    
5189           a+(*COMMIT)b           a+(*COMMIT)b
5190    
5191         This  matches  "xxaab" but not "aacaab". It can be thought of as a kind         This matches "xxaab" but not "aacaab". It can be thought of as  a  kind
5192         of dynamic anchor, or "I've started, so I must finish."         of dynamic anchor, or "I've started, so I must finish."
5193    
5194           (*PRUNE)           (*PRUNE)
5195    
5196         This verb causes the match to fail at the current position if the  rest         This  verb causes the match to fail at the current position if the rest
5197         of the pattern does not match. If the pattern is unanchored, the normal         of the pattern does not match. If the pattern is unanchored, the normal
5198         "bumpalong" advance to the next starting character then happens.  Back-         "bumpalong"  advance to the next starting character then happens. Back-
5199         tracking  can  occur as usual to the left of (*PRUNE), or when matching         tracking can occur as usual to the left of (*PRUNE), or  when  matching
5200         to the right of (*PRUNE), but if there is no match to the right,  back-         to  the right of (*PRUNE), but if there is no match to the right, back-
5201         tracking  cannot  cross (*PRUNE).  In simple cases, the use of (*PRUNE)         tracking cannot cross (*PRUNE).  In simple cases, the use  of  (*PRUNE)
5202         is just an alternative to an atomic group or possessive quantifier, but         is just an alternative to an atomic group or possessive quantifier, but
5203         there  are  some uses of (*PRUNE) that cannot be expressed in any other         there are some uses of (*PRUNE) that cannot be expressed in  any  other
5204         way.         way.
5205    
5206           (*SKIP)           (*SKIP)
5207    
5208         This verb is like (*PRUNE), except that if the pattern  is  unanchored,         This  verb  is like (*PRUNE), except that if the pattern is unanchored,
5209         the  "bumpalong" advance is not to the next character, but to the posi-         the "bumpalong" advance is not to the next character, but to the  posi-
5210         tion in the subject where (*SKIP) was  encountered.  (*SKIP)  signifies         tion  in  the  subject where (*SKIP) was encountered. (*SKIP) signifies
5211         that  whatever  text  was  matched leading up to it cannot be part of a         that whatever text was matched leading up to it cannot  be  part  of  a
5212         successful match. Consider:         successful match. Consider:
5213    
5214           a+(*SKIP)b           a+(*SKIP)b
5215    
5216         If the subject is "aaaac...",  after  the  first  match  attempt  fails         If  the  subject  is  "aaaac...",  after  the first match attempt fails
5217         (starting  at  the  first  character in the string), the starting point         (starting at the first character in the  string),  the  starting  point
5218         skips on to start the next attempt at "c". Note that a possessive quan-         skips on to start the next attempt at "c". Note that a possessive quan-
5219         tifer  does not have the same effect as this example; although it would         tifer does not have the same effect as this example; although it  would
5220         suppress backtracking  during  the  first  match  attempt,  the  second         suppress  backtracking  during  the  first  match  attempt,  the second
5221         attempt  would  start at the second character instead of skipping on to         attempt would start at the second character instead of skipping  on  to
5222         "c".         "c".
5223    
5224           (*THEN)           (*THEN)
5225    
5226         This verb causes a skip to the next alternation if the rest of the pat-         This verb causes a skip to the next alternation if the rest of the pat-
5227         tern does not match. That is, it cancels pending backtracking, but only         tern does not match. That is, it cancels pending backtracking, but only
5228         within the current alternation. Its name  comes  from  the  observation         within  the  current  alternation.  Its name comes from the observation
5229         that it can be used for a pattern-based if-then-else block:         that it can be used for a pattern-based if-then-else block:
5230    
5231           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...           ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
5232    
5233         If  the COND1 pattern matches, FOO is tried (and possibly further items         If the COND1 pattern matches, FOO is tried (and possibly further  items
5234         after the end of the group if FOO succeeds);  on  failure  the  matcher         after  the  end  of  the group if FOO succeeds); on failure the matcher
5235         skips  to  the second alternative and tries COND2, without backtracking         skips to the second alternative and tries COND2,  without  backtracking
5236         into COND1. If (*THEN) is used outside  of  any  alternation,  it  acts         into  COND1.  If  (*THEN)  is  used outside of any alternation, it acts
5237         exactly like (*PRUNE).         exactly like (*PRUNE).
5238    
5239    
# Line 5244  AUTHOR Line 5251  AUTHOR
5251    
5252  REVISION  REVISION
5253    
5254         Last updated: 18 October 2009         Last updated: 11 January 2010
5255         Copyright (c) 1997-2009 University of Cambridge.         Copyright (c) 1997-2010 University of Cambridge.
5256  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
5257    
5258    

Legend:
Removed from v.487  
changed lines
  Added in v.488

  ViewVC Help
Powered by ViewVC 1.1.5