/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 385 by ph10, Sun Mar 8 16:56:58 2009 UTC revision 454 by ph10, Tue Sep 22 09:42:11 2009 UTC
# Line 23  description of PCRE's regular expression Line 23  description of PCRE's regular expression
23  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
24  there is now also support for UTF-8 character strings. To use this, you must  there is now also support for UTF-8 character strings. To use this, you must
25  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
26  the PCRE_UTF8 option. How this affects pattern matching is mentioned in several  the PCRE_UTF8 option. There is also a special sequence that can be given at the
27  places below. There is also a summary of UTF-8 features in the  start of a pattern:
28    .sp
29      (*UTF8)
30    .sp
31    Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
32    option. This feature is not Perl-compatible. How setting UTF-8 mode affects
33    pattern matching is mentioned in several places below. There is also a summary
34    of UTF-8 features in the
35  .\" HTML <a href="pcre.html#utf8support">  .\" HTML <a href="pcre.html#utf8support">
36  .\" </a>  .\" </a>
37  section on UTF-8 support  section on UTF-8 support
# Line 326  syntax for referencing a subpattern as a Line 333  syntax for referencing a subpattern as a
333  later.  later.
334  .\"  .\"
335  Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP  Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
336  synonymous. The former is a back reference; the latter is a subroutine call.  synonymous. The former is a back reference; the latter is a
337    .\" HTML <a href="#subpatternsassubroutines">
338    .\" </a>
339    subroutine
340    .\"
341    call.
342  .  .
343  .  .
344  .SS "Generic character types"  .SS "Generic character types"
# Line 364  In UTF-8 mode, characters with values gr Line 376  In UTF-8 mode, characters with values gr
376  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
377  character property support is available. These sequences retain their original  character property support is available. These sequences retain their original
378  meanings from before UTF-8 support was available, mainly for efficiency  meanings from before UTF-8 support was available, mainly for efficiency
379  reasons.  reasons. Note that this also affects \eb, because it is defined in terms of \ew
380    and \eW.
381  .P  .P
382  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
383  other sequences, these do match certain high-valued codepoints in UTF-8 mode.  other sequences, these do match certain high-valued codepoints in UTF-8 mode.
# Line 634  cannot be tested by PCRE, unless UTF-8 v Line 647  cannot be tested by PCRE, unless UTF-8 v
647  .\" HREF  .\" HREF
648  \fBpcreapi\fP  \fBpcreapi\fP
649  .\"  .\"
650  page).  page). Perl does not support the Cs property.
651  .P  .P
652  The long synonyms for these properties that Perl supports (such as \ep{Letter})  The long synonyms for property names that Perl supports (such as \ep{Letter})
653  are not supported by PCRE, nor is it permitted to prefix any of these  are not supported by PCRE, nor is it permitted to prefix any of these
654  properties with "Is".  properties with "Is".
655  .P  .P
# Line 1031  The PCRE-specific options PCRE_DUPNAMES, Line 1044  The PCRE-specific options PCRE_DUPNAMES,
1044  changed in the same way as the Perl-compatible options by using the characters  changed in the same way as the Perl-compatible options by using the characters
1045  J, U and X respectively.  J, U and X respectively.
1046  .P  .P
1047  When an option change occurs at top level (that is, not inside subpattern  When one of these option changes occurs at top level (that is, not inside
1048  parentheses), the change applies to the remainder of the pattern that follows.  subpattern parentheses), the change applies to the remainder of the pattern
1049  If the change is placed right at the start of a pattern, PCRE extracts it into  that follows. If the change is placed right at the start of a pattern, PCRE
1050  the global options (and it will therefore show up in data extracted by the  extracts it into the global options (and it will therefore show up in data
1051  \fBpcre_fullinfo()\fP function).  extracted by the \fBpcre_fullinfo()\fP function).
1052  .P  .P
1053  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
1054  subpatterns) affects only that part of the current pattern that follows it, so  subpatterns) affects only that part of the current pattern that follows it, so
# Line 1056  behaviour otherwise. Line 1069  behaviour otherwise.
1069  .P  .P
1070  \fBNote:\fP There are other PCRE-specific options that can be set by the  \fBNote:\fP There are other PCRE-specific options that can be set by the
1071  application when the compile or match functions are called. In some cases the  application when the compile or match functions are called. In some cases the
1072  pattern can contain special leading sequences to override what the application  pattern can contain special leading sequences such as (*CRLF) to override what
1073  has set or what has been defaulted. Details are given in the section entitled  the application has set or what has been defaulted. Details are given in the
1074    section entitled
1075  .\" HTML <a href="#newlineseq">  .\" HTML <a href="#newlineseq">
1076  .\" </a>  .\" </a>
1077  "Newline sequences"  "Newline sequences"
1078  .\"  .\"
1079  above.  above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
1080    mode; this is equivalent to setting the PCRE_UTF8 option.
1081  .  .
1082  .  .
1083  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 1659  is permitted, but Line 1674  is permitted, but
1674  .sp  .sp
1675  causes an error at compile time. Branches that match different length strings  causes an error at compile time. Branches that match different length strings
1676  are permitted only at the top level of a lookbehind assertion. This is an  are permitted only at the top level of a lookbehind assertion. This is an
1677  extension compared with Perl (at least for 5.8), which requires all branches to  extension compared with Perl (5.8 and 5.10), which requires all branches to
1678  match the same length of string. An assertion such as  match the same length of string. An assertion such as
1679  .sp  .sp
1680    (?<=ab(c|de))    (?<=ab(c|de))
1681  .sp  .sp
1682  is not permitted, because its single top-level branch can match two different  is not permitted, because its single top-level branch can match two different
1683  lengths, but it is acceptable if rewritten to use two top-level branches:  lengths, but it is acceptable to PCRE if rewritten to use two top-level
1684    branches:
1685  .sp  .sp
1686    (?<=abc|abde)    (?<=abc|abde)
1687  .sp  .sp
# Line 1674  In some cases, the Perl 5.10 escape sequ Line 1690  In some cases, the Perl 5.10 escape sequ
1690  .\" </a>  .\" </a>
1691  (see above)  (see above)
1692  .\"  .\"
1693  can be used instead of a lookbehind assertion; this is not restricted to a  can be used instead of a lookbehind assertion to get round the fixed-length
1694  fixed-length.  restriction.
1695  .P  .P
1696  The implementation of lookbehind assertions is, for each alternative, to  The implementation of lookbehind assertions is, for each alternative, to
1697  temporarily move the current position back by the fixed length and then try to  temporarily move the current position back by the fixed length and then try to
# Line 1687  to appear in lookbehind assertions, beca Line 1703  to appear in lookbehind assertions, beca
1703  the length of the lookbehind. The \eX and \eR escapes, which can match  the length of the lookbehind. The \eX and \eR escapes, which can match
1704  different numbers of bytes, are also not permitted.  different numbers of bytes, are also not permitted.
1705  .P  .P
1706    .\" HTML <a href="#subpatternsassubroutines">
1707    .\" </a>
1708    "Subroutine"
1709    .\"
1710    calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
1711    as the subpattern matches a fixed-length string.
1712    .\" HTML <a href="#recursion">
1713    .\" </a>
1714    Recursion,
1715    .\"
1716    however, is not supported.
1717    .P
1718  Possessive quantifiers can be used in conjunction with lookbehind assertions to  Possessive quantifiers can be used in conjunction with lookbehind assertions to
1719  specify efficient matching at the end of the subject string. Consider a simple  specify efficient matching at the end of the subject string. Consider a simple
1720  pattern such as  pattern such as
# Line 1831  the condition is true if the most recent Line 1859  the condition is true if the most recent
1859  number or name is given. This condition does not check the entire recursion  number or name is given. This condition does not check the entire recursion
1860  stack.  stack.
1861  .P  .P
1862  At "top level", all these recursion test conditions are false. Recursive  At "top level", all these recursion test conditions are false.
1863  patterns are described below.  .\" HTML <a href="#recursion">
1864    .\" </a>
1865    Recursive patterns
1866    .\"
1867    are described below.
1868  .  .
1869  .SS "Defining subpatterns for use by reference only"  .SS "Defining subpatterns for use by reference only"
1870  .rs  .rs
# Line 1841  If the condition is the string (DEFINE), Line 1873  If the condition is the string (DEFINE),
1873  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
1874  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
1875  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
1876  "subroutines" that can be referenced from elsewhere. (The use of "subroutines"  "subroutines" that can be referenced from elsewhere. (The use of
1877    .\" HTML <a href="#subpatternsassubroutines">
1878    .\" </a>
1879    "subroutines"
1880    .\"
1881  is described below.) For example, a pattern to match an IPv4 address could be  is described below.) For example, a pattern to match an IPv4 address could be
1882  written like this (ignore whitespace and line breaks):  written like this (ignore whitespace and line breaks):
1883  .sp  .sp
# Line 1912  recursively to the pattern in which it a Line 1948  recursively to the pattern in which it a
1948  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
1949  supports special syntax for recursion of the entire pattern, and also for  supports special syntax for recursion of the entire pattern, and also for
1950  individual subpattern recursion. After its introduction in PCRE and Python,  individual subpattern recursion. After its introduction in PCRE and Python,
1951  this kind of recursion was introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
1952  .P  .P
1953  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
1954  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
1955  provided that it occurs inside that subpattern. (If not, it is a "subroutine"  provided that it occurs inside that subpattern. (If not, it is a
1956    .\" HTML <a href="#subpatternsassubroutines">
1957    .\" </a>
1958    "subroutine"
1959    .\"
1960  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
1961  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
1962  .P  .P
 In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  
 treated as an atomic group. That is, once it has matched some of the subject  
 string, it is never re-entered, even if it contains untried alternatives and  
 there is a subsequent matching failure.  
 .P  
1963  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1964  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1965  .sp  .sp
# Line 1953  it is encountered. Line 1988  it is encountered.
1988  It is also possible to refer to subsequently opened parentheses, by writing  It is also possible to refer to subsequently opened parentheses, by writing
1989  references such as (?+2). However, these cannot be recursive because the  references such as (?+2). However, these cannot be recursive because the
1990  reference is not inside the parentheses that are referenced. They are always  reference is not inside the parentheses that are referenced. They are always
1991  "subroutine" calls, as described in the next section.  .\" HTML <a href="#subpatternsassubroutines">
1992    .\" </a>
1993    "subroutine"
1994    .\"
1995    calls, as described in the next section.
1996  .P  .P
1997  An alternative approach is to use named parentheses instead. The Perl syntax  An alternative approach is to use named parentheses instead. The Perl syntax
1998  for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We  for this is (?&name); PCRE's earlier syntax (?P>name) is also supported. We
# Line 2012  different alternatives for the recursive Line 2051  different alternatives for the recursive
2051  is the actual recursive call.  is the actual recursive call.
2052  .  .
2053  .  .
2054    .\" HTML <a name="recursiondifference"></a>
2055    .SS "Recursion difference from Perl"
2056    .rs
2057    .sp
2058    In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2059    treated as an atomic group. That is, once it has matched some of the subject
2060    string, it is never re-entered, even if it contains untried alternatives and
2061    there is a subsequent matching failure. This can be illustrated by the
2062    following pattern, which purports to match a palindromic string that contains
2063    an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2064    .sp
2065      ^(.|(.)(?1)\e2)$
2066    .sp
2067    The idea is that it either matches a single character, or two identical
2068    characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2069    it does not if the pattern is longer than three characters. Consider the
2070    subject string "abcba":
2071    .P
2072    At the top level, the first character is matched, but as it is not at the end
2073    of the string, the first alternative fails; the second alternative is taken
2074    and the recursion kicks in. The recursive call to subpattern 1 successfully
2075    matches the next character ("b"). (Note that the beginning and end of line
2076    tests are not part of the recursion).
2077    .P
2078    Back at the top level, the next character ("c") is compared with what
2079    subpattern 2 matched, which was "a". This fails. Because the recursion is
2080    treated as an atomic group, there are now no backtracking points, and so the
2081    entire match fails. (Perl is able, at this point, to re-enter the recursion and
2082    try the second alternative.) However, if the pattern is written with the
2083    alternatives in the other order, things are different:
2084    .sp
2085      ^((.)(?1)\e2|.)$
2086    .sp
2087    This time, the recursing alternative is tried first, and continues to recurse
2088    until it runs out of characters, at which point the recursion fails. But this
2089    time we do have another alternative to try at the higher level. That is the big
2090    difference: in the previous case the remaining alternative is at a deeper
2091    recursion level, which PCRE cannot use.
2092    .P
2093    To change the pattern so that matches all palindromic strings, not just those
2094    with an odd number of characters, it is tempting to change the pattern to this:
2095    .sp
2096      ^((.)(?1)\e2|.?)$
2097    .sp
2098    Again, this works in Perl, but not in PCRE, and for the same reason. When a
2099    deeper recursion has matched a single character, it cannot be entered again in
2100    order to match an empty string. The solution is to separate the two cases, and
2101    write out the odd and even cases as alternatives at the higher level:
2102    .sp
2103      ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2104    .sp
2105    If you want to match typical palindromic phrases, the pattern has to ignore all
2106    non-word characters, which can be done like this:
2107    .sp
2108      ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
2109    .sp
2110    If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2111    man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2112    the use of the possessive quantifier *+ to avoid backtracking into sequences of
2113    non-word characters. Without this, PCRE takes a great deal longer (ten times or
2114    more) to match typical phrases, and Perl takes so long that you think it has
2115    gone into a loop.
2116    .
2117    .
2118  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2119  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2120  .rs  .rs
# Line 2125  a backtracking algorithm. With the excep Line 2228  a backtracking algorithm. With the excep
2228  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2229  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2230  .P  .P
2231    If any of these verbs are used in an assertion subpattern, their effect is
2232    confined to that subpattern; it does not extend to the surrounding pattern.
2233    Note that assertion subpatterns are processed as anchored at the point where
2234    they are tested.
2235    .P
2236  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2237  parenthesis followed by an asterisk. In Perl, they are generally of the form  parenthesis followed by an asterisk. In Perl, they are generally of the form
2238  (*VERB:ARG) but PCRE does not support the use of arguments, so its general  (*VERB:ARG) but PCRE does not support the use of arguments, so its general
# Line 2140  The following verbs act as soon as they Line 2248  The following verbs act as soon as they
2248  .sp  .sp
2249  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2250  pattern. When inside a recursion, only the innermost pattern is ended  pattern. When inside a recursion, only the innermost pattern is ended
2251  immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside  immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
2252  capturing parentheses. In Perl, the data so far is captured: in PCRE no data is  is captured. (This feature was added to PCRE at release 8.00.) For example:
 captured. For example:  
2253  .sp  .sp
2254    A(A|B(*ACCEPT)|C)D    A((?:A|B(*ACCEPT)|C)D)
2255  .sp  .sp
2256  This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2257  captured.  the outer parentheses.
2258  .sp  .sp
2259    (*FAIL) or (*F)    (*FAIL) or (*F)
2260  .sp  .sp
# Line 2244  Cambridge CB2 3QH, England. Line 2351  Cambridge CB2 3QH, England.
2351  .rs  .rs
2352  .sp  .sp
2353  .nf  .nf
2354  Last updated: 08 March 2009  Last updated: 22 September 2009
2355  Copyright (c) 1997-2009 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2356  .fi  .fi

Legend:
Removed from v.385  
changed lines
  Added in v.454

  ViewVC Help
Powered by ViewVC 1.1.5