/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 394 by ph10, Wed Mar 18 16:38:23 2009 UTC revision 453 by ph10, Fri Sep 18 19:12:35 2009 UTC
# Line 23  description of PCRE's regular expression Line 23  description of PCRE's regular expression
23  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
24  there is now also support for UTF-8 character strings. To use this, you must  there is now also support for UTF-8 character strings. To use this, you must
25  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with  build PCRE to include UTF-8 support, and then call \fBpcre_compile()\fP with
26  the PCRE_UTF8 option. How this affects pattern matching is mentioned in several  the PCRE_UTF8 option. There is also a special sequence that can be given at the
27  places below. There is also a summary of UTF-8 features in the  start of a pattern:
28    .sp
29      (*UTF8)
30    .sp
31    Starting a pattern with this sequence is equivalent to setting the PCRE_UTF8
32    option. This feature is not Perl-compatible. How setting UTF-8 mode affects
33    pattern matching is mentioned in several places below. There is also a summary
34    of UTF-8 features in the
35  .\" HTML <a href="pcre.html#utf8support">  .\" HTML <a href="pcre.html#utf8support">
36  .\" </a>  .\" </a>
37  section on UTF-8 support  section on UTF-8 support
# Line 364  In UTF-8 mode, characters with values gr Line 371  In UTF-8 mode, characters with values gr
371  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode  \ew, and always match \eD, \eS, and \eW. This is true even when Unicode
372  character property support is available. These sequences retain their original  character property support is available. These sequences retain their original
373  meanings from before UTF-8 support was available, mainly for efficiency  meanings from before UTF-8 support was available, mainly for efficiency
374  reasons. Note that this also affects \eb, because it is defined in terms of \ew  reasons. Note that this also affects \eb, because it is defined in terms of \ew
375  and \eW.  and \eW.
376  .P  .P
377  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the
# Line 635  cannot be tested by PCRE, unless UTF-8 v Line 642  cannot be tested by PCRE, unless UTF-8 v
642  .\" HREF  .\" HREF
643  \fBpcreapi\fP  \fBpcreapi\fP
644  .\"  .\"
645  page).  page). Perl does not support the Cs property.
646  .P  .P
647  The long synonyms for these properties that Perl supports (such as \ep{Letter})  The long synonyms for property names that Perl supports (such as \ep{Letter})
648  are not supported by PCRE, nor is it permitted to prefix any of these  are not supported by PCRE, nor is it permitted to prefix any of these
649  properties with "Is".  properties with "Is".
650  .P  .P
# Line 1032  The PCRE-specific options PCRE_DUPNAMES, Line 1039  The PCRE-specific options PCRE_DUPNAMES,
1039  changed in the same way as the Perl-compatible options by using the characters  changed in the same way as the Perl-compatible options by using the characters
1040  J, U and X respectively.  J, U and X respectively.
1041  .P  .P
1042  When an option change occurs at top level (that is, not inside subpattern  When one of these option changes occurs at top level (that is, not inside
1043  parentheses), the change applies to the remainder of the pattern that follows.  subpattern parentheses), the change applies to the remainder of the pattern
1044  If the change is placed right at the start of a pattern, PCRE extracts it into  that follows. If the change is placed right at the start of a pattern, PCRE
1045  the global options (and it will therefore show up in data extracted by the  extracts it into the global options (and it will therefore show up in data
1046  \fBpcre_fullinfo()\fP function).  extracted by the \fBpcre_fullinfo()\fP function).
1047  .P  .P
1048  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
1049  subpatterns) affects only that part of the current pattern that follows it, so  subpatterns) affects only that part of the current pattern that follows it, so
# Line 1057  behaviour otherwise. Line 1064  behaviour otherwise.
1064  .P  .P
1065  \fBNote:\fP There are other PCRE-specific options that can be set by the  \fBNote:\fP There are other PCRE-specific options that can be set by the
1066  application when the compile or match functions are called. In some cases the  application when the compile or match functions are called. In some cases the
1067  pattern can contain special leading sequences to override what the application  pattern can contain special leading sequences such as (*CRLF) to override what
1068  has set or what has been defaulted. Details are given in the section entitled  the application has set or what has been defaulted. Details are given in the
1069    section entitled
1070  .\" HTML <a href="#newlineseq">  .\" HTML <a href="#newlineseq">
1071  .\" </a>  .\" </a>
1072  "Newline sequences"  "Newline sequences"
1073  .\"  .\"
1074  above.  above. There is also the (*UTF8) leading sequence that can be used to set UTF-8
1075    mode; this is equivalent to setting the PCRE_UTF8 option.
1076  .  .
1077  .  .
1078  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 1913  recursively to the pattern in which it a Line 1922  recursively to the pattern in which it a
1922  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it  Obviously, PCRE cannot support the interpolation of Perl code. Instead, it
1923  supports special syntax for recursion of the entire pattern, and also for  supports special syntax for recursion of the entire pattern, and also for
1924  individual subpattern recursion. After its introduction in PCRE and Python,  individual subpattern recursion. After its introduction in PCRE and Python,
1925  this kind of recursion was introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
1926  .P  .P
1927  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
1928  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
# Line 1921  provided that it occurs inside that subp Line 1930  provided that it occurs inside that subp
1930  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
1931  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
1932  .P  .P
 In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  
 treated as an atomic group. That is, once it has matched some of the subject  
 string, it is never re-entered, even if it contains untried alternatives and  
 there is a subsequent matching failure.  
 .P  
1933  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1934  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1935  .sp  .sp
# Line 2013  different alternatives for the recursive Line 2017  different alternatives for the recursive
2017  is the actual recursive call.  is the actual recursive call.
2018  .  .
2019  .  .
2020    .\" HTML <a name="recursiondifference"></a>
2021    .SS "Recursion difference from Perl"
2022    .rs
2023    .sp
2024    In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2025    treated as an atomic group. That is, once it has matched some of the subject
2026    string, it is never re-entered, even if it contains untried alternatives and
2027    there is a subsequent matching failure. This can be illustrated by the
2028    following pattern, which purports to match a palindromic string that contains
2029    an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2030    .sp
2031      ^(.|(.)(?1)\e2)$
2032    .sp
2033    The idea is that it either matches a single character, or two identical
2034    characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2035    it does not if the pattern is longer than three characters. Consider the
2036    subject string "abcba":
2037    .P
2038    At the top level, the first character is matched, but as it is not at the end
2039    of the string, the first alternative fails; the second alternative is taken
2040    and the recursion kicks in. The recursive call to subpattern 1 successfully
2041    matches the next character ("b"). (Note that the beginning and end of line
2042    tests are not part of the recursion).
2043    .P
2044    Back at the top level, the next character ("c") is compared with what
2045    subpattern 2 matched, which was "a". This fails. Because the recursion is
2046    treated as an atomic group, there are now no backtracking points, and so the
2047    entire match fails. (Perl is able, at this point, to re-enter the recursion and
2048    try the second alternative.) However, if the pattern is written with the
2049    alternatives in the other order, things are different:
2050    .sp
2051      ^((.)(?1)\e2|.)$
2052    .sp
2053    This time, the recursing alternative is tried first, and continues to recurse
2054    until it runs out of characters, at which point the recursion fails. But this
2055    time we do have another alternative to try at the higher level. That is the big
2056    difference: in the previous case the remaining alternative is at a deeper
2057    recursion level, which PCRE cannot use.
2058    .P
2059    To change the pattern so that matches all palindromic strings, not just those
2060    with an odd number of characters, it is tempting to change the pattern to this:
2061    .sp
2062      ^((.)(?1)\e2|.?)$
2063    .sp
2064    Again, this works in Perl, but not in PCRE, and for the same reason. When a
2065    deeper recursion has matched a single character, it cannot be entered again in
2066    order to match an empty string. The solution is to separate the two cases, and
2067    write out the odd and even cases as alternatives at the higher level:
2068    .sp
2069      ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2070    .sp
2071    If you want to match typical palindromic phrases, the pattern has to ignore all
2072    non-word characters, which can be done like this:
2073    .sp
2074      ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$
2075    .sp
2076    If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2077    man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2078    the use of the possessive quantifier *+ to avoid backtracking into sequences of
2079    non-word characters. Without this, PCRE takes a great deal longer (ten times or
2080    more) to match typical phrases, and Perl takes so long that you think it has
2081    gone into a loop.
2082    .
2083    .
2084  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2085  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2086  .rs  .rs
# Line 2126  a backtracking algorithm. With the excep Line 2194  a backtracking algorithm. With the excep
2194  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2195  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2196  .P  .P
2197    If any of these verbs are used in an assertion subpattern, their effect is
2198    confined to that subpattern; it does not extend to the surrounding pattern.
2199    Note that assertion subpatterns are processed as anchored at the point where
2200    they are tested.
2201    .P
2202  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2203  parenthesis followed by an asterisk. In Perl, they are generally of the form  parenthesis followed by an asterisk. In Perl, they are generally of the form
2204  (*VERB:ARG) but PCRE does not support the use of arguments, so its general  (*VERB:ARG) but PCRE does not support the use of arguments, so its general
# Line 2141  The following verbs act as soon as they Line 2214  The following verbs act as soon as they
2214  .sp  .sp
2215  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2216  pattern. When inside a recursion, only the innermost pattern is ended  pattern. When inside a recursion, only the innermost pattern is ended
2217  immediately. PCRE differs from Perl in what happens if the (*ACCEPT) is inside  immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far
2218  capturing parentheses. In Perl, the data so far is captured: in PCRE no data is  is captured. (This feature was added to PCRE at release 8.00.) For example:
 captured. For example:  
2219  .sp  .sp
2220    A(A|B(*ACCEPT)|C)D    A((?:A|B(*ACCEPT)|C)D)
2221  .sp  .sp
2222  This matches "AB", "AAD", or "ACD", but when it matches "AB", no data is  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2223  captured.  the outer parentheses.
2224  .sp  .sp
2225    (*FAIL) or (*F)    (*FAIL) or (*F)
2226  .sp  .sp
# Line 2245  Cambridge CB2 3QH, England. Line 2317  Cambridge CB2 3QH, England.
2317  .rs  .rs
2318  .sp  .sp
2319  .nf  .nf
2320  Last updated: 18 March 2009  Last updated: 18 September 2009
2321  Copyright (c) 1997-2009 University of Cambridge.  Copyright (c) 1997-2009 University of Cambridge.
2322  .fi  .fi

Legend:
Removed from v.394  
changed lines
  Added in v.453

  ViewVC Help
Powered by ViewVC 1.1.5