/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 678 by ph10, Sun Aug 28 15:23:03 2011 UTC revision 738 by ph10, Fri Oct 21 09:04:01 2011 UTC
# Line 971  that signifies the end of a line. Line 971  that signifies the end of a line.
971  .rs  .rs
972  .sp  .sp
973  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
974  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
975  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
976  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
977  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \eC in UTF-8
978  the \eC escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
979    character. This has undefined results, because PCRE assumes that it is dealing
980    with valid UTF-8 strings (and by default it checks this at the start of
981    processing unless the PCRE_NO_UTF8_CHECK option is used).
982  .P  .P
983  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
984  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
# Line 984  PCRE does not allow \eC to appear in loo Line 987  PCRE does not allow \eC to appear in loo
987  .\"  .\"
988  because in UTF-8 mode this would make it impossible to calculate the length of  because in UTF-8 mode this would make it impossible to calculate the length of
989  the lookbehind.  the lookbehind.
990    .P
991    In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
992    way of using it that avoids the problem of malformed UTF-8 characters is to
993    use a lookahead to check the length of the next character, as in this pattern
994    (ignore white space and line breaks):
995    .sp
996      (?| (?=[\ex00-\ex7f])(\eC) |
997          (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
998          (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
999          (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1000    .sp
1001    A group that starts with (?| resets the capturing parentheses numbers in each
1002    alternative (see
1003    .\" HTML <a href="#dupsubpatternnumber">
1004    .\" </a>
1005    "Duplicate Subpattern Numbers"
1006    .\"
1007    below). The assertions at the start of each branch check the next UTF-8
1008    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1009    character's individual bytes are then captured by the appropriate number of
1010    groups.
1011  .  .
1012  .  .
1013  .\" HTML <a name="characterclass"></a>  .\" HTML <a name="characterclass"></a>
# Line 1315  or "defdef": Line 1339  or "defdef":
1339  .sp  .sp
1340    /(?|(abc)|(def))\e1/    /(?|(abc)|(def))\e1/
1341  .sp  .sp
1342  In contrast, a recursive or "subroutine" call to a numbered subpattern always  In contrast, a subroutine call to a numbered subpattern always refers to the
1343  refers to the first one in the pattern with the given number. The following  first one in the pattern with the given number. The following pattern matches
1344  pattern matches "abcabc" or "defabc":  "abcabc" or "defabc":
1345  .sp  .sp
1346    /(?|(abc)|(def))(?1)/    /(?|(abc)|(def))(?1)/
1347  .sp  .sp
# Line 1434  items: Line 1458  items:
1458    a character class    a character class
1459    a back reference (see next section)    a back reference (see next section)
1460    a parenthesized subpattern (including assertions)    a parenthesized subpattern (including assertions)
1461    a recursive or "subroutine" call to a subpattern    a subroutine call to a subpattern (recursive or otherwise)
1462  .sp  .sp
1463  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1464  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 2123  If the condition is the string (DEFINE), Line 2147  If the condition is the string (DEFINE),
2147  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
2148  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
2149  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
2150  "subroutines" that can be referenced from elsewhere. (The use of  subroutines that can be referenced from elsewhere. (The use of
2151  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2152  .\" </a>  .\" </a>
2153  "subroutines"  subroutines
2154  .\"  .\"
2155  is described below.) For example, a pattern to match an IPv4 address such as  is described below.) For example, a pattern to match an IPv4 address such as
2156  "192.168.23.245" could be written like this (ignore whitespace and line  "192.168.23.245" could be written like this (ignore whitespace and line
# Line 2221  individual subpattern recursion. After i Line 2245  individual subpattern recursion. After i
2245  this kind of recursion was subsequently introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
2246  .P  .P
2247  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
2248  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive subroutine call of the subpattern of the
2249  provided that it occurs inside that subpattern. (If not, it is a  given number, provided that it occurs inside that subpattern. (If not, it is a
2250  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2251  .\" </a>  .\" </a>
2252  "subroutine"  non-recursive subroutine
2253  .\"  .\"
2254  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
2255  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
# Line 2260  references such as (?+2). However, these Line 2284  references such as (?+2). However, these
2284  reference is not inside the parentheses that are referenced. They are always  reference is not inside the parentheses that are referenced. They are always
2285  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2286  .\" </a>  .\" </a>
2287  "subroutine"  non-recursive subroutine
2288  .\"  .\"
2289  calls, as described in the next section.  calls, as described in the next section.
2290  .P  .P
# Line 2297  documentation). If the pattern above is Line 2321  documentation). If the pattern above is
2321  .sp  .sp
2322  the value for the inner capturing parentheses (numbered 2) is "ef", which is  the value for the inner capturing parentheses (numbered 2) is "ef", which is
2323  the last value taken on at the top level. If a capturing subpattern is not  the last value taken on at the top level. If a capturing subpattern is not
2324  matched at the top level, its final value is unset, even if it is (temporarily)  matched at the top level, its final captured value is unset, even if it was
2325  set at a deeper level.  (temporarily) set at a deeper level during the matching process.
2326  .P  .P
2327  If there are more than 15 capturing parentheses in a pattern, PCRE has to  If there are more than 15 capturing parentheses in a pattern, PCRE has to
2328  obtain extra memory to store data during a recursion, which it does by using  obtain extra memory to store data during a recursion, which it does by using
# Line 2318  is the actual recursive call. Line 2342  is the actual recursive call.
2342  .  .
2343  .  .
2344  .\" HTML <a name="recursiondifference"></a>  .\" HTML <a name="recursiondifference"></a>
2345  .SS "Recursion difference from Perl"  .SS "Differences in recursion processing between PCRE and Perl"
2346  .rs  .rs
2347  .sp  .sp
2348  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2349  treated as an atomic group. That is, once it has matched some of the subject  (like Python, but unlike Perl), a recursive subpattern call is always treated
2350  string, it is never re-entered, even if it contains untried alternatives and  as an atomic group. That is, once it has matched some of the subject string, it
2351  there is a subsequent matching failure. This can be illustrated by the  is never re-entered, even if it contains untried alternatives and there is a
2352  following pattern, which purports to match a palindromic string that contains  subsequent matching failure. This can be illustrated by the following pattern,
2353  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  which purports to match a palindromic string that contains an odd number of
2354    characters (for example, "a", "aba", "abcba", "abcdcba"):
2355  .sp  .sp
2356    ^(.|(.)(?1)\e2)$    ^(.|(.)(?1)\e2)$
2357  .sp  .sp
# Line 2387  For example, although "abcba" is correct Line 2412  For example, although "abcba" is correct
2412  PCRE finds the palindrome "aba" at the start, then fails at top level because  PCRE finds the palindrome "aba" at the start, then fails at top level because
2413  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2414  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2415    .P
2416    The second way in which PCRE and Perl differ in their recursion processing is
2417    in the handling of captured values. In Perl, when a subpattern is called
2418    recursively or as a subpattern (see the next section), it has no access to any
2419    values that were captured outside the recursion, whereas in PCRE these values
2420    can be referenced. Consider this pattern:
2421    .sp
2422      ^(.)(\e1|a(?2))
2423    .sp
2424    In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2425    then in the second group, when the back reference \e1 fails to match "b", the
2426    second alternative matches "a" and then recurses. In the recursion, \e1 does
2427    now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2428    match because inside the recursive call \e1 cannot access the externally set
2429    value.
2430  .  .
2431  .  .
2432  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2433  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2434  .rs  .rs
2435  .sp  .sp
2436  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern call (either by number or by
2437  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
2438  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The called subpattern may be defined
2439  before or after the reference. A numbered reference can be absolute or  before or after the reference. A numbered reference can be absolute or
2440  relative, as in these examples:  relative, as in these examples:
2441  .sp  .sp
# Line 2415  matches "sense and sensibility" and "res Line 2455  matches "sense and sensibility" and "res
2455  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
2456  strings. Another example is given in the discussion of DEFINE above.  strings. Another example is given in the discussion of DEFINE above.
2457  .P  .P
2458  Like recursive subpatterns, a subroutine call is always treated as an atomic  All subroutine calls, whether recursive or not, are always treated as atomic
2459  group. That is, once it has matched some of the subject string, it is never  groups. That is, once a subroutine has matched some of the subject string, it
2460  re-entered, even if it contains untried alternatives and there is a subsequent  is never re-entered, even if it contains untried alternatives and there is a
2461  matching failure. Any capturing parentheses that are set during the subroutine  subsequent matching failure. Any capturing parentheses that are set during the
2462  call revert to their previous values afterwards.  subroutine call revert to their previous values afterwards.
2463  .P  .P
2464  When a subpattern is used as a subroutine, processing options such as  Processing options such as case-independence are fixed when a subpattern is
2465  case-independence are fixed when the subpattern is defined. They cannot be  defined, so if it is used as a subroutine, such options cannot be changed for
2466  changed for different calls. For example, consider this pattern:  different calls. For example, consider this pattern:
2467  .sp  .sp
2468    (abc)(?i:(?-1))    (abc)(?i:(?-1))
2469  .sp  .sp
# Line 2504  a backtracking algorithm. With the excep Line 2544  a backtracking algorithm. With the excep
2544  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2545  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2546  .P  .P
2547  If any of these verbs are used in an assertion or subroutine subpattern  If any of these verbs are used in an assertion or in a subpattern that is
2548  (including recursive subpatterns), their effect is confined to that subpattern;  called as a subroutine (whether or not recursively), their effect is confined
2549  it does not extend to the surrounding pattern, with one exception: a *MARK that  to that subpattern; it does not extend to the surrounding pattern, with one
2550  is encountered in a positive assertion \fIis\fP passed back (compare capturing  exception: a *MARK that is encountered in a positive assertion \fIis\fP passed
2551  parentheses in assertions). Note that such subpatterns are processed as  back (compare capturing parentheses in assertions). Note that such subpatterns
2552  anchored at the point where they are tested.  are processed as anchored at the point where they are tested. Note also that
2553    Perl's treatment of subroutines is different in some cases.
2554  .P  .P
2555  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2556  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2557  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2558  depending on whether or not an argument is present. An name is a sequence of  depending on whether or not an argument is present. A name is any sequence of
2559  letters, digits, and underscores. If the name is empty, that is, if the closing  characters that does not include a closing parenthesis. If the name is empty,
2560  parenthesis immediately follows the colon, the effect is as if the colon were  that is, if the closing parenthesis immediately follows the colon, the effect
2561  not there. Any number of these verbs may occur in a pattern.  is as if the colon were not there. Any number of these verbs may occur in a
2562    pattern.
2563  .P  .P
2564  PCRE contains some optimizations that are used to speed up matching by running  PCRE contains some optimizations that are used to speed up matching by running
2565  some checks at the start of each match attempt. For example, it may know the  some checks at the start of each match attempt. For example, it may know the
# Line 2538  followed by a name. Line 2580  followed by a name.
2580     (*ACCEPT)     (*ACCEPT)
2581  .sp  .sp
2582  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2583  pattern. When inside a recursion, only the innermost pattern is ended  pattern. However, when it is inside a subpattern that is called as a
2584  immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is  subroutine, only that subpattern is ended successfully. Matching then continues
2585  captured. (This feature was added to PCRE at release 8.00.) For example:  at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
2586    far is captured. For example:
2587  .sp  .sp
2588    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2589  .sp  .sp
# Line 2549  the outer parentheses. Line 2592  the outer parentheses.
2592  .sp  .sp
2593    (*FAIL) or (*F)    (*FAIL) or (*F)
2594  .sp  .sp
2595  This verb causes the match to fail, forcing backtracking to occur. It is  This verb causes a matching failure, forcing backtracking to occur. It is
2596  equivalent to (?!) but easier to read. The Perl documentation notes that it is  equivalent to (?!) but easier to read. The Perl documentation notes that it is
2597  probably useful only when combined with (?{}) or (??{}). Those are, of course,  probably useful only when combined with (?{}) or (??{}). Those are, of course,
2598  Perl features that are not present in PCRE. The nearest equivalent is the  Perl features that are not present in PCRE. The nearest equivalent is the
# Line 2602  capturing parentheses. Line 2645  capturing parentheses.
2645  .P  .P
2646  If (*MARK) is encountered in a positive assertion, its name is recorded and  If (*MARK) is encountered in a positive assertion, its name is recorded and
2647  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2648  assetions.  assertions.
2649  .P  .P
2650  A name may also be returned after a failed match if the final path through the  A name may also be returned after a failed match if the final path through the
2651  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  pattern involves (*MARK). However, unless (*MARK) used in conjunction with
# Line 2716  following pattern fails to match, the pr Line 2759  following pattern fails to match, the pr
2759  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2760  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2761  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2762  matching name is found, normal "bumpalong" of one character happens (the  matching name is found, normal "bumpalong" of one character happens (that is,
2763  (*SKIP) is ignored).  the (*SKIP) is ignored).
2764  .sp  .sp
2765    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2766  .sp  .sp
2767  This verb causes a skip to the next alternation in the innermost enclosing  This verb causes a skip to the next innermost alternative if the rest of the
2768  group if the rest of the pattern does not match. That is, it cancels pending  pattern does not match. That is, it cancels pending backtracking, but only
2769  backtracking, but only within the current alternation. Its name comes from the  within the current alternative. Its name comes from the observation that it can
2770  observation that it can be used for a pattern-based if-then-else block:  be used for a pattern-based if-then-else block:
2771  .sp  .sp
2772    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2773  .sp  .sp
2774  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2775  the end of the group if FOO succeeds); on failure the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2776  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2777  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2778  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not inside an alternation, it acts like
2779  like (*PRUNE).  (*PRUNE).
 .  
 .P  
 The above verbs provide four different "strengths" of control when subsequent  
 matching fails. (*THEN) is the weakest, carrying on the match at the next  
 alternation. (*PRUNE) comes next, failing the match at the current starting  
 position, but allowing an advance to the next character (for an unanchored  
 pattern). (*SKIP) is similar, except that the advance may be more than one  
 character. (*COMMIT) is the strongest, causing the entire match to fail.  
2780  .P  .P
2781  If more than one is present in a pattern, the "stongest" one wins. For example,  Note that a subpattern that does not contain a | character is just a part of
2782  consider this pattern, where A, B, etc. are complex pattern fragments:  the enclosing alternative; it is not a nested alternation with only one
2783    alternative. The effect of (*THEN) extends beyond such a subpattern to the
2784    enclosing alternative. Consider this pattern, where A, B, etc. are complex
2785    pattern fragments that do not contain any | characters at this level:
2786    .sp
2787      A (B(*THEN)C) | D
2788    .sp
2789    If A and B are matched, but there is a failure in C, matching does not
2790    backtrack into A; instead it moves to the next alternative, that is, D.
2791    However, if the subpattern containing (*THEN) is given an alternative, it
2792    behaves differently:
2793    .sp
2794      A (B(*THEN)C | (*FAIL)) | D
2795    .sp
2796    The effect of (*THEN) is now confined to the inner subpattern. After a failure
2797    in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2798    because there are no more alternatives to try. In this case, matching does now
2799    backtrack into A.
2800    .P
2801    Note also that a conditional subpattern is not considered as having two
2802    alternatives, because only one is ever used. In other words, the | character in
2803    a conditional subpattern has a different meaning. Ignoring white space,
2804    consider:
2805    .sp
2806      ^.*? (?(?=a) a | b(*THEN)c )
2807    .sp
2808    If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2809    it initially matches zero characters. The condition (?=a) then fails, the
2810    character "b" is matched, but "c" is not. At this point, matching does not
2811    backtrack to .*? as might perhaps be expected from the presence of the |
2812    character. The conditional subpattern is part of the single alternative that
2813    comprises the whole pattern, and so the match fails. (If there was a backtrack
2814    into .*?, allowing it to match "b", the match would succeed.)
2815    .P
2816    The verbs just described provide four different "strengths" of control when
2817    subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
2818    next alternative. (*PRUNE) comes next, failing the match at the current
2819    starting position, but allowing an advance to the next character (for an
2820    unanchored pattern). (*SKIP) is similar, except that the advance may be more
2821    than one character. (*COMMIT) is the strongest, causing the entire match to
2822    fail.
2823    .P
2824    If more than one such verb is present in a pattern, the "strongest" one wins.
2825    For example, consider this pattern, where A, B, etc. are complex pattern
2826    fragments:
2827  .sp  .sp
2828    (A(*COMMIT)B(*THEN)C|D)    (A(*COMMIT)B(*THEN)C|D)
2829  .sp  .sp
2830  Once A has matched, PCRE is committed to this match, at the current starting  Once A has matched, PCRE is committed to this match, at the current starting
2831  position. If subsequently B matches, but C does not, the normal (*THEN) action  position. If subsequently B matches, but C does not, the normal (*THEN) action
2832  of trying the next alternation (that is, D) does not happen because (*COMMIT)  of trying the next alternative (that is, D) does not happen because (*COMMIT)
2833  overrides.  overrides.
2834  .  .
2835  .  .
# Line 2775  Cambridge CB2 3QH, England. Line 2854  Cambridge CB2 3QH, England.
2854  .rs  .rs
2855  .sp  .sp
2856  .nf  .nf
2857  Last updated: 24 August 2011  Last updated: 19 October 2011
2858  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2859  .fi  .fi

Legend:
Removed from v.678  
changed lines
  Added in v.738

  ViewVC Help
Powered by ViewVC 1.1.5