/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 678 by ph10, Sun Aug 28 15:23:03 2011 UTC revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC
# Line 1315  or "defdef": Line 1315  or "defdef":
1315  .sp  .sp
1316    /(?|(abc)|(def))\e1/    /(?|(abc)|(def))\e1/
1317  .sp  .sp
1318  In contrast, a recursive or "subroutine" call to a numbered subpattern always  In contrast, a subroutine call to a numbered subpattern always refers to the
1319  refers to the first one in the pattern with the given number. The following  first one in the pattern with the given number. The following pattern matches
1320  pattern matches "abcabc" or "defabc":  "abcabc" or "defabc":
1321  .sp  .sp
1322    /(?|(abc)|(def))(?1)/    /(?|(abc)|(def))(?1)/
1323  .sp  .sp
# Line 1434  items: Line 1434  items:
1434    a character class    a character class
1435    a back reference (see next section)    a back reference (see next section)
1436    a parenthesized subpattern (including assertions)    a parenthesized subpattern (including assertions)
1437    a recursive or "subroutine" call to a subpattern    a subroutine call to a subpattern (recursive or otherwise)
1438  .sp  .sp
1439  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1440  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 2123  If the condition is the string (DEFINE), Line 2123  If the condition is the string (DEFINE),
2123  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
2124  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
2125  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
2126  "subroutines" that can be referenced from elsewhere. (The use of  subroutines that can be referenced from elsewhere. (The use of
2127  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2128  .\" </a>  .\" </a>
2129  "subroutines"  subroutines
2130  .\"  .\"
2131  is described below.) For example, a pattern to match an IPv4 address such as  is described below.) For example, a pattern to match an IPv4 address such as
2132  "192.168.23.245" could be written like this (ignore whitespace and line  "192.168.23.245" could be written like this (ignore whitespace and line
# Line 2221  individual subpattern recursion. After i Line 2221  individual subpattern recursion. After i
2221  this kind of recursion was subsequently introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
2222  .P  .P
2223  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
2224  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive subroutine call of the subpattern of the
2225  provided that it occurs inside that subpattern. (If not, it is a  given number, provided that it occurs inside that subpattern. (If not, it is a
2226  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2227  .\" </a>  .\" </a>
2228  "subroutine"  non-recursive subroutine
2229  .\"  .\"
2230  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
2231  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
# Line 2260  references such as (?+2). However, these Line 2260  references such as (?+2). However, these
2260  reference is not inside the parentheses that are referenced. They are always  reference is not inside the parentheses that are referenced. They are always
2261  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2262  .\" </a>  .\" </a>
2263  "subroutine"  non-recursive subroutine
2264  .\"  .\"
2265  calls, as described in the next section.  calls, as described in the next section.
2266  .P  .P
# Line 2297  documentation). If the pattern above is Line 2297  documentation). If the pattern above is
2297  .sp  .sp
2298  the value for the inner capturing parentheses (numbered 2) is "ef", which is  the value for the inner capturing parentheses (numbered 2) is "ef", which is
2299  the last value taken on at the top level. If a capturing subpattern is not  the last value taken on at the top level. If a capturing subpattern is not
2300  matched at the top level, its final value is unset, even if it is (temporarily)  matched at the top level, its final captured value is unset, even if it was
2301  set at a deeper level.  (temporarily) set at a deeper level during the matching process.
2302  .P  .P
2303  If there are more than 15 capturing parentheses in a pattern, PCRE has to  If there are more than 15 capturing parentheses in a pattern, PCRE has to
2304  obtain extra memory to store data during a recursion, which it does by using  obtain extra memory to store data during a recursion, which it does by using
# Line 2318  is the actual recursive call. Line 2318  is the actual recursive call.
2318  .  .
2319  .  .
2320  .\" HTML <a name="recursiondifference"></a>  .\" HTML <a name="recursiondifference"></a>
2321  .SS "Recursion difference from Perl"  .SS "Differences in recursion processing between PCRE and Perl"
2322  .rs  .rs
2323  .sp  .sp
2324  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2325  treated as an atomic group. That is, once it has matched some of the subject  (like Python, but unlike Perl), a recursive subpattern call is always treated
2326  string, it is never re-entered, even if it contains untried alternatives and  as an atomic group. That is, once it has matched some of the subject string, it
2327  there is a subsequent matching failure. This can be illustrated by the  is never re-entered, even if it contains untried alternatives and there is a
2328  following pattern, which purports to match a palindromic string that contains  subsequent matching failure. This can be illustrated by the following pattern,
2329  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  which purports to match a palindromic string that contains an odd number of
2330    characters (for example, "a", "aba", "abcba", "abcdcba"):
2331  .sp  .sp
2332    ^(.|(.)(?1)\e2)$    ^(.|(.)(?1)\e2)$
2333  .sp  .sp
# Line 2387  For example, although "abcba" is correct Line 2388  For example, although "abcba" is correct
2388  PCRE finds the palindrome "aba" at the start, then fails at top level because  PCRE finds the palindrome "aba" at the start, then fails at top level because
2389  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2390  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2391    .P
2392    The second way in which PCRE and Perl differ in their recursion processing is
2393    in the handling of captured values. In Perl, when a subpattern is called
2394    recursively or as a subpattern (see the next section), it has no access to any
2395    values that were captured outside the recursion, whereas in PCRE these values
2396    can be referenced. Consider this pattern:
2397    .sp
2398      ^(.)(\e1|a(?2))
2399    .sp
2400    In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2401    then in the second group, when the back reference \e1 fails to match "b", the
2402    second alternative matches "a" and then recurses. In the recursion, \e1 does
2403    now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2404    match because inside the recursive call \e1 cannot access the externally set
2405    value.
2406  .  .
2407  .  .
2408  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
2409  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2410  .rs  .rs
2411  .sp  .sp
2412  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern call (either by number or by
2413  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
2414  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The called subpattern may be defined
2415  before or after the reference. A numbered reference can be absolute or  before or after the reference. A numbered reference can be absolute or
2416  relative, as in these examples:  relative, as in these examples:
2417  .sp  .sp
# Line 2415  matches "sense and sensibility" and "res Line 2431  matches "sense and sensibility" and "res
2431  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
2432  strings. Another example is given in the discussion of DEFINE above.  strings. Another example is given in the discussion of DEFINE above.
2433  .P  .P
2434  Like recursive subpatterns, a subroutine call is always treated as an atomic  All subroutine calls, whether recursive or not, are always treated as atomic
2435  group. That is, once it has matched some of the subject string, it is never  groups. That is, once a subroutine has matched some of the subject string, it
2436  re-entered, even if it contains untried alternatives and there is a subsequent  is never re-entered, even if it contains untried alternatives and there is a
2437  matching failure. Any capturing parentheses that are set during the subroutine  subsequent matching failure. Any capturing parentheses that are set during the
2438  call revert to their previous values afterwards.  subroutine call revert to their previous values afterwards.
2439  .P  .P
2440  When a subpattern is used as a subroutine, processing options such as  Processing options such as case-independence are fixed when a subpattern is
2441  case-independence are fixed when the subpattern is defined. They cannot be  defined, so if it is used as a subroutine, such options cannot be changed for
2442  changed for different calls. For example, consider this pattern:  different calls. For example, consider this pattern:
2443  .sp  .sp
2444    (abc)(?i:(?-1))    (abc)(?i:(?-1))
2445  .sp  .sp
# Line 2504  a backtracking algorithm. With the excep Line 2520  a backtracking algorithm. With the excep
2520  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2521  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2522  .P  .P
2523  If any of these verbs are used in an assertion or subroutine subpattern  If any of these verbs are used in an assertion or in a subpattern that is
2524  (including recursive subpatterns), their effect is confined to that subpattern;  called as a subroutine (whether or not recursively), their effect is confined
2525  it does not extend to the surrounding pattern, with one exception: a *MARK that  to that subpattern; it does not extend to the surrounding pattern, with one
2526  is encountered in a positive assertion \fIis\fP passed back (compare capturing  exception: a *MARK that is encountered in a positive assertion \fIis\fP passed
2527  parentheses in assertions). Note that such subpatterns are processed as  back (compare capturing parentheses in assertions). Note that such subpatterns
2528  anchored at the point where they are tested.  are processed as anchored at the point where they are tested. Note also that
2529    Perl's treatment of subroutines is different in some cases.
2530  .P  .P
2531  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2532  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2533  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2534  depending on whether or not an argument is present. An name is a sequence of  depending on whether or not an argument is present. A name is any sequence of
2535  letters, digits, and underscores. If the name is empty, that is, if the closing  characters that does not include a closing parenthesis. If the name is empty,
2536  parenthesis immediately follows the colon, the effect is as if the colon were  that is, if the closing parenthesis immediately follows the colon, the effect
2537  not there. Any number of these verbs may occur in a pattern.  is as if the colon were not there. Any number of these verbs may occur in a
2538    pattern.
2539  .P  .P
2540  PCRE contains some optimizations that are used to speed up matching by running  PCRE contains some optimizations that are used to speed up matching by running
2541  some checks at the start of each match attempt. For example, it may know the  some checks at the start of each match attempt. For example, it may know the
# Line 2538  followed by a name. Line 2556  followed by a name.
2556     (*ACCEPT)     (*ACCEPT)
2557  .sp  .sp
2558  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2559  pattern. When inside a recursion, only the innermost pattern is ended  pattern. However, when it is inside a subpattern that is called as a
2560  immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is  subroutine, only that subpattern is ended successfully. Matching then continues
2561  captured. (This feature was added to PCRE at release 8.00.) For example:  at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
2562    far is captured. For example:
2563  .sp  .sp
2564    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2565  .sp  .sp
# Line 2549  the outer parentheses. Line 2568  the outer parentheses.
2568  .sp  .sp
2569    (*FAIL) or (*F)    (*FAIL) or (*F)
2570  .sp  .sp
2571  This verb causes the match to fail, forcing backtracking to occur. It is  This verb causes a matching failure, forcing backtracking to occur. It is
2572  equivalent to (?!) but easier to read. The Perl documentation notes that it is  equivalent to (?!) but easier to read. The Perl documentation notes that it is
2573  probably useful only when combined with (?{}) or (??{}). Those are, of course,  probably useful only when combined with (?{}) or (??{}). Those are, of course,
2574  Perl features that are not present in PCRE. The nearest equivalent is the  Perl features that are not present in PCRE. The nearest equivalent is the
# Line 2602  capturing parentheses. Line 2621  capturing parentheses.
2621  .P  .P
2622  If (*MARK) is encountered in a positive assertion, its name is recorded and  If (*MARK) is encountered in a positive assertion, its name is recorded and
2623  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2624  assetions.  assertions.
2625  .P  .P
2626  A name may also be returned after a failed match if the final path through the  A name may also be returned after a failed match if the final path through the
2627  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  pattern involves (*MARK). However, unless (*MARK) used in conjunction with
# Line 2716  following pattern fails to match, the pr Line 2735  following pattern fails to match, the pr
2735  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2736  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2737  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2738  matching name is found, normal "bumpalong" of one character happens (the  matching name is found, normal "bumpalong" of one character happens (that is,
2739  (*SKIP) is ignored).  the (*SKIP) is ignored).
2740  .sp  .sp
2741    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2742  .sp  .sp
2743  This verb causes a skip to the next alternation in the innermost enclosing  This verb causes a skip to the next innermost alternative if the rest of the
2744  group if the rest of the pattern does not match. That is, it cancels pending  pattern does not match. That is, it cancels pending backtracking, but only
2745  backtracking, but only within the current alternation. Its name comes from the  within the current alternative. Its name comes from the observation that it can
2746  observation that it can be used for a pattern-based if-then-else block:  be used for a pattern-based if-then-else block:
2747  .sp  .sp
2748    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2749  .sp  .sp
2750  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2751  the end of the group if FOO succeeds); on failure the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2752  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2753  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2754  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not inside an alternation, it acts like
2755  like (*PRUNE).  (*PRUNE).
 .  
 .P  
 The above verbs provide four different "strengths" of control when subsequent  
 matching fails. (*THEN) is the weakest, carrying on the match at the next  
 alternation. (*PRUNE) comes next, failing the match at the current starting  
 position, but allowing an advance to the next character (for an unanchored  
 pattern). (*SKIP) is similar, except that the advance may be more than one  
 character. (*COMMIT) is the strongest, causing the entire match to fail.  
2756  .P  .P
2757  If more than one is present in a pattern, the "stongest" one wins. For example,  Note that a subpattern that does not contain a | character is just a part of
2758  consider this pattern, where A, B, etc. are complex pattern fragments:  the enclosing alternative; it is not a nested alternation with only one
2759    alternative. The effect of (*THEN) extends beyond such a subpattern to the
2760    enclosing alternative. Consider this pattern, where A, B, etc. are complex
2761    pattern fragments that do not contain any | characters at this level:
2762    .sp
2763      A (B(*THEN)C) | D
2764    .sp
2765    If A and B are matched, but there is a failure in C, matching does not
2766    backtrack into A; instead it moves to the next alternative, that is, D.
2767    However, if the subpattern containing (*THEN) is given an alternative, it
2768    behaves differently:
2769    .sp
2770      A (B(*THEN)C | (*FAIL)) | D
2771    .sp
2772    The effect of (*THEN) is now confined to the inner subpattern. After a failure
2773    in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2774    because there are no more alternatives to try. In this case, matching does now
2775    backtrack into A.
2776    .P
2777    Note also that a conditional subpattern is not considered as having two
2778    alternatives, because only one is ever used. In other words, the | character in
2779    a conditional subpattern has a different meaning. Ignoring white space,
2780    consider:
2781    .sp
2782      ^.*? (?(?=a) a | b(*THEN)c )
2783    .sp
2784    If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2785    it initially matches zero characters. The condition (?=a) then fails, the
2786    character "b" is matched, but "c" is not. At this point, matching does not
2787    backtrack to .*? as might perhaps be expected from the presence of the |
2788    character. The conditional subpattern is part of the single alternative that
2789    comprises the whole pattern, and so the match fails. (If there was a backtrack
2790    into .*?, allowing it to match "b", the match would succeed.)
2791    .P
2792    The verbs just described provide four different "strengths" of control when
2793    subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
2794    next alternative. (*PRUNE) comes next, failing the match at the current
2795    starting position, but allowing an advance to the next character (for an
2796    unanchored pattern). (*SKIP) is similar, except that the advance may be more
2797    than one character. (*COMMIT) is the strongest, causing the entire match to
2798    fail.
2799    .P
2800    If more than one such verb is present in a pattern, the "strongest" one wins.
2801    For example, consider this pattern, where A, B, etc. are complex pattern
2802    fragments:
2803  .sp  .sp
2804    (A(*COMMIT)B(*THEN)C|D)    (A(*COMMIT)B(*THEN)C|D)
2805  .sp  .sp
2806  Once A has matched, PCRE is committed to this match, at the current starting  Once A has matched, PCRE is committed to this match, at the current starting
2807  position. If subsequently B matches, but C does not, the normal (*THEN) action  position. If subsequently B matches, but C does not, the normal (*THEN) action
2808  of trying the next alternation (that is, D) does not happen because (*COMMIT)  of trying the next alternative (that is, D) does not happen because (*COMMIT)
2809  overrides.  overrides.
2810  .  .
2811  .  .
# Line 2775  Cambridge CB2 3QH, England. Line 2830  Cambridge CB2 3QH, England.
2830  .rs  .rs
2831  .sp  .sp
2832  .nf  .nf
2833  Last updated: 24 August 2011  Last updated: 09 October 2011
2834  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2835  .fi  .fi

Legend:
Removed from v.678  
changed lines
  Added in v.733

  ViewVC Help
Powered by ViewVC 1.1.5