/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 732 by ph10, Sun Aug 28 15:23:03 2011 UTC revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC
# Line 1308  or "defdef": Line 1308  or "defdef":
1308  <pre>  <pre>
1309    /(?|(abc)|(def))\1/    /(?|(abc)|(def))\1/
1310  </pre>  </pre>
1311  In contrast, a recursive or "subroutine" call to a numbered subpattern always  In contrast, a subroutine call to a numbered subpattern always refers to the
1312  refers to the first one in the pattern with the given number. The following  first one in the pattern with the given number. The following pattern matches
1313  pattern matches "abcabc" or "defabc":  "abcabc" or "defabc":
1314  <pre>  <pre>
1315    /(?|(abc)|(def))(?1)/    /(?|(abc)|(def))(?1)/
1316  </pre>  </pre>
# Line 1412  items: Line 1412  items:
1412    a character class    a character class
1413    a back reference (see next section)    a back reference (see next section)
1414    a parenthesized subpattern (including assertions)    a parenthesized subpattern (including assertions)
1415    a recursive or "subroutine" call to a subpattern    a subroutine call to a subpattern (recursive or otherwise)
1416  </pre>  </pre>
1417  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1418  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 2097  If the condition is the string (DEFINE), Line 2097  If the condition is the string (DEFINE),
2097  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
2098  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
2099  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
2100  "subroutines" that can be referenced from elsewhere. (The use of  subroutines that can be referenced from elsewhere. (The use of
2101  <a href="#subpatternsassubroutines">"subroutines"</a>  <a href="#subpatternsassubroutines">subroutines</a>
2102  is described below.) For example, a pattern to match an IPv4 address such as  is described below.) For example, a pattern to match an IPv4 address such as
2103  "192.168.23.245" could be written like this (ignore whitespace and line  "192.168.23.245" could be written like this (ignore whitespace and line
2104  breaks):  breaks):
# Line 2188  this kind of recursion was subsequently Line 2188  this kind of recursion was subsequently
2188  </P>  </P>
2189  <P>  <P>
2190  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
2191  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive subroutine call of the subpattern of the
2192  provided that it occurs inside that subpattern. (If not, it is a  given number, provided that it occurs inside that subpattern. (If not, it is a
2193  <a href="#subpatternsassubroutines">"subroutine"</a>  <a href="#subpatternsassubroutines">non-recursive subroutine</a>
2194  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
2195  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
2196  </P>  </P>
# Line 2226  capturing parentheses leftwards from the Line 2226  capturing parentheses leftwards from the
2226  It is also possible to refer to subsequently opened parentheses, by writing  It is also possible to refer to subsequently opened parentheses, by writing
2227  references such as (?+2). However, these cannot be recursive because the  references such as (?+2). However, these cannot be recursive because the
2228  reference is not inside the parentheses that are referenced. They are always  reference is not inside the parentheses that are referenced. They are always
2229  <a href="#subpatternsassubroutines">"subroutine"</a>  <a href="#subpatternsassubroutines">non-recursive subroutine</a>
2230  calls, as described in the next section.  calls, as described in the next section.
2231  </P>  </P>
2232  <P>  <P>
# Line 2263  documentation). If the pattern above is Line 2263  documentation). If the pattern above is
2263  </pre>  </pre>
2264  the value for the inner capturing parentheses (numbered 2) is "ef", which is  the value for the inner capturing parentheses (numbered 2) is "ef", which is
2265  the last value taken on at the top level. If a capturing subpattern is not  the last value taken on at the top level. If a capturing subpattern is not
2266  matched at the top level, its final value is unset, even if it is (temporarily)  matched at the top level, its final captured value is unset, even if it was
2267  set at a deeper level.  (temporarily) set at a deeper level during the matching process.
2268  </P>  </P>
2269  <P>  <P>
2270  If there are more than 15 capturing parentheses in a pattern, PCRE has to  If there are more than 15 capturing parentheses in a pattern, PCRE has to
# Line 2285  different alternatives for the recursive Line 2285  different alternatives for the recursive
2285  is the actual recursive call.  is the actual recursive call.
2286  <a name="recursiondifference"></a></P>  <a name="recursiondifference"></a></P>
2287  <br><b>  <br><b>
2288  Recursion difference from Perl  Differences in recursion processing between PCRE and Perl
2289  </b><br>  </b><br>
2290  <P>  <P>
2291  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2292  treated as an atomic group. That is, once it has matched some of the subject  (like Python, but unlike Perl), a recursive subpattern call is always treated
2293  string, it is never re-entered, even if it contains untried alternatives and  as an atomic group. That is, once it has matched some of the subject string, it
2294  there is a subsequent matching failure. This can be illustrated by the  is never re-entered, even if it contains untried alternatives and there is a
2295  following pattern, which purports to match a palindromic string that contains  subsequent matching failure. This can be illustrated by the following pattern,
2296  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  which purports to match a palindromic string that contains an odd number of
2297    characters (for example, "a", "aba", "abcba", "abcdcba"):
2298  <pre>  <pre>
2299    ^(.|(.)(?1)\2)$    ^(.|(.)(?1)\2)$
2300  </pre>  </pre>
# Line 2358  For example, although "abcba" is correct Line 2359  For example, although "abcba" is correct
2359  PCRE finds the palindrome "aba" at the start, then fails at top level because  PCRE finds the palindrome "aba" at the start, then fails at top level because
2360  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2361  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2362    </P>
2363    <P>
2364    The second way in which PCRE and Perl differ in their recursion processing is
2365    in the handling of captured values. In Perl, when a subpattern is called
2366    recursively or as a subpattern (see the next section), it has no access to any
2367    values that were captured outside the recursion, whereas in PCRE these values
2368    can be referenced. Consider this pattern:
2369    <pre>
2370      ^(.)(\1|a(?2))
2371    </pre>
2372    In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2373    then in the second group, when the back reference \1 fails to match "b", the
2374    second alternative matches "a" and then recurses. In the recursion, \1 does
2375    now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2376    match because inside the recursive call \1 cannot access the externally set
2377    value.
2378  <a name="subpatternsassubroutines"></a></P>  <a name="subpatternsassubroutines"></a></P>
2379  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
2380  <P>  <P>
2381  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern call (either by number or by
2382  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
2383  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The called subpattern may be defined
2384  before or after the reference. A numbered reference can be absolute or  before or after the reference. A numbered reference can be absolute or
2385  relative, as in these examples:  relative, as in these examples:
2386  <pre>  <pre>
# Line 2384  is used, it does match "sense and respon Line 2401  is used, it does match "sense and respon
2401  strings. Another example is given in the discussion of DEFINE above.  strings. Another example is given in the discussion of DEFINE above.
2402  </P>  </P>
2403  <P>  <P>
2404  Like recursive subpatterns, a subroutine call is always treated as an atomic  All subroutine calls, whether recursive or not, are always treated as atomic
2405  group. That is, once it has matched some of the subject string, it is never  groups. That is, once a subroutine has matched some of the subject string, it
2406  re-entered, even if it contains untried alternatives and there is a subsequent  is never re-entered, even if it contains untried alternatives and there is a
2407  matching failure. Any capturing parentheses that are set during the subroutine  subsequent matching failure. Any capturing parentheses that are set during the
2408  call revert to their previous values afterwards.  subroutine call revert to their previous values afterwards.
2409  </P>  </P>
2410  <P>  <P>
2411  When a subpattern is used as a subroutine, processing options such as  Processing options such as case-independence are fixed when a subpattern is
2412  case-independence are fixed when the subpattern is defined. They cannot be  defined, so if it is used as a subroutine, such options cannot be changed for
2413  changed for different calls. For example, consider this pattern:  different calls. For example, consider this pattern:
2414  <pre>  <pre>
2415    (abc)(?i:(?-1))    (abc)(?i:(?-1))
2416  </pre>  </pre>
# Line 2469  failing negative assertion, they cause a Line 2486  failing negative assertion, they cause a
2486  <b>pcre_dfa_exec()</b>.  <b>pcre_dfa_exec()</b>.
2487  </P>  </P>
2488  <P>  <P>
2489  If any of these verbs are used in an assertion or subroutine subpattern  If any of these verbs are used in an assertion or in a subpattern that is
2490  (including recursive subpatterns), their effect is confined to that subpattern;  called as a subroutine (whether or not recursively), their effect is confined
2491  it does not extend to the surrounding pattern, with one exception: a *MARK that  to that subpattern; it does not extend to the surrounding pattern, with one
2492  is encountered in a positive assertion <i>is</i> passed back (compare capturing  exception: a *MARK that is encountered in a positive assertion <i>is</i> passed
2493  parentheses in assertions). Note that such subpatterns are processed as  back (compare capturing parentheses in assertions). Note that such subpatterns
2494  anchored at the point where they are tested.  are processed as anchored at the point where they are tested. Note also that
2495    Perl's treatment of subroutines is different in some cases.
2496  </P>  </P>
2497  <P>  <P>
2498  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2499  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2500  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2501  depending on whether or not an argument is present. An name is a sequence of  depending on whether or not an argument is present. A name is any sequence of
2502  letters, digits, and underscores. If the name is empty, that is, if the closing  characters that does not include a closing parenthesis. If the name is empty,
2503  parenthesis immediately follows the colon, the effect is as if the colon were  that is, if the closing parenthesis immediately follows the colon, the effect
2504  not there. Any number of these verbs may occur in a pattern.  is as if the colon were not there. Any number of these verbs may occur in a
2505    pattern.
2506  </P>  </P>
2507  <P>  <P>
2508  PCRE contains some optimizations that are used to speed up matching by running  PCRE contains some optimizations that are used to speed up matching by running
# Line 2505  followed by a name. Line 2524  followed by a name.
2524     (*ACCEPT)     (*ACCEPT)
2525  </pre>  </pre>
2526  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2527  pattern. When inside a recursion, only the innermost pattern is ended  pattern. However, when it is inside a subpattern that is called as a
2528  immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is  subroutine, only that subpattern is ended successfully. Matching then continues
2529  captured. (This feature was added to PCRE at release 8.00.) For example:  at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
2530    far is captured. For example:
2531  <pre>  <pre>
2532    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2533  </pre>  </pre>
# Line 2516  the outer parentheses. Line 2536  the outer parentheses.
2536  <pre>  <pre>
2537    (*FAIL) or (*F)    (*FAIL) or (*F)
2538  </pre>  </pre>
2539  This verb causes the match to fail, forcing backtracking to occur. It is  This verb causes a matching failure, forcing backtracking to occur. It is
2540  equivalent to (?!) but easier to read. The Perl documentation notes that it is  equivalent to (?!) but easier to read. The Perl documentation notes that it is
2541  probably useful only when combined with (?{}) or (??{}). Those are, of course,  probably useful only when combined with (?{}) or (??{}). Those are, of course,
2542  Perl features that are not present in PCRE. The nearest equivalent is the  Perl features that are not present in PCRE. The nearest equivalent is the
# Line 2566  capturing parentheses. Line 2586  capturing parentheses.
2586  <P>  <P>
2587  If (*MARK) is encountered in a positive assertion, its name is recorded and  If (*MARK) is encountered in a positive assertion, its name is recorded and
2588  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2589  assetions.  assertions.
2590  </P>  </P>
2591  <P>  <P>
2592  A name may also be returned after a failed match if the final path through the  A name may also be returned after a failed match if the final path through the
# Line 2684  following pattern fails to match, the pr Line 2704  following pattern fails to match, the pr
2704  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2705  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2706  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2707  matching name is found, normal "bumpalong" of one character happens (the  matching name is found, normal "bumpalong" of one character happens (that is,
2708  (*SKIP) is ignored).  the (*SKIP) is ignored).
2709  <pre>  <pre>
2710    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2711  </pre>  </pre>
2712  This verb causes a skip to the next alternation in the innermost enclosing  This verb causes a skip to the next innermost alternative if the rest of the
2713  group if the rest of the pattern does not match. That is, it cancels pending  pattern does not match. That is, it cancels pending backtracking, but only
2714  backtracking, but only within the current alternation. Its name comes from the  within the current alternative. Its name comes from the observation that it can
2715  observation that it can be used for a pattern-based if-then-else block:  be used for a pattern-based if-then-else block:
2716  <pre>  <pre>
2717    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2718  </pre>  </pre>
2719  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2720  the end of the group if FOO succeeds); on failure the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2721  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2722  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2723  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not inside an alternation, it acts like
2724  like (*PRUNE).  (*PRUNE).
2725    </P>
2726    <P>
2727    Note that a subpattern that does not contain a | character is just a part of
2728    the enclosing alternative; it is not a nested alternation with only one
2729    alternative. The effect of (*THEN) extends beyond such a subpattern to the
2730    enclosing alternative. Consider this pattern, where A, B, etc. are complex
2731    pattern fragments that do not contain any | characters at this level:
2732    <pre>
2733      A (B(*THEN)C) | D
2734    </pre>
2735    If A and B are matched, but there is a failure in C, matching does not
2736    backtrack into A; instead it moves to the next alternative, that is, D.
2737    However, if the subpattern containing (*THEN) is given an alternative, it
2738    behaves differently:
2739    <pre>
2740      A (B(*THEN)C | (*FAIL)) | D
2741    </pre>
2742    The effect of (*THEN) is now confined to the inner subpattern. After a failure
2743    in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2744    because there are no more alternatives to try. In this case, matching does now
2745    backtrack into A.
2746    </P>
2747    <P>
2748    Note also that a conditional subpattern is not considered as having two
2749    alternatives, because only one is ever used. In other words, the | character in
2750    a conditional subpattern has a different meaning. Ignoring white space,
2751    consider:
2752    <pre>
2753      ^.*? (?(?=a) a | b(*THEN)c )
2754    </pre>
2755    If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2756    it initially matches zero characters. The condition (?=a) then fails, the
2757    character "b" is matched, but "c" is not. At this point, matching does not
2758    backtrack to .*? as might perhaps be expected from the presence of the |
2759    character. The conditional subpattern is part of the single alternative that
2760    comprises the whole pattern, and so the match fails. (If there was a backtrack
2761    into .*?, allowing it to match "b", the match would succeed.)
2762  </P>  </P>
2763  <P>  <P>
2764  The above verbs provide four different "strengths" of control when subsequent  The verbs just described provide four different "strengths" of control when
2765  matching fails. (*THEN) is the weakest, carrying on the match at the next  subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
2766  alternation. (*PRUNE) comes next, failing the match at the current starting  next alternative. (*PRUNE) comes next, failing the match at the current
2767  position, but allowing an advance to the next character (for an unanchored  starting position, but allowing an advance to the next character (for an
2768  pattern). (*SKIP) is similar, except that the advance may be more than one  unanchored pattern). (*SKIP) is similar, except that the advance may be more
2769  character. (*COMMIT) is the strongest, causing the entire match to fail.  than one character. (*COMMIT) is the strongest, causing the entire match to
2770    fail.
2771  </P>  </P>
2772  <P>  <P>
2773  If more than one is present in a pattern, the "stongest" one wins. For example,  If more than one such verb is present in a pattern, the "strongest" one wins.
2774  consider this pattern, where A, B, etc. are complex pattern fragments:  For example, consider this pattern, where A, B, etc. are complex pattern
2775    fragments:
2776  <pre>  <pre>
2777    (A(*COMMIT)B(*THEN)C|D)    (A(*COMMIT)B(*THEN)C|D)
2778  </pre>  </pre>
2779  Once A has matched, PCRE is committed to this match, at the current starting  Once A has matched, PCRE is committed to this match, at the current starting
2780  position. If subsequently B matches, but C does not, the normal (*THEN) action  position. If subsequently B matches, but C does not, the normal (*THEN) action
2781  of trying the next alternation (that is, D) does not happen because (*COMMIT)  of trying the next alternative (that is, D) does not happen because (*COMMIT)
2782  overrides.  overrides.
2783  </P>  </P>
2784  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
# Line 2738  Cambridge CB2 3QH, England. Line 2797  Cambridge CB2 3QH, England.
2797  </P>  </P>
2798  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2799  <P>  <P>
2800  Last updated: 24 August 2011  Last updated: 09 October 2011
2801  <br>  <br>
2802  Copyright &copy; 1997-2011 University of Cambridge.  Copyright &copy; 1997-2011 University of Cambridge.
2803  <br>  <br>

Legend:
Removed from v.732  
changed lines
  Added in v.733

  ViewVC Help
Powered by ViewVC 1.1.5