/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 716 by ph10, Tue Oct 4 16:38:05 2011 UTC revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC
# Line 2297  documentation). If the pattern above is Line 2297  documentation). If the pattern above is
2297  .sp  .sp
2298  the value for the inner capturing parentheses (numbered 2) is "ef", which is  the value for the inner capturing parentheses (numbered 2) is "ef", which is
2299  the last value taken on at the top level. If a capturing subpattern is not  the last value taken on at the top level. If a capturing subpattern is not
2300  matched at the top level, its final value is unset, even if it is (temporarily)  matched at the top level, its final captured value is unset, even if it was
2301  set at a deeper level.  (temporarily) set at a deeper level during the matching process.
2302  .P  .P
2303  If there are more than 15 capturing parentheses in a pattern, PCRE has to  If there are more than 15 capturing parentheses in a pattern, PCRE has to
2304  obtain extra memory to store data during a recursion, which it does by using  obtain extra memory to store data during a recursion, which it does by using
# Line 2318  is the actual recursive call. Line 2318  is the actual recursive call.
2318  .  .
2319  .  .
2320  .\" HTML <a name="recursiondifference"></a>  .\" HTML <a name="recursiondifference"></a>
2321  .SS "Recursion difference from Perl"  .SS "Differences in recursion processing between PCRE and Perl"
2322  .rs  .rs
2323  .sp  .sp
2324  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  Recursion processing in PCRE differs from Perl in two important ways. In PCRE
2325  treated as an atomic group. That is, once it has matched some of the subject  (like Python, but unlike Perl), a recursive subpattern call is always treated
2326  string, it is never re-entered, even if it contains untried alternatives and  as an atomic group. That is, once it has matched some of the subject string, it
2327  there is a subsequent matching failure. This can be illustrated by the  is never re-entered, even if it contains untried alternatives and there is a
2328  following pattern, which purports to match a palindromic string that contains  subsequent matching failure. This can be illustrated by the following pattern,
2329  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  which purports to match a palindromic string that contains an odd number of
2330    characters (for example, "a", "aba", "abcba", "abcdcba"):
2331  .sp  .sp
2332    ^(.|(.)(?1)\e2)$    ^(.|(.)(?1)\e2)$
2333  .sp  .sp
# Line 2387  For example, although "abcba" is correct Line 2388  For example, although "abcba" is correct
2388  PCRE finds the palindrome "aba" at the start, then fails at top level because  PCRE finds the palindrome "aba" at the start, then fails at top level because
2389  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2390  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2391    .P
2392    The second way in which PCRE and Perl differ in their recursion processing is
2393    in the handling of captured values. In Perl, when a subpattern is called
2394    recursively or as a subpattern (see the next section), it has no access to any
2395    values that were captured outside the recursion, whereas in PCRE these values
2396    can be referenced. Consider this pattern:
2397    .sp
2398      ^(.)(\e1|a(?2))
2399    .sp
2400    In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2401    then in the second group, when the back reference \e1 fails to match "b", the
2402    second alternative matches "a" and then recurses. In the recursion, \e1 does
2403    now match "b" and so the whole match succeeds. In Perl, the pattern fails to
2404    match because inside the recursive call \e1 cannot access the externally set
2405    value.
2406  .  .
2407  .  .
2408  .\" HTML <a name="subpatternsassubroutines"></a>  .\" HTML <a name="subpatternsassubroutines"></a>
# Line 2746  pattern fragments that do not contain an Line 2762  pattern fragments that do not contain an
2762  .sp  .sp
2763    A (B(*THEN)C) | D    A (B(*THEN)C) | D
2764  .sp  .sp
2765  If A and B are matched, but there is a failure in C, matching does not  If A and B are matched, but there is a failure in C, matching does not
2766  backtrack into A; instead it moves to the next alternative, that is, D.  backtrack into A; instead it moves to the next alternative, that is, D.
2767  However, if the subpattern containing (*THEN) is given an alternative, it  However, if the subpattern containing (*THEN) is given an alternative, it
2768  behaves differently:  behaves differently:
# Line 2754  behaves differently: Line 2770  behaves differently:
2770    A (B(*THEN)C | (*FAIL)) | D    A (B(*THEN)C | (*FAIL)) | D
2771  .sp  .sp
2772  The effect of (*THEN) is now confined to the inner subpattern. After a failure  The effect of (*THEN) is now confined to the inner subpattern. After a failure
2773  in C, matching moves to (*FAIL), which causes the whole subpattern to fail  in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2774  because there are no more alternatives to try. In this case, matching does now  because there are no more alternatives to try. In this case, matching does now
2775  backtrack into A.  backtrack into A.
2776  .P  .P
2777  Note also that a conditional subpattern is not considered as having two  Note also that a conditional subpattern is not considered as having two
2778  alternatives, because only one is ever used. In other words, the | character in  alternatives, because only one is ever used. In other words, the | character in
2779  a conditional subpattern has a different meaning. Ignoring white space,  a conditional subpattern has a different meaning. Ignoring white space,
2780  consider:  consider:
2781  .sp  .sp
2782    ^.*? (?(?=a) a | b(*THEN)c )    ^.*? (?(?=a) a | b(*THEN)c )
2783  .sp  .sp
2784  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2785  it initially matches zero characters. The condition (?=a) then fails, the  it initially matches zero characters. The condition (?=a) then fails, the
2786  character "b" is matched, but "c" is not. At this point, matching does not  character "b" is matched, but "c" is not. At this point, matching does not
2787  backtrack to .*? as might perhaps be expected from the presence of the |  backtrack to .*? as might perhaps be expected from the presence of the |
2788  character. The conditional subpattern is part of the single alternative that  character. The conditional subpattern is part of the single alternative that
2789  comprises the whole pattern, and so the match fails. (If there was a backtrack  comprises the whole pattern, and so the match fails. (If there was a backtrack
2790  into .*?, allowing it to match "b", the match would succeed.)  into .*?, allowing it to match "b", the match would succeed.)
2791  .P  .P
2792  The verbs just described provide four different "strengths" of control when  The verbs just described provide four different "strengths" of control when
# Line 2814  Cambridge CB2 3QH, England. Line 2830  Cambridge CB2 3QH, England.
2830  .rs  .rs
2831  .sp  .sp
2832  .nf  .nf
2833  Last updated: 04 October 2011  Last updated: 09 October 2011
2834  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2835  .fi  .fi

Legend:
Removed from v.716  
changed lines
  Added in v.733

  ViewVC Help
Powered by ViewVC 1.1.5