/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 678 by ph10, Sun Aug 28 15:23:03 2011 UTC revision 716 by ph10, Tue Oct 4 16:38:05 2011 UTC
# Line 1315  or "defdef": Line 1315  or "defdef":
1315  .sp  .sp
1316    /(?|(abc)|(def))\e1/    /(?|(abc)|(def))\e1/
1317  .sp  .sp
1318  In contrast, a recursive or "subroutine" call to a numbered subpattern always  In contrast, a subroutine call to a numbered subpattern always refers to the
1319  refers to the first one in the pattern with the given number. The following  first one in the pattern with the given number. The following pattern matches
1320  pattern matches "abcabc" or "defabc":  "abcabc" or "defabc":
1321  .sp  .sp
1322    /(?|(abc)|(def))(?1)/    /(?|(abc)|(def))(?1)/
1323  .sp  .sp
# Line 1434  items: Line 1434  items:
1434    a character class    a character class
1435    a back reference (see next section)    a back reference (see next section)
1436    a parenthesized subpattern (including assertions)    a parenthesized subpattern (including assertions)
1437    a recursive or "subroutine" call to a subpattern    a subroutine call to a subpattern (recursive or otherwise)
1438  .sp  .sp
1439  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1440  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 2123  If the condition is the string (DEFINE), Line 2123  If the condition is the string (DEFINE),
2123  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
2124  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
2125  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
2126  "subroutines" that can be referenced from elsewhere. (The use of  subroutines that can be referenced from elsewhere. (The use of
2127  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2128  .\" </a>  .\" </a>
2129  "subroutines"  subroutines
2130  .\"  .\"
2131  is described below.) For example, a pattern to match an IPv4 address such as  is described below.) For example, a pattern to match an IPv4 address such as
2132  "192.168.23.245" could be written like this (ignore whitespace and line  "192.168.23.245" could be written like this (ignore whitespace and line
# Line 2221  individual subpattern recursion. After i Line 2221  individual subpattern recursion. After i
2221  this kind of recursion was subsequently introduced into Perl at release 5.10.  this kind of recursion was subsequently introduced into Perl at release 5.10.
2222  .P  .P
2223  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
2224  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive subroutine call of the subpattern of the
2225  provided that it occurs inside that subpattern. (If not, it is a  given number, provided that it occurs inside that subpattern. (If not, it is a
2226  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2227  .\" </a>  .\" </a>
2228  "subroutine"  non-recursive subroutine
2229  .\"  .\"
2230  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
2231  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
# Line 2260  references such as (?+2). However, these Line 2260  references such as (?+2). However, these
2260  reference is not inside the parentheses that are referenced. They are always  reference is not inside the parentheses that are referenced. They are always
2261  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2262  .\" </a>  .\" </a>
2263  "subroutine"  non-recursive subroutine
2264  .\"  .\"
2265  calls, as described in the next section.  calls, as described in the next section.
2266  .P  .P
# Line 2393  recursion to try other alternatives, so Line 2393  recursion to try other alternatives, so
2393  .SH "SUBPATTERNS AS SUBROUTINES"  .SH "SUBPATTERNS AS SUBROUTINES"
2394  .rs  .rs
2395  .sp  .sp
2396  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern call (either by number or by
2397  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
2398  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The called subpattern may be defined
2399  before or after the reference. A numbered reference can be absolute or  before or after the reference. A numbered reference can be absolute or
2400  relative, as in these examples:  relative, as in these examples:
2401  .sp  .sp
# Line 2415  matches "sense and sensibility" and "res Line 2415  matches "sense and sensibility" and "res
2415  is used, it does match "sense and responsibility" as well as the other two  is used, it does match "sense and responsibility" as well as the other two
2416  strings. Another example is given in the discussion of DEFINE above.  strings. Another example is given in the discussion of DEFINE above.
2417  .P  .P
2418  Like recursive subpatterns, a subroutine call is always treated as an atomic  All subroutine calls, whether recursive or not, are always treated as atomic
2419  group. That is, once it has matched some of the subject string, it is never  groups. That is, once a subroutine has matched some of the subject string, it
2420  re-entered, even if it contains untried alternatives and there is a subsequent  is never re-entered, even if it contains untried alternatives and there is a
2421  matching failure. Any capturing parentheses that are set during the subroutine  subsequent matching failure. Any capturing parentheses that are set during the
2422  call revert to their previous values afterwards.  subroutine call revert to their previous values afterwards.
2423  .P  .P
2424  When a subpattern is used as a subroutine, processing options such as  Processing options such as case-independence are fixed when a subpattern is
2425  case-independence are fixed when the subpattern is defined. They cannot be  defined, so if it is used as a subroutine, such options cannot be changed for
2426  changed for different calls. For example, consider this pattern:  different calls. For example, consider this pattern:
2427  .sp  .sp
2428    (abc)(?i:(?-1))    (abc)(?i:(?-1))
2429  .sp  .sp
# Line 2504  a backtracking algorithm. With the excep Line 2504  a backtracking algorithm. With the excep
2504  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2505  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2506  .P  .P
2507  If any of these verbs are used in an assertion or subroutine subpattern  If any of these verbs are used in an assertion or in a subpattern that is
2508  (including recursive subpatterns), their effect is confined to that subpattern;  called as a subroutine (whether or not recursively), their effect is confined
2509  it does not extend to the surrounding pattern, with one exception: a *MARK that  to that subpattern; it does not extend to the surrounding pattern, with one
2510  is encountered in a positive assertion \fIis\fP passed back (compare capturing  exception: a *MARK that is encountered in a positive assertion \fIis\fP passed
2511  parentheses in assertions). Note that such subpatterns are processed as  back (compare capturing parentheses in assertions). Note that such subpatterns
2512  anchored at the point where they are tested.  are processed as anchored at the point where they are tested. Note also that
2513    Perl's treatment of subroutines is different in some cases.
2514  .P  .P
2515  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2516  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2517  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2518  depending on whether or not an argument is present. An name is a sequence of  depending on whether or not an argument is present. A name is any sequence of
2519  letters, digits, and underscores. If the name is empty, that is, if the closing  characters that does not include a closing parenthesis. If the name is empty,
2520  parenthesis immediately follows the colon, the effect is as if the colon were  that is, if the closing parenthesis immediately follows the colon, the effect
2521  not there. Any number of these verbs may occur in a pattern.  is as if the colon were not there. Any number of these verbs may occur in a
2522    pattern.
2523  .P  .P
2524  PCRE contains some optimizations that are used to speed up matching by running  PCRE contains some optimizations that are used to speed up matching by running
2525  some checks at the start of each match attempt. For example, it may know the  some checks at the start of each match attempt. For example, it may know the
# Line 2538  followed by a name. Line 2540  followed by a name.
2540     (*ACCEPT)     (*ACCEPT)
2541  .sp  .sp
2542  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2543  pattern. When inside a recursion, only the innermost pattern is ended  pattern. However, when it is inside a subpattern that is called as a
2544  immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is  subroutine, only that subpattern is ended successfully. Matching then continues
2545  captured. (This feature was added to PCRE at release 8.00.) For example:  at the outer level. If (*ACCEPT) is inside capturing parentheses, the data so
2546    far is captured. For example:
2547  .sp  .sp
2548    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2549  .sp  .sp
# Line 2549  the outer parentheses. Line 2552  the outer parentheses.
2552  .sp  .sp
2553    (*FAIL) or (*F)    (*FAIL) or (*F)
2554  .sp  .sp
2555  This verb causes the match to fail, forcing backtracking to occur. It is  This verb causes a matching failure, forcing backtracking to occur. It is
2556  equivalent to (?!) but easier to read. The Perl documentation notes that it is  equivalent to (?!) but easier to read. The Perl documentation notes that it is
2557  probably useful only when combined with (?{}) or (??{}). Those are, of course,  probably useful only when combined with (?{}) or (??{}). Those are, of course,
2558  Perl features that are not present in PCRE. The nearest equivalent is the  Perl features that are not present in PCRE. The nearest equivalent is the
# Line 2602  capturing parentheses. Line 2605  capturing parentheses.
2605  .P  .P
2606  If (*MARK) is encountered in a positive assertion, its name is recorded and  If (*MARK) is encountered in a positive assertion, its name is recorded and
2607  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2608  assetions.  assertions.
2609  .P  .P
2610  A name may also be returned after a failed match if the final path through the  A name may also be returned after a failed match if the final path through the
2611  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  pattern involves (*MARK). However, unless (*MARK) used in conjunction with
# Line 2716  following pattern fails to match, the pr Line 2719  following pattern fails to match, the pr
2719  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2720  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2721  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2722  matching name is found, normal "bumpalong" of one character happens (the  matching name is found, normal "bumpalong" of one character happens (that is,
2723  (*SKIP) is ignored).  the (*SKIP) is ignored).
2724  .sp  .sp
2725    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2726  .sp  .sp
2727  This verb causes a skip to the next alternation in the innermost enclosing  This verb causes a skip to the next innermost alternative if the rest of the
2728  group if the rest of the pattern does not match. That is, it cancels pending  pattern does not match. That is, it cancels pending backtracking, but only
2729  backtracking, but only within the current alternation. Its name comes from the  within the current alternative. Its name comes from the observation that it can
2730  observation that it can be used for a pattern-based if-then-else block:  be used for a pattern-based if-then-else block:
2731  .sp  .sp
2732    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...    ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
2733  .sp  .sp
2734  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2735  the end of the group if FOO succeeds); on failure the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2736  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2737  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the
2738  overall match fails. If (*THEN) is not directly inside an alternation, it acts  overall match fails. If (*THEN) is not inside an alternation, it acts like
2739  like (*PRUNE).  (*PRUNE).
 .  
 .P  
 The above verbs provide four different "strengths" of control when subsequent  
 matching fails. (*THEN) is the weakest, carrying on the match at the next  
 alternation. (*PRUNE) comes next, failing the match at the current starting  
 position, but allowing an advance to the next character (for an unanchored  
 pattern). (*SKIP) is similar, except that the advance may be more than one  
 character. (*COMMIT) is the strongest, causing the entire match to fail.  
2740  .P  .P
2741  If more than one is present in a pattern, the "stongest" one wins. For example,  Note that a subpattern that does not contain a | character is just a part of
2742  consider this pattern, where A, B, etc. are complex pattern fragments:  the enclosing alternative; it is not a nested alternation with only one
2743    alternative. The effect of (*THEN) extends beyond such a subpattern to the
2744    enclosing alternative. Consider this pattern, where A, B, etc. are complex
2745    pattern fragments that do not contain any | characters at this level:
2746    .sp
2747      A (B(*THEN)C) | D
2748    .sp
2749    If A and B are matched, but there is a failure in C, matching does not
2750    backtrack into A; instead it moves to the next alternative, that is, D.
2751    However, if the subpattern containing (*THEN) is given an alternative, it
2752    behaves differently:
2753    .sp
2754      A (B(*THEN)C | (*FAIL)) | D
2755    .sp
2756    The effect of (*THEN) is now confined to the inner subpattern. After a failure
2757    in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2758    because there are no more alternatives to try. In this case, matching does now
2759    backtrack into A.
2760    .P
2761    Note also that a conditional subpattern is not considered as having two
2762    alternatives, because only one is ever used. In other words, the | character in
2763    a conditional subpattern has a different meaning. Ignoring white space,
2764    consider:
2765    .sp
2766      ^.*? (?(?=a) a | b(*THEN)c )
2767    .sp
2768    If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2769    it initially matches zero characters. The condition (?=a) then fails, the
2770    character "b" is matched, but "c" is not. At this point, matching does not
2771    backtrack to .*? as might perhaps be expected from the presence of the |
2772    character. The conditional subpattern is part of the single alternative that
2773    comprises the whole pattern, and so the match fails. (If there was a backtrack
2774    into .*?, allowing it to match "b", the match would succeed.)
2775    .P
2776    The verbs just described provide four different "strengths" of control when
2777    subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
2778    next alternative. (*PRUNE) comes next, failing the match at the current
2779    starting position, but allowing an advance to the next character (for an
2780    unanchored pattern). (*SKIP) is similar, except that the advance may be more
2781    than one character. (*COMMIT) is the strongest, causing the entire match to
2782    fail.
2783    .P
2784    If more than one such verb is present in a pattern, the "strongest" one wins.
2785    For example, consider this pattern, where A, B, etc. are complex pattern
2786    fragments:
2787  .sp  .sp
2788    (A(*COMMIT)B(*THEN)C|D)    (A(*COMMIT)B(*THEN)C|D)
2789  .sp  .sp
2790  Once A has matched, PCRE is committed to this match, at the current starting  Once A has matched, PCRE is committed to this match, at the current starting
2791  position. If subsequently B matches, but C does not, the normal (*THEN) action  position. If subsequently B matches, but C does not, the normal (*THEN) action
2792  of trying the next alternation (that is, D) does not happen because (*COMMIT)  of trying the next alternative (that is, D) does not happen because (*COMMIT)
2793  overrides.  overrides.
2794  .  .
2795  .  .
# Line 2775  Cambridge CB2 3QH, England. Line 2814  Cambridge CB2 3QH, England.
2814  .rs  .rs
2815  .sp  .sp
2816  .nf  .nf
2817  Last updated: 24 August 2011  Last updated: 04 October 2011
2818  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2819  .fi  .fi

Legend:
Removed from v.678  
changed lines
  Added in v.716

  ViewVC Help
Powered by ViewVC 1.1.5