/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 835 by ph10, Wed Dec 28 16:10:09 2011 UTC revision 836 by ph10, Wed Dec 28 17:16:11 2011 UTC
# Line 242  one of the following escape sequences th Line 242  one of the following escape sequences th
242    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
243    \exhh      character with hex code hh    \exhh      character with hex code hh
244    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
245    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
246  .sp  .sp
247  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
248  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 265  there is no terminating }, this form of Line 265  there is no terminating }, this form of
265  initial \ex will be interpreted as a basic hexadecimal escape, with no  initial \ex will be interpreted as a basic hexadecimal escape, with no
266  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
267  .P  .P
268  If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is  If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
269  as just described only when it is followed by two hexadecimal digits.  as just described only when it is followed by two hexadecimal digits.
270  Otherwise, it matches a literal "x" character. In JavaScript mode, support for  Otherwise, it matches a literal "x" character. In JavaScript mode, support for
271  code points greater than 256 is provided by \eu, which must be followed by  code points greater than 256 is provided by \eu, which must be followed by
272  four hexadecimal digits; otherwise it matches a literal "u" character.  four hexadecimal digits; otherwise it matches a literal "u" character.
273  .P  .P
274  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
275  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
276  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
277  \eu00dc in JavaScript mode).  \eu00dc in JavaScript mode).
278  .P  .P
279  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
# Line 328  Note that octal values of 100 or greater Line 328  Note that octal values of 100 or greater
328  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
329  .P  .P
330  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
331  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, \eb is
332  sequence \eb is interpreted as the backspace character (hex 08). The sequences  interpreted as the backspace character (hex 08).
333  \eB, \eN, \eR, and \eX are not special inside a character class. Like any other  .P
334  unrecognized escape sequences, they are treated as the literal characters "B",  \eN is not allowed in a character class. \eB, \eR, and \eX are not special
335  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is  inside a character class. Like other unrecognized escape sequences, they are
336  set. Outside a character class, these sequences have different meanings.  treated as the literal characters "B", "R", and "X" by default, but cause an
337    error if the PCRE_EXTRA option is set. Outside a character class, these
338    sequences have different meanings.
339  .  .
340  .  .
341  .SS "Unsupported escape sequences"  .SS "Unsupported escape sequences"
# Line 405  This is the same as Line 407  This is the same as
407  .\" </a>  .\" </a>
408  the "." metacharacter  the "." metacharacter
409  .\"  .\"
410  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
411  PCRE does not support this.  PCRE does not support this.
412  .P  .P
413  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
# Line 2567  failing negative assertion, they cause a Line 2569  failing negative assertion, they cause a
2569  If any of these verbs are used in an assertion or in a subpattern that is  If any of these verbs are used in an assertion or in a subpattern that is
2570  called as a subroutine (whether or not recursively), their effect is confined  called as a subroutine (whether or not recursively), their effect is confined
2571  to that subpattern; it does not extend to the surrounding pattern, with one  to that subpattern; it does not extend to the surrounding pattern, with one
2572  exception: a *MARK that is encountered in a positive assertion \fIis\fP passed  exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
2573  back (compare capturing parentheses in assertions). Note that such subpatterns  a successful positive assertion \fIis\fP passed back when a match succeeds
2574  are processed as anchored at the point where they are tested. Note also that  (compare capturing parentheses in assertions). Note that such subpatterns are
2575  Perl's treatment of subroutines is different in some cases.  processed as anchored at the point where they are tested. Note also that Perl's
2576    treatment of subroutines is different in some cases.
2577  .P  .P
2578  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2579  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
# Line 2589  included backtracking verbs will not, of Line 2592  included backtracking verbs will not, of
2592  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
2593  when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the  when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the
2594  pattern with (*NO_START_OPT).  pattern with (*NO_START_OPT).
2595    .P
2596    Experiments with Perl suggest that it too has similar optimizations, sometimes
2597    leading to anomalous results.
2598  .  .
2599  .  .
2600  .SS "Verbs that act immediately"  .SS "Verbs that act immediately"
# Line 2636  starting point (see (*SKIP) below). Line 2642  starting point (see (*SKIP) below).
2642  A name is always required with this verb. There may be as many instances of  A name is always required with this verb. There may be as many instances of
2643  (*MARK) as you like in a pattern, and their names do not have to be unique.  (*MARK) as you like in a pattern, and their names do not have to be unique.
2644  .P  .P
2645  When a match succeeds, the name of the last-encountered (*MARK) is passed back  When a match succeeds, the name of the last-encountered (*MARK) on the matching
2646  to the caller via the \fIpcre_extra\fP data structure, as described in the  path is passed back to the caller via the \fIpcre_extra\fP data structure, as
2647    described in the
2648  .\" HTML <a href="pcreapi.html#extradata">  .\" HTML <a href="pcreapi.html#extradata">
2649  .\" </a>  .\" </a>
2650  section on \fIpcre_extra\fP  section on \fIpcre_extra\fP
# Line 2646  in the Line 2653  in the
2653  .\" HREF  .\" HREF
2654  \fBpcreapi\fP  \fBpcreapi\fP
2655  .\"  .\"
2656  documentation. No data is returned for a partial match. Here is an example of  documentation. Here is an example of \fBpcretest\fP output, where the /K
2657  \fBpcretest\fP output, where the /K modifier requests the retrieval and  modifier requests the retrieval and outputting of (*MARK) data:
 outputting of (*MARK) data:  
2658  .sp  .sp
2659    /X(*MARK:A)Y|X(*MARK:B)Z/K      re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2660    XY    data> XY
2661     0: XY     0: XY
2662    MK: A    MK: A
2663    XZ    XZ
# Line 2667  If (*MARK) is encountered in a positive Line 2673  If (*MARK) is encountered in a positive
2673  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2674  assertions.  assertions.
2675  .P  .P
2676  A name may also be returned after a failed match if the final path through the  After a partial match or a failed match, the name of the last encountered
2677  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  (*MARK) in the entire match process is returned. For example:
 (*COMMIT), this is unlikely to happen for an unanchored pattern because, as the  
 starting point for matching is advanced, the final check is often with an empty  
 string, causing a failure before (*MARK) is reached. For example:  
 .sp  
   /X(*MARK:A)Y|X(*MARK:B)Z/K  
   XP  
   No match  
 .sp  
 There are three potential starting points for this match (starting with X,  
 starting with P, and with an empty string). If the pattern is anchored, the  
 result is different:  
2678  .sp  .sp
2679    /^X(*MARK:A)Y|^X(*MARK:B)Z/K      re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2680    XP    data> XP
2681    No match, mark = B    No match, mark = B
2682  .sp  .sp
2683  PCRE's start-of-match optimizations can also interfere with this. For example,  Note that in this unanchored example the mark is retained from the match
2684  if, as a result of a call to \fBpcre_study()\fP, it knows the minimum  attempt that started at the letter "X". Subsequent match attempts starting at
2685  subject length for a match, a shorter subject will not be scanned at all.  "P" and then with an empty string do not get as far as the (*MARK) item, but
2686  .P  nevertheless do not reset it.
 Note that similar anomalies (though different in detail) exist in Perl, no  
 doubt for the same reasons. The use of (*MARK) data after a failed match of an  
 unanchored pattern is not recommended, unless (*COMMIT) is involved.  
2687  .  .
2688  .  .
2689  .SS "Verbs that act after backtracking"  .SS "Verbs that act after backtracking"
# Line 2728  Note that (*COMMIT) at the start of a pa Line 2720  Note that (*COMMIT) at the start of a pa
2720  unless PCRE's start-of-match optimizations are turned off, as shown in this  unless PCRE's start-of-match optimizations are turned off, as shown in this
2721  \fBpcretest\fP example:  \fBpcretest\fP example:
2722  .sp  .sp
2723    /(*COMMIT)abc/      re> /(*COMMIT)abc/
2724    xyzabc    data> xyzabc
2725     0: abc     0: abc
2726    xyzabc\eY    xyzabc\eY
2727    No match    No match
# Line 2750  reached, or when matching to the right o Line 2742  reached, or when matching to the right o
2742  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2743  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2744  but there are some uses of (*PRUNE) that cannot be expressed in any other way.  but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2745  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
2746  match fails completely; the name is passed back if this is the final attempt.  anchored pattern (*PRUNE) has the same effect as (*COMMIT).
 (*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored  
 pattern (*PRUNE) has the same effect as (*COMMIT).  
2747  .sp  .sp
2748    (*SKIP)    (*SKIP)
2749  .sp  .sp
# Line 2779  following pattern fails to match, the pr Line 2769  following pattern fails to match, the pr
2769  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2770  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2771  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2772  matching name is found, normal "bumpalong" of one character happens (that is,  matching name is found, the (*SKIP) is ignored.
 the (*SKIP) is ignored).  
2773  .sp  .sp
2774    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2775  .sp  .sp
# Line 2794  be used for a pattern-based if-then-else Line 2783  be used for a pattern-based if-then-else
2783  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2784  the end of the group if FOO succeeds); on failure, the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2785  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2786  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
2787  overall match fails. If (*THEN) is not inside an alternation, it acts like  If (*THEN) is not inside an alternation, it acts like (*PRUNE).
 (*PRUNE).  
2788  .P  .P
2789  Note that a subpattern that does not contain a | character is just a part of  Note that a subpattern that does not contain a | character is just a part of
2790  the enclosing alternative; it is not a nested alternation with only one  the enclosing alternative; it is not a nested alternation with only one
# Line 2874  Cambridge CB2 3QH, England. Line 2862  Cambridge CB2 3QH, England.
2862  .rs  .rs
2863  .sp  .sp
2864  .nf  .nf
2865  Last updated: 19 November 2011  Last updated: 29 November 2011
2866  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2867  .fi  .fi

Legend:
Removed from v.835  
changed lines
  Added in v.836

  ViewVC Help
Powered by ViewVC 1.1.5