/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC revision 771 by ph10, Tue Nov 29 15:34:12 2011 UTC
# Line 241  one of the following escape sequences th Line 241  one of the following escape sequences th
241    \et        tab (hex 09)    \et        tab (hex 09)
242    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
243    \exhh      character with hex code hh    \exhh      character with hex code hh
244    \ex{hhh..} character with hex code hhh..    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
245      \euhhhh    character with hex code hhhh (JavaScript mode only)
246  .sp  .sp
247  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
248  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 252  both byte mode and UTF-8 mode. (When PCR Line 253  both byte mode and UTF-8 mode. (When PCR
253  values are valid. A lower case letter is converted to upper case, and then the  values are valid. A lower case letter is converted to upper case, and then the
254  0xc0 bits are flipped.)  0xc0 bits are flipped.)
255  .P  .P
256  After \ex, from zero to two hexadecimal digits are read (letters can be in  By default, after \ex, from zero to two hexadecimal digits are read (letters
257  upper or lower case). Any number of hexadecimal digits may appear between \ex{  can be in upper or lower case). Any number of hexadecimal digits may appear
258  and }, but the value of the character code must be less than 256 in non-UTF-8  between \ex{ and }, but the value of the character code must be less than 256
259  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
260  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
261  point, which is 10FFFF.  Unicode code point, which is 10FFFF.
262  .P  .P
263  If characters other than hexadecimal digits appear between \ex{ and }, or if  If characters other than hexadecimal digits appear between \ex{ and }, or if
264  there is no terminating }, this form of escape is not recognized. Instead, the  there is no terminating }, this form of escape is not recognized. Instead, the
265  initial \ex will be interpreted as a basic hexadecimal escape, with no  initial \ex will be interpreted as a basic hexadecimal escape, with no
266  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
267  .P  .P
268    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
269    as just described only when it is followed by two hexadecimal digits.
270    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
271    code points greater than 256 is provided by \eu, which must be followed by
272    four hexadecimal digits; otherwise it matches a literal "u" character.
273    .P
274  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
275  syntaxes for \ex. There is no difference in the way they are handled. For  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
276  example, \exdc is exactly the same as \ex{dc}.  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
277    \eu00dc in JavaScript mode).
278  .P  .P
279  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
280  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
# Line 320  Note that octal values of 100 or greater Line 328  Note that octal values of 100 or greater
328  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
329  .P  .P
330  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
331  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, \eb is
332  sequence \eb is interpreted as the backspace character (hex 08). The sequences  interpreted as the backspace character (hex 08).
333  \eB, \eN, \eR, and \eX are not special inside a character class. Like any other  .P
334  unrecognized escape sequences, they are treated as the literal characters "B",  \eN is not allowed in a character class. \eB, \eR, and \eX are not special
335  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is  inside a character class. Like other unrecognized escape sequences, they are
336  set. Outside a character class, these sequences have different meanings.  treated as the literal characters "B", "R", and "X" by default, but cause an
337    error if the PCRE_EXTRA option is set. Outside a character class, these
338    sequences have different meanings.
339    .
340    .
341    .SS "Unsupported escape sequences"
342    .rs
343    .sp
344    In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
345    handler and used to modify the case of following characters. By default, PCRE
346    does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
347    option is set, \eU matches a "U" character, and \eu can be used to define a
348    character by code point, as described in the previous section.
349  .  .
350  .  .
351  .SS "Absolute and relative back references"  .SS "Absolute and relative back references"
# Line 387  This is the same as Line 407  This is the same as
407  .\" </a>  .\" </a>
408  the "." metacharacter  the "." metacharacter
409  .\"  .\"
410  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
411    PCRE does not support this.
412  .P  .P
413  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
414  of characters into two disjoint sets. Any given character matches one, and only  of characters into two disjoint sets. Any given character matches one, and only
# Line 964  special meaning in a character class. Line 985  special meaning in a character class.
985  .P  .P
986  The escape sequence \eN behaves like a dot, except that it is not affected by  The escape sequence \eN behaves like a dot, except that it is not affected by
987  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
988  that signifies the end of a line.  that signifies the end of a line. Perl also uses \eN to match characters by
989    name; PCRE does not support this.
990  .  .
991  .  .
992  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
993  .rs  .rs
994  .sp  .sp
995  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
996  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
997  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
998  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
999  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \eC in UTF-8
1000  the \eC escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
1001    character. This has undefined results, because PCRE assumes that it is dealing
1002    with valid UTF-8 strings (and by default it checks this at the start of
1003    processing unless the PCRE_NO_UTF8_CHECK option is used).
1004  .P  .P
1005  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
1006  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
1007  .\" </a>  .\" </a>
1008  (described below),  (described below)
1009  .\"  .\"
1010  because in UTF-8 mode this would make it impossible to calculate the length of  in UTF-8 mode, because this would make it impossible to calculate the length of
1011  the lookbehind.  the lookbehind.
1012    .P
1013    In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
1014    way of using it that avoids the problem of malformed UTF-8 characters is to
1015    use a lookahead to check the length of the next character, as in this pattern
1016    (ignore white space and line breaks):
1017    .sp
1018      (?| (?=[\ex00-\ex7f])(\eC) |
1019          (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1020          (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1021          (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1022    .sp
1023    A group that starts with (?| resets the capturing parentheses numbers in each
1024    alternative (see
1025    .\" HTML <a href="#dupsubpatternnumber">
1026    .\" </a>
1027    "Duplicate Subpattern Numbers"
1028    .\"
1029    below). The assertions at the start of each branch check the next UTF-8
1030    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1031    character's individual bytes are then captured by the appropriate number of
1032    groups.
1033  .  .
1034  .  .
1035  .\" HTML <a name="characterclass"></a>  .\" HTML <a name="characterclass"></a>
# Line 1926  temporarily move the current position ba Line 1972  temporarily move the current position ba
1972  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
1973  assertion fails.  assertion fails.
1974  .P  .P
1975  PCRE does not allow the \eC escape (which matches a single byte in UTF-8 mode)  In UTF-8 mode, PCRE does not allow the \eC escape (which matches a single byte,
1976  to appear in lookbehind assertions, because it makes it impossible to calculate  even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
1977  the length of the lookbehind. The \eX and \eR escapes, which can match  impossible to calculate the length of the lookbehind. The \eX and \eR escapes,
1978  different numbers of bytes, are also not permitted.  which can match different numbers of bytes, are also not permitted.
1979  .P  .P
1980  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
1981  .\" </a>  .\" </a>
# Line 2523  failing negative assertion, they cause a Line 2569  failing negative assertion, they cause a
2569  If any of these verbs are used in an assertion or in a subpattern that is  If any of these verbs are used in an assertion or in a subpattern that is
2570  called as a subroutine (whether or not recursively), their effect is confined  called as a subroutine (whether or not recursively), their effect is confined
2571  to that subpattern; it does not extend to the surrounding pattern, with one  to that subpattern; it does not extend to the surrounding pattern, with one
2572  exception: a *MARK that is encountered in a positive assertion \fIis\fP passed  exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
2573  back (compare capturing parentheses in assertions). Note that such subpatterns  a successful positive assertion \fIis\fP passed back when a match succeeds
2574  are processed as anchored at the point where they are tested. Note also that  (compare capturing parentheses in assertions). Note that such subpatterns are
2575  Perl's treatment of subroutines is different in some cases.  processed as anchored at the point where they are tested. Note also that Perl's
2576    treatment of subroutines is different in some cases.
2577  .P  .P
2578  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2579  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
# Line 2545  included backtracking verbs will not, of Line 2592  included backtracking verbs will not, of
2592  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
2593  when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the  when calling \fBpcre_compile()\fP or \fBpcre_exec()\fP, or by starting the
2594  pattern with (*NO_START_OPT).  pattern with (*NO_START_OPT).
2595    .P
2596    Experiments with Perl suggest that it too has similar optimizations, sometimes
2597    leading to anomalous results.
2598  .  .
2599  .  .
2600  .SS "Verbs that act immediately"  .SS "Verbs that act immediately"
# Line 2592  starting point (see (*SKIP) below). Line 2642  starting point (see (*SKIP) below).
2642  A name is always required with this verb. There may be as many instances of  A name is always required with this verb. There may be as many instances of
2643  (*MARK) as you like in a pattern, and their names do not have to be unique.  (*MARK) as you like in a pattern, and their names do not have to be unique.
2644  .P  .P
2645  When a match succeeds, the name of the last-encountered (*MARK) is passed back  When a match succeeds, the name of the last-encountered (*MARK) on the matching
2646  to the caller via the \fIpcre_extra\fP data structure, as described in the  path is passed back to the caller via the \fIpcre_extra\fP data structure, as
2647    described in the
2648  .\" HTML <a href="pcreapi.html#extradata">  .\" HTML <a href="pcreapi.html#extradata">
2649  .\" </a>  .\" </a>
2650  section on \fIpcre_extra\fP  section on \fIpcre_extra\fP
# Line 2602  in the Line 2653  in the
2653  .\" HREF  .\" HREF
2654  \fBpcreapi\fP  \fBpcreapi\fP
2655  .\"  .\"
2656  documentation. No data is returned for a partial match. Here is an example of  documentation. Here is an example of \fBpcretest\fP output, where the /K
2657  \fBpcretest\fP output, where the /K modifier requests the retrieval and  modifier requests the retrieval and outputting of (*MARK) data:
 outputting of (*MARK) data:  
2658  .sp  .sp
2659    /X(*MARK:A)Y|X(*MARK:B)Z/K      re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2660    XY    data> XY
2661     0: XY     0: XY
2662    MK: A    MK: A
2663    XZ    XZ
# Line 2623  If (*MARK) is encountered in a positive Line 2673  If (*MARK) is encountered in a positive
2673  passed back if it is the last-encountered. This does not happen for negative  passed back if it is the last-encountered. This does not happen for negative
2674  assertions.  assertions.
2675  .P  .P
2676  A name may also be returned after a failed match if the final path through the  After a partial match or a failed match, the name of the last encountered
2677  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  (*MARK) in the entire match process is returned. For example:
 (*COMMIT), this is unlikely to happen for an unanchored pattern because, as the  
 starting point for matching is advanced, the final check is often with an empty  
 string, causing a failure before (*MARK) is reached. For example:  
 .sp  
   /X(*MARK:A)Y|X(*MARK:B)Z/K  
   XP  
   No match  
 .sp  
 There are three potential starting points for this match (starting with X,  
 starting with P, and with an empty string). If the pattern is anchored, the  
 result is different:  
2678  .sp  .sp
2679    /^X(*MARK:A)Y|^X(*MARK:B)Z/K      re> /X(*MARK:A)Y|X(*MARK:B)Z/K
2680    XP    data> XP
2681    No match, mark = B    No match, mark = B
2682  .sp  .sp
2683  PCRE's start-of-match optimizations can also interfere with this. For example,  Note that in this unanchored example the mark is retained from the match
2684  if, as a result of a call to \fBpcre_study()\fP, it knows the minimum  attempt that started at the letter "X". Subsequent match attempts starting at
2685  subject length for a match, a shorter subject will not be scanned at all.  "P" and then with an empty string do not get as far as the (*MARK) item, but
2686  .P  nevertheless do not reset it.
 Note that similar anomalies (though different in detail) exist in Perl, no  
 doubt for the same reasons. The use of (*MARK) data after a failed match of an  
 unanchored pattern is not recommended, unless (*COMMIT) is involved.  
2687  .  .
2688  .  .
2689  .SS "Verbs that act after backtracking"  .SS "Verbs that act after backtracking"
# Line 2684  Note that (*COMMIT) at the start of a pa Line 2720  Note that (*COMMIT) at the start of a pa
2720  unless PCRE's start-of-match optimizations are turned off, as shown in this  unless PCRE's start-of-match optimizations are turned off, as shown in this
2721  \fBpcretest\fP example:  \fBpcretest\fP example:
2722  .sp  .sp
2723    /(*COMMIT)abc/      re> /(*COMMIT)abc/
2724    xyzabc    data> xyzabc
2725     0: abc     0: abc
2726    xyzabc\eY    xyzabc\eY
2727    No match    No match
# Line 2706  reached, or when matching to the right o Line 2742  reached, or when matching to the right o
2742  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2743  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2744  but there are some uses of (*PRUNE) that cannot be expressed in any other way.  but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2745  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
2746  match fails completely; the name is passed back if this is the final attempt.  anchored pattern (*PRUNE) has the same effect as (*COMMIT).
 (*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored  
 pattern (*PRUNE) has the same effect as (*COMMIT).  
2747  .sp  .sp
2748    (*SKIP)    (*SKIP)
2749  .sp  .sp
# Line 2735  following pattern fails to match, the pr Line 2769  following pattern fails to match, the pr
2769  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2770  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2771  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2772  matching name is found, normal "bumpalong" of one character happens (that is,  matching name is found, the (*SKIP) is ignored.
 the (*SKIP) is ignored).  
2773  .sp  .sp
2774    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2775  .sp  .sp
# Line 2750  be used for a pattern-based if-then-else Line 2783  be used for a pattern-based if-then-else
2783  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2784  the end of the group if FOO succeeds); on failure, the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2785  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2786  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
2787  overall match fails. If (*THEN) is not inside an alternation, it acts like  If (*THEN) is not inside an alternation, it acts like (*PRUNE).
 (*PRUNE).  
2788  .P  .P
2789  Note that a subpattern that does not contain a | character is just a part of  Note that a subpattern that does not contain a | character is just a part of
2790  the enclosing alternative; it is not a nested alternation with only one  the enclosing alternative; it is not a nested alternation with only one
# Line 2830  Cambridge CB2 3QH, England. Line 2862  Cambridge CB2 3QH, England.
2862  .rs  .rs
2863  .sp  .sp
2864  .nf  .nf
2865  Last updated: 09 October 2011  Last updated: 29 November 2011
2866  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2867  .fi  .fi

Legend:
Removed from v.733  
changed lines
  Added in v.771

  ViewVC Help
Powered by ViewVC 1.1.5