/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 783 by ph10, Fri Oct 21 09:04:01 2011 UTC revision 784 by ph10, Mon Dec 5 12:33:44 2011 UTC
# Line 268  one of the following escape sequences th Line 268  one of the following escape sequences th
268    \t        tab (hex 09)    \t        tab (hex 09)
269    \ddd      character with octal code ddd, or back reference    \ddd      character with octal code ddd, or back reference
270    \xhh      character with hex code hh    \xhh      character with hex code hh
271    \x{hhh..} character with hex code hhh..    \x{hhh..} character with hex code hhh.. (non-JavaScript mode)
272      \uhhhh    character with hex code hhhh (JavaScript mode only)
273  </pre>  </pre>
274  The precise effect of \cx is as follows: if x is a lower case letter, it  The precise effect of \cx is as follows: if x is a lower case letter, it
275  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 280  values are valid. A lower case letter is Line 281  values are valid. A lower case letter is
281  0xc0 bits are flipped.)  0xc0 bits are flipped.)
282  </P>  </P>
283  <P>  <P>
284  After \x, from zero to two hexadecimal digits are read (letters can be in  By default, after \x, from zero to two hexadecimal digits are read (letters
285  upper or lower case). Any number of hexadecimal digits may appear between \x{  can be in upper or lower case). Any number of hexadecimal digits may appear
286  and }, but the value of the character code must be less than 256 in non-UTF-8  between \x{ and }, but the value of the character code must be less than 256
287  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
288  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
289  point, which is 10FFFF.  Unicode code point, which is 10FFFF.
290  </P>  </P>
291  <P>  <P>
292  If characters other than hexadecimal digits appear between \x{ and }, or if  If characters other than hexadecimal digits appear between \x{ and }, or if
# Line 294  initial \x will be interpreted as a basi Line 295  initial \x will be interpreted as a basi
295  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
296  </P>  </P>
297  <P>  <P>
298    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is
299    as just described only when it is followed by two hexadecimal digits.
300    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
301    code points greater than 256 is provided by \u, which must be followed by
302    four hexadecimal digits; otherwise it matches a literal "u" character.
303    </P>
304    <P>
305  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
306  syntaxes for \x. There is no difference in the way they are handled. For  syntaxes for \x (or by \u in JavaScript mode). There is no difference in the
307  example, \xdc is exactly the same as \x{dc}.  way they are handled. For example, \xdc is exactly the same as \x{dc} (or
308    \u00dc in JavaScript mode).
309  </P>  </P>
310  <P>  <P>
311  After \0 up to two further octal digits are read. If there are fewer than two  After \0 up to two further octal digits are read. If there are fewer than two
# Line 338  zero, because no more than three octal d Line 347  zero, because no more than three octal d
347  </P>  </P>
348  <P>  <P>
349  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
350  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, \b is
351  sequence \b is interpreted as the backspace character (hex 08). The sequences  interpreted as the backspace character (hex 08).
352  \B, \N, \R, and \X are not special inside a character class. Like any other  </P>
353  unrecognized escape sequences, they are treated as the literal characters "B",  <P>
354  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is  \N is not allowed in a character class. \B, \R, and \X are not special
355  set. Outside a character class, these sequences have different meanings.  inside a character class. Like other unrecognized escape sequences, they are
356    treated as the literal characters "B", "R", and "X" by default, but cause an
357    error if the PCRE_EXTRA option is set. Outside a character class, these
358    sequences have different meanings.
359    </P>
360    <br><b>
361    Unsupported escape sequences
362    </b><br>
363    <P>
364    In Perl, the sequences \l, \L, \u, and \U are recognized by its string
365    handler and used to modify the case of following characters. By default, PCRE
366    does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
367    option is set, \U matches a "U" character, and \u can be used to define a
368    character by code point, as described in the previous section.
369  </P>  </P>
370  <br><b>  <br><b>
371  Absolute and relative back references  Absolute and relative back references
# Line 389  Another use of backslash is for specifyi Line 411  Another use of backslash is for specifyi
411  There is also the single sequence \N, which matches a non-newline character.  There is also the single sequence \N, which matches a non-newline character.
412  This is the same as  This is the same as
413  <a href="#fullstopdot">the "." metacharacter</a>  <a href="#fullstopdot">the "." metacharacter</a>
414  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set. Perl also uses \N to match characters by name;
415    PCRE does not support this.
416  </P>  </P>
417  <P>  <P>
418  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
# Line 963  special meaning in a character class. Line 986  special meaning in a character class.
986  <P>  <P>
987  The escape sequence \N behaves like a dot, except that it is not affected by  The escape sequence \N behaves like a dot, except that it is not affected by
988  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
989  that signifies the end of a line.  that signifies the end of a line. Perl also uses \N to match characters by
990    name; PCRE does not support this.
991  </P>  </P>
992  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
993  <P>  <P>
# Line 979  processing unless the PCRE_NO_UTF8_CHECK Line 1003  processing unless the PCRE_NO_UTF8_CHECK
1003  </P>  </P>
1004  <P>  <P>
1005  PCRE does not allow \C to appear in lookbehind assertions  PCRE does not allow \C to appear in lookbehind assertions
1006  <a href="#lookbehind">(described below),</a>  <a href="#lookbehind">(described below)</a>
1007  because in UTF-8 mode this would make it impossible to calculate the length of  in UTF-8 mode, because this would make it impossible to calculate the length of
1008  the lookbehind.  the lookbehind.
1009  </P>  </P>
1010  <P>  <P>
# Line 1926  match. If there are insufficient charact Line 1950  match. If there are insufficient charact
1950  assertion fails.  assertion fails.
1951  </P>  </P>
1952  <P>  <P>
1953  PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)  In UTF-8 mode, PCRE does not allow the \C escape (which matches a single byte,
1954  to appear in lookbehind assertions, because it makes it impossible to calculate  even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
1955  the length of the lookbehind. The \X and \R escapes, which can match  impossible to calculate the length of the lookbehind. The \X and \R escapes,
1956  different numbers of bytes, are also not permitted.  which can match different numbers of bytes, are also not permitted.
1957  </P>  </P>
1958  <P>  <P>
1959  <a href="#subpatternsassubroutines">"Subroutine"</a>  <a href="#subpatternsassubroutines">"Subroutine"</a>
# Line 2511  failing negative assertion, they cause a Line 2535  failing negative assertion, they cause a
2535  If any of these verbs are used in an assertion or in a subpattern that is  If any of these verbs are used in an assertion or in a subpattern that is
2536  called as a subroutine (whether or not recursively), their effect is confined  called as a subroutine (whether or not recursively), their effect is confined
2537  to that subpattern; it does not extend to the surrounding pattern, with one  to that subpattern; it does not extend to the surrounding pattern, with one
2538  exception: a *MARK that is encountered in a positive assertion <i>is</i> passed  exception: the name from a *(MARK), (*PRUNE), or (*THEN) that is encountered in
2539  back (compare capturing parentheses in assertions). Note that such subpatterns  a successful positive assertion <i>is</i> passed back when a match succeeds
2540  are processed as anchored at the point where they are tested. Note also that  (compare capturing parentheses in assertions). Note that such subpatterns are
2541  Perl's treatment of subroutines is different in some cases.  processed as anchored at the point where they are tested. Note also that Perl's
2542    treatment of subroutines is different in some cases.
2543  </P>  </P>
2544  <P>  <P>
2545  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
# Line 2536  the start-of-match optimizations by sett Line 2561  the start-of-match optimizations by sett
2561  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
2562  pattern with (*NO_START_OPT).  pattern with (*NO_START_OPT).
2563  </P>  </P>
2564    <P>
2565    Experiments with Perl suggest that it too has similar optimizations, sometimes
2566    leading to anomalous results.
2567    </P>
2568  <br><b>  <br><b>
2569  Verbs that act immediately  Verbs that act immediately
2570  </b><br>  </b><br>
# Line 2583  A name is always required with this verb Line 2612  A name is always required with this verb
2612  (*MARK) as you like in a pattern, and their names do not have to be unique.  (*MARK) as you like in a pattern, and their names do not have to be unique.
2613  </P>  </P>
2614  <P>  <P>
2615  When a match succeeds, the name of the last-encountered (*MARK) is passed back  When a match succeeds, the name of the last-encountered (*MARK) on the matching
2616  to the caller via the <i>pcre_extra</i> data structure, as described in the  path is passed back to the caller via the <i>pcre_extra</i> data structure, as
2617    described in the
2618  <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>  <a href="pcreapi.html#extradata">section on <i>pcre_extra</i></a>
2619  in the  in the
2620  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
2621  documentation. No data is returned for a partial match. Here is an example of  documentation. Here is an example of <b>pcretest</b> output, where the /K
2622  <b>pcretest</b> output, where the /K modifier requests the retrieval and  modifier requests the retrieval and outputting of (*MARK) data:
 outputting of (*MARK) data:  
2623  <pre>  <pre>
2624    /X(*MARK:A)Y|X(*MARK:B)Z/K      re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
2625    XY    data&#62; XY
2626     0: XY     0: XY
2627    MK: A    MK: A
2628    XZ    XZ
# Line 2611  passed back if it is the last-encountere Line 2640  passed back if it is the last-encountere
2640  assertions.  assertions.
2641  </P>  </P>
2642  <P>  <P>
2643  A name may also be returned after a failed match if the final path through the  After a partial match or a failed match, the name of the last encountered
2644  pattern involves (*MARK). However, unless (*MARK) used in conjunction with  (*MARK) in the entire match process is returned. For example:
 (*COMMIT), this is unlikely to happen for an unanchored pattern because, as the  
 starting point for matching is advanced, the final check is often with an empty  
 string, causing a failure before (*MARK) is reached. For example:  
2645  <pre>  <pre>
2646    /X(*MARK:A)Y|X(*MARK:B)Z/K      re&#62; /X(*MARK:A)Y|X(*MARK:B)Z/K
2647    XP    data&#62; XP
   No match  
 </pre>  
 There are three potential starting points for this match (starting with X,  
 starting with P, and with an empty string). If the pattern is anchored, the  
 result is different:  
 <pre>  
   /^X(*MARK:A)Y|^X(*MARK:B)Z/K  
   XP  
2648    No match, mark = B    No match, mark = B
2649  </pre>  </pre>
2650  PCRE's start-of-match optimizations can also interfere with this. For example,  Note that in this unanchored example the mark is retained from the match
2651  if, as a result of a call to <b>pcre_study()</b>, it knows the minimum  attempt that started at the letter "X". Subsequent match attempts starting at
2652  subject length for a match, a shorter subject will not be scanned at all.  "P" and then with an empty string do not get as far as the (*MARK) item, but
2653  </P>  nevertheless do not reset it.
 <P>  
 Note that similar anomalies (though different in detail) exist in Perl, no  
 doubt for the same reasons. The use of (*MARK) data after a failed match of an  
 unanchored pattern is not recommended, unless (*COMMIT) is involved.  
2654  </P>  </P>
2655  <br><b>  <br><b>
2656  Verbs that act after backtracking  Verbs that act after backtracking
# Line 2675  Note that (*COMMIT) at the start of a pa Line 2689  Note that (*COMMIT) at the start of a pa
2689  unless PCRE's start-of-match optimizations are turned off, as shown in this  unless PCRE's start-of-match optimizations are turned off, as shown in this
2690  <b>pcretest</b> example:  <b>pcretest</b> example:
2691  <pre>  <pre>
2692    /(*COMMIT)abc/      re&#62; /(*COMMIT)abc/
2693    xyzabc    data&#62; xyzabc
2694     0: abc     0: abc
2695    xyzabc\Y    xyzabc\Y
2696    No match    No match
# Line 2697  reached, or when matching to the right o Line 2711  reached, or when matching to the right o
2711  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of  the right, backtracking cannot cross (*PRUNE). In simple cases, the use of
2712  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,  (*PRUNE) is just an alternative to an atomic group or possessive quantifier,
2713  but there are some uses of (*PRUNE) that cannot be expressed in any other way.  but there are some uses of (*PRUNE) that cannot be expressed in any other way.
2714  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the  The behaviour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an
2715  match fails completely; the name is passed back if this is the final attempt.  anchored pattern (*PRUNE) has the same effect as (*COMMIT).
 (*PRUNE:NAME) does not pass back a name if the match succeeds. In an anchored  
 pattern (*PRUNE) has the same effect as (*COMMIT).  
2716  <pre>  <pre>
2717    (*SKIP)    (*SKIP)
2718  </pre>  </pre>
# Line 2726  following pattern fails to match, the pr Line 2738  following pattern fails to match, the pr
2738  searched for the most recent (*MARK) that has the same name. If one is found,  searched for the most recent (*MARK) that has the same name. If one is found,
2739  the "bumpalong" advance is to the subject position that corresponds to that  the "bumpalong" advance is to the subject position that corresponds to that
2740  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a  (*MARK) instead of to where (*SKIP) was encountered. If no (*MARK) with a
2741  matching name is found, normal "bumpalong" of one character happens (that is,  matching name is found, the (*SKIP) is ignored.
 the (*SKIP) is ignored).  
2742  <pre>  <pre>
2743    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2744  </pre>  </pre>
# Line 2741  be used for a pattern-based if-then-else Line 2752  be used for a pattern-based if-then-else
2752  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2753  the end of the group if FOO succeeds); on failure, the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2754  second alternative and tries COND2, without backtracking into COND1. The  second alternative and tries COND2, without backtracking into COND1. The
2755  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN) if the  behaviour of (*THEN:NAME) is exactly the same as (*MARK:NAME)(*THEN).
2756  overall match fails. If (*THEN) is not inside an alternation, it acts like  If (*THEN) is not inside an alternation, it acts like (*PRUNE).
 (*PRUNE).  
2757  </P>  </P>
2758  <P>  <P>
2759  Note that a subpattern that does not contain a | character is just a part of  Note that a subpattern that does not contain a | character is just a part of
# Line 2819  Cambridge CB2 3QH, England. Line 2829  Cambridge CB2 3QH, England.
2829  </P>  </P>
2830  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2831  <P>  <P>
2832  Last updated: 19 October 2011  Last updated: 29 November 2011
2833  <br>  <br>
2834  Copyright &copy; 1997-2011 University of Cambridge.  Copyright &copy; 1997-2011 University of Cambridge.
2835  <br>  <br>

Legend:
Removed from v.783  
changed lines
  Added in v.784

  ViewVC Help
Powered by ViewVC 1.1.5