/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 724 by ph10, Sun Oct 9 16:23:45 2011 UTC revision 745 by ph10, Mon Nov 14 11:41:03 2011 UTC
# Line 241  one of the following escape sequences th Line 241  one of the following escape sequences th
241    \et        tab (hex 09)    \et        tab (hex 09)
242    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
243    \exhh      character with hex code hh    \exhh      character with hex code hh
244    \ex{hhh..} character with hex code hhh..    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
245      \euhhhh    character with hex code hhhh (JavaScript mode only)
246  .sp  .sp
247  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
248  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 252  both byte mode and UTF-8 mode. (When PCR Line 253  both byte mode and UTF-8 mode. (When PCR
253  values are valid. A lower case letter is converted to upper case, and then the  values are valid. A lower case letter is converted to upper case, and then the
254  0xc0 bits are flipped.)  0xc0 bits are flipped.)
255  .P  .P
256  After \ex, from zero to two hexadecimal digits are read (letters can be in  By default, after \ex, from zero to two hexadecimal digits are read (letters
257  upper or lower case). Any number of hexadecimal digits may appear between \ex{  can be in upper or lower case). Any number of hexadecimal digits may appear
258  and }, but the value of the character code must be less than 256 in non-UTF-8  between \ex{ and }, but the value of the character code must be less than 256
259  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
260  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
261  point, which is 10FFFF.  Unicode code point, which is 10FFFF.
262  .P  .P
263  If characters other than hexadecimal digits appear between \ex{ and }, or if  If characters other than hexadecimal digits appear between \ex{ and }, or if
264  there is no terminating }, this form of escape is not recognized. Instead, the  there is no terminating }, this form of escape is not recognized. Instead, the
265  initial \ex will be interpreted as a basic hexadecimal escape, with no  initial \ex will be interpreted as a basic hexadecimal escape, with no
266  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
267  .P  .P
268    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
269    as just described only when it is followed by two hexadecimal digits.
270    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
271    code points greater than 256 is provided by \eu, which must be followed by
272    four hexadecimal digits; otherwise it matches a literal "u" character.
273    .P
274  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
275  syntaxes for \ex. There is no difference in the way they are handled. For  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
276  example, \exdc is exactly the same as \ex{dc}.  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
277    \eu00dc in JavaScript mode).
278  .P  .P
279  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
280  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
# Line 328  unrecognized escape sequences, they are Line 336  unrecognized escape sequences, they are
336  set. Outside a character class, these sequences have different meanings.  set. Outside a character class, these sequences have different meanings.
337  .  .
338  .  .
339    .SS "Unsupported escape sequences"
340    .rs
341    .sp
342    In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
343    handler and used to modify the case of following characters. By default, PCRE
344    does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
345    option is set, \eU matches a "U" character, and \eu can be used to define a
346    character by code point, as described in the previous section.
347    .
348    .
349  .SS "Absolute and relative back references"  .SS "Absolute and relative back references"
350  .rs  .rs
351  .sp  .sp
# Line 387  This is the same as Line 405  This is the same as
405  .\" </a>  .\" </a>
406  the "." metacharacter  the "." metacharacter
407  .\"  .\"
408  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
409    PCRE does not support this.
410  .P  .P
411  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
412  of characters into two disjoint sets. Any given character matches one, and only  of characters into two disjoint sets. Any given character matches one, and only
# Line 964  special meaning in a character class. Line 983  special meaning in a character class.
983  .P  .P
984  The escape sequence \eN behaves like a dot, except that it is not affected by  The escape sequence \eN behaves like a dot, except that it is not affected by
985  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
986  that signifies the end of a line.  that signifies the end of a line. Perl also uses \eN to match characters by
987    name; PCRE does not support this.
988  .  .
989  .  .
990  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
991  .rs  .rs
992  .sp  .sp
993  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
994  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
995  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
996  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
997  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \eC in UTF-8
998  the \eC escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
999    character. This has undefined results, because PCRE assumes that it is dealing
1000    with valid UTF-8 strings (and by default it checks this at the start of
1001    processing unless the PCRE_NO_UTF8_CHECK option is used).
1002  .P  .P
1003  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
1004  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
# Line 984  PCRE does not allow \eC to appear in loo Line 1007  PCRE does not allow \eC to appear in loo
1007  .\"  .\"
1008  because in UTF-8 mode this would make it impossible to calculate the length of  because in UTF-8 mode this would make it impossible to calculate the length of
1009  the lookbehind.  the lookbehind.
1010    .P
1011    In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
1012    way of using it that avoids the problem of malformed UTF-8 characters is to
1013    use a lookahead to check the length of the next character, as in this pattern
1014    (ignore white space and line breaks):
1015    .sp
1016      (?| (?=[\ex00-\ex7f])(\eC) |
1017          (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1018          (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1019          (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1020    .sp
1021    A group that starts with (?| resets the capturing parentheses numbers in each
1022    alternative (see
1023    .\" HTML <a href="#dupsubpatternnumber">
1024    .\" </a>
1025    "Duplicate Subpattern Numbers"
1026    .\"
1027    below). The assertions at the start of each branch check the next UTF-8
1028    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1029    character's individual bytes are then captured by the appropriate number of
1030    groups.
1031  .  .
1032  .  .
1033  .\" HTML <a name="characterclass"></a>  .\" HTML <a name="characterclass"></a>
# Line 2389  PCRE finds the palindrome "aba" at the s Line 2433  PCRE finds the palindrome "aba" at the s
2433  the end of the string does not follow. Once again, it cannot jump back into the  the end of the string does not follow. Once again, it cannot jump back into the
2434  recursion to try other alternatives, so the entire match fails.  recursion to try other alternatives, so the entire match fails.
2435  .P  .P
2436  The second way in which PCRE and Perl differ in their recursion processing is  The second way in which PCRE and Perl differ in their recursion processing is
2437  in the handling of captured values. In Perl, when a subpattern is called  in the handling of captured values. In Perl, when a subpattern is called
2438  recursively or as a subpattern (see the next section), it has no access to any  recursively or as a subpattern (see the next section), it has no access to any
2439  values that were captured outside the recursion, whereas in PCRE these values  values that were captured outside the recursion, whereas in PCRE these values
2440  can be referenced. Consider this pattern:  can be referenced. Consider this pattern:
2441  .sp  .sp
2442    ^(.)(\e1|a(?2))    ^(.)(\e1|a(?2))
2443  .sp  .sp
2444  In PCRE, this pattern matches "bab". The first capturing parentheses match "b",  In PCRE, this pattern matches "bab". The first capturing parentheses match "b",
2445  then in the second group, when the back reference \e1 fails to match "b", the  then in the second group, when the back reference \e1 fails to match "b", the
2446  second alternative matches "a" and then recurses. In the recursion, \e1 does  second alternative matches "a" and then recurses. In the recursion, \e1 does
2447  now match "b" and so the whole match succeeds. In Perl, the pattern fails to  now match "b" and so the whole match succeeds. In Perl, the pattern fails to
# Line 2762  pattern fragments that do not contain an Line 2806  pattern fragments that do not contain an
2806  .sp  .sp
2807    A (B(*THEN)C) | D    A (B(*THEN)C) | D
2808  .sp  .sp
2809  If A and B are matched, but there is a failure in C, matching does not  If A and B are matched, but there is a failure in C, matching does not
2810  backtrack into A; instead it moves to the next alternative, that is, D.  backtrack into A; instead it moves to the next alternative, that is, D.
2811  However, if the subpattern containing (*THEN) is given an alternative, it  However, if the subpattern containing (*THEN) is given an alternative, it
2812  behaves differently:  behaves differently:
# Line 2770  behaves differently: Line 2814  behaves differently:
2814    A (B(*THEN)C | (*FAIL)) | D    A (B(*THEN)C | (*FAIL)) | D
2815  .sp  .sp
2816  The effect of (*THEN) is now confined to the inner subpattern. After a failure  The effect of (*THEN) is now confined to the inner subpattern. After a failure
2817  in C, matching moves to (*FAIL), which causes the whole subpattern to fail  in C, matching moves to (*FAIL), which causes the whole subpattern to fail
2818  because there are no more alternatives to try. In this case, matching does now  because there are no more alternatives to try. In this case, matching does now
2819  backtrack into A.  backtrack into A.
2820  .P  .P
2821  Note also that a conditional subpattern is not considered as having two  Note also that a conditional subpattern is not considered as having two
2822  alternatives, because only one is ever used. In other words, the | character in  alternatives, because only one is ever used. In other words, the | character in
2823  a conditional subpattern has a different meaning. Ignoring white space,  a conditional subpattern has a different meaning. Ignoring white space,
2824  consider:  consider:
2825  .sp  .sp
2826    ^.*? (?(?=a) a | b(*THEN)c )    ^.*? (?(?=a) a | b(*THEN)c )
2827  .sp  .sp
2828  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,  If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
2829  it initially matches zero characters. The condition (?=a) then fails, the  it initially matches zero characters. The condition (?=a) then fails, the
2830  character "b" is matched, but "c" is not. At this point, matching does not  character "b" is matched, but "c" is not. At this point, matching does not
2831  backtrack to .*? as might perhaps be expected from the presence of the |  backtrack to .*? as might perhaps be expected from the presence of the |
2832  character. The conditional subpattern is part of the single alternative that  character. The conditional subpattern is part of the single alternative that
2833  comprises the whole pattern, and so the match fails. (If there was a backtrack  comprises the whole pattern, and so the match fails. (If there was a backtrack
2834  into .*?, allowing it to match "b", the match would succeed.)  into .*?, allowing it to match "b", the match would succeed.)
2835  .P  .P
2836  The verbs just described provide four different "strengths" of control when  The verbs just described provide four different "strengths" of control when
# Line 2830  Cambridge CB2 3QH, England. Line 2874  Cambridge CB2 3QH, England.
2874  .rs  .rs
2875  .sp  .sp
2876  .nf  .nf
2877  Last updated: 09 October 2011  Last updated: 14 November 2011
2878  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2879  .fi  .fi

Legend:
Removed from v.724  
changed lines
  Added in v.745

  ViewVC Help
Powered by ViewVC 1.1.5