/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 737 by ph10, Wed Oct 19 17:37:29 2011 UTC revision 758 by ph10, Mon Nov 21 12:05:36 2011 UTC
# Line 241  one of the following escape sequences th Line 241  one of the following escape sequences th
241    \et        tab (hex 09)    \et        tab (hex 09)
242    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
243    \exhh      character with hex code hh    \exhh      character with hex code hh
244    \ex{hhh..} character with hex code hhh..    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
245      \euhhhh    character with hex code hhhh (JavaScript mode only)
246  .sp  .sp
247  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx is as follows: if x is a lower case letter, it
248  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
# Line 252  both byte mode and UTF-8 mode. (When PCR Line 253  both byte mode and UTF-8 mode. (When PCR
253  values are valid. A lower case letter is converted to upper case, and then the  values are valid. A lower case letter is converted to upper case, and then the
254  0xc0 bits are flipped.)  0xc0 bits are flipped.)
255  .P  .P
256  After \ex, from zero to two hexadecimal digits are read (letters can be in  By default, after \ex, from zero to two hexadecimal digits are read (letters
257  upper or lower case). Any number of hexadecimal digits may appear between \ex{  can be in upper or lower case). Any number of hexadecimal digits may appear
258  and }, but the value of the character code must be less than 256 in non-UTF-8  between \ex{ and }, but the value of the character code must be less than 256
259  mode, and less than 2**31 in UTF-8 mode. That is, the maximum value in  in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, the maximum
260  hexadecimal is 7FFFFFFF. Note that this is bigger than the largest Unicode code  value in hexadecimal is 7FFFFFFF. Note that this is bigger than the largest
261  point, which is 10FFFF.  Unicode code point, which is 10FFFF.
262  .P  .P
263  If characters other than hexadecimal digits appear between \ex{ and }, or if  If characters other than hexadecimal digits appear between \ex{ and }, or if
264  there is no terminating }, this form of escape is not recognized. Instead, the  there is no terminating }, this form of escape is not recognized. Instead, the
265  initial \ex will be interpreted as a basic hexadecimal escape, with no  initial \ex will be interpreted as a basic hexadecimal escape, with no
266  following digits, giving a character whose value is zero.  following digits, giving a character whose value is zero.
267  .P  .P
268    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
269    as just described only when it is followed by two hexadecimal digits.
270    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
271    code points greater than 256 is provided by \eu, which must be followed by
272    four hexadecimal digits; otherwise it matches a literal "u" character.
273    .P
274  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
275  syntaxes for \ex. There is no difference in the way they are handled. For  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
276  example, \exdc is exactly the same as \ex{dc}.  way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
277    \eu00dc in JavaScript mode).
278  .P  .P
279  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
280  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
# Line 320  Note that octal values of 100 or greater Line 328  Note that octal values of 100 or greater
328  zero, because no more than three octal digits are ever read.  zero, because no more than three octal digits are ever read.
329  .P  .P
330  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
331  and outside character classes. In addition, inside a character class, the  and outside character classes. In addition, inside a character class, \eb is
332  sequence \eb is interpreted as the backspace character (hex 08). The sequences  interpreted as the backspace character (hex 08).
333  \eB, \eN, \eR, and \eX are not special inside a character class. Like any other  .P
334  unrecognized escape sequences, they are treated as the literal characters "B",  \eN is not allowed in a character class. \eB, \eR, and \eX are not special
335  "N", "R", and "X" by default, but cause an error if the PCRE_EXTRA option is  inside a character class. Like other unrecognized escape sequences, they are
336  set. Outside a character class, these sequences have different meanings.  treated as the literal characters "B", "R", and "X" by default, but cause an
337    error if the PCRE_EXTRA option is set. Outside a character class, these
338    sequences have different meanings.
339    .
340    .
341    .SS "Unsupported escape sequences"
342    .rs
343    .sp
344    In Perl, the sequences \el, \eL, \eu, and \eU are recognized by its string
345    handler and used to modify the case of following characters. By default, PCRE
346    does not support these escape sequences. However, if the PCRE_JAVASCRIPT_COMPAT
347    option is set, \eU matches a "U" character, and \eu can be used to define a
348    character by code point, as described in the previous section.
349  .  .
350  .  .
351  .SS "Absolute and relative back references"  .SS "Absolute and relative back references"
# Line 387  This is the same as Line 407  This is the same as
407  .\" </a>  .\" </a>
408  the "." metacharacter  the "." metacharacter
409  .\"  .\"
410  when PCRE_DOTALL is not set.  when PCRE_DOTALL is not set. Perl also uses \eN to match characters by name;
411    PCRE does not support this.
412  .P  .P
413  Each pair of lower and upper case escape sequences partitions the complete set  Each pair of lower and upper case escape sequences partitions the complete set
414  of characters into two disjoint sets. Any given character matches one, and only  of characters into two disjoint sets. Any given character matches one, and only
# Line 964  special meaning in a character class. Line 985  special meaning in a character class.
985  .P  .P
986  The escape sequence \eN behaves like a dot, except that it is not affected by  The escape sequence \eN behaves like a dot, except that it is not affected by
987  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
988  that signifies the end of a line.  that signifies the end of a line. Perl also uses \eN to match characters by
989    name; PCRE does not support this.
990  .  .
991  .  .
992  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
# Line 983  processing unless the PCRE_NO_UTF8_CHECK Line 1005  processing unless the PCRE_NO_UTF8_CHECK
1005  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
1006  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
1007  .\" </a>  .\" </a>
1008  (described below),  (described below)
1009  .\"  .\"
1010  because in UTF-8 mode this would make it impossible to calculate the length of  in UTF-8 mode, because this would make it impossible to calculate the length of
1011  the lookbehind.  the lookbehind.
1012  .P  .P
1013  In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one  In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
1014  way of using it that avoids the problem of malformed UTF-8 characters is to  way of using it that avoids the problem of malformed UTF-8 characters is to
1015  use a lookahead to check the length of the next character, as in this pattern  use a lookahead to check the length of the next character, as in this pattern
1016  (ignore white space and line breaks):  (ignore white space and line breaks):
1017  .sp  .sp
1018    (?| (?=[\ex00-\ex7f])(\eC) |    (?| (?=[\ex00-\ex7f])(\eC) |
1019        (?=[\ex80-\ex{7ff}])(\eC)(\eC) |        (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1020        (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |        (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1021        (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))        (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1022  .sp  .sp
1023  A group that starts with (?| resets the capturing parentheses numbers in each  A group that starts with (?| resets the capturing parentheses numbers in each
1024  alternative (see  alternative (see
1025  .\" HTML <a href="#dupsubpatternnumber">  .\" HTML <a href="#dupsubpatternnumber">
1026  .\" </a>  .\" </a>
1027  "Duplicate Subpattern Numbers"  "Duplicate Subpattern Numbers"
1028  .\"  .\"
1029  below). The assertions at the start of each branch check the next UTF-8  below). The assertions at the start of each branch check the next UTF-8
1030  character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The  character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1031  character's individual bytes are then captured by the appropriate number of  character's individual bytes are then captured by the appropriate number of
1032  groups.  groups.
1033  .  .
# Line 1950  temporarily move the current position ba Line 1972  temporarily move the current position ba
1972  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
1973  assertion fails.  assertion fails.
1974  .P  .P
1975  PCRE does not allow the \eC escape (which matches a single byte in UTF-8 mode)  In UTF-8 mode, PCRE does not allow the \eC escape (which matches a single byte,
1976  to appear in lookbehind assertions, because it makes it impossible to calculate  even in UTF-8 mode) to appear in lookbehind assertions, because it makes it
1977  the length of the lookbehind. The \eX and \eR escapes, which can match  impossible to calculate the length of the lookbehind. The \eX and \eR escapes,
1978  different numbers of bytes, are also not permitted.  which can match different numbers of bytes, are also not permitted.
1979  .P  .P
1980  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
1981  .\" </a>  .\" </a>
# Line 2854  Cambridge CB2 3QH, England. Line 2876  Cambridge CB2 3QH, England.
2876  .rs  .rs
2877  .sp  .sp
2878  .nf  .nf
2879  Last updated: 19 October 2011  Last updated: 19 November 2011
2880  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2881  .fi  .fi

Legend:
Removed from v.737  
changed lines
  Added in v.758

  ViewVC Help
Powered by ViewVC 1.1.5