/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1369 by ph10, Tue Oct 8 15:06:46 2013 UTC revision 1370 by ph10, Wed Oct 9 10:18:26 2013 UTC
# Line 300  one of the following escape sequences th Line 300  one of the following escape sequences th
300    \en        linefeed (hex 0A)    \en        linefeed (hex 0A)
301    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
302    \et        tab (hex 09)    \et        tab (hex 09)
303      \e0dd      character with octal code 0dd
304    \eddd      character with octal code ddd, or back reference    \eddd      character with octal code ddd, or back reference
305      \eo{ddd..} character with octal code ddd..
306    \exhh      character with hex code hh    \exhh      character with hex code hh
307    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
308    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
# Line 321  byte are inverted. Thus \ecA becomes hex Line 323  byte are inverted. Thus \ecA becomes hex
323  the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other  the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
324  characters also generate different values.  characters also generate different values.
325  .P  .P
 By default, after \ex, from zero to two hexadecimal digits are read (letters  
 can be in upper or lower case). Any number of hexadecimal digits may appear  
 between \ex{ and }, but the character code is constrained as follows:  
 .sp  
   8-bit non-UTF mode    less than 0x100  
   8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint  
   16-bit non-UTF mode   less than 0x10000  
   16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint  
   32-bit non-UTF mode   less than 0x80000000  
   32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint  
 .sp  
 Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  
 "surrogate" codepoints), and 0xffef.  
 .P  
 If characters other than hexadecimal digits appear between \ex{ and }, or if  
 there is no terminating }, this form of escape is not recognized. Instead, the  
 initial \ex will be interpreted as a basic hexadecimal escape, with no  
 following digits, giving a character whose value is zero.  
 .P  
 If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is  
 as just described only when it is followed by two hexadecimal digits.  
 Otherwise, it matches a literal "x" character. In JavaScript mode, support for  
 code points greater than 256 is provided by \eu, which must be followed by  
 four hexadecimal digits; otherwise it matches a literal "u" character.  
 Character codes specified by \eu in JavaScript mode are constrained in the same  
 was as those specified by \ex in non-JavaScript mode.  
 .P  
 Characters whose value is less than 256 can be defined by either of the two  
 syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the  
 way they are handled. For example, \exdc is exactly the same as \ex{dc} (or  
 \eu00dc in JavaScript mode).  
 .P  
326  After \e0 up to two further octal digits are read. If there are fewer than two  After \e0 up to two further octal digits are read. If there are fewer than two
327  digits, just those that are present are used. Thus the sequence \e0\ex\e07  digits, just those that are present are used. Thus the sequence \e0\ex\e07
328  specifies two binary zeros followed by a BEL character (code value 7). Make  specifies two binary zeros followed by a BEL character (code value 7). Make
329  sure you supply two digits after the initial zero if the pattern character that  sure you supply two digits after the initial zero if the pattern character that
330  follows is itself an octal digit.  follows is itself an octal digit.
331  .P  .P
332    The escape \eo must be followed by a sequence of octal digits, enclosed in
333    braces. An error occurs if this is not the case. This escape is a recent
334    addition to Perl; it provides way of specifying character code points as octal
335    numbers greater than 0777, and it also allows octal numbers and back references
336    to be unambiguously specified.
337    .P
338    For greater clarity and unambiguity, it is best to avoid following \e by a
339    digit greater than zero. Instead, use \eo{} or \ex{} to specify character
340    numbers, and \eg{} to specify back references. The following paragraphs
341    describe the old, ambiguous syntax.
342    .P
343  The handling of a backslash followed by a digit other than 0 is complicated,  The handling of a backslash followed by a digit other than 0 is complicated,
344  and Perl has changed in recent releases, causing PCRE also to change. Outside a  and Perl has changed in recent releases, causing PCRE also to change. Outside a
345  character class, PCRE reads the digit and any following digits as a decimal  character class, PCRE reads the digit and any following digits as a decimal
# Line 379  Inside a character class, or if the deci Line 360  Inside a character class, or if the deci
360  7 and there have not been that many capturing subpatterns, PCRE handles \e8 and  7 and there have not been that many capturing subpatterns, PCRE handles \e8 and
361  \e9 as the literal characters "8" and "9", and otherwise re-reads up to three  \e9 as the literal characters "8" and "9", and otherwise re-reads up to three
362  octal digits following the backslash, using them to generate a data character.  octal digits following the backslash, using them to generate a data character.
363  Any subsequent digits stand for themselves. The value of the character is  Any subsequent digits stand for themselves. For example:
 constrained in the same way as characters specified in hexadecimal. For  
 example:  
364  .sp  .sp
365    \e040   is another way of writing an ASCII space    \e040   is another way of writing an ASCII space
366  .\" JOIN  .\" JOIN
# Line 403  example: Line 382  example:
382    \e81    is either a back reference, or the two    \e81    is either a back reference, or the two
383              characters "8" and "1"              characters "8" and "1"
384  .sp  .sp
385  Note that octal values of 100 or greater must not be introduced by a leading  Note that octal values of 100 or greater that are specified using this syntax
386  zero, because no more than three octal digits are ever read.  must not be introduced by a leading zero, because no more than three octal
387    digits are ever read.
388    .P
389    By default, after \ex that is not followed by {, from zero to two hexadecimal
390    digits are read (letters can be in upper or lower case). Any number of
391    hexadecimal digits may appear between \ex{ and }. If a character other than
392    a hexadecimal digit appears between \ex{ and }, or if there is no terminating
393    }, an error occurs.
394  .P  .P
395    If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \ex is
396    as just described only when it is followed by two hexadecimal digits.
397    Otherwise, it matches a literal "x" character. In JavaScript mode, support for
398    code points greater than 256 is provided by \eu, which must be followed by
399    four hexadecimal digits; otherwise it matches a literal "u" character.
400    .P
401    Characters whose value is less than 256 can be defined by either of the two
402    syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
403    way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
404    \eu00dc in JavaScript mode).
405    .
406    .
407    .SS "Constraints on character values"
408    .rs
409    .sp
410    Characters that are specified using octal or hexadecimal numbers are
411    limited to certain values, as follows:
412    .sp
413      8-bit non-UTF mode    less than 0x100
414      8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
415      16-bit non-UTF mode   less than 0x10000
416      16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
417      32-bit non-UTF mode   less than 0x80000000
418      32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
419    .sp
420    Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
421    "surrogate" codepoints), and 0xffef.
422    .
423    .
424    .SS "Escape sequences in character classes"
425    .rs
426    .sp
427  All the sequences that define a single character value can be used both inside  All the sequences that define a single character value can be used both inside
428  and outside character classes. In addition, inside a character class, \eb is  and outside character classes. In addition, inside a character class, \eb is
429  interpreted as the backspace character (hex 08).  interpreted as the backspace character (hex 08).

Legend:
Removed from v.1369  
changed lines
  Added in v.1370

  ViewVC Help
Powered by ViewVC 1.1.5