/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 733 by ph10, Tue Oct 11 10:29:36 2011 UTC revision 737 by ph10, Wed Oct 19 17:37:29 2011 UTC
# Line 971  that signifies the end of a line. Line 971  that signifies the end of a line.
971  .rs  .rs
972  .sp  .sp
973  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
974  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
975  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
976  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \eC
977  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \eC in UTF-8
978  the \eC escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
979    character. This has undefined results, because PCRE assumes that it is dealing
980    with valid UTF-8 strings (and by default it checks this at the start of
981    processing unless the PCRE_NO_UTF8_CHECK option is used).
982  .P  .P
983  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
984  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
# Line 984  PCRE does not allow \eC to appear in loo Line 987  PCRE does not allow \eC to appear in loo
987  .\"  .\"
988  because in UTF-8 mode this would make it impossible to calculate the length of  because in UTF-8 mode this would make it impossible to calculate the length of
989  the lookbehind.  the lookbehind.
990    .P
991    In general, the \eC escape sequence is best avoided in UTF-8 mode. However, one
992    way of using it that avoids the problem of malformed UTF-8 characters is to
993    use a lookahead to check the length of the next character, as in this pattern
994    (ignore white space and line breaks):
995    .sp
996      (?| (?=[\ex00-\ex7f])(\eC) |
997          (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
998          (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
999          (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1000    .sp
1001    A group that starts with (?| resets the capturing parentheses numbers in each
1002    alternative (see
1003    .\" HTML <a href="#dupsubpatternnumber">
1004    .\" </a>
1005    "Duplicate Subpattern Numbers"
1006    .\"
1007    below). The assertions at the start of each branch check the next UTF-8
1008    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1009    character's individual bytes are then captured by the appropriate number of
1010    groups.
1011  .  .
1012  .  .
1013  .\" HTML <a name="characterclass"></a>  .\" HTML <a name="characterclass"></a>
# Line 2830  Cambridge CB2 3QH, England. Line 2854  Cambridge CB2 3QH, England.
2854  .rs  .rs
2855  .sp  .sp
2856  .nf  .nf
2857  Last updated: 09 October 2011  Last updated: 19 October 2011
2858  Copyright (c) 1997-2011 University of Cambridge.  Copyright (c) 1997-2011 University of Cambridge.
2859  .fi  .fi

Legend:
Removed from v.733  
changed lines
  Added in v.737

  ViewVC Help
Powered by ViewVC 1.1.5