/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 737 by ph10, Tue Oct 11 10:29:36 2011 UTC revision 738 by ph10, Fri Oct 21 09:04:01 2011 UTC
# Line 968  that signifies the end of a line. Line 968  that signifies the end of a line.
968  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
969  <P>  <P>
970  Outside a character class, the escape sequence \C matches any one byte, both  Outside a character class, the escape sequence \C matches any one byte, both
971  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches line-ending
972  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
973  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the  in UTF-8 mode, but it is unclear how it can usefully be used. Because \C
974  rest of the string may start with a malformed UTF-8 character. For this reason,  breaks up characters into individual bytes, matching one byte with \C in UTF-8
975  the \C escape sequence is best avoided.  mode means that the rest of the string may start with a malformed UTF-8
976    character. This has undefined results, because PCRE assumes that it is dealing
977    with valid UTF-8 strings (and by default it checks this at the start of
978    processing unless the PCRE_NO_UTF8_CHECK option is used).
979  </P>  </P>
980  <P>  <P>
981  PCRE does not allow \C to appear in lookbehind assertions  PCRE does not allow \C to appear in lookbehind assertions
982  <a href="#lookbehind">(described below),</a>  <a href="#lookbehind">(described below),</a>
983  because in UTF-8 mode this would make it impossible to calculate the length of  because in UTF-8 mode this would make it impossible to calculate the length of
984  the lookbehind.  the lookbehind.
985    </P>
986    <P>
987    In general, the \C escape sequence is best avoided in UTF-8 mode. However, one
988    way of using it that avoids the problem of malformed UTF-8 characters is to
989    use a lookahead to check the length of the next character, as in this pattern
990    (ignore white space and line breaks):
991    <pre>
992      (?| (?=[\x00-\x7f])(\C) |
993          (?=[\x80-\x{7ff}])(\C)(\C) |
994          (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
995          (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
996    </pre>
997    A group that starts with (?| resets the capturing parentheses numbers in each
998    alternative (see
999    <a href="#dupsubpatternnumber">"Duplicate Subpattern Numbers"</a>
1000    below). The assertions at the start of each branch check the next UTF-8
1001    character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1002    character's individual bytes are then captured by the appropriate number of
1003    groups.
1004  <a name="characterclass"></a></P>  <a name="characterclass"></a></P>
1005  <br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>  <br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
1006  <P>  <P>
# Line 2797  Cambridge CB2 3QH, England. Line 2819  Cambridge CB2 3QH, England.
2819  </P>  </P>
2820  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2821  <P>  <P>
2822  Last updated: 09 October 2011  Last updated: 19 October 2011
2823  <br>  <br>
2824  Copyright &copy; 1997-2011 University of Cambridge.  Copyright &copy; 1997-2011 University of Cambridge.
2825  <br>  <br>

Legend:
Removed from v.737  
changed lines
  Added in v.738

  ViewVC Help
Powered by ViewVC 1.1.5