/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1001 by ph10, Wed Aug 8 10:18:25 2012 UTC revision 1011 by ph10, Sat Aug 25 11:36:15 2012 UTC
# Line 576  The extra escape sequences are: Line 576  The extra escape sequences are:
576  .sp  .sp
577    \ep{\fIxx\fP}   a character with the \fIxx\fP property    \ep{\fIxx\fP}   a character with the \fIxx\fP property
578    \eP{\fIxx\fP}   a character without the \fIxx\fP property    \eP{\fIxx\fP}   a character without the \fIxx\fP property
579    \eX       an extended Unicode sequence    \eX       a Unicode extended grapheme cluster
580  .sp  .sp
581  The property names represented by \fIxx\fP above are limited to the Unicode  The property names represented by \fIxx\fP above are limited to the Unicode
582  script names, the general category properties, "Any", which matches any  script names, the general category properties, "Any", which matches any
# Line 786  Unicode table. Line 786  Unicode table.
786  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
787  example, \ep{Lu} always matches only upper case letters.  example, \ep{Lu} always matches only upper case letters.
788  .P  .P
789  The \eX escape matches any number of Unicode characters that form an extended  Matching characters by Unicode property is not fast, because PCRE has to do a
790  Unicode sequence. \eX is equivalent to  multistage table lookup in order to find a character's property. That is why
791  .sp  the traditional escape sequences such as \ed and \ew do not use Unicode
792    (?>\ePM\epM*)  properties in PCRE by default, though you can make them do so by setting the
793    PCRE_UCP option or by starting the pattern with (*UCP).
794    .
795    .
796    .SS Extended grapheme clusters
797    .rs
798  .sp  .sp
799  That is, it matches a character without the "mark" property, followed by zero  The \eX escape matches any number of Unicode characters that form an "extended
800  or more characters with the "mark" property, and treats the sequence as an  grapheme cluster", and treats the sequence as an atomic group
 atomic group  
801  .\" HTML <a href="#atomicgroup">  .\" HTML <a href="#atomicgroup">
802  .\" </a>  .\" </a>
803  (see below).  (see below).
804  .\"  .\"
805  Characters with the "mark" property are typically accents that affect the  Up to and including release 8.31, PCRE matched an earlier, simpler definition
806  preceding character. None of them have codepoints less than 256, so in  that was equivalent to
807  8-bit non-UTF-8 mode \eX matches any one character.  .sp
808  .P    (?>\ePM\epM*)
809  Note that recent versions of Perl have changed \eX to match what Unicode calls  .sp
810  an "extended grapheme cluster", which has a more complicated definition.  That is, it matched a character without the "mark" property, followed by zero
811  .P  or more characters with the "mark" property. Characters with the "mark"
812  Matching characters by Unicode property is not fast, because PCRE has to search  property are typically non-spacing accents that affect the preceding character.
813  a structure that contains data for over fifteen thousand characters. That is  .P
814  why the traditional escape sequences such as \ed and \ew do not use Unicode  This simple definition was extended in Unicode to include more complicated
815  properties in PCRE by default, though you can make them do so by setting the  kinds of composite character by giving each character a grapheme breaking
816  PCRE_UCP option or by starting the pattern with (*UCP).  property, and creating rules that use these properties to define the boundaries
817    of extended grapheme clusters. In releases of PCRE later than 8.31, \eX matches
818    one of these clusters.
819    .P
820    \eX always matches at least one character. Then it decides whether to add
821    additional characters according to the following rules for ending a cluster:
822    .P
823    1. End at the end of the subject string.
824    .P
825    2. Do not end between CR and LF; otherwise end after any control character.
826    .P
827    3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
828    are of five types: L, V, T, LV, and LVT. An L character may be followed by an
829    L, V, LV, or LVT character; an LV or V character may be followed by a V or T
830    character; an LVT or T character may be follwed only by a T character.
831    .P
832    4. Do not end before extending characters or spacing marks. Characters with
833    the "mark" property always have the "extend" grapheme breaking property.
834    .P
835    5. Do not end after prepend characters.
836    .P
837    6. Otherwise, end the cluster.
838  .  .
839  .  .
840  .\" HTML <a name="extraprops"></a>  .\" HTML <a name="extraprops"></a>
841  .SS PCRE's additional properties  .SS PCRE's additional properties
842  .rs  .rs
843  .sp  .sp
844  As well as the standard Unicode properties described in the previous  As well as the standard Unicode properties described above, PCRE supports four
845  section, PCRE supports four more that make it possible to convert traditional  more that make it possible to convert traditional escape sequences such as \ew
846  escape sequences such as \ew and \es and POSIX character classes to use Unicode  and \es and POSIX character classes to use Unicode properties. PCRE uses these
847  properties. PCRE uses these non-standard, non-Perl properties internally when  non-standard, non-Perl properties internally when PCRE_UCP is set. They are:
 PCRE_UCP is set. They are:  
848  .sp  .sp
849    Xan   Any alphanumeric character    Xan   Any alphanumeric character
850    Xps   Any POSIX space character    Xps   Any POSIX space character
# Line 1541  quantifier, but a literal string of four Line 1565  quantifier, but a literal string of four
1565  In UTF modes, quantifiers apply to characters rather than to individual data  In UTF modes, quantifiers apply to characters rather than to individual data
1566  units. Thus, for example, \ex{100}{2} matches two characters, each of  units. Thus, for example, \ex{100}{2} matches two characters, each of
1567  which is represented by a two-byte sequence in a UTF-8 string. Similarly,  which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1568  \eX{3} matches three Unicode extended sequences, each of which may be several  \eX{3} matches three Unicode extended grapheme clusters, each of which may be
1569  data units long (and they may be of different lengths).  several data units long (and they may be of different lengths).
1570  .P  .P
1571  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1572  previous item and the quantifier were not present. This may be useful for  previous item and the quantifier were not present. This may be useful for
# Line 2929  Cambridge CB2 3QH, England. Line 2953  Cambridge CB2 3QH, England.
2953  .rs  .rs
2954  .sp  .sp
2955  .nf  .nf
2956  Last updated: 08 August 2012  Last updated: 25 August 2012
2957  Copyright (c) 1997-2012 University of Cambridge.  Copyright (c) 1997-2012 University of Cambridge.
2958  .fi  .fi

Legend:
Removed from v.1001  
changed lines
  Added in v.1011

  ViewVC Help
Powered by ViewVC 1.1.5