/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 964 by ph10, Fri May 4 13:03:39 2012 UTC revision 1011 by ph10, Sat Aug 25 11:36:15 2012 UTC
# Line 198  In a UTF mode, only ASCII numbers and le Line 198  In a UTF mode, only ASCII numbers and le
198  backslash. All other characters (in particular, those whose codepoints are  backslash. All other characters (in particular, those whose codepoints are
199  greater than 127) are treated as literals.  greater than 127) are treated as literals.
200  .P  .P
201  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the  If a pattern is compiled with the PCRE_EXTENDED option, white space in the
202  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class) and characters between a # outside
203  a character class and the next newline are ignored. An escaping backslash can  a character class and the next newline are ignored. An escaping backslash can
204  be used to include a whitespace or # character as part of the pattern.  be used to include a white space or # character as part of the pattern.
205  .P  .P
206  If you want to remove the special meaning from a sequence of characters, you  If you want to remove the special meaning from a sequence of characters, you
207  can do so by putting them between \eQ and \eE. This is different from Perl in  can do so by putting them between \eQ and \eE. This is different from Perl in
# Line 237  one of the following escape sequences th Line 237  one of the following escape sequences th
237    \ea        alarm, that is, the BEL character (hex 07)    \ea        alarm, that is, the BEL character (hex 07)
238    \ecx       "control-x", where x is any ASCII character    \ecx       "control-x", where x is any ASCII character
239    \ee        escape (hex 1B)    \ee        escape (hex 1B)
240    \ef        formfeed (hex 0C)    \ef        form feed (hex 0C)
241    \en        linefeed (hex 0A)    \en        linefeed (hex 0A)
242    \er        carriage return (hex 0D)    \er        carriage return (hex 0D)
243    \et        tab (hex 09)    \et        tab (hex 09)
# Line 246  one of the following escape sequences th Line 246  one of the following escape sequences th
246    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)    \ex{hhh..} character with hex code hhh.. (non-JavaScript mode)
247    \euhhhh    character with hex code hhhh (JavaScript mode only)    \euhhhh    character with hex code hhhh (JavaScript mode only)
248  .sp  .sp
249  The precise effect of \ecx is as follows: if x is a lower case letter, it  The precise effect of \ecx on ASCII characters is as follows: if x is a lower
250  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  case letter, it is converted to upper case. Then bit 6 of the character (hex
251  Thus \ecz becomes hex 1A (z is 7A), but \ec{ becomes hex 3B ({ is 7B), while  40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
252  \ec; becomes hex 7B (; is 3B). If the byte following \ec has a value greater  but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
253  than 127, a compile-time error occurs. This locks out non-ASCII characters in  data item (byte or 16-bit value) following \ec has a value greater than 127, a
254  all modes. (When PCRE is compiled in EBCDIC mode, all byte values are valid. A  compile-time error occurs. This locks out non-ASCII characters in all modes.
255  lower case letter is converted to upper case, and then the 0xc0 bits are  .P
256  flipped.)  The \ec facility was designed for use with ASCII characters, but with the
257    extension to Unicode it is even less useful than it once was. It is, however,
258    recognized when PCRE is compiled in EBCDIC mode, where data items are always
259    bytes. In this mode, all values are valid after \ec. If the next character is a
260    lower case letter, it is converted to upper case. Then the 0xc0 bits of the
261    byte are inverted. Thus \ecA becomes hex 01, as in ASCII (A is C1), but because
262    the EBCDIC letters are disjoint, \ecZ becomes hex 29 (Z is E9), and other
263    characters also generate different values.
264  .P  .P
265  By default, after \ex, from zero to two hexadecimal digits are read (letters  By default, after \ex, from zero to two hexadecimal digits are read (letters
266  can be in upper or lower case). Any number of hexadecimal digits may appear  can be in upper or lower case). Any number of hexadecimal digits may appear
# Line 277  as just described only when it is follow Line 284  as just described only when it is follow
284  Otherwise, it matches a literal "x" character. In JavaScript mode, support for  Otherwise, it matches a literal "x" character. In JavaScript mode, support for
285  code points greater than 256 is provided by \eu, which must be followed by  code points greater than 256 is provided by \eu, which must be followed by
286  four hexadecimal digits; otherwise it matches a literal "u" character.  four hexadecimal digits; otherwise it matches a literal "u" character.
287    Character codes specified by \eu in JavaScript mode are constrained in the same
288    was as those specified by \ex in non-JavaScript mode.
289  .P  .P
290  Characters whose value is less than 256 can be defined by either of the two  Characters whose value is less than 256 can be defined by either of the two
291  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the  syntaxes for \ex (or by \eu in JavaScript mode). There is no difference in the
# Line 399  Another use of backslash is for specifyi Line 408  Another use of backslash is for specifyi
408  .sp  .sp
409    \ed     any decimal digit    \ed     any decimal digit
410    \eD     any character that is not a decimal digit    \eD     any character that is not a decimal digit
411    \eh     any horizontal whitespace character    \eh     any horizontal white space character
412    \eH     any character that is not a horizontal whitespace character    \eH     any character that is not a horizontal white space character
413    \es     any whitespace character    \es     any white space character
414    \eS     any character that is not a whitespace character    \eS     any character that is not a white space character
415    \ev     any vertical whitespace character    \ev     any vertical white space character
416    \eV     any character that is not a vertical whitespace character    \eV     any character that is not a vertical white space character
417    \ew     any "word" character    \ew     any "word" character
418    \eW     any "non-word" character    \eW     any "non-word" character
419  .sp  .sp
# Line 493  The vertical space characters are: Line 502  The vertical space characters are:
502  .sp  .sp
503    U+000A     Linefeed    U+000A     Linefeed
504    U+000B     Vertical tab    U+000B     Vertical tab
505    U+000C     Formfeed    U+000C     Form feed
506    U+000D     Carriage return    U+000D     Carriage return
507    U+0085     Next line    U+0085     Next line
508    U+2028     Line separator    U+2028     Line separator
# Line 520  below. Line 529  below.
529  .\"  .\"
530  This particular group matches either the two-character sequence CR followed by  This particular group matches either the two-character sequence CR followed by
531  LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,  LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
532  U+000B), FF (formfeed, U+000C), CR (carriage return, U+000D), or NEL (next  U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
533  line, U+0085). The two-character sequence is treated as a single unit that  line, U+0085). The two-character sequence is treated as a single unit that
534  cannot be split.  cannot be split.
535  .P  .P
# Line 567  The extra escape sequences are: Line 576  The extra escape sequences are:
576  .sp  .sp
577    \ep{\fIxx\fP}   a character with the \fIxx\fP property    \ep{\fIxx\fP}   a character with the \fIxx\fP property
578    \eP{\fIxx\fP}   a character without the \fIxx\fP property    \eP{\fIxx\fP}   a character without the \fIxx\fP property
579    \eX       an extended Unicode sequence    \eX       a Unicode extended grapheme cluster
580  .sp  .sp
581  The property names represented by \fIxx\fP above are limited to the Unicode  The property names represented by \fIxx\fP above are limited to the Unicode
582  script names, the general category properties, "Any", which matches any  script names, the general category properties, "Any", which matches any
# Line 777  Unicode table. Line 786  Unicode table.
786  Specifying caseless matching does not affect these escape sequences. For  Specifying caseless matching does not affect these escape sequences. For
787  example, \ep{Lu} always matches only upper case letters.  example, \ep{Lu} always matches only upper case letters.
788  .P  .P
789  The \eX escape matches any number of Unicode characters that form an extended  Matching characters by Unicode property is not fast, because PCRE has to do a
790  Unicode sequence. \eX is equivalent to  multistage table lookup in order to find a character's property. That is why
791  .sp  the traditional escape sequences such as \ed and \ew do not use Unicode
792    (?>\ePM\epM*)  properties in PCRE by default, though you can make them do so by setting the
793    PCRE_UCP option or by starting the pattern with (*UCP).
794    .
795    .
796    .SS Extended grapheme clusters
797    .rs
798  .sp  .sp
799  That is, it matches a character without the "mark" property, followed by zero  The \eX escape matches any number of Unicode characters that form an "extended
800  or more characters with the "mark" property, and treats the sequence as an  grapheme cluster", and treats the sequence as an atomic group
 atomic group  
801  .\" HTML <a href="#atomicgroup">  .\" HTML <a href="#atomicgroup">
802  .\" </a>  .\" </a>
803  (see below).  (see below).
804  .\"  .\"
805  Characters with the "mark" property are typically accents that affect the  Up to and including release 8.31, PCRE matched an earlier, simpler definition
806  preceding character. None of them have codepoints less than 256, so in  that was equivalent to
807  8-bit non-UTF-8 mode \eX matches any one character.  .sp
808  .P    (?>\ePM\epM*)
809  Note that recent versions of Perl have changed \eX to match what Unicode calls  .sp
810  an "extended grapheme cluster", which has a more complicated definition.  That is, it matched a character without the "mark" property, followed by zero
811  .P  or more characters with the "mark" property. Characters with the "mark"
812  Matching characters by Unicode property is not fast, because PCRE has to search  property are typically non-spacing accents that affect the preceding character.
813  a structure that contains data for over fifteen thousand characters. That is  .P
814  why the traditional escape sequences such as \ed and \ew do not use Unicode  This simple definition was extended in Unicode to include more complicated
815  properties in PCRE by default, though you can make them do so by setting the  kinds of composite character by giving each character a grapheme breaking
816  PCRE_UCP option or by starting the pattern with (*UCP).  property, and creating rules that use these properties to define the boundaries
817    of extended grapheme clusters. In releases of PCRE later than 8.31, \eX matches
818    one of these clusters.
819    .P
820    \eX always matches at least one character. Then it decides whether to add
821    additional characters according to the following rules for ending a cluster:
822    .P
823    1. End at the end of the subject string.
824    .P
825    2. Do not end between CR and LF; otherwise end after any control character.
826    .P
827    3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
828    are of five types: L, V, T, LV, and LVT. An L character may be followed by an
829    L, V, LV, or LVT character; an LV or V character may be followed by a V or T
830    character; an LVT or T character may be follwed only by a T character.
831    .P
832    4. Do not end before extending characters or spacing marks. Characters with
833    the "mark" property always have the "extend" grapheme breaking property.
834    .P
835    5. Do not end after prepend characters.
836    .P
837    6. Otherwise, end the cluster.
838  .  .
839  .  .
840  .\" HTML <a name="extraprops"></a>  .\" HTML <a name="extraprops"></a>
841  .SS PCRE's additional properties  .SS PCRE's additional properties
842  .rs  .rs
843  .sp  .sp
844  As well as the standard Unicode properties described in the previous  As well as the standard Unicode properties described above, PCRE supports four
845  section, PCRE supports four more that make it possible to convert traditional  more that make it possible to convert traditional escape sequences such as \ew
846  escape sequences such as \ew and \es and POSIX character classes to use Unicode  and \es and POSIX character classes to use Unicode properties. PCRE uses these
847  properties. PCRE uses these non-standard, non-Perl properties internally when  non-standard, non-Perl properties internally when PCRE_UCP is set. They are:
 PCRE_UCP is set. They are:  
848  .sp  .sp
849    Xan   Any alphanumeric character    Xan   Any alphanumeric character
850    Xps   Any POSIX space character    Xps   Any POSIX space character
# Line 819  PCRE_UCP is set. They are: Line 852  PCRE_UCP is set. They are:
852    Xwd   Any Perl "word" character    Xwd   Any Perl "word" character
853  .sp  .sp
854  Xan matches characters that have either the L (letter) or the N (number)  Xan matches characters that have either the L (letter) or the N (number)
855  property. Xps matches the characters tab, linefeed, vertical tab, formfeed, or  property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
856  carriage return, and any other character that has the Z (separator) property.  carriage return, and any other character that has the Z (separator) property.
857  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the  Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the
858  same characters as Xan, plus underscore.  same characters as Xan, plus underscore.
# Line 1532  quantifier, but a literal string of four Line 1565  quantifier, but a literal string of four
1565  In UTF modes, quantifiers apply to characters rather than to individual data  In UTF modes, quantifiers apply to characters rather than to individual data
1566  units. Thus, for example, \ex{100}{2} matches two characters, each of  units. Thus, for example, \ex{100}{2} matches two characters, each of
1567  which is represented by a two-byte sequence in a UTF-8 string. Similarly,  which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1568  \eX{3} matches three Unicode extended sequences, each of which may be several  \eX{3} matches three Unicode extended grapheme clusters, each of which may be
1569  data units long (and they may be of different lengths).  several data units long (and they may be of different lengths).
1570  .P  .P
1571  The quantifier {0} is permitted, causing the expression to behave as if the  The quantifier {0} is permitted, causing the expression to behave as if the
1572  previous item and the quantifier were not present. This may be useful for  previous item and the quantifier were not present. This may be useful for
# Line 1619  In cases where it is known that the subj Line 1652  In cases where it is known that the subj
1652  worth setting PCRE_DOTALL in order to obtain this optimization, or  worth setting PCRE_DOTALL in order to obtain this optimization, or
1653  alternatively using ^ to indicate anchoring explicitly.  alternatively using ^ to indicate anchoring explicitly.
1654  .P  .P
1655  However, there is one situation where the optimization cannot be used. When .*  However, there are some cases where the optimization cannot be used. When .*
1656  is inside capturing parentheses that are the subject of a back reference  is inside capturing parentheses that are the subject of a back reference
1657  elsewhere in the pattern, a match at the start may fail where a later one  elsewhere in the pattern, a match at the start may fail where a later one
1658  succeeds. Consider, for example:  succeeds. Consider, for example:
# Line 1629  succeeds. Consider, for example: Line 1662  succeeds. Consider, for example:
1662  If the subject is "xyz123abc123" the match point is the fourth character. For  If the subject is "xyz123abc123" the match point is the fourth character. For
1663  this reason, such a pattern is not implicitly anchored.  this reason, such a pattern is not implicitly anchored.
1664  .P  .P
1665    Another case where implicit anchoring is not applied is when the leading .* is
1666    inside an atomic group. Once again, a match at the start may fail where a later
1667    one succeeds. Consider this pattern:
1668    .sp
1669      (?>.*?a)b
1670    .sp
1671    It matches "ab" in the subject "aab". The use of the backtracking control verbs
1672    (*PRUNE) and (*SKIP) also disable this optimization.
1673    .P
1674  When a capturing subpattern is repeated, the value captured is the substring  When a capturing subpattern is repeated, the value captured is the substring
1675  that matched the final iteration. For example, after  that matched the final iteration. For example, after
1676  .sp  .sp
# Line 1843  Because there may be many capturing pare Line 1885  Because there may be many capturing pare
1885  following a backslash are taken as part of a potential back reference number.  following a backslash are taken as part of a potential back reference number.
1886  If the pattern continues with a digit character, some delimiter must be used to  If the pattern continues with a digit character, some delimiter must be used to
1887  terminate the back reference. If the PCRE_EXTENDED option is set, this can be  terminate the back reference. If the PCRE_EXTENDED option is set, this can be
1888  whitespace. Otherwise, the \eg{ syntax or an empty comment (see  white space. Otherwise, the \eg{ syntax or an empty comment (see
1889  .\" HTML <a href="#comments">  .\" HTML <a href="#comments">
1890  .\" </a>  .\" </a>
1891  "Comments"  "Comments"
# Line 2200  subroutines that can be referenced from Line 2242  subroutines that can be referenced from
2242  subroutines  subroutines
2243  .\"  .\"
2244  is described below.) For example, a pattern to match an IPv4 address such as  is described below.) For example, a pattern to match an IPv4 address such as
2245  "192.168.23.245" could be written like this (ignore whitespace and line  "192.168.23.245" could be written like this (ignore white space and line
2246  breaks):  breaks):
2247  .sp  .sp
2248    (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )    (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
# Line 2599  exception: the name from a *(MARK), (*PR Line 2641  exception: the name from a *(MARK), (*PR
2641  a successful positive assertion \fIis\fP passed back when a match succeeds  a successful positive assertion \fIis\fP passed back when a match succeeds
2642  (compare capturing parentheses in assertions). Note that such subpatterns are  (compare capturing parentheses in assertions). Note that such subpatterns are
2643  processed as anchored at the point where they are tested. Note also that Perl's  processed as anchored at the point where they are tested. Note also that Perl's
2644  treatment of subroutines is different in some cases.  treatment of subroutines and assertions is different in some cases.
2645  .P  .P
2646  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2647  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
# Line 2633  in the Line 2675  in the
2675  .\" HREF  .\" HREF
2676  \fBpcreapi\fP  \fBpcreapi\fP
2677  .\"  .\"
2678  documentation.  documentation.
2679  .P  .P
2680  Experiments with Perl suggest that it too has similar optimizations, sometimes  Experiments with Perl suggest that it too has similar optimizations, sometimes
2681  leading to anomalous results.  leading to anomalous results.
# Line 2727  attempts starting at "P" and then with a Line 2769  attempts starting at "P" and then with a
2769  (*MARK) item, but nevertheless do not reset it.  (*MARK) item, but nevertheless do not reset it.
2770  .P  .P
2771  If you are interested in (*MARK) values after failed matches, you should  If you are interested in (*MARK) values after failed matches, you should
2772  probably set the PCRE_NO_START_OPTIMIZE option  probably set the PCRE_NO_START_OPTIMIZE option
2773  .\" HTML <a href="#nooptimize">  .\" HTML <a href="#nooptimize">
2774  .\" </a>  .\" </a>
2775  (see above)  (see above)
2776  .\"  .\"
2777  to ensure that the match is always attempted.  to ensure that the match is always attempted.
2778  .  .
# Line 2911  Cambridge CB2 3QH, England. Line 2953  Cambridge CB2 3QH, England.
2953  .rs  .rs
2954  .sp  .sp
2955  .nf  .nf
2956  Last updated: 04 May 2012  Last updated: 25 August 2012
2957  Copyright (c) 1997-2012 University of Cambridge.  Copyright (c) 1997-2012 University of Cambridge.
2958  .fi  .fi

Legend:
Removed from v.964  
changed lines
  Added in v.1011

  ViewVC Help
Powered by ViewVC 1.1.5