/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1220 by ph10, Wed Oct 31 17:42:29 2012 UTC revision 1221 by ph10, Sun Nov 11 20:27:03 2012 UTC
# Line 63  description of PCRE's regular expression Line 63  description of PCRE's regular expression
63  <P>  <P>
64  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
65  there is now also support for UTF-8 strings in the original library, an  there is now also support for UTF-8 strings in the original library, an
66  extra library that supports 16-bit and UTF-16 character strings, and an  extra library that supports 16-bit and UTF-16 character strings, and a
67  extra library that supports 32-bit and UTF-32 character strings. To use these  third library that supports 32-bit and UTF-32 character strings. To use these
68  features, PCRE must be built to include appropriate support. When using UTF  features, PCRE must be built to include appropriate support. When using UTF
69  strings you must either call the compiling function with the PCRE_UTF8,  strings you must either call the compiling function with the PCRE_UTF8,
70  PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of  PCRE_UTF16, or PCRE_UTF32 option, or the pattern must start with one of
71  these special sequences:  these special sequences:
72  <pre>  <pre>
73    (*UTF8)    (*UTF8)
74    (*UTF16)    (*UTF16)
75    (*UTF32)    (*UTF32)
76      (*UTF)
77  </pre>  </pre>
78    (*UTF) is a generic sequence that can be used with any of the libraries.
79  Starting a pattern with such a sequence is equivalent to setting the relevant  Starting a pattern with such a sequence is equivalent to setting the relevant
80  option. This feature is not Perl-compatible. How setting a UTF mode affects  option. This feature is not Perl-compatible. How setting a UTF mode affects
81  pattern matching is mentioned in several places below. There is also a summary  pattern matching is mentioned in several places below. There is also a summary
# Line 83  page. Line 85  page.
85  </P>  </P>
86  <P>  <P>
87  Another special sequence that may appear at the start of a pattern or in  Another special sequence that may appear at the start of a pattern or in
88  combination with (*UTF8) or (*UTF16) or (*UTF32) is:  combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:
89  <pre>  <pre>
90    (*UCP)    (*UCP)
91  </pre>  </pre>
# Line 112  page. Line 114  page.
114  </P>  </P>
115  <br><a name="SEC2" href="#TOC1">EBCDIC CHARACTER CODES</a><br>  <br><a name="SEC2" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
116  <P>  <P>
117  PCRE can be compiled to run in an environment that uses EBCDIC as its character  PCRE can be compiled to run in an environment that uses EBCDIC as its character
118  code rather than ASCII or Unicode (typically a mainframe system). In the  code rather than ASCII or Unicode (typically a mainframe system). In the
119  sections below, character code values are ASCII or Unicode; in an EBCDIC  sections below, character code values are ASCII or Unicode; in an EBCDIC
120  environment these characters may have different code values, and there are no  environment these characters may have different code values, and there are no
121  code points greater than 255.  code points greater than 255.
122  <a name="newlines"></a></P>  <a name="newlines"></a></P>
123  <br><a name="SEC3" href="#TOC1">NEWLINE CONVENTIONS</a><br>  <br><a name="SEC3" href="#TOC1">NEWLINE CONVENTIONS</a><br>
# Line 152  they must be in upper case. If more than Line 154  they must be in upper case. If more than
154  is used.  is used.
155  </P>  </P>
156  <P>  <P>
157  The newline convention affects the interpretation of the dot metacharacter when  The newline convention affects where the circumflex and dollar assertions are
158  PCRE_DOTALL is not set, and also the behaviour of \N. However, it does not  true. It also affects the interpretation of the dot metacharacter when
159  affect what the \R escape sequence matches. By default, this is any Unicode  PCRE_DOTALL is not set, and the behaviour of \N. However, it does not affect
160  newline sequence, for Perl compatibility. However, this can be changed; see the  what the \R escape sequence matches. By default, this is any Unicode newline
161    sequence, for Perl compatibility. However, this can be changed; see the
162  description of \R in the section entitled  description of \R in the section entitled
163  <a href="#newlineseq">"Newline sequences"</a>  <a href="#newlineseq">"Newline sequences"</a>
164  below. A change of \R setting can be combined with a change of newline  below. A change of \R setting can be combined with a change of newline
# Line 298  recognized when PCRE is compiled in EBCD Line 301  recognized when PCRE is compiled in EBCD
301  bytes. In this mode, all values are valid after \c. If the next character is a  bytes. In this mode, all values are valid after \c. If the next character is a
302  lower case letter, it is converted to upper case. Then the 0xc0 bits of the  lower case letter, it is converted to upper case. Then the 0xc0 bits of the
303  byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because  byte are inverted. Thus \cA becomes hex 01, as in ASCII (A is C1), but because
304  the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other  the EBCDIC letters are disjoint, \cZ becomes hex 29 (Z is E9), and other
305  characters also generate different values.  characters also generate different values.
306  </P>  </P>
307  <P>  <P>
# Line 574  change of newline convention; for exampl Line 577  change of newline convention; for exampl
577  <pre>  <pre>
578    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
579  </pre>  </pre>
580  They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special  They can also be combined with the (*UTF8), (*UTF16), (*UTF32), (*UTF) or
581  sequences. Inside a character class, \R is treated as an unrecognized escape  (*UCP) special sequences. Inside a character class, \R is treated as an
582  sequence, and so matches the letter "R" by default, but causes an error if  unrecognized escape sequence, and so matches the letter "R" by default, but
583  PCRE_EXTRA is set.  causes an error if PCRE_EXTRA is set.
584  <a name="uniextseq"></a></P>  <a name="uniextseq"></a></P>
585  <br><b>  <br><b>
586  Unicode character properties  Unicode character properties
# Line 836  of extended grapheme clusters. In releas Line 839  of extended grapheme clusters. In releas
839  one of these clusters.  one of these clusters.
840  </P>  </P>
841  <P>  <P>
842  \X always matches at least one character. Then it decides whether to add  \X always matches at least one character. Then it decides whether to add
843  additional characters according to the following rules for ending a cluster:  additional characters according to the following rules for ending a cluster:
844  </P>  </P>
845  <P>  <P>
# Line 846  additional characters according to the f Line 849  additional characters according to the f
849  2. Do not end between CR and LF; otherwise end after any control character.  2. Do not end between CR and LF; otherwise end after any control character.
850  </P>  </P>
851  <P>  <P>
852  3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters  3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
853  are of five types: L, V, T, LV, and LVT. An L character may be followed by an  are of five types: L, V, T, LV, and LVT. An L character may be followed by an
854  L, V, LV, or LVT character; an LV or V character may be followed by a V or T  L, V, LV, or LVT character; an LV or V character may be followed by a V or T
855  character; an LVT or T character may be follwed only by a T character.  character; an LVT or T character may be follwed only by a T character.
856  </P>  </P>
857  <P>  <P>
# Line 978  regular expression. Line 981  regular expression.
981  </P>  </P>
982  <br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>  <br><a name="SEC6" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
983  <P>  <P>
984    The circumflex and dollar metacharacters are zero-width assertions. That is,
985    they test for a particular condition being true without consuming any
986    characters from the subject string.
987    </P>
988    <P>
989  Outside a character class, in the default matching mode, the circumflex  Outside a character class, in the default matching mode, the circumflex
990  character is an assertion that is true only if the current matching point is  character is an assertion that is true only if the current matching point is at
991  at the start of the subject string. If the <i>startoffset</i> argument of  the start of the subject string. If the <i>startoffset</i> argument of
992  <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE  <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
993  option is unset. Inside a character class, circumflex has an entirely different  option is unset. Inside a character class, circumflex has an entirely different
994  meaning  meaning
# Line 996  constrained to match only at the start o Line 1004  constrained to match only at the start o
1004  to be anchored.)  to be anchored.)
1005  </P>  </P>
1006  <P>  <P>
1007  A dollar character is an assertion that is true only if the current matching  The dollar character is an assertion that is true only if the current matching
1008  point is at the end of the subject string, or immediately before a newline  point is at the end of the subject string, or immediately before a newline at
1009  at the end of the string (by default). Dollar need not be the last character of  the end of the string (by default). Note, however, that it does not actually
1010  the pattern if a number of alternatives are involved, but it should be the last  match the newline. Dollar need not be the last character of the pattern if a
1011  item in any branch in which it appears. Dollar has no special meaning in a  number of alternatives are involved, but it should be the last item in any
1012  character class.  branch in which it appears. Dollar has no special meaning in a character class.
1013  </P>  </P>
1014  <P>  <P>
1015  The meaning of dollar can be changed so that it matches only at the very end of  The meaning of dollar can be changed so that it matches only at the very end of
# Line 1344  the pattern can contain special leading Line 1352  the pattern can contain special leading
1352  what the application has set or what has been defaulted. Details are given in  what the application has set or what has been defaulted. Details are given in
1353  the section entitled  the section entitled
1354  <a href="#newlineseq">"Newline sequences"</a>  <a href="#newlineseq">"Newline sequences"</a>
1355  above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading  above. There are also the (*UTF8), (*UTF16),(*UTF32), and (*UCP) leading
1356  sequences that can be used to set UTF and Unicode property modes; they are  sequences that can be used to set UTF and Unicode property modes; they are
1357  equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP  equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
1358  options, respectively.  options, respectively. The (*UTF) sequence is a generic version that can be
1359    used with any of the libraries.
1360  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
1361  <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>
1362  <P>  <P>
# Line 1674  one succeeds. Consider this pattern: Line 1683  one succeeds. Consider this pattern:
1683  <pre>  <pre>
1684    (?&#62;.*?a)b    (?&#62;.*?a)b
1685  </pre>  </pre>
1686  It matches "ab" in the subject "aab". The use of the backtracking control verbs  It matches "ab" in the subject "aab". The use of the backtracking control verbs
1687  (*PRUNE) and (*SKIP) also disable this optimization.  (*PRUNE) and (*SKIP) also disable this optimization.
1688  </P>  </P>
1689  <P>  <P>
# Line 2935  Cambridge CB2 3QH, England. Line 2944  Cambridge CB2 3QH, England.
2944  </P>  </P>
2945  <br><a name="SEC29" href="#TOC1">REVISION</a><br>  <br><a name="SEC29" href="#TOC1">REVISION</a><br>
2946  <P>  <P>
2947  Last updated: 10 September 2012  Last updated: 11 November 2012
2948  <br>  <br>
2949  Copyright &copy; 1997-2012 University of Cambridge.  Copyright &copy; 1997-2012 University of Cambridge.
2950  <br>  <br>

Legend:
Removed from v.1220  
changed lines
  Added in v.1221

  ViewVC Help
Powered by ViewVC 1.1.5