/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1033 by ph10, Mon Sep 10 11:02:48 2012 UTC revision 1055 by chpe, Tue Oct 16 15:53:30 2012 UTC
# Line 21  published by O'Reilly, covers regular ex Line 21  published by O'Reilly, covers regular ex
21  description of PCRE's regular expressions is intended as reference material.  description of PCRE's regular expressions is intended as reference material.
22  .P  .P
23  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
24  there is now also support for UTF-8 strings in the original library, and a  there is now also support for UTF-8 strings in the original library, an
25  second library that supports 16-bit and UTF-16 character strings. To use these  extra library that supports 16-bit and UTF-16 character strings, and an
26    extra library that supports 32-bit and UTF-32 character strings. To use these
27  features, PCRE must be built to include appropriate support. When using UTF  features, PCRE must be built to include appropriate support. When using UTF
28  strings you must either call the compiling function with the PCRE_UTF8 or  strings you must either call the compiling function with the PCRE_UTF8,
29  PCRE_UTF16 option, or the pattern must start with one of these special  PCRE_UTF16 or PCRE_UTF32 option, or the pattern must start with one of
30  sequences:  these special sequences:
31  .sp  .sp
32    (*UTF8)    (*UTF8)
33    (*UTF16)    (*UTF16)
34      (*UTF32)
35  .sp  .sp
36  Starting a pattern with such a sequence is equivalent to setting the relevant  Starting a pattern with such a sequence is equivalent to setting the relevant
37  option. This feature is not Perl-compatible. How setting a UTF mode affects  option. This feature is not Perl-compatible. How setting a UTF mode affects
# Line 41  of features in the Line 43  of features in the
43  page.  page.
44  .P  .P
45  Another special sequence that may appear at the start of a pattern or in  Another special sequence that may appear at the start of a pattern or in
46  combination with (*UTF8) or (*UTF16) is:  combination with (*UTF8) or (*UTF16) or (*UTF32) is:
47  .sp  .sp
48    (*UCP)    (*UCP)
49  .sp  .sp
# Line 57  of newlines; they are described below. Line 59  of newlines; they are described below.
59  .P  .P
60  The remainder of this document discusses the patterns that are supported by  The remainder of this document discusses the patterns that are supported by
61  PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or  PCRE when one its main matching functions, \fBpcre_exec()\fP (8-bit) or
62  \fBpcre16_exec()\fP (16-bit), is used. PCRE also has alternative matching  \fBpcre[16|32]_exec()\fP (16- or 32-bit), is used. PCRE also has alternative
63  functions, \fBpcre_dfa_exec()\fP and \fBpcre16_dfa_exec()\fP, which match using  matching functions, \fBpcre_dfa_exec()\fP and \fBpcre[16|32_dfa_exec()\fP,
64  a different algorithm that is not Perl-compatible. Some of the features  which match using a different algorithm that is not Perl-compatible. Some of
65  discussed below are not available when DFA matching is used. The advantages and  the features discussed below are not available when DFA matching is used. The
66  disadvantages of the alternative functions, and how they differ from the normal  advantages and disadvantages of the alternative functions, and how they differ
67  functions, are discussed in the  from the normal functions, are discussed in the
68  .\" HREF  .\" HREF
69  \fBpcrematching\fP  \fBpcrematching\fP
70  .\"  .\"
# Line 280  between \ex{ and }, but the character co Line 282  between \ex{ and }, but the character co
282    8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint    8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
283    16-bit non-UTF mode   less than 0x10000    16-bit non-UTF mode   less than 0x10000
284    16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint    16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
285      32-bit non-UTF mode   less than 0x80000000
286      32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
287  .sp  .sp
288  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called  Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called
289  "surrogate" codepoints).  "surrogate" codepoints), and 0xffef.
290  .P  .P
291  If characters other than hexadecimal digits appear between \ex{ and }, or if  If characters other than hexadecimal digits appear between \ex{ and }, or if
292  there is no terminating }, this form of escape is not recognized. Instead, the  there is no terminating }, this form of escape is not recognized. Instead, the
# Line 568  change of newline convention; for exampl Line 572  change of newline convention; for exampl
572  .sp  .sp
573    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
574  .sp  .sp
575  They can also be combined with the (*UTF8), (*UTF16), or (*UCP) special  They can also be combined with the (*UTF8), (*UTF16), (*UTF32) or (*UCP) special
576  sequences. Inside a character class, \eR is treated as an unrecognized escape  sequences. Inside a character class, \eR is treated as an unrecognized escape
577  sequence, and so matches the letter "R" by default, but causes an error if  sequence, and so matches the letter "R" by default, but causes an error if
578  PCRE_EXTRA is set.  PCRE_EXTRA is set.
# Line 779  a modifier or "other". Line 783  a modifier or "other".
783  The Cs (Surrogate) property applies only to characters in the range U+D800 to  The Cs (Surrogate) property applies only to characters in the range U+D800 to
784  U+DFFF. Such characters are not valid in Unicode strings and so  U+DFFF. Such characters are not valid in Unicode strings and so
785  cannot be tested by PCRE, unless UTF validity checking has been turned off  cannot be tested by PCRE, unless UTF validity checking has been turned off
786  (see the discussion of PCRE_NO_UTF8_CHECK and PCRE_NO_UTF16_CHECK in the  (see the discussion of PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK and
787    PCRE_NO_UTF32_CHECK in the
788  .\" HREF  .\" HREF
789  \fBpcreapi\fP  \fBpcreapi\fP
790  .\"  .\"
# Line 1056  name; PCRE does not support this. Line 1061  name; PCRE does not support this.
1061  .sp  .sp
1062  Outside a character class, the escape sequence \eC matches any one data unit,  Outside a character class, the escape sequence \eC matches any one data unit,
1063  whether or not a UTF mode is set. In the 8-bit library, one data unit is one  whether or not a UTF mode is set. In the 8-bit library, one data unit is one
1064  byte; in the 16-bit library it is a 16-bit unit. Unlike a dot, \eC always  byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is
1065    a 32-bit unit. Unlike a dot, \eC always
1066  matches line-ending characters. The feature is provided in Perl in order to  matches line-ending characters. The feature is provided in Perl in order to
1067  match individual bytes in UTF-8 mode, but it is unclear how it can usefully be  match individual bytes in UTF-8 mode, but it is unclear how it can usefully be
1068  used. Because \eC breaks up characters into individual data units, matching one  used. Because \eC breaks up characters into individual data units, matching one
1069  unit with \eC in a UTF mode means that the rest of the string may start with a  unit with \eC in a UTF mode means that the rest of the string may start with a
1070  malformed UTF character. This has undefined results, because PCRE assumes that  malformed UTF character. This has undefined results, because PCRE assumes that
1071  it is dealing with valid UTF strings (and by default it checks this at the  it is dealing with valid UTF strings (and by default it checks this at the
1072  start of processing unless the PCRE_NO_UTF8_CHECK or PCRE_NO_UTF16_CHECK option  start of processing unless the PCRE_NO_UTF8_CHECK, PCRE_NO_UTF16_CHECK or
1073  is used).  PCRE_NO_UTF32_CHECK option is used).
1074  .P  .P
1075  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
1076  .\" HTML <a href="#lookbehind">  .\" HTML <a href="#lookbehind">
# Line 1123  circumflex is not an assertion; it still Line 1129  circumflex is not an assertion; it still
1129  string, and therefore it fails if the current pointer is at the end of the  string, and therefore it fails if the current pointer is at the end of the
1130  string.  string.
1131  .P  .P
1132  In UTF-8 (UTF-16) mode, characters with values greater than 255 (0xffff) can be  In UTF-8 (UTF-16, UTF-32) mode, characters with values greater than 255 (0xffff)
1133  included in a class as a literal string of data units, or by using the \ex{  can be included in a class as a literal string of data units, or by using the
1134  escaping mechanism.  \ex{ escaping mechanism.
1135  .P  .P
1136  When caseless matching is set, any letters in a class represent both their  When caseless matching is set, any letters in a class represent both their
1137  upper case and lower case versions, so for example, a caseless [aeiou] matches  upper case and lower case versions, so for example, a caseless [aeiou] matches
# Line 1338  the section entitled Line 1344  the section entitled
1344  .\" </a>  .\" </a>
1345  "Newline sequences"  "Newline sequences"
1346  .\"  .\"
1347  above. There are also the (*UTF8), (*UTF16), and (*UCP) leading sequences that  above. There are also the (*UTF8), (*UTF16),(*UTF32) and (*UCP) leading
1348  can be used to set UTF and Unicode property modes; they are equivalent to  sequences that can be used to set UTF and Unicode property modes; they are
1349  setting the PCRE_UTF8, PCRE_UTF16, and the PCRE_UCP options, respectively.  equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
1350    options, respectively.
1351  .  .
1352  .  .
1353  .\" HTML <a name="subpattern"></a>  .\" HTML <a name="subpattern"></a>
# Line 2602  same pair of parentheses when there is a Line 2609  same pair of parentheses when there is a
2609  PCRE provides a similar feature, but of course it cannot obey arbitrary Perl  PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
2610  code. The feature is called "callout". The caller of PCRE provides an external  code. The feature is called "callout". The caller of PCRE provides an external
2611  function by putting its entry point in the global variable \fIpcre_callout\fP  function by putting its entry point in the global variable \fIpcre_callout\fP
2612  (8-bit library) or \fIpcre16_callout\fP (16-bit library). By default, this  (8-bit library) or \fIpcre[16|32]_callout\fP (16-bit or 32-bit library).
2613  variable contains NULL, which disables all calling out.  By default, this variable contains NULL, which disables all calling out.
2614  .P  .P
2615  Within a regular expression, (?C) indicates the points at which the external  Within a regular expression, (?C) indicates the points at which the external
2616  function is to be called. If you want to identify different callout points, you  function is to be called. If you want to identify different callout points, you
# Line 2658  parenthesis followed by an asterisk. The Line 2665  parenthesis followed by an asterisk. The
2665  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,
2666  depending on whether or not an argument is present. A name is any sequence of  depending on whether or not an argument is present. A name is any sequence of
2667  characters that does not include a closing parenthesis. The maximum length of  characters that does not include a closing parenthesis. The maximum length of
2668  name is 255 in the 8-bit library and 65535 in the 16-bit library. If the name  name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit library.
2669  is empty, that is, if the closing parenthesis immediately follows the colon,  If the name is empty, that is, if the closing parenthesis immediately follows
2670  the effect is as if the colon were not there. Any number of these verbs may  the colon, the effect is as if the colon were not there. Any number of these
2671  occur in a pattern.  verbs may occur in a pattern.
2672  .  .
2673  .  .
2674  .\" HTML <a name="nooptimize"></a>  .\" HTML <a name="nooptimize"></a>
# Line 2946  overrides. Line 2953  overrides.
2953  .rs  .rs
2954  .sp  .sp
2955  \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),  \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
2956  \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP.  \fBpcresyntax\fP(3), \fBpcre\fP(3), \fBpcre16(3)\fP, \fBpcre32(3)\fP.
2957  .  .
2958  .  .
2959  .SH AUTHOR  .SH AUTHOR

Legend:
Removed from v.1033  
changed lines
  Added in v.1055

  ViewVC Help
Powered by ViewVC 1.1.5