/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1319 by ph10, Fri Mar 22 16:13:13 2013 UTC revision 1320 by ph10, Wed May 1 16:39:35 2013 UTC
# Line 14  man page, in case the conversion went wr Line 14  man page, in case the conversion went wr
14  <br>  <br>
15  <ul>  <ul>
16  <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>  <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
17  <li><a name="TOC2" href="#SEC2">EBCDIC CHARACTER CODES</a>  <li><a name="TOC2" href="#SEC2">SPECIAL START-OF-PATTERN ITEMS</a>
18  <li><a name="TOC3" href="#SEC3">NEWLINE CONVENTIONS</a>  <li><a name="TOC3" href="#SEC3">EBCDIC CHARACTER CODES</a>
19  <li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>  <li><a name="TOC4" href="#SEC4">CHARACTERS AND METACHARACTERS</a>
20  <li><a name="TOC5" href="#SEC5">BACKSLASH</a>  <li><a name="TOC5" href="#SEC5">BACKSLASH</a>
21  <li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>  <li><a name="TOC6" href="#SEC6">CIRCUMFLEX AND DOLLAR</a>
# Line 61  published by O'Reilly, covers regular ex Line 61  published by O'Reilly, covers regular ex
61  description of PCRE's regular expressions is intended as reference material.  description of PCRE's regular expressions is intended as reference material.
62  </P>  </P>
63  <P>  <P>
64    This document discusses the patterns that are supported by PCRE when one its
65    main matching functions, <b>pcre_exec()</b> (8-bit) or <b>pcre[16|32]_exec()</b>
66    (16- or 32-bit), is used. PCRE also has alternative matching functions,
67    <b>pcre_dfa_exec()</b> and <b>pcre[16|32_dfa_exec()</b>, which match using a
68    different algorithm that is not Perl-compatible. Some of the features discussed
69    below are not available when DFA matching is used. The advantages and
70    disadvantages of the alternative functions, and how they differ from the normal
71    functions, are discussed in the
72    <a href="pcrematching.html"><b>pcrematching</b></a>
73    page.
74    </P>
75    <br><a name="SEC2" href="#TOC1">SPECIAL START-OF-PATTERN ITEMS</a><br>
76    <P>
77    A number of options that can be passed to <b>pcre_compile()</b> can also be set
78    by special items at the start of a pattern. These are not Perl-compatible, but
79    are provided to make these options accessible to pattern writers who are not
80    able to change the program that processes the pattern. Any number of these
81    items may appear, but they must all be together right at the start of the
82    pattern string, and the letters must be in upper case.
83    </P>
84    <br><b>
85    UTF support
86    </b><br>
87    <P>
88  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
89  there is now also support for UTF-8 strings in the original library, an  there is now also support for UTF-8 strings in the original library, an
90  extra library that supports 16-bit and UTF-16 character strings, and a  extra library that supports 16-bit and UTF-16 character strings, and a
# Line 77  these special sequences: Line 101  these special sequences:
101  </pre>  </pre>
102  (*UTF) is a generic sequence that can be used with any of the libraries.  (*UTF) is a generic sequence that can be used with any of the libraries.
103  Starting a pattern with such a sequence is equivalent to setting the relevant  Starting a pattern with such a sequence is equivalent to setting the relevant
104  option. This feature is not Perl-compatible. How setting a UTF mode affects  option. How setting a UTF mode affects pattern matching is mentioned in several
105  pattern matching is mentioned in several places below. There is also a summary  places below. There is also a summary of features in the
 of features in the  
106  <a href="pcreunicode.html"><b>pcreunicode</b></a>  <a href="pcreunicode.html"><b>pcreunicode</b></a>
107  page.  page.
108  </P>  </P>
109  <P>  <P>
110  Another special sequence that may appear at the start of a pattern or in  Some applications that allow their users to supply patterns may wish to
111  combination with (*UTF8), (*UTF16), (*UTF32) or (*UTF) is:  restrict them to non-UTF data for security reasons. If the PCRE_NEVER_UTF
112    option is set at compile time, (*UTF) etc. are not allowed, and their
113    appearance causes an error.
114    </P>
115    <br><b>
116    Unicode property support
117    </b><br>
118    <P>
119    Another special sequence that may appear at the start of a pattern is
120  <pre>  <pre>
121    (*UCP)    (*UCP)
122  </pre>  </pre>
# Line 94  such as \d and \w to use Unicode propert Line 125  such as \d and \w to use Unicode propert
125  instead of recognizing only characters with codes less than 128 via a lookup  instead of recognizing only characters with codes less than 128 via a lookup
126  table.  table.
127  </P>  </P>
128    <br><b>
129    Disabling start-up optimizations
130    </b><br>
131  <P>  <P>
132  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the  If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
133  PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are  PCRE_NO_START_OPTIMIZE option either at compile or matching time.
 also some more of these special sequences that are concerned with the handling  
 of newlines; they are described below.  
 </P>  
 <P>  
 The remainder of this document discusses the patterns that are supported by  
 PCRE when one its main matching functions, <b>pcre_exec()</b> (8-bit) or  
 <b>pcre[16|32]_exec()</b> (16- or 32-bit), is used. PCRE also has alternative  
 matching functions, <b>pcre_dfa_exec()</b> and <b>pcre[16|32_dfa_exec()</b>,  
 which match using a different algorithm that is not Perl-compatible. Some of  
 the features discussed below are not available when DFA matching is used. The  
 advantages and disadvantages of the alternative functions, and how they differ  
 from the normal functions, are discussed in the  
 <a href="pcrematching.html"><b>pcrematching</b></a>  
 page.  
 </P>  
 <br><a name="SEC2" href="#TOC1">EBCDIC CHARACTER CODES</a><br>  
 <P>  
 PCRE can be compiled to run in an environment that uses EBCDIC as its character  
 code rather than ASCII or Unicode (typically a mainframe system). In the  
 sections below, character code values are ASCII or Unicode; in an EBCDIC  
 environment these characters may have different code values, and there are no  
 code points greater than 255.  
134  <a name="newlines"></a></P>  <a name="newlines"></a></P>
135  <br><a name="SEC3" href="#TOC1">NEWLINE CONVENTIONS</a><br>  <br><b>
136    Newline conventions
137    </b><br>
138  <P>  <P>
139  PCRE supports five different conventions for indicating line breaks in  PCRE supports five different conventions for indicating line breaks in
140  strings: a single CR (carriage return) character, a single LF (linefeed)  strings: a single CR (carriage return) character, a single LF (linefeed)
# Line 148  example, on a Unix system where LF is th Line 162  example, on a Unix system where LF is th
162    (*CR)a.b    (*CR)a.b
163  </pre>  </pre>
164  changes the convention to CR. That pattern matches "a\nb" because LF is no  changes the convention to CR. That pattern matches "a\nb" because LF is no
165  longer a newline. Note that these special settings, which are not  longer a newline. If more than one of these settings is present, the last one
 Perl-compatible, are recognized only at the very start of a pattern, and that  
 they must be in upper case. If more than one of them is present, the last one  
166  is used.  is used.
167  </P>  </P>
168  <P>  <P>
# Line 164  description of \R in the section entitle Line 176  description of \R in the section entitle
176  below. A change of \R setting can be combined with a change of newline  below. A change of \R setting can be combined with a change of newline
177  convention.  convention.
178  </P>  </P>
179    <br><b>
180    Setting match and recursion limits
181    </b><br>
182    <P>
183    The caller of <b>pcre_exec()</b> can set a limit on the number of times the
184    internal <b>match()</b> function is called and on the maximum depth of
185    recursive calls. These facilities are provided to catch runaway matches that
186    are provoked by patterns with huge matching trees (a typical example is a
187    pattern with nested unlimited repeats) and to avoid running out of system stack
188    by too much recursion. When one of these limits is reached, <b>pcre_exec()</b>
189    gives an error return. The limits can also be set by items at the start of the
190    pattern of the form
191    <pre>
192      (*LIMIT_MATCH=d)
193      (*LIMIT_RECURSION=d)
194    </pre>
195    where d is any number of decimal digits. However, the value of the setting must
196    be less than the value set by the caller of <b>pcre_exec()</b> for it to have
197    any effect. In other words, the pattern writer can lower the limit set by the
198    programmer, but not raise it. If there is more than one setting of one of these
199    limits, the lower value is used.
200    </P>
201    <br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br>
202    <P>
203    PCRE can be compiled to run in an environment that uses EBCDIC as its character
204    code rather than ASCII or Unicode (typically a mainframe system). In the
205    sections below, character code values are ASCII or Unicode; in an EBCDIC
206    environment these characters may have different code values, and there are no
207    code points greater than 255.
208    </P>
209  <br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>  <br><a name="SEC4" href="#TOC1">CHARACTERS AND METACHARACTERS</a><br>
210  <P>  <P>
211  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
# Line 1368  above. There are also the (*UTF8), (*UTF Line 1410  above. There are also the (*UTF8), (*UTF
1410  sequences that can be used to set UTF and Unicode property modes; they are  sequences that can be used to set UTF and Unicode property modes; they are
1411  equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP  equivalent to setting the PCRE_UTF8, PCRE_UTF16, PCRE_UTF32 and the PCRE_UCP
1412  options, respectively. The (*UTF) sequence is a generic version that can be  options, respectively. The (*UTF) sequence is a generic version that can be
1413  used with any of the libraries.  used with any of the libraries. However, the application can set the
1414    PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.
1415  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
1416  <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>
1417  <P>  <P>
# Line 2647  remarks apply to the PCRE features descr Line 2690  remarks apply to the PCRE features descr
2690  <P>  <P>
2691  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
2692  parenthesis followed by an asterisk. They are generally of the form  parenthesis followed by an asterisk. They are generally of the form
2693  (*VERB) or (*VERB:NAME). Some may take either form, with differing behaviour,  (*VERB) or (*VERB:NAME). Some may take either form, possibly behaving
2694  depending on whether or not a name is present. A name is any sequence of  differently depending on whether or not a name is present. A name is any
2695  characters that does not include a closing parenthesis. The maximum length of  sequence of characters that does not include a closing parenthesis. The maximum
2696  name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit libraries.  length of name is 255 in the 8-bit library and 65535 in the 16-bit and 32-bit
2697  If the name is empty, that is, if the closing parenthesis immediately follows  libraries. If the name is empty, that is, if the closing parenthesis
2698  the colon, the effect is as if the colon were not there. Any number of these  immediately follows the colon, the effect is as if the colon were not there.
2699  verbs may occur in a pattern.  Any number of these verbs may occur in a pattern.
2700  </P>  </P>
2701  <P>  <P>
2702  Since these verbs are specifically related to backtracking, most of them can be  Since these verbs are specifically related to backtracking, most of them can be
# Line 2767  of obtaining this information than putti Line 2810  of obtaining this information than putti
2810  capturing parentheses.  capturing parentheses.
2811  </P>  </P>
2812  <P>  <P>
2813  If a verb with a name is encountered in a positive assertion, its name is  If a verb with a name is encountered in a positive assertion that is true, the
2814  recorded and passed back if it is the last-encountered. This does not happen  name is recorded and passed back if it is the last-encountered. This does not
2815  for negative assertions.  happen for negative assertions or failing positive assertions.
2816  </P>  </P>
2817  <P>  <P>
2818  After a partial match or a failed match, the last encountered name in the  After a partial match or a failed match, the last encountered name in the
# Line 2798  The following verbs do nothing when they Line 2841  The following verbs do nothing when they
2841  with what follows, but if there is no subsequent match, causing a backtrack to  with what follows, but if there is no subsequent match, causing a backtrack to
2842  the verb, a failure is forced. That is, backtracking cannot pass to the left of  the verb, a failure is forced. That is, backtracking cannot pass to the left of
2843  the verb. However, when one of these verbs appears inside an atomic group or an  the verb. However, when one of these verbs appears inside an atomic group or an
2844  assertion, its effect is confined to that group, because once the group has  assertion that is true, its effect is confined to that group, because once the
2845  been matched, there is never any backtracking into it. In this situation,  group has been matched, there is never any backtracking into it. In this
2846  backtracking can "jump back" to the left of the entire atomic group or  situation, backtracking can "jump back" to the left of the entire atomic group
2847  assertion. (Remember also, as stated above, that this localization also applies  or assertion. (Remember also, as stated above, that this localization also
2848  in subroutine calls.)  applies in subroutine calls.)
2849  </P>  </P>
2850  <P>  <P>
2851  These verbs differ in exactly what kind of failure occurs when backtracking  These verbs differ in exactly what kind of failure occurs when backtracking
2852  reaches them.  reaches them. The behaviour described below is what happens when the verb is
2853    not in a subroutine or an assertion. Subsequent sections cover these special
2854    cases.
2855  <pre>  <pre>
2856    (*COMMIT)    (*COMMIT)
2857  </pre>  </pre>
# Line 2906  pattern-based if-then-else block: Line 2951  pattern-based if-then-else block:
2951  </pre>  </pre>
2952  If the COND1 pattern matches, FOO is tried (and possibly further items after  If the COND1 pattern matches, FOO is tried (and possibly further items after
2953  the end of the group if FOO succeeds); on failure, the matcher skips to the  the end of the group if FOO succeeds); on failure, the matcher skips to the
2954  second alternative and tries COND2, without backtracking into COND1.  second alternative and tries COND2, without backtracking into COND1. If that
2955  If (*THEN) is not inside an alternation, it acts like (*PRUNE).  succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
2956    more alternatives, so there is a backtrack to whatever came before the entire
2957    group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
2958  </P>  </P>
2959  <P>  <P>
2960  The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN).  The behaviour of (*THEN:NAME) is the not the same as (*MARK:NAME)(*THEN).
# Line 3007  further processing. In a negative assert Line 3054  further processing. In a negative assert
3054  fail without any further processing.  fail without any further processing.
3055  </P>  </P>
3056  <P>  <P>
3057  The other backtracking verbs are not treated specially if they appear in an  The other backtracking verbs are not treated specially if they appear in a
3058  assertion. In particular, (*THEN) skips to the next alternative in the  positive assertion. In particular, (*THEN) skips to the next alternative in the
3059  innermost enclosing group that has alternations, whether or not this is within  innermost enclosing group that has alternations, whether or not this is within
3060  the assertion.  the assertion.
3061    </P>
3062    <P>
3063    Negative assertions are, however, different, in order to ensure that changing a
3064    positive assertion into a negative assertion changes its result. Backtracking
3065    into (*COMMIT), (*SKIP), or (*PRUNE) causes a negative assertion to be true,
3066    without considering any further alternative branches in the assertion.
3067    Backtracking into (*THEN) causes it to skip to the next enclosing alternative
3068    within the assertion (the normal behaviour), but if the assertion does not have
3069    such an alternative, (*THEN) behaves like (*PRUNE).
3070  <a name="btsub"></a></P>  <a name="btsub"></a></P>
3071  <br><b>  <br><b>
3072  Backtracking verbs in subroutines  Backtracking verbs in subroutines
# Line 3053  Cambridge CB2 3QH, England. Line 3109  Cambridge CB2 3QH, England.
3109  </P>  </P>
3110  <br><a name="SEC29" href="#TOC1">REVISION</a><br>  <br><a name="SEC29" href="#TOC1">REVISION</a><br>
3111  <P>  <P>
3112  Last updated: 22 March 2013  Last updated: 26 April 2013
3113  <br>  <br>
3114  Copyright &copy; 1997-2013 University of Cambridge.  Copyright &copy; 1997-2013 University of Cambridge.
3115  <br>  <br>

Legend:
Removed from v.1319  
changed lines
  Added in v.1320

  ViewVC Help
Powered by ViewVC 1.1.5