/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 578 by ph10, Wed Nov 17 17:55:57 2010 UTC revision 579 by ph10, Wed Nov 24 17:39:25 2010 UTC
# Line 89  instead of recognizing only characters w Line 89  instead of recognizing only characters w
89  table.  table.
90  </P>  </P>
91  <P>  <P>
92    If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
93    PCRE_NO_START_OPTIMIZE option either at compile or matching time. There are
94    also some more of these special sequences that are concerned with the handling
95    of newlines; they are described below.
96    </P>
97    <P>
98  The remainder of this document discusses the patterns that are supported by  The remainder of this document discusses the patterns that are supported by
99  PCRE when its main matching function, <b>pcre_exec()</b>, is used.  PCRE when its main matching function, <b>pcre_exec()</b>, is used.
100  From release 6.0, PCRE offers a second matching function,  From release 6.0, PCRE offers a second matching function,
# Line 204  The following sections describe the use Line 210  The following sections describe the use
210  <br><a name="SEC4" href="#TOC1">BACKSLASH</a><br>  <br><a name="SEC4" href="#TOC1">BACKSLASH</a><br>
211  <P>  <P>
212  The backslash character has several uses. Firstly, if it is followed by a  The backslash character has several uses. Firstly, if it is followed by a
213  non-alphanumeric character, it takes away any special meaning that character  character that is not a number or a letter, it takes away any special meaning
214  may have. This use of backslash as an escape character applies both inside and  that character may have. This use of backslash as an escape character applies
215  outside character classes.  both inside and outside character classes.
216  </P>  </P>
217  <P>  <P>
218  For example, if you want to match a * character, you write \* in the pattern.  For example, if you want to match a * character, you write \* in the pattern.
# Line 216  non-alphanumeric with backslash to speci Line 222  non-alphanumeric with backslash to speci
222  particular, if you want to match a backslash, you write \\.  particular, if you want to match a backslash, you write \\.
223  </P>  </P>
224  <P>  <P>
225    In UTF-8 mode, only ASCII numbers and letters have any special meaning after a
226    backslash. All other characters (in particular, those whose codepoints are
227    greater than 127) are treated as literals.
228    </P>
229    <P>
230  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the  If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
231  pattern (other than in a character class) and characters between a # outside  pattern (other than in a character class) and characters between a # outside
232  a character class and the next newline are ignored. An escaping backslash can  a character class and the next newline are ignored. An escaping backslash can
# Line 247  but when a pattern is being prepared by Line 258  but when a pattern is being prepared by
258  one of the following escape sequences than the binary character it represents:  one of the following escape sequences than the binary character it represents:
259  <pre>  <pre>
260    \a        alarm, that is, the BEL character (hex 07)    \a        alarm, that is, the BEL character (hex 07)
261    \cx       "control-x", where x is any character    \cx       "control-x", where x is any ASCII character
262    \e        escape (hex 1B)    \e        escape (hex 1B)
263    \f        formfeed (hex 0C)    \f        formfeed (hex 0C)
264    \n        linefeed (hex 0A)    \n        linefeed (hex 0A)
# Line 259  one of the following escape sequences th Line 270  one of the following escape sequences th
270  </pre>  </pre>
271  The precise effect of \cx is as follows: if x is a lower case letter, it  The precise effect of \cx is as follows: if x is a lower case letter, it
272  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.  is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
273  Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex  Thus \cz becomes hex 1A (z is 7A), but \c{ becomes hex 3B ({ is 7B), while
274  7B.  \c; becomes hex 7B (; is 3B). If the byte following \c has a value greater
275    than 127, a compile-time error occurs. This locks out non-ASCII characters in
276    both byte mode and UTF-8 mode. (When PCRE is compiled in EBCDIC mode, all byte
277    values are valid. A lower case letter is converted to upper case, and then the
278    0xc0 bits are flipped.)
279  </P>  </P>
280  <P>  <P>
281  After \x, from zero to two hexadecimal digits are read (letters can be in  After \x, from zero to two hexadecimal digits are read (letters can be in
# Line 421  any Unicode letter, and underscore. Note Line 436  any Unicode letter, and underscore. Note
436  is noticeably slower when PCRE_UCP is set.  is noticeably slower when PCRE_UCP is set.
437  </P>  </P>
438  <P>  <P>
439  The sequences \h, \H, \v, and \V are features that were added to Perl at  The sequences \h, \H, \v, and \V are features that were added to Perl at
440  release 5.10. In contrast to the other sequences, which match only ASCII  release 5.10. In contrast to the other sequences, which match only ASCII
441  characters by default, these always match certain high-valued codepoints in  characters by default, these always match certain high-valued codepoints in
442  UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters  UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
# Line 940  dollar, the only relationship being that Line 955  dollar, the only relationship being that
955  special meaning in a character class.  special meaning in a character class.
956  </P>  </P>
957  <P>  <P>
958  The escape sequence \N behaves like a dot, except that it is not affected by  The escape sequence \N behaves like a dot, except that it is not affected by
959  the PCRE_DOTALL option. In other words, it matches any character except one  the PCRE_DOTALL option. In other words, it matches any character except one
960  that signifies the end of a line.  that signifies the end of a line.
961  </P>  </P>
# Line 1040  characters with values greater than 128 Line 1055  characters with values greater than 128
1055  property support.  property support.
1056  </P>  </P>
1057  <P>  <P>
1058  The character types \d, \D, \h, \H, \p, \P, \s, \S, \v, \V, \w, and  The character escape sequences \d, \D, \h, \H, \p, \P, \s, \S, \v,
1059  \W may also appear in a character class, and add the characters that they  \V, \w, and \W may appear in a character class, and add the characters that
1060  match to the class. For example, [\dABCDEF] matches any hexadecimal digit. A  they match to the class. For example, [\dABCDEF] matches any hexadecimal
1061  circumflex can conveniently be used with the upper case character types to  digit. In UTF-8 mode, the PCRE_UCP option affects the meanings of \d, \s, \w
1062    and their upper case partners, just as it does when they appear outside a
1063    character class, as described in the section entitled
1064    <a href="#genericchartypes">"Generic character types"</a>
1065    above. The escape sequence \b has a different meaning inside a character
1066    class; it matches the backspace character. The sequences \B, \N, \R, and \X
1067    are not special inside a character class. Like any other unrecognized escape
1068    sequences, they are treated as the literal characters "B", "N", "R", and "X" by
1069    default, but cause an error if the PCRE_EXTRA option is set.
1070    </P>
1071    <P>
1072    A circumflex can conveniently be used with the upper case character types to
1073  specify a more restricted set of characters than the matching lower case type.  specify a more restricted set of characters than the matching lower case type.
1074  For example, the class [^\W_] matches any letter or digit, but not underscore.  For example, the class [^\W_] matches any letter or digit, but not underscore,
1075    whereas [\w] includes underscore. A positive character class should be read as
1076    "something OR something OR ..." and a negative class as "NOT something AND NOT
1077    something AND NOT ...".
1078  </P>  </P>
1079  <P>  <P>
1080  The only metacharacters that are recognized in character classes are backslash,  The only metacharacters that are recognized in character classes are backslash,
# Line 1669  example: Line 1698  example:
1698    (abc(def)ghi)\g{-1}    (abc(def)ghi)\g{-1}
1699  </pre>  </pre>
1700  The sequence \g{-1} is a reference to the most recently started capturing  The sequence \g{-1} is a reference to the most recently started capturing
1701  subpattern before \g, that is, is it equivalent to \2. Similarly, \g{-2}  subpattern before \g, that is, is it equivalent to \2 in this example.
1702  would be equivalent to \1. The use of relative references can be helpful in  Similarly, \g{-2} would be equivalent to \1. The use of relative references
1703  long patterns, and also in patterns that are created by joining together  can be helpful in long patterns, and also in patterns that are created by
1704  fragments that contain references within themselves.  joining together fragments that contain references within themselves.
1705  </P>  </P>
1706  <P>  <P>
1707  A back reference matches whatever actually matched the capturing subpattern in  A back reference matches whatever actually matched the capturing subpattern in
# Line 1802  lookbehind assertion is needed to achiev Line 1831  lookbehind assertion is needed to achiev
1831  If you want to force a matching failure at some point in a pattern, the most  If you want to force a matching failure at some point in a pattern, the most
1832  convenient way to do it is with (?!) because an empty string always matches, so  convenient way to do it is with (?!) because an empty string always matches, so
1833  an assertion that requires there not to be an empty string must always fail.  an assertion that requires there not to be an empty string must always fail.
1834  The backtracking control verb (*FAIL) or (*F) is essentially a synonym for  The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
 (?!).  
1835  <a name="lookbehind"></a></P>  <a name="lookbehind"></a></P>
1836  <br><b>  <br><b>
1837  Lookbehind assertions  Lookbehind assertions
# Line 1936  already been matched. The two possible f Line 1964  already been matched. The two possible f
1964  If the condition is satisfied, the yes-pattern is used; otherwise the  If the condition is satisfied, the yes-pattern is used; otherwise the
1965  no-pattern (if present) is used. If there are more than two alternatives in the  no-pattern (if present) is used. If there are more than two alternatives in the
1966  subpattern, a compile-time error occurs. Each of the two alternatives may  subpattern, a compile-time error occurs. Each of the two alternatives may
1967  itself contain nested subpatterns of any form, including conditional  itself contain nested subpatterns of any form, including conditional
1968  subpatterns; the restriction to two alternatives applies only at the level of  subpatterns; the restriction to two alternatives applies only at the level of
1969  the condition. This pattern fragment is an example where the alternatives are  the condition. This pattern fragment is an example where the alternatives are
1970  complex:  complex:
1971  <pre>  <pre>
1972    (?(1) (A|B|C) | (D | (?(2)E|F) | E) )    (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
# Line 1958  condition is true if a capturing subpatt Line 1986  condition is true if a capturing subpatt
1986  matched. If there is more than one capturing subpattern with the same number  matched. If there is more than one capturing subpattern with the same number
1987  (see the earlier  (see the earlier
1988  <a href="#recursion">section about duplicate subpattern numbers),</a>  <a href="#recursion">section about duplicate subpattern numbers),</a>
1989  the condition is true if any of them have been set. An alternative notation is  the condition is true if any of them have matched. An alternative notation is
1990  to precede the digits with a plus or minus sign. In this case, the subpattern  to precede the digits with a plus or minus sign. In this case, the subpattern
1991  number is relative rather than absolute. The most recently opened parentheses  number is relative rather than absolute. The most recently opened parentheses
1992  can be referenced by (?(-1), the next most recent by (?(-2), and so on. In  can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
1993  looping constructs it can also make sense to refer to subsequent groups with  loops it can also make sense to refer to subsequent groups. The next
1994  constructs such as (?(+2).  parentheses to be opened can be referenced as (?(+1), and so on. (The value
1995    zero in any of these forms is not used; it provokes a compile-time error.)
1996  </P>  </P>
1997  <P>  <P>
1998  Consider the following pattern, which contains non-significant white space to  Consider the following pattern, which contains non-significant white space to
# Line 1975  three parts for ease of discussion: Line 2004  three parts for ease of discussion:
2004  The first part matches an optional opening parenthesis, and if that  The first part matches an optional opening parenthesis, and if that
2005  character is present, sets it as the first captured substring. The second part  character is present, sets it as the first captured substring. The second part
2006  matches one or more characters that are not parentheses. The third part is a  matches one or more characters that are not parentheses. The third part is a
2007  conditional subpattern that tests whether the first set of parentheses matched  conditional subpattern that tests whether or not the first set of parentheses
2008  or not. If they did, that is, if subject started with an opening parenthesis,  matched. If they did, that is, if subject started with an opening parenthesis,
2009  the condition is true, and so the yes-pattern is executed and a closing  the condition is true, and so the yes-pattern is executed and a closing
2010  parenthesis is required. Otherwise, since no-pattern is not present, the  parenthesis is required. Otherwise, since no-pattern is not present, the
2011  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
# Line 2044  alternative in the subpattern. It is alw Line 2073  alternative in the subpattern. It is alw
2073  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
2074  "subroutines" that can be referenced from elsewhere. (The use of  "subroutines" that can be referenced from elsewhere. (The use of
2075  <a href="#subpatternsassubroutines">"subroutines"</a>  <a href="#subpatternsassubroutines">"subroutines"</a>
2076  is described below.) For example, a pattern to match an IPv4 address could be  is described below.) For example, a pattern to match an IPv4 address such as
2077  written like this (ignore whitespace and line breaks):  "192.168.23.245" could be written like this (ignore whitespace and line
2078    breaks):
2079  <pre>  <pre>
2080    (?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )    (?(DEFINE) (?&#60;byte&#62; 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
2081    \b (?&byte) (\.(?&byte)){3} \b    \b (?&byte) (\.(?&byte)){3} \b
# Line 2078  dd-aaa-dd or dd-dd-dd, where aaa are let Line 2108  dd-aaa-dd or dd-dd-dd, where aaa are let
2108  <a name="comments"></a></P>  <a name="comments"></a></P>
2109  <br><a name="SEC20" href="#TOC1">COMMENTS</a><br>  <br><a name="SEC20" href="#TOC1">COMMENTS</a><br>
2110  <P>  <P>
2111  There are two ways of including comments in patterns that are processed by  There are two ways of including comments in patterns that are processed by
2112  PCRE. In both cases, the start of the comment must not be in a character class,  PCRE. In both cases, the start of the comment must not be in a character class,
2113  nor in the middle of any other sequence of related characters such as (?: or a  nor in the middle of any other sequence of related characters such as (?: or a
2114  subpattern name or number. The characters that make up a comment play no part  subpattern name or number. The characters that make up a comment play no part
# Line 2100  default newline convention is in force: Line 2130  default newline convention is in force:
2130  <pre>  <pre>
2131    abc #comment \n still comment    abc #comment \n still comment
2132  </pre>  </pre>
2133  On encountering the # character, <b>pcre_compile()</b> skips along, looking for  On encountering the # character, <b>pcre_compile()</b> skips along, looking for
2134  a newline in the pattern. The sequence \n is still literal at this stage, so  a newline in the pattern. The sequence \n is still literal at this stage, so
2135  it does not terminate the comment. Only an actual character with the code value  it does not terminate the comment. Only an actual character with the code value
2136  0x0a (the default newline) does so.  0x0a (the default newline) does so.
# Line 2270  difference: in the previous case the rem Line 2300  difference: in the previous case the rem
2300  recursion level, which PCRE cannot use.  recursion level, which PCRE cannot use.
2301  </P>  </P>
2302  <P>  <P>
2303  To change the pattern so that matches all palindromic strings, not just those  To change the pattern so that it matches all palindromic strings, not just
2304  with an odd number of characters, it is tempting to change the pattern to this:  those with an odd number of characters, it is tempting to change the pattern to
2305    this:
2306  <pre>  <pre>
2307    ^((.)(?1)\2|.?)$    ^((.)(?1)\2|.?)$
2308  </pre>  </pre>
# Line 2433  minimum length of matching subject, or t Line 2464  minimum length of matching subject, or t
2464  present. When one of these optimizations suppresses the running of a match, any  present. When one of these optimizations suppresses the running of a match, any
2465  included backtracking verbs will not, of course, be processed. You can suppress  included backtracking verbs will not, of course, be processed. You can suppress
2466  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option  the start-of-match optimizations by setting the PCRE_NO_START_OPTIMIZE option
2467  when calling <b>pcre_exec()</b>.  when calling <b>pcre_compile()</b> or <b>pcre_exec()</b>, or by starting the
2468    pattern with (*NO_START_OPT).
2469  </P>  </P>
2470  <br><b>  <br><b>
2471  Verbs that act immediately  Verbs that act immediately
# Line 2624  matching name is found, normal "bumpalon Line 2656  matching name is found, normal "bumpalon
2656  <pre>  <pre>
2657    (*THEN) or (*THEN:NAME)    (*THEN) or (*THEN:NAME)
2658  </pre>  </pre>
2659  This verb causes a skip to the next alternation in the innermost enclosing  This verb causes a skip to the next alternation in the innermost enclosing
2660  group if the rest of the pattern does not match. That is, it cancels pending  group if the rest of the pattern does not match. That is, it cancels pending
2661  backtracking, but only within the current alternation. Its name comes from the  backtracking, but only within the current alternation. Its name comes from the
2662  observation that it can be used for a pattern-based if-then-else block:  observation that it can be used for a pattern-based if-then-else block:
# Line 2639  overall match fails. If (*THEN) is not d Line 2671  overall match fails. If (*THEN) is not d
2671  like (*PRUNE).  like (*PRUNE).
2672  </P>  </P>
2673  <P>  <P>
2674  The above verbs provide four different "strengths" of control when subsequent  The above verbs provide four different "strengths" of control when subsequent
2675  matching fails. (*THEN) is the weakest, carrying on the match at the next  matching fails. (*THEN) is the weakest, carrying on the match at the next
2676  alternation. (*PRUNE) comes next, failing the match at the current starting  alternation. (*PRUNE) comes next, failing the match at the current starting
2677  position, but allowing an advance to the next character (for an unanchored  position, but allowing an advance to the next character (for an unanchored
# Line 2673  Cambridge CB2 3QH, England. Line 2705  Cambridge CB2 3QH, England.
2705  </P>  </P>
2706  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2707  <P>  <P>
2708  Last updated: 17 November 2010  Last updated: 21 November 2010
2709  <br>  <br>
2710  Copyright &copy; 1997-2010 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
2711  <br>  <br>

Legend:
Removed from v.578  
changed lines
  Added in v.579

  ViewVC Help
Powered by ViewVC 1.1.5