/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 562 by ph10, Sun Oct 31 14:06:43 2010 UTC revision 572 by ph10, Wed Nov 17 17:55:57 2010 UTC
# Line 424  any Unicode letter, and underscore. Note Line 424  any Unicode letter, and underscore. Note
424  \eB because they are defined in terms of \ew and \eW. Matching these sequences  \eB because they are defined in terms of \ew and \eW. Matching these sequences
425  is noticeably slower when PCRE_UCP is set.  is noticeably slower when PCRE_UCP is set.
426  .P  .P
427  The sequences \eh, \eH, \ev, and \eV are Perl 5.10 features. In contrast to the  The sequences \eh, \eH, \ev, and \eV are features that were added to Perl at
428  other sequences, which match only ASCII characters by default, these always  release 5.10. In contrast to the other sequences, which match only ASCII
429  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is  characters by default, these always match certain high-valued codepoints in
430  set. The horizontal space characters are:  UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
431    are:
432  .sp  .sp
433    U+0009     Horizontal tab    U+0009     Horizontal tab
434    U+0020     Space    U+0020     Space
# Line 465  The vertical space characters are: Line 466  The vertical space characters are:
466  .rs  .rs
467  .sp  .sp
468  Outside a character class, by default, the escape sequence \eR matches any  Outside a character class, by default, the escape sequence \eR matches any
469  Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \eR is  Unicode newline sequence. In non-UTF-8 mode \eR is equivalent to the following:
 equivalent to the following:  
470  .sp  .sp
471    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)    (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
472  .sp  .sp
# Line 774  same characters as Xan, plus underscore. Line 774  same characters as Xan, plus underscore.
774  .SS "Resetting the match start"  .SS "Resetting the match start"
775  .rs  .rs
776  .sp  .sp
777  The escape sequence \eK, which is a Perl 5.10 feature, causes any previously  The escape sequence \eK causes any previously matched characters not to be
778  matched characters not to be included in the final matched sequence. For  included in the final matched sequence. For example, the pattern:
 example, the pattern:  
779  .sp  .sp
780    foo\eKbar    foo\eKbar
781  .sp  .sp
# Line 948  The handling of dot is entirely independ Line 947  The handling of dot is entirely independ
947  dollar, the only relationship being that they both involve newlines. Dot has no  dollar, the only relationship being that they both involve newlines. Dot has no
948  special meaning in a character class.  special meaning in a character class.
949  .P  .P
950  The escape sequence \eN always behaves as a dot does when PCRE_DOTALL is not  The escape sequence \eN behaves like a dot, except that it is not affected by
951  set. In other words, it matches any one character except one that signifies the  the PCRE_DOTALL option. In other words, it matches any character except one
952  end of a line.  that signifies the end of a line.
953  .  .
954  .  .
955  .SH "MATCHING A SINGLE BYTE"  .SH "MATCHING A SINGLE BYTE"
# Line 959  end of a line. Line 958  end of a line.
958  Outside a character class, the escape sequence \eC matches any one byte, both  Outside a character class, the escape sequence \eC matches any one byte, both
959  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
960  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
961  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
962  what remains in the string may be a malformed UTF-8 string. For this reason,  rest of the string may start with a malformed UTF-8 character. For this reason,
963  the \eC escape sequence is best avoided.  the \eC escape sequence is best avoided.
964  .P  .P
965  PCRE does not allow \eC to appear in lookbehind assertions  PCRE does not allow \eC to appear in lookbehind assertions
# Line 1173  extracts it into the global options (and Line 1172  extracts it into the global options (and
1172  extracted by the \fBpcre_fullinfo()\fP function).  extracted by the \fBpcre_fullinfo()\fP function).
1173  .P  .P
1174  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
1175  subpatterns) affects only that part of the current pattern that follows it, so  subpatterns) affects only that part of the subpattern that follows it, so
1176  .sp  .sp
1177    (a(?i)b)c    (a(?i)b)c
1178  .sp  .sp
# Line 1214  Turning part of a pattern into a subpatt Line 1213  Turning part of a pattern into a subpatt
1213  .sp  .sp
1214    cat(aract|erpillar|)    cat(aract|erpillar|)
1215  .sp  .sp
1216  matches one of the words "cat", "cataract", or "caterpillar". Without the  matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1217  parentheses, it would match "cataract", "erpillar" or an empty string.  match "cataract", "erpillar" or an empty string.
1218  .sp  .sp
1219  2. It sets up the subpattern as a capturing subpattern. This means that, when  2. It sets up the subpattern as a capturing subpattern. This means that, when
1220  the whole pattern matches, that portion of the subject string that matched the  the whole pattern matches, that portion of the subject string that matched the
1221  subpattern is passed back to the caller via the \fIovector\fP argument of  subpattern is passed back to the caller via the \fIovector\fP argument of
1222  \fBpcre_exec()\fP. Opening parentheses are counted from left to right (starting  \fBpcre_exec()\fP. Opening parentheses are counted from left to right (starting
1223  from 1) to obtain numbers for the capturing subpatterns.  from 1) to obtain numbers for the capturing subpatterns. For example, if the
1224  .P  string "the red king" is matched against the pattern
 For example, if the string "the red king" is matched against the pattern  
1225  .sp  .sp
1226    the ((red|white) (king|queen))    the ((red|white) (king|queen))
1227  .sp  .sp
# Line 1272  at captured substring number one, whiche Line 1270  at captured substring number one, whiche
1270  is useful when you want to capture part, but not all, of one of a number of  is useful when you want to capture part, but not all, of one of a number of
1271  alternatives. Inside a (?| group, parentheses are numbered as usual, but the  alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1272  number is reset at the start of each branch. The numbers of any capturing  number is reset at the start of each branch. The numbers of any capturing
1273  buffers that follow the subpattern start after the highest number used in any  parentheses that follow the subpattern start after the highest number used in
1274  branch. The following example is taken from the Perl documentation.  any branch. The following example is taken from the Perl documentation. The
1275  The numbers underneath show in which buffer the captured content will be  numbers underneath show in which buffer the captured content will be stored.
 stored.  
1276  .sp  .sp
1277    # before  ---------------branch-reset----------- after    # before  ---------------branch-reset----------- after
1278    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# Line 1402  items: Line 1399  items:
1399    the \eC escape sequence    the \eC escape sequence
1400    the \eX escape sequence (in UTF-8 mode with Unicode properties)    the \eX escape sequence (in UTF-8 mode with Unicode properties)
1401    the \eR escape sequence    the \eR escape sequence
1402    an escape such as \ed that matches a single character    an escape such as \ed or \epL that matches a single character
1403    a character class    a character class
1404    a back reference (see next section)    a back reference (see next section)
1405    a parenthesized subpattern (unless it is an assertion)    a parenthesized subpattern (unless it is an assertion)
# Line 1444  subpatterns that are referenced as Line 1441  subpatterns that are referenced as
1441  .\" </a>  .\" </a>
1442  subroutines  subroutines
1443  .\"  .\"
1444  from elsewhere in the pattern. Items other than subpatterns that have a {0}  from elsewhere in the pattern (but see also the section entitled
1445  quantifier are omitted from the compiled pattern.  .\" HTML <a href="#subdefine">
1446    .\" </a>
1447    "Defining subpatterns for use by reference only"
1448    .\"
1449    below). Items other than subpatterns that have a {0} quantifier are omitted
1450    from the compiled pattern.
1451  .P  .P
1452  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
1453  abbreviations:  abbreviations:
# Line 1670  no such problem when named parentheses a Line 1672  no such problem when named parentheses a
1672  subpattern is possible using named parentheses (see below).  subpattern is possible using named parentheses (see below).
1673  .P  .P
1674  Another way of avoiding the ambiguity inherent in the use of digits following a  Another way of avoiding the ambiguity inherent in the use of digits following a
1675  backslash is to use the \eg escape sequence, which is a feature introduced in  backslash is to use the \eg escape sequence. This escape must be followed by an
1676  Perl 5.10. This escape must be followed by an unsigned number or a negative  unsigned number or a negative number, optionally enclosed in braces. These
1677  number, optionally enclosed in braces. These examples are all identical:  examples are all identical:
1678  .sp  .sp
1679    (ring), \e1    (ring), \e1
1680    (ring), \eg1    (ring), \eg1
# Line 1686  example: Line 1688  example:
1688    (abc(def)ghi)\eg{-1}    (abc(def)ghi)\eg{-1}
1689  .sp  .sp
1690  The sequence \eg{-1} is a reference to the most recently started capturing  The sequence \eg{-1} is a reference to the most recently started capturing
1691  subpattern before \eg, that is, is it equivalent to \e2. Similarly, \eg{-2}  subpattern before \eg, that is, is it equivalent to \e2 in this example.
1692  would be equivalent to \e1. The use of relative references can be helpful in  Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
1693  long patterns, and also in patterns that are created by joining together  can be helpful in long patterns, and also in patterns that are created by
1694  fragments that contain references within themselves.  joining together fragments that contain references within themselves.
1695  .P  .P
1696  A back reference matches whatever actually matched the capturing subpattern in  A back reference matches whatever actually matched the capturing subpattern in
1697  the current subject string, rather than anything matching the subpattern  the current subject string, rather than anything matching the subpattern
# Line 1825  lookbehind assertion is needed to achiev Line 1827  lookbehind assertion is needed to achiev
1827  If you want to force a matching failure at some point in a pattern, the most  If you want to force a matching failure at some point in a pattern, the most
1828  convenient way to do it is with (?!) because an empty string always matches, so  convenient way to do it is with (?!) because an empty string always matches, so
1829  an assertion that requires there not to be an empty string must always fail.  an assertion that requires there not to be an empty string must always fail.
1830  The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a  The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
 synonym for (?!).  
1831  .  .
1832  .  .
1833  .\" HTML <a name="lookbehind"></a>  .\" HTML <a name="lookbehind"></a>
# Line 1851  is permitted, but Line 1852  is permitted, but
1852  .sp  .sp
1853  causes an error at compile time. Branches that match different length strings  causes an error at compile time. Branches that match different length strings
1854  are permitted only at the top level of a lookbehind assertion. This is an  are permitted only at the top level of a lookbehind assertion. This is an
1855  extension compared with Perl (5.8 and 5.10), which requires all branches to  extension compared with Perl, which requires all branches to match the same
1856  match the same length of string. An assertion such as  length of string. An assertion such as
1857  .sp  .sp
1858    (?<=ab(c|de))    (?<=ab(c|de))
1859  .sp  .sp
# Line 1862  branches: Line 1863  branches:
1863  .sp  .sp
1864    (?<=abc|abde)    (?<=abc|abde)
1865  .sp  .sp
1866  In some cases, the Perl 5.10 escape sequence \eK  In some cases, the escape sequence \eK
1867  .\" HTML <a href="#resetmatchstart">  .\" HTML <a href="#resetmatchstart">
1868  .\" </a>  .\" </a>
1869  (see above)  (see above)
# Line 1990  matched. If there is more than one captu Line 1991  matched. If there is more than one captu
1991  .\" </a>  .\" </a>
1992  section about duplicate subpattern numbers),  section about duplicate subpattern numbers),
1993  .\"  .\"
1994  the condition is true if any of them have been set. An alternative notation is  the condition is true if any of them have matched. An alternative notation is
1995  to precede the digits with a plus or minus sign. In this case, the subpattern  to precede the digits with a plus or minus sign. In this case, the subpattern
1996  number is relative rather than absolute. The most recently opened parentheses  number is relative rather than absolute. The most recently opened parentheses
1997  can be referenced by (?(-1), the next most recent by (?(-2), and so on. In  can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
1998  looping constructs it can also make sense to refer to subsequent groups with  loops it can also make sense to refer to subsequent groups. The next
1999  constructs such as (?(+2).  parentheses to be opened can be referenced as (?(+1), and so on. (The value
2000    zero in any of these forms is not used; it provokes a compile-time error.)
2001  .P  .P
2002  Consider the following pattern, which contains non-significant white space to  Consider the following pattern, which contains non-significant white space to
2003  make it more readable (assume the PCRE_EXTENDED option) and to divide it into  make it more readable (assume the PCRE_EXTENDED option) and to divide it into
# Line 2006  three parts for ease of discussion: Line 2008  three parts for ease of discussion:
2008  The first part matches an optional opening parenthesis, and if that  The first part matches an optional opening parenthesis, and if that
2009  character is present, sets it as the first captured substring. The second part  character is present, sets it as the first captured substring. The second part
2010  matches one or more characters that are not parentheses. The third part is a  matches one or more characters that are not parentheses. The third part is a
2011  conditional subpattern that tests whether the first set of parentheses matched  conditional subpattern that tests whether or not the first set of parentheses
2012  or not. If they did, that is, if subject started with an opening parenthesis,  matched. If they did, that is, if subject started with an opening parenthesis,
2013  the condition is true, and so the yes-pattern is executed and a closing  the condition is true, and so the yes-pattern is executed and a closing
2014  parenthesis is required. Otherwise, since no-pattern is not present, the  parenthesis is required. Otherwise, since no-pattern is not present, the
2015  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
# Line 2063  The syntax for recursive patterns Line 2065  The syntax for recursive patterns
2065  .\"  .\"
2066  is described below.  is described below.
2067  .  .
2068    .\" HTML <a name="subdefine"></a>
2069  .SS "Defining subpatterns for use by reference only"  .SS "Defining subpatterns for use by reference only"
2070  .rs  .rs
2071  .sp  .sp
# Line 2075  point in the pattern; the idea of DEFINE Line 2078  point in the pattern; the idea of DEFINE
2078  .\" </a>  .\" </a>
2079  "subroutines"  "subroutines"
2080  .\"  .\"
2081  is described below.) For example, a pattern to match an IPv4 address could be  is described below.) For example, a pattern to match an IPv4 address such as
2082  written like this (ignore whitespace and line breaks):  "192.168.23.245" could be written like this (ignore whitespace and line
2083    breaks):
2084  .sp  .sp
2085    (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )    (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2086    \eb (?&byte) (\e.(?&byte)){3} \eb    \eb (?&byte) (\e.(?&byte)){3} \eb
# Line 2124  this case continues to immediately after Line 2128  this case continues to immediately after
2128  character sequence in the pattern. Which characters are interpreted as newlines  character sequence in the pattern. Which characters are interpreted as newlines
2129  is controlled by the options passed to \fBpcre_compile()\fP or by a special  is controlled by the options passed to \fBpcre_compile()\fP or by a special
2130  sequence at the start of the pattern, as described in the section entitled  sequence at the start of the pattern, as described in the section entitled
2131  .\" HTML <a href="#recursion">  .\" HTML <a href="#newlines">
2132  .\" </a>  .\" </a>
2133  "Newline conventions"  "Newline conventions"
2134  .\"  .\"
2135  above. Note that end of this type of comment is a literal newline sequence in  above. Note that the end of this type of comment is a literal newline sequence
2136  the pattern; escape sequences that happen to represent a newline do not count.  in the pattern; escape sequences that happen to represent a newline do not
2137  For example, consider this pattern when PCRE_EXTENDED is set, and the default  count. For example, consider this pattern when PCRE_EXTENDED is set, and the
2138  newline convention is in force:  default newline convention is in force:
2139  .sp  .sp
2140    abc #comment \en still comment    abc #comment \en still comment
2141  .sp  .sp
# Line 2196  We have put the pattern into parentheses Line 2200  We have put the pattern into parentheses
2200  them instead of the whole pattern.  them instead of the whole pattern.
2201  .P  .P
2202  In a larger pattern, keeping track of parenthesis numbers can be tricky. This  In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2203  is made easier by the use of relative references (a Perl 5.10 feature).  is made easier by the use of relative references. Instead of (?1) in the
2204  Instead of (?1) in the pattern above you can write (?-2) to refer to the second  pattern above you can write (?-2) to refer to the second most recently opened
2205  most recently opened parentheses preceding the recursion. In other words, a  parentheses preceding the recursion. In other words, a negative number counts
2206  negative number counts capturing parentheses leftwards from the point at which  capturing parentheses leftwards from the point at which it is encountered.
 it is encountered.  
2207  .P  .P
2208  It is also possible to refer to subsequently opened parentheses, by writing  It is also possible to refer to subsequently opened parentheses, by writing
2209  references such as (?+2). However, these cannot be recursive because the  references such as (?+2). However, these cannot be recursive because the
# Line 2303  time we do have another alternative to t Line 2306  time we do have another alternative to t
2306  difference: in the previous case the remaining alternative is at a deeper  difference: in the previous case the remaining alternative is at a deeper
2307  recursion level, which PCRE cannot use.  recursion level, which PCRE cannot use.
2308  .P  .P
2309  To change the pattern so that matches all palindromic strings, not just those  To change the pattern so that it matches all palindromic strings, not just
2310  with an odd number of characters, it is tempting to change the pattern to this:  those with an odd number of characters, it is tempting to change the pattern to
2311    this:
2312  .sp  .sp
2313    ^((.)(?1)\e2|.?)$    ^((.)(?1)\e2|.?)$
2314  .sp  .sp
# Line 2714  Cambridge CB2 3QH, England. Line 2718  Cambridge CB2 3QH, England.
2718  .rs  .rs
2719  .sp  .sp
2720  .nf  .nf
2721  Last updated: 31 October 2010  Last updated: 17 November 2010
2722  Copyright (c) 1997-2010 University of Cambridge.  Copyright (c) 1997-2010 University of Cambridge.
2723  .fi  .fi

Legend:
Removed from v.562  
changed lines
  Added in v.572

  ViewVC Help
Powered by ViewVC 1.1.5