/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 571 by ph10, Sat Nov 6 17:10:00 2010 UTC revision 572 by ph10, Wed Nov 17 17:55:57 2010 UTC
# Line 421  any Unicode letter, and underscore. Note Line 421  any Unicode letter, and underscore. Note
421  is noticeably slower when PCRE_UCP is set.  is noticeably slower when PCRE_UCP is set.
422  </P>  </P>
423  <P>  <P>
424  The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the  The sequences \h, \H, \v, and \V are features that were added to Perl at
425  other sequences, which match only ASCII characters by default, these always  release 5.10. In contrast to the other sequences, which match only ASCII
426  match certain high-valued codepoints in UTF-8 mode, whether or not PCRE_UCP is  characters by default, these always match certain high-valued codepoints in
427  set. The horizontal space characters are:  UTF-8 mode, whether or not PCRE_UCP is set. The horizontal space characters
428    are:
429  <pre>  <pre>
430    U+0009     Horizontal tab    U+0009     Horizontal tab
431    U+0020     Space    U+0020     Space
# Line 462  Newline sequences Line 463  Newline sequences
463  </b><br>  </b><br>
464  <P>  <P>
465  Outside a character class, by default, the escape sequence \R matches any  Outside a character class, by default, the escape sequence \R matches any
466  Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is  Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the following:
 equivalent to the following:  
467  <pre>  <pre>
468    (?&#62;\r\n|\n|\x0b|\f|\r|\x85)    (?&#62;\r\n|\n|\x0b|\f|\r|\x85)
469  </pre>  </pre>
# Line 769  same characters as Xan, plus underscore. Line 769  same characters as Xan, plus underscore.
769  Resetting the match start  Resetting the match start
770  </b><br>  </b><br>
771  <P>  <P>
772  The escape sequence \K, which is a Perl 5.10 feature, causes any previously  The escape sequence \K causes any previously matched characters not to be
773  matched characters not to be included in the final matched sequence. For  included in the final matched sequence. For example, the pattern:
 example, the pattern:  
774  <pre>  <pre>
775    foo\Kbar    foo\Kbar
776  </pre>  </pre>
# Line 941  dollar, the only relationship being that Line 940  dollar, the only relationship being that
940  special meaning in a character class.  special meaning in a character class.
941  </P>  </P>
942  <P>  <P>
943  The escape sequence \N always behaves as a dot does when PCRE_DOTALL is not  The escape sequence \N behaves like a dot, except that it is not affected by
944  set. In other words, it matches any one character except one that signifies the  the PCRE_DOTALL option. In other words, it matches any character except one
945  end of a line.  that signifies the end of a line.
946  </P>  </P>
947  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>  <br><a name="SEC7" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
948  <P>  <P>
949  Outside a character class, the escape sequence \C matches any one byte, both  Outside a character class, the escape sequence \C matches any one byte, both
950  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending  in and out of UTF-8 mode. Unlike a dot, it always matches any line-ending
951  characters. The feature is provided in Perl in order to match individual bytes  characters. The feature is provided in Perl in order to match individual bytes
952  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes,  in UTF-8 mode. Because it breaks up UTF-8 characters into individual bytes, the
953  what remains in the string may be a malformed UTF-8 string. For this reason,  rest of the string may start with a malformed UTF-8 character. For this reason,
954  the \C escape sequence is best avoided.  the \C escape sequence is best avoided.
955  </P>  </P>
956  <P>  <P>
# Line 1166  extracted by the <b>pcre_fullinfo()</b> Line 1165  extracted by the <b>pcre_fullinfo()</b>
1165  </P>  </P>
1166  <P>  <P>
1167  An option change within a subpattern (see below for a description of  An option change within a subpattern (see below for a description of
1168  subpatterns) affects only that part of the current pattern that follows it, so  subpatterns) affects only that part of the subpattern that follows it, so
1169  <pre>  <pre>
1170    (a(?i)b)c    (a(?i)b)c
1171  </pre>  </pre>
# Line 1203  Turning part of a pattern into a subpatt Line 1202  Turning part of a pattern into a subpatt
1202  <pre>  <pre>
1203    cat(aract|erpillar|)    cat(aract|erpillar|)
1204  </pre>  </pre>
1205  matches one of the words "cat", "cataract", or "caterpillar". Without the  matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1206  parentheses, it would match "cataract", "erpillar" or an empty string.  match "cataract", "erpillar" or an empty string.
1207  <br>  <br>
1208  <br>  <br>
1209  2. It sets up the subpattern as a capturing subpattern. This means that, when  2. It sets up the subpattern as a capturing subpattern. This means that, when
1210  the whole pattern matches, that portion of the subject string that matched the  the whole pattern matches, that portion of the subject string that matched the
1211  subpattern is passed back to the caller via the <i>ovector</i> argument of  subpattern is passed back to the caller via the <i>ovector</i> argument of
1212  <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting  <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
1213  from 1) to obtain numbers for the capturing subpatterns.  from 1) to obtain numbers for the capturing subpatterns. For example, if the
1214  </P>  string "the red king" is matched against the pattern
 <P>  
 For example, if the string "the red king" is matched against the pattern  
1215  <pre>  <pre>
1216    the ((red|white) (king|queen))    the ((red|white) (king|queen))
1217  </pre>  </pre>
# Line 1262  at captured substring number one, whiche Line 1259  at captured substring number one, whiche
1259  is useful when you want to capture part, but not all, of one of a number of  is useful when you want to capture part, but not all, of one of a number of
1260  alternatives. Inside a (?| group, parentheses are numbered as usual, but the  alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1261  number is reset at the start of each branch. The numbers of any capturing  number is reset at the start of each branch. The numbers of any capturing
1262  buffers that follow the subpattern start after the highest number used in any  parentheses that follow the subpattern start after the highest number used in
1263  branch. The following example is taken from the Perl documentation.  any branch. The following example is taken from the Perl documentation. The
1264  The numbers underneath show in which buffer the captured content will be  numbers underneath show in which buffer the captured content will be stored.
 stored.  
1265  <pre>  <pre>
1266    # before  ---------------branch-reset----------- after    # before  ---------------branch-reset----------- after
1267    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
# Line 1377  items: Line 1373  items:
1373    the \C escape sequence    the \C escape sequence
1374    the \X escape sequence (in UTF-8 mode with Unicode properties)    the \X escape sequence (in UTF-8 mode with Unicode properties)
1375    the \R escape sequence    the \R escape sequence
1376    an escape such as \d that matches a single character    an escape such as \d or \pL that matches a single character
1377    a character class    a character class
1378    a back reference (see next section)    a back reference (see next section)
1379    a parenthesized subpattern (unless it is an assertion)    a parenthesized subpattern (unless it is an assertion)
# Line 1418  The quantifier {0} is permitted, causing Line 1414  The quantifier {0} is permitted, causing
1414  previous item and the quantifier were not present. This may be useful for  previous item and the quantifier were not present. This may be useful for
1415  subpatterns that are referenced as  subpatterns that are referenced as
1416  <a href="#subpatternsassubroutines">subroutines</a>  <a href="#subpatternsassubroutines">subroutines</a>
1417  from elsewhere in the pattern. Items other than subpatterns that have a {0}  from elsewhere in the pattern (but see also the section entitled
1418  quantifier are omitted from the compiled pattern.  <a href="#subdefine">"Defining subpatterns for use by reference only"</a>
1419    below). Items other than subpatterns that have a {0} quantifier are omitted
1420    from the compiled pattern.
1421  </P>  </P>
1422  <P>  <P>
1423  For convenience, the three most common quantifiers have single-character  For convenience, the three most common quantifiers have single-character
# Line 1655  subpattern is possible using named paren Line 1653  subpattern is possible using named paren
1653  </P>  </P>
1654  <P>  <P>
1655  Another way of avoiding the ambiguity inherent in the use of digits following a  Another way of avoiding the ambiguity inherent in the use of digits following a
1656  backslash is to use the \g escape sequence, which is a feature introduced in  backslash is to use the \g escape sequence. This escape must be followed by an
1657  Perl 5.10. This escape must be followed by an unsigned number or a negative  unsigned number or a negative number, optionally enclosed in braces. These
1658  number, optionally enclosed in braces. These examples are all identical:  examples are all identical:
1659  <pre>  <pre>
1660    (ring), \1    (ring), \1
1661    (ring), \g1    (ring), \g1
# Line 1804  lookbehind assertion is needed to achiev Line 1802  lookbehind assertion is needed to achiev
1802  If you want to force a matching failure at some point in a pattern, the most  If you want to force a matching failure at some point in a pattern, the most
1803  convenient way to do it is with (?!) because an empty string always matches, so  convenient way to do it is with (?!) because an empty string always matches, so
1804  an assertion that requires there not to be an empty string must always fail.  an assertion that requires there not to be an empty string must always fail.
1805  The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a  The backtracking control verb (*FAIL) or (*F) is essentially a synonym for
1806  synonym for (?!).  (?!).
1807  <a name="lookbehind"></a></P>  <a name="lookbehind"></a></P>
1808  <br><b>  <br><b>
1809  Lookbehind assertions  Lookbehind assertions
# Line 1829  is permitted, but Line 1827  is permitted, but
1827  </pre>  </pre>
1828  causes an error at compile time. Branches that match different length strings  causes an error at compile time. Branches that match different length strings
1829  are permitted only at the top level of a lookbehind assertion. This is an  are permitted only at the top level of a lookbehind assertion. This is an
1830  extension compared with Perl (5.8 and 5.10), which requires all branches to  extension compared with Perl, which requires all branches to match the same
1831  match the same length of string. An assertion such as  length of string. An assertion such as
1832  <pre>  <pre>
1833    (?&#60;=ab(c|de))    (?&#60;=ab(c|de))
1834  </pre>  </pre>
# Line 1840  branches: Line 1838  branches:
1838  <pre>  <pre>
1839    (?&#60;=abc|abde)    (?&#60;=abc|abde)
1840  </pre>  </pre>
1841  In some cases, the Perl 5.10 escape sequence \K  In some cases, the escape sequence \K
1842  <a href="#resetmatchstart">(see above)</a>  <a href="#resetmatchstart">(see above)</a>
1843  can be used instead of a lookbehind assertion to get round the fixed-length  can be used instead of a lookbehind assertion to get round the fixed-length
1844  restriction.  restriction.
# Line 2035  the most recent recursion. Line 2033  the most recent recursion.
2033  At "top level", all these recursion test conditions are false.  At "top level", all these recursion test conditions are false.
2034  <a href="#recursion">The syntax for recursive patterns</a>  <a href="#recursion">The syntax for recursive patterns</a>
2035  is described below.  is described below.
2036  </P>  <a name="subdefine"></a></P>
2037  <br><b>  <br><b>
2038  Defining subpatterns for use by reference only  Defining subpatterns for use by reference only
2039  </b><br>  </b><br>
# Line 2094  this case continues to immediately after Line 2092  this case continues to immediately after
2092  character sequence in the pattern. Which characters are interpreted as newlines  character sequence in the pattern. Which characters are interpreted as newlines
2093  is controlled by the options passed to <b>pcre_compile()</b> or by a special  is controlled by the options passed to <b>pcre_compile()</b> or by a special
2094  sequence at the start of the pattern, as described in the section entitled  sequence at the start of the pattern, as described in the section entitled
2095  <a href="#recursion">"Newline conventions"</a>  <a href="#newlines">"Newline conventions"</a>
2096  above. Note that end of this type of comment is a literal newline sequence in  above. Note that the end of this type of comment is a literal newline sequence
2097  the pattern; escape sequences that happen to represent a newline do not count.  in the pattern; escape sequences that happen to represent a newline do not
2098  For example, consider this pattern when PCRE_EXTENDED is set, and the default  count. For example, consider this pattern when PCRE_EXTENDED is set, and the
2099  newline convention is in force:  default newline convention is in force:
2100  <pre>  <pre>
2101    abc #comment \n still comment    abc #comment \n still comment
2102  </pre>  </pre>
# Line 2163  them instead of the whole pattern. Line 2161  them instead of the whole pattern.
2161  </P>  </P>
2162  <P>  <P>
2163  In a larger pattern, keeping track of parenthesis numbers can be tricky. This  In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2164  is made easier by the use of relative references (a Perl 5.10 feature).  is made easier by the use of relative references. Instead of (?1) in the
2165  Instead of (?1) in the pattern above you can write (?-2) to refer to the second  pattern above you can write (?-2) to refer to the second most recently opened
2166  most recently opened parentheses preceding the recursion. In other words, a  parentheses preceding the recursion. In other words, a negative number counts
2167  negative number counts capturing parentheses leftwards from the point at which  capturing parentheses leftwards from the point at which it is encountered.
 it is encountered.  
2168  </P>  </P>
2169  <P>  <P>
2170  It is also possible to refer to subsequently opened parentheses, by writing  It is also possible to refer to subsequently opened parentheses, by writing
# Line 2676  Cambridge CB2 3QH, England. Line 2673  Cambridge CB2 3QH, England.
2673  </P>  </P>
2674  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2675  <P>  <P>
2676  Last updated: 31 October 2010  Last updated: 17 November 2010
2677  <br>  <br>
2678  Copyright &copy; 1997-2010 University of Cambridge.  Copyright &copy; 1997-2010 University of Cambridge.
2679  <br>  <br>

Legend:
Removed from v.571  
changed lines
  Added in v.572

  ViewVC Help
Powered by ViewVC 1.1.5