/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 171 by ph10, Tue Apr 24 13:36:11 2007 UTC revision 172 by ph10, Tue Jun 5 10:40:13 2007 UTC
# Line 63  The remainder of this document discusses Line 63  The remainder of this document discusses
63  PCRE when its main matching function, <b>pcre_exec()</b>, is used.  PCRE when its main matching function, <b>pcre_exec()</b>, is used.
64  From release 6.0, PCRE offers a second matching function,  From release 6.0, PCRE offers a second matching function,
65  <b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not  <b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not
66  Perl-compatible. The advantages and disadvantages of the alternative function,  Perl-compatible. Some of the features discussed below are not available when
67  and how it differs from the normal function, are discussed in the  <b>pcre_dfa_exec()</b> is used. The advantages and disadvantages of the
68    alternative function, and how it differs from the normal function, are
69    discussed in the
70  <a href="pcrematching.html"><b>pcrematching</b></a>  <a href="pcrematching.html"><b>pcrematching</b></a>
71  page.  page.
72  </P>  </P>
# Line 253  Absolute and relative back references Line 255  Absolute and relative back references
255  </b><br>  </b><br>
256  <P>  <P>
257  The sequence \g followed by a positive or negative number, optionally enclosed  The sequence \g followed by a positive or negative number, optionally enclosed
258  in braces, is an absolute or relative back reference. Back references are  in braces, is an absolute or relative back reference. A named back reference
259  discussed  can be coded as \g{name}. Back references are discussed
260  <a href="#backreferences">later,</a>  <a href="#backreferences">later,</a>
261  following the discussion of  following the discussion of
262  <a href="#subpattern">parenthesized subpatterns.</a>  <a href="#subpattern">parenthesized subpatterns.</a>
# Line 528  Matching characters by Unicode property Line 530  Matching characters by Unicode property
530  a structure that contains data for over fifteen thousand characters. That is  a structure that contains data for over fifteen thousand characters. That is
531  why the traditional escape sequences such as \d and \w do not use Unicode  why the traditional escape sequences such as \d and \w do not use Unicode
532  properties in PCRE.  properties in PCRE.
533    <a name="resetmatchstart"></a></P>
534    <br><b>
535    Resetting the match start
536    </b><br>
537    <P>
538    The escape sequence \K, which is a Perl 5.10 feature, causes any previously
539    matched characters not to be included in the final matched sequence. For
540    example, the pattern:
541    <pre>
542      foo\Kbar
543    </pre>
544    matches "foobar", but reports that it has matched "bar". This feature is
545    similar to a lookbehind assertion
546    <a href="#lookbehind">(described below).</a>
547    However, in this case, the part of the subject before the real match does not
548    have to be of fixed length, as lookbehind assertions do. The use of \K does
549    not interfere with the setting of
550    <a href="#subpattern">captured substrings.</a>
551    For example, when the pattern
552    <pre>
553      (foo)\Kbar
554    </pre>
555    matches "foobar", the first substring is still set to "foo".
556  <a name="smallassertions"></a></P>  <a name="smallassertions"></a></P>
557  <br><b>  <br><b>
558  Simple assertions  Simple assertions
# Line 1309  matches "rah rah" and "RAH RAH", but not Line 1334  matches "rah rah" and "RAH RAH", but not
1334  capturing subpattern is matched caselessly.  capturing subpattern is matched caselessly.
1335  </P>  </P>
1336  <P>  <P>
1337  Back references to named subpatterns use the Perl syntax \k&#60;name&#62; or \k'name'  There are several different ways of writing back references to named
1338  or the Python syntax (?P=name). We could rewrite the above example in either of  subpatterns. The .NET syntax \k{name} and the Perl syntax \k&#60;name&#62; or
1339    \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
1340    back reference syntax, in which \g can be used for both numeric and named
1341    references, is also supported. We could rewrite the above example in any of
1342  the following ways:  the following ways:
1343  <pre>  <pre>
1344    (?&#60;p1&#62;(?i)rah)\s+\k&#60;p1&#62;    (?&#60;p1&#62;(?i)rah)\s+\k&#60;p1&#62;
1345      (?'p1'(?i)rah)\s+\k{p1}
1346    (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)    (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
1347      (?&#60;p1&#62;(?i)rah)\s+\g{p1}
1348  </pre>  </pre>
1349  A subpattern that is referenced by name may appear in the pattern before or  A subpattern that is referenced by name may appear in the pattern before or
1350  after the reference.  after the reference.
# Line 1432  lengths, but it is acceptable if rewritt Line 1462  lengths, but it is acceptable if rewritt
1462  <pre>  <pre>
1463    (?&#60;=abc|abde)    (?&#60;=abc|abde)
1464  </pre>  </pre>
1465    In some cases, the Perl 5.10 escape sequence \K
1466    <a href="#resetmatchstart">(see above)</a>
1467    can be used instead of a lookbehind assertion; this is not restricted to a
1468    fixed-length.
1469    </P>
1470    <P>
1471  The implementation of lookbehind assertions is, for each alternative, to  The implementation of lookbehind assertions is, for each alternative, to
1472  temporarily move the current position back by the fixed length and then try to  temporarily move the current position back by the fixed length and then try to
1473  match. If there are insufficient characters before the current position, the  match. If there are insufficient characters before the current position, the
# Line 1528  Checking for a used subpattern by number Line 1564  Checking for a used subpattern by number
1564  <P>  <P>
1565  If the text between the parentheses consists of a sequence of digits, the  If the text between the parentheses consists of a sequence of digits, the
1566  condition is true if the capturing subpattern of that number has previously  condition is true if the capturing subpattern of that number has previously
1567  matched.  matched. An alternative notation is to precede the digits with a plus or minus
1568    sign. In this case, the subpattern number is relative rather than absolute.
1569    The most recently opened parentheses can be referenced by (?(-1), the next most
1570    recent by (?(-2), and so on. In looping constructs it can also make sense to
1571    refer to subsequent groups with constructs such as (?(+2).
1572  </P>  </P>
1573  <P>  <P>
1574  Consider the following pattern, which contains non-significant white space to  Consider the following pattern, which contains non-significant white space to
# Line 1547  parenthesis is required. Otherwise, sinc Line 1587  parenthesis is required. Otherwise, sinc
1587  subpattern matches nothing. In other words, this pattern matches a sequence of  subpattern matches nothing. In other words, this pattern matches a sequence of
1588  non-parentheses, optionally enclosed in parentheses.  non-parentheses, optionally enclosed in parentheses.
1589  </P>  </P>
1590    <P>
1591    If you were embedding this pattern in a larger one, you could use a relative
1592    reference:
1593    <pre>
1594      ...other stuff... ( \( )?    [^()]+    (?(-1) \) ) ...
1595    </pre>
1596    This makes the fragment independent of the parentheses in the larger pattern.
1597    </P>
1598  <br><b>  <br><b>
1599  Checking for a used subpattern by name  Checking for a used subpattern by name
1600  </b><br>  </b><br>
# Line 1697  pattern, so instead you could use this: Line 1745  pattern, so instead you could use this:
1745    ( \( ( (?&#62;[^()]+) | (?1) )* \) )    ( \( ( (?&#62;[^()]+) | (?1) )* \) )
1746  </pre>  </pre>
1747  We have put the pattern into parentheses, and caused the recursion to refer to  We have put the pattern into parentheses, and caused the recursion to refer to
1748  them instead of the whole pattern. In a larger pattern, keeping track of  them instead of the whole pattern.
1749  parenthesis numbers can be tricky. It may be more convenient to use named  </P>
1750  parentheses instead. The Perl syntax for this is (?&name); PCRE's earlier  <P>
1751  syntax (?P&#62;name) is also supported. We could rewrite the above example as  In a larger pattern, keeping track of parenthesis numbers can be tricky. This
1752  follows:  is made easier by the use of relative references. (A Perl 5.10 feature.)
1753    Instead of (?1) in the pattern above you can write (?-2) to refer to the second
1754    most recently opened parentheses preceding the recursion. In other words, a
1755    negative number counts capturing parentheses leftwards from the point at which
1756    it is encountered.
1757    </P>
1758    <P>
1759    It is also possible to refer to subsequently opened parentheses, by writing
1760    references such as (?+2). However, these cannot be recursive because the
1761    reference is not inside the parentheses that are referenced. They are always
1762    "subroutine" calls, as described in the next section.
1763    </P>
1764    <P>
1765    An alternative approach is to use named parentheses instead. The Perl syntax
1766    for this is (?&name); PCRE's earlier syntax (?P&#62;name) is also supported. We
1767    could rewrite the above example as follows:
1768  <pre>  <pre>
1769    (?&#60;pn&#62; \( ( (?&#62;[^()]+) | (?&pn) )* \) )    (?&#60;pn&#62; \( ( (?&#62;[^()]+) | (?&pn) )* \) )
1770  </pre>  </pre>
1771  If there is more than one subpattern with the same name, the earliest one is  If there is more than one subpattern with the same name, the earliest one is
1772  used. This particular example pattern contains nested unlimited repeats, and so  used.
1773  the use of atomic grouping for matching strings of non-parentheses is important  </P>
1774  when applying the pattern to strings that do not match. For example, when this  <P>
1775  pattern is applied to  This particular example pattern that we have been looking at contains nested
1776    unlimited repeats, and so the use of atomic grouping for matching strings of
1777    non-parentheses is important when applying the pattern to strings that do not
1778    match. For example, when this pattern is applied to
1779  <pre>  <pre>
1780    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1781  </pre>  </pre>
# Line 1758  is the actual recursive call. Line 1824  is the actual recursive call.
1824  If the syntax for a recursive subpattern reference (either by number or by  If the syntax for a recursive subpattern reference (either by number or by
1825  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
1826  subroutine in a programming language. The "called" subpattern may be defined  subroutine in a programming language. The "called" subpattern may be defined
1827  before or after the reference. An earlier example pointed out that the pattern  before or after the reference. A numbered reference can be absolute or
1828    relative, as in these examples:
1829    <pre>
1830      (...(absolute)...)...(?2)...
1831      (...(relative)...)...(?-1)...
1832      (...(?+1)...(relative)...
1833    </pre>
1834    An earlier example pointed out that the pattern
1835  <pre>  <pre>
1836    (sens|respons)e and \1ibility    (sens|respons)e and \1ibility
1837  </pre>  </pre>
# Line 1781  When a subpattern is used as a subroutin Line 1854  When a subpattern is used as a subroutin
1854  case-independence are fixed when the subpattern is defined. They cannot be  case-independence are fixed when the subpattern is defined. They cannot be
1855  changed for different calls. For example, consider this pattern:  changed for different calls. For example, consider this pattern:
1856  <pre>  <pre>
1857    (abc)(?i:(?1))    (abc)(?i:(?-1))
1858  </pre>  </pre>
1859  It matches "abcabc". It does not match "abcABC" because the change of  It matches "abcabc". It does not match "abcABC" because the change of
1860  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
# Line 1836  Cambridge CB2 3QH, England. Line 1909  Cambridge CB2 3QH, England.
1909  </P>  </P>
1910  <br><a name="SEC24" href="#TOC1">REVISION</a><br>  <br><a name="SEC24" href="#TOC1">REVISION</a><br>
1911  <P>  <P>
1912  Last updated: 06 March 2007  Last updated: 29 May 2007
1913  <br>  <br>
1914  Copyright &copy; 1997-2007 University of Cambridge.  Copyright &copy; 1997-2007 University of Cambridge.
1915  <br>  <br>

Legend:
Removed from v.171  
changed lines
  Added in v.172

  ViewVC Help
Powered by ViewVC 1.1.5