/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1411 by ph10, Tue Nov 19 15:36:57 2013 UTC revision 1412 by ph10, Sun Dec 15 17:01:46 2013 UTC
# Line 23  man page, in case the conversion went wr Line 23  man page, in case the conversion went wr
23  <li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a>  <li><a name="TOC8" href="#SEC8">MATCHING A SINGLE DATA UNIT</a>
24  <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>  <li><a name="TOC9" href="#SEC9">SQUARE BRACKETS AND CHARACTER CLASSES</a>
25  <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>  <li><a name="TOC10" href="#SEC10">POSIX CHARACTER CLASSES</a>
26  <li><a name="TOC11" href="#SEC11">VERTICAL BAR</a>  <li><a name="TOC11" href="#SEC11">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a>
27  <li><a name="TOC12" href="#SEC12">INTERNAL OPTION SETTING</a>  <li><a name="TOC12" href="#SEC12">VERTICAL BAR</a>
28  <li><a name="TOC13" href="#SEC13">SUBPATTERNS</a>  <li><a name="TOC13" href="#SEC13">INTERNAL OPTION SETTING</a>
29  <li><a name="TOC14" href="#SEC14">DUPLICATE SUBPATTERN NUMBERS</a>  <li><a name="TOC14" href="#SEC14">SUBPATTERNS</a>
30  <li><a name="TOC15" href="#SEC15">NAMED SUBPATTERNS</a>  <li><a name="TOC15" href="#SEC15">DUPLICATE SUBPATTERN NUMBERS</a>
31  <li><a name="TOC16" href="#SEC16">REPETITION</a>  <li><a name="TOC16" href="#SEC16">NAMED SUBPATTERNS</a>
32  <li><a name="TOC17" href="#SEC17">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>  <li><a name="TOC17" href="#SEC17">REPETITION</a>
33  <li><a name="TOC18" href="#SEC18">BACK REFERENCES</a>  <li><a name="TOC18" href="#SEC18">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a>
34  <li><a name="TOC19" href="#SEC19">ASSERTIONS</a>  <li><a name="TOC19" href="#SEC19">BACK REFERENCES</a>
35  <li><a name="TOC20" href="#SEC20">CONDITIONAL SUBPATTERNS</a>  <li><a name="TOC20" href="#SEC20">ASSERTIONS</a>
36  <li><a name="TOC21" href="#SEC21">COMMENTS</a>  <li><a name="TOC21" href="#SEC21">CONDITIONAL SUBPATTERNS</a>
37  <li><a name="TOC22" href="#SEC22">RECURSIVE PATTERNS</a>  <li><a name="TOC22" href="#SEC22">COMMENTS</a>
38  <li><a name="TOC23" href="#SEC23">SUBPATTERNS AS SUBROUTINES</a>  <li><a name="TOC23" href="#SEC23">RECURSIVE PATTERNS</a>
39  <li><a name="TOC24" href="#SEC24">ONIGURUMA SUBROUTINE SYNTAX</a>  <li><a name="TOC24" href="#SEC24">SUBPATTERNS AS SUBROUTINES</a>
40  <li><a name="TOC25" href="#SEC25">CALLOUTS</a>  <li><a name="TOC25" href="#SEC25">ONIGURUMA SUBROUTINE SYNTAX</a>
41  <li><a name="TOC26" href="#SEC26">BACKTRACKING CONTROL</a>  <li><a name="TOC26" href="#SEC26">CALLOUTS</a>
42  <li><a name="TOC27" href="#SEC27">SEE ALSO</a>  <li><a name="TOC27" href="#SEC27">BACKTRACKING CONTROL</a>
43  <li><a name="TOC28" href="#SEC28">AUTHOR</a>  <li><a name="TOC28" href="#SEC28">SEE ALSO</a>
44  <li><a name="TOC29" href="#SEC29">REVISION</a>  <li><a name="TOC29" href="#SEC29">AUTHOR</a>
45    <li><a name="TOC30" href="#SEC30">REVISION</a>
46  </ul>  </ul>
47  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>  <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
48  <P>  <P>
# Line 536  For compatibility with Perl, \s did not Line 537  For compatibility with Perl, \s did not
537  added VT at release 5.18, and PCRE followed suit at release 8.34. The default  added VT at release 5.18, and PCRE followed suit at release 8.34. The default
538  \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space  \s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space
539  (32), which are defined as white space in the "C" locale. This list may vary if  (32), which are defined as white space in the "C" locale. This list may vary if
540  locale-specific matching is taking place; in particular, in some locales the  locale-specific matching is taking place. For example, in some locales the
541  "non-breaking space" character (\xA0) is recognized as white space.  "non-breaking space" character (\xA0) is recognized as white space, and in
542    others the VT character is not.
543  </P>  </P>
544  <P>  <P>
545  A "word" character is an underscore or any character that is a letter or digit.  A "word" character is an underscore or any character that is a letter or digit.
# Line 1315  something AND NOT ...". Line 1317  something AND NOT ...".
1317  The only metacharacters that are recognized in character classes are backslash,  The only metacharacters that are recognized in character classes are backslash,
1318  hyphen (only where it can be interpreted as specifying a range), circumflex  hyphen (only where it can be interpreted as specifying a range), circumflex
1319  (only at the start), opening square bracket (only when it can be interpreted as  (only at the start), opening square bracket (only when it can be interpreted as
1320  introducing a POSIX class name - see the next section), and the terminating  introducing a POSIX class name, or for a special compatibility feature - see
1321  closing square bracket. However, escaping other non-alphanumeric characters  the next two sections), and the terminating closing square bracket. However,
1322  does no harm.  escaping other non-alphanumeric characters does no harm.
1323  </P>  </P>
1324  <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>  <br><a name="SEC10" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
1325  <P>  <P>
# Line 1346  are: Line 1348  are:
1348    xdigit   hexadecimal digits    xdigit   hexadecimal digits
1349  </pre>  </pre>
1350  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),  The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1351  and space (32). If locale-specific matching is taking place, there may be  and space (32). If locale-specific matching is taking place, the list of space
1352  additional space characters. "Space" used to be different to \s, which did not  characters may be different; there may be fewer or more of them. "Space" used
1353  include VT, for Perl compatibility. However, Perl changed at release 5.18, and  to be different to \s, which did not include VT, for Perl compatibility.
1354  PCRE followed at release 8.34. "Space" and \s now match the same set of  However, Perl changed at release 5.18, and PCRE followed at release 8.34.
1355  characters.  "Space" and \s now match the same set of characters.
1356  </P>  </P>
1357  <P>  <P>
1358  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl  The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
# Line 1409  plus those characters whose code points Line 1411  plus those characters whose code points
1411  The other POSIX classes are unchanged, and match only characters with code  The other POSIX classes are unchanged, and match only characters with code
1412  points less than 128.  points less than 128.
1413  </P>  </P>
1414  <br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br>  <br><a name="SEC11" href="#TOC1">COMPATIBILITY FEATURE FOR WORD BOUNDARIES</a><br>
1415    <P>
1416    In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1417    syntax [[:&#60;:]] and [[:&#62;:]] is used for matching "start of word" and "end of
1418    word". PCRE treats these items as follows:
1419    <pre>
1420      [[:&#60;:]]  is converted to  \b(?=\w)
1421      [[:&#62;:]]  is converted to  \b(?&#60;=\w)
1422    </pre>
1423    Only these exact character sequences are recognized. A sequence such as
1424    [a[:&#60;:]b] provokes error for an unrecognized POSIX class name. This support is
1425    not compatible with Perl. It is provided to help migrations from other
1426    environments, and is best not used in any new patterns. Note that \b matches
1427    at the start and the end of a word (see
1428    <a href="#smallassertions">"Simple assertions"</a>
1429    above), and in a Perl-style pattern the preceding or following character
1430    normally shows which is wanted, without the need for the assertions that are
1431    used above in order to give exactly the POSIX behaviour.
1432    </P>
1433    <br><a name="SEC12" href="#TOC1">VERTICAL BAR</a><br>
1434  <P>  <P>
1435  Vertical bar characters are used to separate alternative patterns. For example,  Vertical bar characters are used to separate alternative patterns. For example,
1436  the pattern  the pattern
# Line 1424  that succeeds is used. If the alternativ Line 1445  that succeeds is used. If the alternativ
1445  "succeeds" means matching the rest of the main pattern as well as the  "succeeds" means matching the rest of the main pattern as well as the
1446  alternative in the subpattern.  alternative in the subpattern.
1447  </P>  </P>
1448  <br><a name="SEC12" href="#TOC1">INTERNAL OPTION SETTING</a><br>  <br><a name="SEC13" href="#TOC1">INTERNAL OPTION SETTING</a><br>
1449  <P>  <P>
1450  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and  The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
1451  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within  PCRE_EXTENDED options (which are Perl-compatible) can be changed from within
# Line 1487  options, respectively. The (*UTF) sequen Line 1508  options, respectively. The (*UTF) sequen
1508  used with any of the libraries. However, the application can set the  used with any of the libraries. However, the application can set the
1509  PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.  PCRE_NEVER_UTF option, which locks out the use of the (*UTF) sequences.
1510  <a name="subpattern"></a></P>  <a name="subpattern"></a></P>
1511  <br><a name="SEC13" href="#TOC1">SUBPATTERNS</a><br>  <br><a name="SEC14" href="#TOC1">SUBPATTERNS</a><br>
1512  <P>  <P>
1513  Subpatterns are delimited by parentheses (round brackets), which can be nested.  Subpatterns are delimited by parentheses (round brackets), which can be nested.
1514  Turning part of a pattern into a subpattern does two things:  Turning part of a pattern into a subpattern does two things:
# Line 1543  from left to right, and options are not Line 1564  from left to right, and options are not
1564  is reached, an option setting in one branch does affect subsequent branches, so  is reached, an option setting in one branch does affect subsequent branches, so
1565  the above patterns match "SUNDAY" as well as "Saturday".  the above patterns match "SUNDAY" as well as "Saturday".
1566  <a name="dupsubpatternnumber"></a></P>  <a name="dupsubpatternnumber"></a></P>
1567  <br><a name="SEC14" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>  <br><a name="SEC15" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
1568  <P>  <P>
1569  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1570  the same numbers for its capturing parentheses. Such a subpattern starts with  the same numbers for its capturing parentheses. Such a subpattern starts with
# Line 1587  true if any of the subpatterns of that n Line 1608  true if any of the subpatterns of that n
1608  An alternative approach to using this "branch reset" feature is to use  An alternative approach to using this "branch reset" feature is to use
1609  duplicate named subpatterns, as described in the next section.  duplicate named subpatterns, as described in the next section.
1610  </P>  </P>
1611  <br><a name="SEC15" href="#TOC1">NAMED SUBPATTERNS</a><br>  <br><a name="SEC16" href="#TOC1">NAMED SUBPATTERNS</a><br>
1612  <P>  <P>
1613  Identifying capturing parentheses by number is simple, but it can be very hard  Identifying capturing parentheses by number is simple, but it can be very hard
1614  to keep track of the numbers in complicated regular expressions. Furthermore,  to keep track of the numbers in complicated regular expressions. Furthermore,
# Line 1678  are given to subpatterns with the same n Line 1699  are given to subpatterns with the same n
1699  same name to subpatterns with the same number, even when PCRE_DUPNAMES is not  same name to subpatterns with the same number, even when PCRE_DUPNAMES is not
1700  set.  set.
1701  </P>  </P>
1702  <br><a name="SEC16" href="#TOC1">REPETITION</a><br>  <br><a name="SEC17" href="#TOC1">REPETITION</a><br>
1703  <P>  <P>
1704  Repetition is specified by quantifiers, which can follow any of the following  Repetition is specified by quantifiers, which can follow any of the following
1705  items:  items:
# Line 1846  example, after Line 1867  example, after
1867  </pre>  </pre>
1868  matches "aba" the value of the second captured substring is "b".  matches "aba" the value of the second captured substring is "b".
1869  <a name="atomicgroup"></a></P>  <a name="atomicgroup"></a></P>
1870  <br><a name="SEC17" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>  <br><a name="SEC18" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
1871  <P>  <P>
1872  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")  With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
1873  repetition, failure of what follows normally causes the repeated item to be  repetition, failure of what follows normally causes the repeated item to be
# Line 1950  an atomic group, like this: Line 1971  an atomic group, like this:
1971  </pre>  </pre>
1972  sequences of non-digits cannot be broken, and failure happens quickly.  sequences of non-digits cannot be broken, and failure happens quickly.
1973  <a name="backreferences"></a></P>  <a name="backreferences"></a></P>
1974  <br><a name="SEC18" href="#TOC1">BACK REFERENCES</a><br>  <br><a name="SEC19" href="#TOC1">BACK REFERENCES</a><br>
1975  <P>  <P>
1976  Outside a character class, a backslash followed by a digit greater than 0 (and  Outside a character class, a backslash followed by a digit greater than 0 (and
1977  possibly further digits) is a back reference to a capturing subpattern earlier  possibly further digits) is a back reference to a capturing subpattern earlier
# Line 2078  as an Line 2099  as an
2099  Once the whole group has been matched, a subsequent matching failure cannot  Once the whole group has been matched, a subsequent matching failure cannot
2100  cause backtracking into the middle of the group.  cause backtracking into the middle of the group.
2101  <a name="bigassertions"></a></P>  <a name="bigassertions"></a></P>
2102  <br><a name="SEC19" href="#TOC1">ASSERTIONS</a><br>  <br><a name="SEC20" href="#TOC1">ASSERTIONS</a><br>
2103  <P>  <P>
2104  An assertion is a test on the characters following or preceding the current  An assertion is a test on the characters following or preceding the current
2105  matching point that does not actually consume any characters. The simple  matching point that does not actually consume any characters. The simple
# Line 2268  preceded by "foo", while Line 2289  preceded by "foo", while
2289  is another pattern that matches "foo" preceded by three digits and any three  is another pattern that matches "foo" preceded by three digits and any three
2290  characters that are not "999".  characters that are not "999".
2291  <a name="conditions"></a></P>  <a name="conditions"></a></P>
2292  <br><a name="SEC20" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>  <br><a name="SEC21" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
2293  <P>  <P>
2294  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
2295  conditionally or to choose between two alternative subpatterns, depending on  conditionally or to choose between two alternative subpatterns, depending on
# Line 2418  subject is matched against the first alt Line 2439  subject is matched against the first alt
2439  against the second. This pattern matches strings in one of the two forms  against the second. This pattern matches strings in one of the two forms
2440  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2441  <a name="comments"></a></P>  <a name="comments"></a></P>
2442  <br><a name="SEC21" href="#TOC1">COMMENTS</a><br>  <br><a name="SEC22" href="#TOC1">COMMENTS</a><br>
2443  <P>  <P>
2444  There are two ways of including comments in patterns that are processed by  There are two ways of including comments in patterns that are processed by
2445  PCRE. In both cases, the start of the comment must not be in a character class,  PCRE. In both cases, the start of the comment must not be in a character class,
# Line 2447  a newline in the pattern. The sequence \ Line 2468  a newline in the pattern. The sequence \
2468  it does not terminate the comment. Only an actual character with the code value  it does not terminate the comment. Only an actual character with the code value
2469  0x0a (the default newline) does so.  0x0a (the default newline) does so.
2470  <a name="recursion"></a></P>  <a name="recursion"></a></P>
2471  <br><a name="SEC22" href="#TOC1">RECURSIVE PATTERNS</a><br>  <br><a name="SEC23" href="#TOC1">RECURSIVE PATTERNS</a><br>
2472  <P>  <P>
2473  Consider the problem of matching a string in parentheses, allowing for  Consider the problem of matching a string in parentheses, allowing for
2474  unlimited nested parentheses. Without the use of recursion, the best that can  unlimited nested parentheses. Without the use of recursion, the best that can
# Line 2662  now match "b" and so the whole match suc Line 2683  now match "b" and so the whole match suc
2683  match because inside the recursive call \1 cannot access the externally set  match because inside the recursive call \1 cannot access the externally set
2684  value.  value.
2685  <a name="subpatternsassubroutines"></a></P>  <a name="subpatternsassubroutines"></a></P>
2686  <br><a name="SEC23" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>  <br><a name="SEC24" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
2687  <P>  <P>
2688  If the syntax for a recursive subpattern call (either by number or by  If the syntax for a recursive subpattern call (either by number or by
2689  name) is used outside the parentheses to which it refers, it operates like a  name) is used outside the parentheses to which it refers, it operates like a
# Line 2703  different calls. For example, consider t Line 2724  different calls. For example, consider t
2724  It matches "abcabc". It does not match "abcABC" because the change of  It matches "abcabc". It does not match "abcABC" because the change of
2725  processing option does not affect the called subpattern.  processing option does not affect the called subpattern.
2726  <a name="onigurumasubroutines"></a></P>  <a name="onigurumasubroutines"></a></P>
2727  <br><a name="SEC24" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>  <br><a name="SEC25" href="#TOC1">ONIGURUMA SUBROUTINE SYNTAX</a><br>
2728  <P>  <P>
2729  For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or  For compatibility with Oniguruma, the non-Perl syntax \g followed by a name or
2730  a number enclosed either in angle brackets or single quotes, is an alternative  a number enclosed either in angle brackets or single quotes, is an alternative
# Line 2721  plus or a minus sign it is taken as a re Line 2742  plus or a minus sign it is taken as a re
2742  Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>  Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
2743  synonymous. The former is a back reference; the latter is a subroutine call.  synonymous. The former is a back reference; the latter is a subroutine call.
2744  </P>  </P>
2745  <br><a name="SEC25" href="#TOC1">CALLOUTS</a><br>  <br><a name="SEC26" href="#TOC1">CALLOUTS</a><br>
2746  <P>  <P>
2747  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl  Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
2748  code to be obeyed in the middle of matching a regular expression. This makes it  code to be obeyed in the middle of matching a regular expression. This makes it
# Line 2770  interface to the callout function, are g Line 2791  interface to the callout function, are g
2791  <a href="pcrecallout.html"><b>pcrecallout</b></a>  <a href="pcrecallout.html"><b>pcrecallout</b></a>
2792  documentation.  documentation.
2793  <a name="backtrackcontrol"></a></P>  <a name="backtrackcontrol"></a></P>
2794  <br><a name="SEC26" href="#TOC1">BACKTRACKING CONTROL</a><br>  <br><a name="SEC27" href="#TOC1">BACKTRACKING CONTROL</a><br>
2795  <P>  <P>
2796  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which  Perl 5.10 introduced a number of "Special Backtracking Control Verbs", which
2797  are still described in the Perl documentation as "experimental and subject to  are still described in the Perl documentation as "experimental and subject to
# Line 3184  the subroutine match to fail. Line 3205  the subroutine match to fail.
3205  the subpattern that has alternatives. If there is no such group within the  the subpattern that has alternatives. If there is no such group within the
3206  subpattern, (*THEN) causes the subroutine match to fail.  subpattern, (*THEN) causes the subroutine match to fail.
3207  </P>  </P>
3208  <br><a name="SEC27" href="#TOC1">SEE ALSO</a><br>  <br><a name="SEC28" href="#TOC1">SEE ALSO</a><br>
3209  <P>  <P>
3210  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),
3211  <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>, <b>pcre32(3)</b>.  <b>pcresyntax</b>(3), <b>pcre</b>(3), <b>pcre16(3)</b>, <b>pcre32(3)</b>.
3212  </P>  </P>
3213  <br><a name="SEC28" href="#TOC1">AUTHOR</a><br>  <br><a name="SEC29" href="#TOC1">AUTHOR</a><br>
3214  <P>  <P>
3215  Philip Hazel  Philip Hazel
3216  <br>  <br>
# Line 3198  University Computing Service Line 3219  University Computing Service
3219  Cambridge CB2 3QH, England.  Cambridge CB2 3QH, England.
3220  <br>  <br>
3221  </P>  </P>
3222  <br><a name="SEC29" href="#TOC1">REVISION</a><br>  <br><a name="SEC30" href="#TOC1">REVISION</a><br>
3223  <P>  <P>
3224  Last updated: 12 November 2013  Last updated: 03 December 2013
3225  <br>  <br>
3226  Copyright &copy; 1997-2013 University of Cambridge.  Copyright &copy; 1997-2013 University of Cambridge.
3227  <br>  <br>

Legend:
Removed from v.1411  
changed lines
  Added in v.1412

  ViewVC Help
Powered by ViewVC 1.1.5