/[pcre]/code/trunk/doc/html/pcrepattern.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 460 by ph10, Tue Sep 22 09:42:11 2009 UTC revision 461 by ph10, Mon Oct 5 10:59:35 2009 UTC
# Line 61  description of PCRE's regular expression Line 61  description of PCRE's regular expression
61  </P>  </P>
62  <P>  <P>
63  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
64  there is now also support for UTF-8 character strings. To use this, you must  there is now also support for UTF-8 character strings. To use this,
65  build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with  PCRE must be built to include UTF-8 support, and you must call
66  the PCRE_UTF8 option. There is also a special sequence that can be given at the  <b>pcre_compile()</b> or <b>pcre_compile2()</b> with the PCRE_UTF8 option. There
67  start of a pattern:  is also a special sequence that can be given at the start of a pattern:
68  <pre>  <pre>
69    (*UTF8)    (*UTF8)
70  </pre>  </pre>
# Line 111  string with one of the following five se Line 111  string with one of the following five se
111    (*ANYCRLF)   any of the three above    (*ANYCRLF)   any of the three above
112    (*ANY)       all Unicode newline sequences    (*ANY)       all Unicode newline sequences
113  </pre>  </pre>
114  These override the default and the options given to <b>pcre_compile()</b>. For  These override the default and the options given to <b>pcre_compile()</b> or
115  example, on a Unix system where LF is the default newline sequence, the pattern  <b>pcre_compile2()</b>. For example, on a Unix system where LF is the default
116    newline sequence, the pattern
117  <pre>  <pre>
118    (*CR)a.b    (*CR)a.b
119  </pre>  </pre>
# Line 228  Non-printing characters Line 229  Non-printing characters
229  A second use of backslash provides a way of encoding non-printing characters  A second use of backslash provides a way of encoding non-printing characters
230  in patterns in a visible manner. There is no restriction on the appearance of  in patterns in a visible manner. There is no restriction on the appearance of
231  non-printing characters, apart from the binary zero that terminates a pattern,  non-printing characters, apart from the binary zero that terminates a pattern,
232  but when a pattern is being prepared by text editing, it is usually easier to  but when a pattern is being prepared by text editing, it is often easier to use
233  use one of the following escape sequences than the binary character it  one of the following escape sequences than the binary character it represents:
 represents:  
234  <pre>  <pre>
235    \a        alarm, that is, the BEL character (hex 07)    \a        alarm, that is, the BEL character (hex 07)
236    \cx       "control-x", where x is any character    \cx       "control-x", where x is any character
# Line 334  a number enclosed either in angle bracke Line 334  a number enclosed either in angle bracke
334  syntax for referencing a subpattern as a "subroutine". Details are discussed  syntax for referencing a subpattern as a "subroutine". Details are discussed
335  <a href="#onigurumasubroutines">later.</a>  <a href="#onigurumasubroutines">later.</a>
336  Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>  Note that \g{...} (Perl syntax) and \g&#60;...&#62; (Oniguruma syntax) are <i>not</i>
337  synonymous. The former is a back reference; the latter is a  synonymous. The former is a back reference; the latter is a
338  <a href="#subpatternsassubroutines">subroutine</a>  <a href="#subpatternsassubroutines">subroutine</a>
339  call.  call.
340  </P>  </P>
# Line 465  one of the following sequences: Line 465  one of the following sequences:
465    (*BSR_ANYCRLF)   CR, LF, or CRLF only    (*BSR_ANYCRLF)   CR, LF, or CRLF only
466    (*BSR_UNICODE)   any Unicode newline sequence    (*BSR_UNICODE)   any Unicode newline sequence
467  </pre>  </pre>
468  These override the default and the options given to <b>pcre_compile()</b>, but  These override the default and the options given to <b>pcre_compile()</b> or
469  they can be overridden by options given to <b>pcre_exec()</b>. Note that these  <b>pcre_compile2()</b>, but they can be overridden by options given to
470  special settings, which are not Perl-compatible, are recognized only at the  <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. Note that these special settings,
471  very start of a pattern, and that they must be in upper case. If more than one  which are not Perl-compatible, are recognized only at the very start of a
472  of them is present, the last one is used. They can be combined with a change of  pattern, and that they must be in upper case. If more than one of them is
473  newline convention, for example, a pattern can start with:  present, the last one is used. They can be combined with a change of newline
474    convention, for example, a pattern can start with:
475  <pre>  <pre>
476    (*ANY)(*BSR_ANYCRLF)    (*ANY)(*BSR_ANYCRLF)
477  </pre>  </pre>
# Line 731  different meaning, namely the backspace Line 732  different meaning, namely the backspace
732  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
733  and the previous character do not both match \w or \W (i.e. one matches  and the previous character do not both match \w or \W (i.e. one matches
734  \w and the other matches \W), or the start or end of the string if the  \w and the other matches \W), or the start or end of the string if the
735  first or last character matches \w, respectively.  first or last character matches \w, respectively. Neither PCRE nor Perl has a
736    separte "start of word" or "end of word" metasequence. However, whatever
737    follows \b normally determines which it is. For example, the fragment
738    \ba matches "a" at the start of a word.
739  </P>  </P>
740  <P>  <P>
741  The \A, \Z, and \z assertions differ from the traditional circumflex and  The \A, \Z, and \z assertions differ from the traditional circumflex and
# Line 862  the lookbehind. Line 866  the lookbehind.
866  <br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>  <br><a name="SEC8" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
867  <P>  <P>
868  An opening square bracket introduces a character class, terminated by a closing  An opening square bracket introduces a character class, terminated by a closing
869  square bracket. A closing square bracket on its own is not special. If a  square bracket. A closing square bracket on its own is not special by default.
870  closing square bracket is required as a member of the class, it should be the  However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
871  first data character in the class (after an initial circumflex, if present) or  bracket causes a compile-time error. If a closing square bracket is required as
872  escaped with a backslash.  a member of the class, it should be the first data character in the class
873    (after an initial circumflex, if present) or escaped with a backslash.
874  </P>  </P>
875  <P>  <P>
876  A character class matches a single character in the subject. In UTF-8 mode, the  A character class matches a single character in the subject. In UTF-8 mode, the
877  character may occupy more than one byte. A matched character must be in the set  character may be more than one byte long. A matched character must be in the
878  of characters defined by the class, unless the first character in the class  set of characters defined by the class, unless the first character in the class
879  definition is a circumflex, in which case the subject character must not be in  definition is a circumflex, in which case the subject character must not be in
880  the set defined by the class. If a circumflex is actually required as a member  the set defined by the class. If a circumflex is actually required as a member
881  of the class, ensure it is not the first character, or escape it with a  of the class, ensure it is not the first character, or escape it with a
# Line 881  For example, the character class [aeiou] Line 886  For example, the character class [aeiou]
886  [^aeiou] matches any character that is not a lower case vowel. Note that a  [^aeiou] matches any character that is not a lower case vowel. Note that a
887  circumflex is just a convenient notation for specifying the characters that  circumflex is just a convenient notation for specifying the characters that
888  are in the class by enumerating those that are not. A class that starts with a  are in the class by enumerating those that are not. A class that starts with a
889  circumflex is not an assertion: it still consumes a character from the subject  circumflex is not an assertion; it still consumes a character from the subject
890  string, and therefore it fails if the current pointer is at the end of the  string, and therefore it fails if the current pointer is at the end of the
891  string.  string.
892  </P>  </P>
# Line 897  caseful version would. In UTF-8 mode, PC Line 902  caseful version would. In UTF-8 mode, PC
902  case for characters whose values are less than 128, so caseless matching is  case for characters whose values are less than 128, so caseless matching is
903  always possible. For characters with higher values, the concept of case is  always possible. For characters with higher values, the concept of case is
904  supported if PCRE is compiled with Unicode property support, but not otherwise.  supported if PCRE is compiled with Unicode property support, but not otherwise.
905  If you want to use caseless matching for characters 128 and above, you must  If you want to use caseless matching in UTF8-mode for characters 128 and above,
906  ensure that PCRE is compiled with Unicode property support as well as with  you must ensure that PCRE is compiled with Unicode property support as well as
907  UTF-8 support.  with UTF-8 support.
908  </P>  </P>
909  <P>  <P>
910  Characters that might indicate line breaks are never treated in any special way  Characters that might indicate line breaks are never treated in any special way
# Line 1127  match exactly the same set of strings. B Line 1132  match exactly the same set of strings. B
1132  from left to right, and options are not reset until the end of the subpattern  from left to right, and options are not reset until the end of the subpattern
1133  is reached, an option setting in one branch does affect subsequent branches, so  is reached, an option setting in one branch does affect subsequent branches, so
1134  the above patterns match "SUNDAY" as well as "Saturday".  the above patterns match "SUNDAY" as well as "Saturday".
1135  </P>  <a name="dupsubpatternnumber"></a></P>
1136  <br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>  <br><a name="SEC13" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br>
1137  <P>  <P>
1138  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses  Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
# Line 1152  stored. Line 1157  stored.
1157    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1158    # 1            2         2  3        2     3     4    # 1            2         2  3        2     3     4
1159  </pre>  </pre>
1160  A backreference or a recursive call to a numbered subpattern always refers to  A backreference to a numbered subpattern uses the most recent value that is set
1161  the first one in the pattern with the given number.  for that number by any subpattern. The following pattern matches "abcabc" or
1162    "defdef":
1163    <pre>
1164      /(?|(abc)|(def))\1/
1165    </pre>
1166    In contrast, a recursive or "subroutine" call to a numbered subpattern always
1167    refers to the first one in the pattern with the given number. The following
1168    pattern matches "abcabc" or "defabc":
1169    <pre>
1170      /(?|(abc)|(def))(?1)/
1171    </pre>
1172    If a
1173    <a href="#conditions">condition test</a>
1174    for a subpattern's having matched refers to a non-unique number, the test is
1175    true if any of the subpatterns of that number have matched.
1176  </P>  </P>
1177  <P>  <P>
1178  An alternative approach to using this "branch reset" feature is to use  An alternative approach to using this "branch reset" feature is to use
# Line 1167  if an expression is modified, the number Line 1186  if an expression is modified, the number
1186  difficulty, PCRE supports the naming of subpatterns. This feature was not  difficulty, PCRE supports the naming of subpatterns. This feature was not
1187  added to Perl until release 5.10. Python had the feature earlier, and PCRE  added to Perl until release 5.10. Python had the feature earlier, and PCRE
1188  introduced it at release 4.0, using the Python syntax. PCRE now supports both  introduced it at release 4.0, using the Python syntax. PCRE now supports both
1189  the Perl and the Python syntax.  the Perl and the Python syntax. Perl allows identically numbered subpatterns to
1190    have different names, but PCRE does not.
1191  </P>  </P>
1192  <P>  <P>
1193  In PCRE, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or  In PCRE, a subpattern can be named in one of three ways: (?&#60;name&#62;...) or
# Line 1188  is also a convenience function for extra Line 1208  is also a convenience function for extra
1208  </P>  </P>
1209  <P>  <P>
1210  By default, a name must be unique within a pattern, but it is possible to relax  By default, a name must be unique within a pattern, but it is possible to relax
1211  this constraint by setting the PCRE_DUPNAMES option at compile time. This can  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
1212  be useful for patterns where only one instance of the named parentheses can  names are also always permitted for subpatterns with the same number, set up as
1213  match. Suppose you want to match the name of a weekday, either as a 3-letter  described in the previous section.) Duplicate names can be useful for patterns
1214  abbreviation or as the full name, and in both cases you want to extract the  where only one instance of the named parentheses can match. Suppose you want to
1215  abbreviation. This pattern (ignoring the line breaks) does the job:  match the name of a weekday, either as a 3-letter abbreviation or as the full
1216    name, and in both cases you want to extract the abbreviation. This pattern
1217    (ignoring the line breaks) does the job:
1218  <pre>  <pre>
1219    (?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|    (?&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
1220    (?&#60;DN&#62;Tue)(?:sday)?|    (?&#60;DN&#62;Tue)(?:sday)?|
# Line 1207  subpattern, as described in the previous Line 1229  subpattern, as described in the previous
1229  <P>  <P>
1230  The convenience function for extracting the data by name returns the substring  The convenience function for extracting the data by name returns the substring
1231  for the first (and in this example, the only) subpattern of that name that  for the first (and in this example, the only) subpattern of that name that
1232  matched. This saves searching to find which numbered subpattern it was. If you  matched. This saves searching to find which numbered subpattern it was.
1233  make a reference to a non-unique named subpattern from elsewhere in the  </P>
1234  pattern, the one that corresponds to the lowest number is used. For further  <P>
1235  details of the interfaces for handling named subpatterns, see the  If you make a backreference to a non-unique named subpattern from elsewhere in
1236    the pattern, the one that corresponds to the first occurrence of the name is
1237    used. In the absence of duplicate numbers (see the previous section) this is
1238    the one with the lowest number. If you use a named reference in a condition
1239    test (see the
1240    <a href="#conditions">section about conditions</a>
1241    below), either to check whether a subpattern has matched, or to check for
1242    recursion, all subpatterns with the same name are tested. If the condition is
1243    true for any one of them, the overall condition is true. This is the same
1244    behaviour as testing by number. For further details of the interfaces for
1245    handling named subpatterns, see the
1246  <a href="pcreapi.html"><b>pcreapi</b></a>  <a href="pcreapi.html"><b>pcreapi</b></a>
1247  documentation.  documentation.
1248  </P>  </P>
1249  <P>  <P>
1250  <b>Warning:</b> You cannot use different names to distinguish between two  <b>Warning:</b> You cannot use different names to distinguish between two
1251  subpatterns with the same number (see the previous section) because PCRE uses  subpatterns with the same number because PCRE uses only the numbers when
1252  only the numbers when matching.  matching. For this reason, an error is given at compile time if different names
1253    are given to subpatterns with the same number. However, you can give the same
1254    name to subpatterns with the same number, even when PCRE_DUPNAMES is not set.
1255  </P>  </P>
1256  <br><a name="SEC15" href="#TOC1">REPETITION</a><br>  <br><a name="SEC15" href="#TOC1">REPETITION</a><br>
1257  <P>  <P>
# Line 1233  items: Line 1267  items:
1267    a character class    a character class
1268    a back reference (see next section)    a back reference (see next section)
1269    a parenthesized subpattern (unless it is an assertion)    a parenthesized subpattern (unless it is an assertion)
1270      a recursive or "subroutine" call to a subpattern
1271  </pre>  </pre>
1272  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1273  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 1564  after the reference. Line 1599  after the reference.
1599  <P>  <P>
1600  There may be more than one back reference to the same subpattern. If a  There may be more than one back reference to the same subpattern. If a
1601  subpattern has not actually been used in a particular match, any back  subpattern has not actually been used in a particular match, any back
1602  references to it always fail. For example, the pattern  references to it always fail by default. For example, the pattern
1603  <pre>  <pre>
1604    (a|(bc))\2    (a|(bc))\2
1605  </pre>  </pre>
1606  always fails if it starts to match "a" rather than "bc". Because there may be  always fails if it starts to match "a" rather than "bc". However, if the
1607  many capturing parentheses in a pattern, all digits following the backslash are  PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
1608  taken as part of a potential back reference number. If the pattern continues  unset value matches an empty string.
1609  with a digit character, some delimiter must be used to terminate the back  </P>
1610  reference. If the PCRE_EXTENDED option is set, this can be whitespace.  <P>
1611  Otherwise an empty comment (see  Because there may be many capturing parentheses in a pattern, all digits
1612    following a backslash are taken as part of a potential back reference number.
1613    If the pattern continues with a digit character, some delimiter must be used to
1614    terminate the back reference. If the PCRE_EXTENDED option is set, this can be
1615    whitespace. Otherwise, the \g{ syntax or an empty comment (see
1616  <a href="#comments">"Comments"</a>  <a href="#comments">"Comments"</a>
1617  below) can be used.  below) can be used.
1618  </P>  </P>
# Line 1641  lookbehind assertion is needed to achiev Line 1680  lookbehind assertion is needed to achiev
1680  If you want to force a matching failure at some point in a pattern, the most  If you want to force a matching failure at some point in a pattern, the most
1681  convenient way to do it is with (?!) because an empty string always matches, so  convenient way to do it is with (?!) because an empty string always matches, so
1682  an assertion that requires there not to be an empty string must always fail.  an assertion that requires there not to be an empty string must always fail.
1683    The Perl 5.10 backtracking control verb (*FAIL) or (*F) is essentially a
1684    synonym for (?!).
1685  <a name="lookbehind"></a></P>  <a name="lookbehind"></a></P>
1686  <br><b>  <br><b>
1687  Lookbehind assertions  Lookbehind assertions
# Line 1677  branches: Line 1718  branches:
1718  </pre>  </pre>
1719  In some cases, the Perl 5.10 escape sequence \K  In some cases, the Perl 5.10 escape sequence \K
1720  <a href="#resetmatchstart">(see above)</a>  <a href="#resetmatchstart">(see above)</a>
1721  can be used instead of a lookbehind assertion to get round the fixed-length  can be used instead of a lookbehind assertion to get round the fixed-length
1722  restriction.  restriction.
1723  </P>  </P>
1724  <P>  <P>
# Line 1695  different numbers of bytes, are also not Line 1736  different numbers of bytes, are also not
1736  <P>  <P>
1737  <a href="#subpatternsassubroutines">"Subroutine"</a>  <a href="#subpatternsassubroutines">"Subroutine"</a>
1738  calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long  calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
1739  as the subpattern matches a fixed-length string.  as the subpattern matches a fixed-length string.
1740  <a href="#recursion">Recursion,</a>  <a href="#recursion">Recursion,</a>
1741  however, is not supported.  however, is not supported.
1742  </P>  </P>
1743  <P>  <P>
1744  Possessive quantifiers can be used in conjunction with lookbehind assertions to  Possessive quantifiers can be used in conjunction with lookbehind assertions to
1745  specify efficient matching at the end of the subject string. Consider a simple  specify efficient matching of fixed-length strings at the end of subject
1746  pattern such as  strings. Consider a simple pattern such as
1747  <pre>  <pre>
1748    abcd$    abcd$
1749  </pre>  </pre>
# Line 1764  characters that are not "999". Line 1805  characters that are not "999".
1805  <P>  <P>
1806  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
1807  conditionally or to choose between two alternative subpatterns, depending on  conditionally or to choose between two alternative subpatterns, depending on
1808  the result of an assertion, or whether a previous capturing subpattern matched  the result of an assertion, or whether a specific capturing subpattern has
1809  or not. The two possible forms of conditional subpattern are  already been matched. The two possible forms of conditional subpattern are:
1810  <pre>  <pre>
1811    (?(condition)yes-pattern)    (?(condition)yes-pattern)
1812    (?(condition)yes-pattern|no-pattern)    (?(condition)yes-pattern|no-pattern)
# Line 1783  Checking for a used subpattern by number Line 1824  Checking for a used subpattern by number
1824  </b><br>  </b><br>
1825  <P>  <P>
1826  If the text between the parentheses consists of a sequence of digits, the  If the text between the parentheses consists of a sequence of digits, the
1827  condition is true if the capturing subpattern of that number has previously  condition is true if a capturing subpattern of that number has previously
1828  matched. An alternative notation is to precede the digits with a plus or minus  matched. If there is more than one capturing subpattern with the same number
1829  sign. In this case, the subpattern number is relative rather than absolute.  (see the earlier
1830  The most recently opened parentheses can be referenced by (?(-1), the next most  <a href="#recursion">section about duplicate subpattern numbers),</a>
1831  recent by (?(-2), and so on. In looping constructs it can also make sense to  the condition is true if any of them have been set. An alternative notation is
1832  refer to subsequent groups with constructs such as (?(+2).  to precede the digits with a plus or minus sign. In this case, the subpattern
1833    number is relative rather than absolute. The most recently opened parentheses
1834    can be referenced by (?(-1), the next most recent by (?(-2), and so on. In
1835    looping constructs it can also make sense to refer to subsequent groups with
1836    constructs such as (?(+2).
1837  </P>  </P>
1838  <P>  <P>
1839  Consider the following pattern, which contains non-significant white space to  Consider the following pattern, which contains non-significant white space to
# Line 1832  names that consist entirely of digits is Line 1877  names that consist entirely of digits is
1877  Rewriting the above example to use a named subpattern gives this:  Rewriting the above example to use a named subpattern gives this:
1878  <pre>  <pre>
1879    (?&#60;OPEN&#62; \( )?    [^()]+    (?(&#60;OPEN&#62;) \) )    (?&#60;OPEN&#62; \( )?    [^()]+    (?(&#60;OPEN&#62;) \) )
1880    </pre>
1881  </PRE>  If the name used in a condition of this kind is a duplicate, the test is
1882    applied to all subpatterns of the same name, and is true if any one of them has
1883    matched.
1884  </P>  </P>
1885  <br><b>  <br><b>
1886  Checking for pattern recursion  Checking for pattern recursion
# Line 1846  letter R, for example: Line 1893  letter R, for example:
1893  <pre>  <pre>
1894    (?(R3)...) or (?(R&name)...)    (?(R3)...) or (?(R&name)...)
1895  </pre>  </pre>
1896  the condition is true if the most recent recursion is into the subpattern whose  the condition is true if the most recent recursion is into a subpattern whose
1897  number or name is given. This condition does not check the entire recursion  number or name is given. This condition does not check the entire recursion
1898  stack.  stack. If the name used in a condition of this kind is a duplicate, the test is
1899    applied to all subpatterns of the same name, and is true if any one of them is
1900    the most recent recursion.
1901  </P>  </P>
1902  <P>  <P>
1903  At "top level", all these recursion test conditions are false.  At "top level", all these recursion test conditions are false.
1904  <a href="#recursion">Recursive patterns</a>  <a href="#recursion">The syntax for recursive patterns</a>
1905  are described below.  is described below.
1906  </P>  </P>
1907  <br><b>  <br><b>
1908  Defining subpatterns for use by reference only  Defining subpatterns for use by reference only
# Line 1863  If the condition is the string (DEFINE), Line 1912  If the condition is the string (DEFINE),
1912  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
1913  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
1914  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
1915  "subroutines" that can be referenced from elsewhere. (The use of  "subroutines" that can be referenced from elsewhere. (The use of
1916  <a href="#subpatternsassubroutines">"subroutines"</a>  <a href="#subpatternsassubroutines">"subroutines"</a>
1917  is described below.) For example, a pattern to match an IPv4 address could be  is described below.) For example, a pattern to match an IPv4 address could be
1918  written like this (ignore whitespace and line breaks):  written like this (ignore whitespace and line breaks):
# Line 1874  written like this (ignore whitespace and Line 1923  written like this (ignore whitespace and
1923  The first part of the pattern is a DEFINE group inside which a another group  The first part of the pattern is a DEFINE group inside which a another group
1924  named "byte" is defined. This matches an individual component of an IPv4  named "byte" is defined. This matches an individual component of an IPv4
1925  address (a number less than 256). When matching takes place, this part of the  address (a number less than 256). When matching takes place, this part of the
1926  pattern is skipped because DEFINE acts like a false condition.  pattern is skipped because DEFINE acts like a false condition. The rest of the
1927  </P>  pattern uses references to the named group to match the four dot-separated
1928  <P>  components of an IPv4 address, insisting on a word boundary at each end.
 The rest of the pattern uses references to the named group to match the four  
 dot-separated components of an IPv4 address, insisting on a word boundary at  
 each end.  
1929  </P>  </P>
1930  <br><b>  <br><b>
1931  Assertion conditions  Assertion conditions
# Line 1939  this kind of recursion was subsequently Line 1985  this kind of recursion was subsequently
1985  <P>  <P>
1986  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
1987  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
1988  provided that it occurs inside that subpattern. (If not, it is a  provided that it occurs inside that subpattern. (If not, it is a
1989  <a href="#subpatternsassubroutines">"subroutine"</a>  <a href="#subpatternsassubroutines">"subroutine"</a>
1990  call, which is described in the next section.) The special item (?R) or (?0) is  call, which is described in the next section.) The special item (?R) or (?0) is
1991  a recursive call of the entire regular expression.  a recursive call of the entire regular expression.
# Line 1948  a recursive call of the entire regular e Line 1994  a recursive call of the entire regular e
1994  This PCRE pattern solves the nested parentheses problem (assume the  This PCRE pattern solves the nested parentheses problem (assume the
1995  PCRE_EXTENDED option is set so that white space is ignored):  PCRE_EXTENDED option is set so that white space is ignored):
1996  <pre>  <pre>
1997    \( ( (?&#62;[^()]+) | (?R) )* \)    \( ( [^()]++ | (?R) )* \)
1998  </pre>  </pre>
1999  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
2000  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
2001  match of the pattern itself (that is, a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
2002  Finally there is a closing parenthesis.  Finally there is a closing parenthesis. Note the use of a possessive quantifier
2003    to avoid backtracking into sequences of non-parentheses.
2004  </P>  </P>
2005  <P>  <P>
2006  If this were part of a larger pattern, you would not want to recurse the entire  If this were part of a larger pattern, you would not want to recurse the entire
2007  pattern, so instead you could use this:  pattern, so instead you could use this:
2008  <pre>  <pre>
2009    ( \( ( (?&#62;[^()]+) | (?1) )* \) )    ( \( ( [^()]++ | (?1) )* \) )
2010  </pre>  </pre>
2011  We have put the pattern into parentheses, and caused the recursion to refer to  We have put the pattern into parentheses, and caused the recursion to refer to
2012  them instead of the whole pattern.  them instead of the whole pattern.
2013  </P>  </P>
2014  <P>  <P>
2015  In a larger pattern, keeping track of parenthesis numbers can be tricky. This  In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2016  is made easier by the use of relative references. (A Perl 5.10 feature.)  is made easier by the use of relative references (a Perl 5.10 feature).
2017  Instead of (?1) in the pattern above you can write (?-2) to refer to the second  Instead of (?1) in the pattern above you can write (?-2) to refer to the second
2018  most recently opened parentheses preceding the recursion. In other words, a  most recently opened parentheses preceding the recursion. In other words, a
2019  negative number counts capturing parentheses leftwards from the point at which  negative number counts capturing parentheses leftwards from the point at which
# Line 1984  An alternative approach is to use named Line 2031  An alternative approach is to use named
2031  for this is (?&name); PCRE's earlier syntax (?P&#62;name) is also supported. We  for this is (?&name); PCRE's earlier syntax (?P&#62;name) is also supported. We
2032  could rewrite the above example as follows:  could rewrite the above example as follows:
2033  <pre>  <pre>
2034    (?&#60;pn&#62; \( ( (?&#62;[^()]+) | (?&pn) )* \) )    (?&#60;pn&#62; \( ( [^()]++ | (?&pn) )* \) )
2035  </pre>  </pre>
2036  If there is more than one subpattern with the same name, the earliest one is  If there is more than one subpattern with the same name, the earliest one is
2037  used.  used.
2038  </P>  </P>
2039  <P>  <P>
2040  This particular example pattern that we have been looking at contains nested  This particular example pattern that we have been looking at contains nested
2041  unlimited repeats, and so the use of atomic grouping for matching strings of  unlimited repeats, and so the use of a possessive quantifier for matching
2042  non-parentheses is important when applying the pattern to strings that do not  strings of non-parentheses is important when applying the pattern to strings
2043  match. For example, when this pattern is applied to  that do not match. For example, when this pattern is applied to
2044  <pre>  <pre>
2045    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()    (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2046  </pre>  </pre>
2047  it yields "no match" quickly. However, if atomic grouping is not used,  it yields "no match" quickly. However, if a possessive quantifier is not used,
2048  the match runs for a very long time indeed because there are so many different  the match runs for a very long time indeed because there are so many different
2049  ways the + and * repeats can carve up the subject, and all have to be tested  ways the + and * repeats can carve up the subject, and all have to be tested
2050  before failure can be reported.  before failure can be reported.
# Line 2015  documentation). If the pattern above is Line 2062  documentation). If the pattern above is
2062  the value for the capturing parentheses is "ef", which is the last value taken  the value for the capturing parentheses is "ef", which is the last value taken
2063  on at the top level. If additional parentheses are added, giving  on at the top level. If additional parentheses are added, giving
2064  <pre>  <pre>
2065    \( ( ( (?&#62;[^()]+) | (?R) )* ) \)    \( ( ( [^()]++ | (?R) )* ) \)
2066       ^                        ^       ^                        ^
2067       ^                        ^       ^                        ^
2068  </pre>  </pre>
# Line 2044  Recursion difference from Perl Line 2091  Recursion difference from Perl
2091  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2092  treated as an atomic group. That is, once it has matched some of the subject  treated as an atomic group. That is, once it has matched some of the subject
2093  string, it is never re-entered, even if it contains untried alternatives and  string, it is never re-entered, even if it contains untried alternatives and
2094  there is a subsequent matching failure. This can be illustrated by the  there is a subsequent matching failure. This can be illustrated by the
2095  following pattern, which purports to match a palindromic string that contains  following pattern, which purports to match a palindromic string that contains
2096  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2097  <pre>  <pre>
2098    ^(.|(.)(?1)\2)$    ^(.|(.)(?1)\2)$
2099  </pre>  </pre>
2100  The idea is that it either matches a single character, or two identical  The idea is that it either matches a single character, or two identical
2101  characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE  characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2102  it does not if the pattern is longer than three characters. Consider the  it does not if the pattern is longer than three characters. Consider the
2103  subject string "abcba":  subject string "abcba":
2104  </P>  </P>
2105  <P>  <P>
2106  At the top level, the first character is matched, but as it is not at the end  At the top level, the first character is matched, but as it is not at the end
2107  of the string, the first alternative fails; the second alternative is taken  of the string, the first alternative fails; the second alternative is taken
2108  and the recursion kicks in. The recursive call to subpattern 1 successfully  and the recursion kicks in. The recursive call to subpattern 1 successfully
2109  matches the next character ("b"). (Note that the beginning and end of line  matches the next character ("b"). (Note that the beginning and end of line
# Line 2064  tests are not part of the recursion). Line 2111  tests are not part of the recursion).
2111  </P>  </P>
2112  <P>  <P>
2113  Back at the top level, the next character ("c") is compared with what  Back at the top level, the next character ("c") is compared with what
2114  subpattern 2 matched, which was "a". This fails. Because the recursion is  subpattern 2 matched, which was "a". This fails. Because the recursion is
2115  treated as an atomic group, there are now no backtracking points, and so the  treated as an atomic group, there are now no backtracking points, and so the
2116  entire match fails. (Perl is able, at this point, to re-enter the recursion and  entire match fails. (Perl is able, at this point, to re-enter the recursion and
2117  try the second alternative.) However, if the pattern is written with the  try the second alternative.) However, if the pattern is written with the
# Line 2072  alternatives in the other order, things Line 2119  alternatives in the other order, things
2119  <pre>  <pre>
2120    ^((.)(?1)\2|.)$    ^((.)(?1)\2|.)$
2121  </pre>  </pre>
2122  This time, the recursing alternative is tried first, and continues to recurse  This time, the recursing alternative is tried first, and continues to recurse
2123  until it runs out of characters, at which point the recursion fails. But this  until it runs out of characters, at which point the recursion fails. But this
2124  time we do have another alternative to try at the higher level. That is the big  time we do have another alternative to try at the higher level. That is the big
2125  difference: in the previous case the remaining alternative is at a deeper  difference: in the previous case the remaining alternative is at a deeper
2126  recursion level, which PCRE cannot use.  recursion level, which PCRE cannot use.
2127  </P>  </P>
2128  <P>  <P>
2129  To change the pattern so that matches all palindromic strings, not just those  To change the pattern so that matches all palindromic strings, not just those
2130  with an odd number of characters, it is tempting to change the pattern to this:  with an odd number of characters, it is tempting to change the pattern to this:
2131  <pre>  <pre>
2132    ^((.)(?1)\2|.?)$    ^((.)(?1)\2|.?)$
2133  </pre>  </pre>
2134  Again, this works in Perl, but not in PCRE, and for the same reason. When a  Again, this works in Perl, but not in PCRE, and for the same reason. When a
2135  deeper recursion has matched a single character, it cannot be entered again in  deeper recursion has matched a single character, it cannot be entered again in
2136  order to match an empty string. The solution is to separate the two cases, and  order to match an empty string. The solution is to separate the two cases, and
2137  write out the odd and even cases as alternatives at the higher level:  write out the odd and even cases as alternatives at the higher level:
2138  <pre>  <pre>
2139    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))    ^(?:((.)(?1)\2|)|((.)(?3)\4|.))
2140  </pre>  </pre>
2141  If you want to match typical palindromic phrases, the pattern has to ignore all  If you want to match typical palindromic phrases, the pattern has to ignore all
2142  non-word characters, which can be done like this:  non-word characters, which can be done like this:
2143  <pre>  <pre>
2144    ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$    ^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$
2145  </pre>  </pre>
2146  If run with the PCRE_CASELESS option, this pattern matches phrases such as "A  If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2147  man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note  man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2148  the use of the possessive quantifier *+ to avoid backtracking into sequences of  the use of the possessive quantifier *+ to avoid backtracking into sequences of
2149  non-word characters. Without this, PCRE takes a great deal longer (ten times or  non-word characters. Without this, PCRE takes a great deal longer (ten times or
2150  more) to match typical phrases, and Perl takes so long that you think it has  more) to match typical phrases, and Perl takes so long that you think it has
2151  gone into a loop.  gone into a loop.
2152    </P>
2153    <P>
2154    <b>WARNING</b>: The palindrome-matching patterns above work only if the subject
2155    string does not start with a palindrome that is shorter than the entire string.
2156    For example, although "abcba" is correctly matched, if the subject is "ababa",
2157    PCRE finds the palindrome "aba" at the start, then fails at top level because
2158    the end of the string does not follow. Once again, it cannot jump back into the
2159    recursion to try other alternatives, so the entire match fails.
2160  <a name="subpatternsassubroutines"></a></P>  <a name="subpatternsassubroutines"></a></P>
2161  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>  <br><a name="SEC22" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
2162  <P>  <P>
# Line 2212  failing negative assertion, they cause a Line 2267  failing negative assertion, they cause a
2267  <b>pcre_dfa_exec()</b>.  <b>pcre_dfa_exec()</b>.
2268  </P>  </P>
2269  <P>  <P>
2270  If any of these verbs are used in an assertion subpattern, their effect is  If any of these verbs are used in an assertion subpattern, their effect is
2271  confined to that subpattern; it does not extend to the surrounding pattern.  confined to that subpattern; it does not extend to the surrounding pattern.
2272  Note that assertion subpatterns are processed as anchored at the point where  Note that assertion subpatterns are processed as anchored at the point where
2273  they are tested.  they are tested.
2274  </P>  </P>
2275  <P>  <P>
# Line 2234  The following verbs act as soon as they Line 2289  The following verbs act as soon as they
2289  </pre>  </pre>
2290  This verb causes the match to end successfully, skipping the remainder of the  This verb causes the match to end successfully, skipping the remainder of the
2291  pattern. When inside a recursion, only the innermost pattern is ended  pattern. When inside a recursion, only the innermost pattern is ended
2292  immediately. If the (*ACCEPT) is inside capturing parentheses, the data so far  immediately. If (*ACCEPT) is inside capturing parentheses, the data so far is
2293  is captured. (This feature was added to PCRE at release 8.00.) For example:  captured. (This feature was added to PCRE at release 8.00.) For example:
2294  <pre>  <pre>
2295    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2296  </pre>  </pre>
2297  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2298  the outer parentheses.  the outer parentheses.
2299  <pre>  <pre>
2300    (*FAIL) or (*F)    (*FAIL) or (*F)
# Line 2267  The verbs differ in exactly what kind of Line 2322  The verbs differ in exactly what kind of
2322  </pre>  </pre>
2323  This verb causes the whole match to fail outright if the rest of the pattern  This verb causes the whole match to fail outright if the rest of the pattern
2324  does not match. Even if the pattern is unanchored, no further attempts to find  does not match. Even if the pattern is unanchored, no further attempts to find
2325  a match by advancing the start point take place. Once (*COMMIT) has been  a match by advancing the starting point take place. Once (*COMMIT) has been
2326  passed, <b>pcre_exec()</b> is committed to finding a match at the current  passed, <b>pcre_exec()</b> is committed to finding a match at the current
2327  starting point, or not at all. For example:  starting point, or not at all. For example:
2328  <pre>  <pre>
# Line 2299  was matched leading up to it cannot be p Line 2354  was matched leading up to it cannot be p
2354  If the subject is "aaaac...", after the first match attempt fails (starting at  If the subject is "aaaac...", after the first match attempt fails (starting at
2355  the first character in the string), the starting point skips on to start the  the first character in the string), the starting point skips on to start the
2356  next attempt at "c". Note that a possessive quantifer does not have the same  next attempt at "c". Note that a possessive quantifer does not have the same
2357  effect in this example; although it would suppress backtracking during the  effect as this example; although it would suppress backtracking during the
2358  first match attempt, the second attempt would start at the second character  first match attempt, the second attempt would start at the second character
2359  instead of skipping on to "c".  instead of skipping on to "c".
2360  <pre>  <pre>
# Line 2319  is used outside of any alternation, it a Line 2374  is used outside of any alternation, it a
2374  </P>  </P>
2375  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>  <br><a name="SEC26" href="#TOC1">SEE ALSO</a><br>
2376  <P>  <P>
2377  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3).  <b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3),
2378    <b>pcresyntax</b>(3), <b>pcre</b>(3).
2379  </P>  </P>
2380  <br><a name="SEC27" href="#TOC1">AUTHOR</a><br>  <br><a name="SEC27" href="#TOC1">AUTHOR</a><br>
2381  <P>  <P>
# Line 2332  Cambridge CB2 3QH, England. Line 2388  Cambridge CB2 3QH, England.
2388  </P>  </P>
2389  <br><a name="SEC28" href="#TOC1">REVISION</a><br>  <br><a name="SEC28" href="#TOC1">REVISION</a><br>
2390  <P>  <P>
2391  Last updated: 22 September 2009  Last updated: 04 October 2009
2392  <br>  <br>
2393  Copyright &copy; 1997-2009 University of Cambridge.  Copyright &copy; 1997-2009 University of Cambridge.
2394  <br>  <br>

Legend:
Removed from v.460  
changed lines
  Added in v.461

  ViewVC Help
Powered by ViewVC 1.1.5