/[pcre]/code/trunk/doc/pcrepattern.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepattern.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 459 by ph10, Sun Oct 4 09:21:39 2009 UTC revision 461 by ph10, Mon Oct 5 10:59:35 2009 UTC
# Line 21  published by O'Reilly, covers regular ex Line 21  published by O'Reilly, covers regular ex
21  description of PCRE's regular expressions is intended as reference material.  description of PCRE's regular expressions is intended as reference material.
22  .P  .P
23  The original operation of PCRE was on strings of one-byte characters. However,  The original operation of PCRE was on strings of one-byte characters. However,
24  there is now also support for UTF-8 character strings. To use this,  there is now also support for UTF-8 character strings. To use this,
25  PCRE must be built to include UTF-8 support, and you must call  PCRE must be built to include UTF-8 support, and you must call
26  \fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There  \fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There
27  is also a special sequence that can be given at the start of a pattern:  is also a special sequence that can be given at the start of a pattern:
# Line 83  string with one of the following five se Line 83  string with one of the following five se
83    (*ANYCRLF)   any of the three above    (*ANYCRLF)   any of the three above
84    (*ANY)       all Unicode newline sequences    (*ANY)       all Unicode newline sequences
85  .sp  .sp
86  These override the default and the options given to \fBpcre_compile()\fP or  These override the default and the options given to \fBpcre_compile()\fP or
87  \fBpcre_compile2()\fP. For example, on a Unix system where LF is the default  \fBpcre_compile2()\fP. For example, on a Unix system where LF is the default
88  newline sequence, the pattern  newline sequence, the pattern
89  .sp  .sp
# Line 333  syntax for referencing a subpattern as a Line 333  syntax for referencing a subpattern as a
333  later.  later.
334  .\"  .\"
335  Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP  Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
336  synonymous. The former is a back reference; the latter is a  synonymous. The former is a back reference; the latter is a
337  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
338  .\" </a>  .\" </a>
339  subroutine  subroutine
# Line 468  one of the following sequences: Line 468  one of the following sequences:
468    (*BSR_ANYCRLF)   CR, LF, or CRLF only    (*BSR_ANYCRLF)   CR, LF, or CRLF only
469    (*BSR_UNICODE)   any Unicode newline sequence    (*BSR_UNICODE)   any Unicode newline sequence
470  .sp  .sp
471  These override the default and the options given to \fBpcre_compile()\fP or  These override the default and the options given to \fBpcre_compile()\fP or
472  \fBpcre_compile2()\fP, but they can be overridden by options given to  \fBpcre_compile2()\fP, but they can be overridden by options given to
473  \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings,  \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings,
474  which are not Perl-compatible, are recognized only at the very start of a  which are not Perl-compatible, are recognized only at the very start of a
# Line 741  different meaning, namely the backspace Line 741  different meaning, namely the backspace
741  A word boundary is a position in the subject string where the current character  A word boundary is a position in the subject string where the current character
742  and the previous character do not both match \ew or \eW (i.e. one matches  and the previous character do not both match \ew or \eW (i.e. one matches
743  \ew and the other matches \eW), or the start or end of the string if the  \ew and the other matches \eW), or the start or end of the string if the
744  first or last character matches \ew, respectively. Neither PCRE nor Perl has a  first or last character matches \ew, respectively. Neither PCRE nor Perl has a
745  separte "start of word" or "end of word" metasequence. However, whatever  separte "start of word" or "end of word" metasequence. However, whatever
746  follows \eb normally determines which it is. For example, the fragment  follows \eb normally determines which it is. For example, the fragment
747  \eba matches "a" at the start of a word.  \eba matches "a" at the start of a word.
748  .P  .P
749  The \eA, \eZ, and \ez assertions differ from the traditional circumflex and  The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
# Line 876  the lookbehind. Line 876  the lookbehind.
876  .rs  .rs
877  .sp  .sp
878  An opening square bracket introduces a character class, terminated by a closing  An opening square bracket introduces a character class, terminated by a closing
879  square bracket. A closing square bracket on its own is not special by default.  square bracket. A closing square bracket on its own is not special by default.
880  However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square  However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square
881  bracket causes a compile-time error. If a closing square bracket is required as  bracket causes a compile-time error. If a closing square bracket is required as
882  a member of the class, it should be the first data character in the class  a member of the class, it should be the first data character in the class
883  (after an initial circumflex, if present) or escaped with a backslash.  (after an initial circumflex, if present) or escaped with a backslash.
# Line 1163  stored. Line 1163  stored.
1163    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x    / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1164    # 1            2         2  3        2     3     4    # 1            2         2  3        2     3     4
1165  .sp  .sp
1166  A backreference to a numbered subpattern uses the most recent value that is set  A backreference to a numbered subpattern uses the most recent value that is set
1167  for that number by any subpattern. The following pattern matches "abcabc" or  for that number by any subpattern. The following pattern matches "abcabc" or
1168  "defdef":  "defdef":
1169  .sp  .sp
1170    /(?|(abc)|(def))\1/    /(?|(abc)|(def))\e1/
1171  .sp  .sp
1172  In contrast, a recursive or "subroutine" call to a numbered subpattern always  In contrast, a recursive or "subroutine" call to a numbered subpattern always
1173  refers to the first one in the pattern with the given number. The following  refers to the first one in the pattern with the given number. The following
1174  pattern matches "abcabc" or "defabc":  pattern matches "abcabc" or "defabc":
1175  .sp  .sp
1176    /(?|(abc)|(def))(?1)/    /(?|(abc)|(def))(?1)/
# Line 1225  is also a convenience function for extra Line 1225  is also a convenience function for extra
1225  .P  .P
1226  By default, a name must be unique within a pattern, but it is possible to relax  By default, a name must be unique within a pattern, but it is possible to relax
1227  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate  this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate
1228  names are also always permitted for subpatterns with the same number, set up as  names are also always permitted for subpatterns with the same number, set up as
1229  described in the previous section.) Duplicate names can be useful for patterns  described in the previous section.) Duplicate names can be useful for patterns
1230  where only one instance of the named parentheses can match. Suppose you want to  where only one instance of the named parentheses can match. Suppose you want to
1231  match the name of a weekday, either as a 3-letter abbreviation or as the full  match the name of a weekday, either as a 3-letter abbreviation or as the full
# Line 1244  subpattern, as described in the previous Line 1244  subpattern, as described in the previous
1244  .P  .P
1245  The convenience function for extracting the data by name returns the substring  The convenience function for extracting the data by name returns the substring
1246  for the first (and in this example, the only) subpattern of that name that  for the first (and in this example, the only) subpattern of that name that
1247  matched. This saves searching to find which numbered subpattern it was.  matched. This saves searching to find which numbered subpattern it was.
1248  .P  .P
1249  If you make a backreference to a non-unique named subpattern from elsewhere in  If you make a backreference to a non-unique named subpattern from elsewhere in
1250  the pattern, the one that corresponds to the first occurrence of the name is  the pattern, the one that corresponds to the first occurrence of the name is
# Line 1256  test (see the Line 1256  test (see the
1256  .\" </a>  .\" </a>
1257  section about conditions  section about conditions
1258  .\"  .\"
1259  below), either to check whether a subpattern has matched, or to check for  below), either to check whether a subpattern has matched, or to check for
1260  recursion, all subpatterns with the same name are tested. If the condition is  recursion, all subpatterns with the same name are tested. If the condition is
1261  true for any one of them, the overall condition is true. This is the same  true for any one of them, the overall condition is true. This is the same
1262  behaviour as testing by number. For further details of the interfaces for  behaviour as testing by number. For further details of the interfaces for
# Line 1288  items: Line 1288  items:
1288    a character class    a character class
1289    a back reference (see next section)    a back reference (see next section)
1290    a parenthesized subpattern (unless it is an assertion)    a parenthesized subpattern (unless it is an assertion)
1291    a recursive or "subroutine" call to a subpattern    a recursive or "subroutine" call to a subpattern
1292  .sp  .sp
1293  The general repetition quantifier specifies a minimum and maximum number of  The general repetition quantifier specifies a minimum and maximum number of
1294  permitted matches, by giving the two numbers in curly brackets (braces),  permitted matches, by giving the two numbers in curly brackets (braces),
# Line 1614  references to it always fail by default. Line 1614  references to it always fail by default.
1614  .sp  .sp
1615    (a|(bc))\e2    (a|(bc))\e2
1616  .sp  .sp
1617  always fails if it starts to match "a" rather than "bc". However, if the  always fails if it starts to match "a" rather than "bc". However, if the
1618  PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an  PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an
1619  unset value matches an empty string.  unset value matches an empty string.
1620  .P  .P
1621  Because there may be many capturing parentheses in a pattern, all digits  Because there may be many capturing parentheses in a pattern, all digits
# Line 1737  In some cases, the Perl 5.10 escape sequ Line 1737  In some cases, the Perl 5.10 escape sequ
1737  .\" </a>  .\" </a>
1738  (see above)  (see above)
1739  .\"  .\"
1740  can be used instead of a lookbehind assertion to get round the fixed-length  can be used instead of a lookbehind assertion to get round the fixed-length
1741  restriction.  restriction.
1742  .P  .P
1743  The implementation of lookbehind assertions is, for each alternative, to  The implementation of lookbehind assertions is, for each alternative, to
# Line 1755  different numbers of bytes, are also not Line 1755  different numbers of bytes, are also not
1755  "Subroutine"  "Subroutine"
1756  .\"  .\"
1757  calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long  calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
1758  as the subpattern matches a fixed-length string.  as the subpattern matches a fixed-length string.
1759  .\" HTML <a href="#recursion">  .\" HTML <a href="#recursion">
1760  .\" </a>  .\" </a>
1761  Recursion,  Recursion,
# Line 1828  characters that are not "999". Line 1828  characters that are not "999".
1828  .sp  .sp
1829  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
1830  conditionally or to choose between two alternative subpatterns, depending on  conditionally or to choose between two alternative subpatterns, depending on
1831  the result of an assertion, or whether a specific capturing subpattern has  the result of an assertion, or whether a specific capturing subpattern has
1832  already been matched. The two possible forms of conditional subpattern are:  already been matched. The two possible forms of conditional subpattern are:
1833  .sp  .sp
1834    (?(condition)yes-pattern)    (?(condition)yes-pattern)
# Line 1846  recursion, a pseudo-condition called DEF Line 1846  recursion, a pseudo-condition called DEF
1846  .sp  .sp
1847  If the text between the parentheses consists of a sequence of digits, the  If the text between the parentheses consists of a sequence of digits, the
1848  condition is true if a capturing subpattern of that number has previously  condition is true if a capturing subpattern of that number has previously
1849  matched. If there is more than one capturing subpattern with the same number  matched. If there is more than one capturing subpattern with the same number
1850  (see the earlier  (see the earlier
1851  .\"  .\"
1852  .\" HTML <a href="#recursion">  .\" HTML <a href="#recursion">
1853  .\" </a>  .\" </a>
# Line 1899  Rewriting the above example to use a nam Line 1899  Rewriting the above example to use a nam
1899  .sp  .sp
1900    (?<OPEN> \e( )?    [^()]+    (?(<OPEN>) \e) )    (?<OPEN> \e( )?    [^()]+    (?(<OPEN>) \e) )
1901  .sp  .sp
1902  If the name used in a condition of this kind is a duplicate, the test is  If the name used in a condition of this kind is a duplicate, the test is
1903  applied to all subpatterns of the same name, and is true if any one of them has  applied to all subpatterns of the same name, and is true if any one of them has
1904  matched.  matched.
1905  .  .
1906  .SS "Checking for pattern recursion"  .SS "Checking for pattern recursion"
# Line 1915  letter R, for example: Line 1915  letter R, for example:
1915  .sp  .sp
1916  the condition is true if the most recent recursion is into a subpattern whose  the condition is true if the most recent recursion is into a subpattern whose
1917  number or name is given. This condition does not check the entire recursion  number or name is given. This condition does not check the entire recursion
1918  stack. If the name used in a condition of this kind is a duplicate, the test is  stack. If the name used in a condition of this kind is a duplicate, the test is
1919  applied to all subpatterns of the same name, and is true if any one of them is  applied to all subpatterns of the same name, and is true if any one of them is
1920  the most recent recursion.  the most recent recursion.
1921  .P  .P
1922  At "top level", all these recursion test conditions are false.  At "top level", all these recursion test conditions are false.
1923  .\" HTML <a href="#recursion">  .\" HTML <a href="#recursion">
1924  .\" </a>  .\" </a>
1925  The syntax for recursive patterns  The syntax for recursive patterns
# Line 1933  If the condition is the string (DEFINE), Line 1933  If the condition is the string (DEFINE),
1933  name DEFINE, the condition is always false. In this case, there may be only one  name DEFINE, the condition is always false. In this case, there may be only one
1934  alternative in the subpattern. It is always skipped if control reaches this  alternative in the subpattern. It is always skipped if control reaches this
1935  point in the pattern; the idea of DEFINE is that it can be used to define  point in the pattern; the idea of DEFINE is that it can be used to define
1936  "subroutines" that can be referenced from elsewhere. (The use of  "subroutines" that can be referenced from elsewhere. (The use of
1937  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
1938  .\" </a>  .\" </a>
1939  "subroutines"  "subroutines"
# Line 2010  this kind of recursion was subsequently Line 2010  this kind of recursion was subsequently
2010  .P  .P
2011  A special item that consists of (? followed by a number greater than zero and a  A special item that consists of (? followed by a number greater than zero and a
2012  closing parenthesis is a recursive call of the subpattern of the given number,  closing parenthesis is a recursive call of the subpattern of the given number,
2013  provided that it occurs inside that subpattern. (If not, it is a  provided that it occurs inside that subpattern. (If not, it is a
2014  .\" HTML <a href="#subpatternsassubroutines">  .\" HTML <a href="#subpatternsassubroutines">
2015  .\" </a>  .\" </a>
2016  "subroutine"  "subroutine"
# Line 2026  PCRE_EXTENDED option is set so that whit Line 2026  PCRE_EXTENDED option is set so that whit
2026  First it matches an opening parenthesis. Then it matches any number of  First it matches an opening parenthesis. Then it matches any number of
2027  substrings which can either be a sequence of non-parentheses, or a recursive  substrings which can either be a sequence of non-parentheses, or a recursive
2028  match of the pattern itself (that is, a correctly parenthesized substring).  match of the pattern itself (that is, a correctly parenthesized substring).
2029  Finally there is a closing parenthesis. Note the use of a possessive quantifier  Finally there is a closing parenthesis. Note the use of a possessive quantifier
2030  to avoid backtracking into sequences of non-parentheses.  to avoid backtracking into sequences of non-parentheses.
2031  .P  .P
2032  If this were part of a larger pattern, you would not want to recurse the entire  If this were part of a larger pattern, you would not want to recurse the entire
# Line 2117  is the actual recursive call. Line 2117  is the actual recursive call.
2117  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always  In PCRE (like Python, but unlike Perl), a recursive subpattern call is always
2118  treated as an atomic group. That is, once it has matched some of the subject  treated as an atomic group. That is, once it has matched some of the subject
2119  string, it is never re-entered, even if it contains untried alternatives and  string, it is never re-entered, even if it contains untried alternatives and
2120  there is a subsequent matching failure. This can be illustrated by the  there is a subsequent matching failure. This can be illustrated by the
2121  following pattern, which purports to match a palindromic string that contains  following pattern, which purports to match a palindromic string that contains
2122  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):  an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"):
2123  .sp  .sp
2124    ^(.|(.)(?1)\e2)$    ^(.|(.)(?1)\e2)$
2125  .sp  .sp
2126  The idea is that it either matches a single character, or two identical  The idea is that it either matches a single character, or two identical
2127  characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE  characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE
2128  it does not if the pattern is longer than three characters. Consider the  it does not if the pattern is longer than three characters. Consider the
2129  subject string "abcba":  subject string "abcba":
2130  .P  .P
2131  At the top level, the first character is matched, but as it is not at the end  At the top level, the first character is matched, but as it is not at the end
2132  of the string, the first alternative fails; the second alternative is taken  of the string, the first alternative fails; the second alternative is taken
2133  and the recursion kicks in. The recursive call to subpattern 1 successfully  and the recursion kicks in. The recursive call to subpattern 1 successfully
2134  matches the next character ("b"). (Note that the beginning and end of line  matches the next character ("b"). (Note that the beginning and end of line
2135  tests are not part of the recursion).  tests are not part of the recursion).
2136  .P  .P
2137  Back at the top level, the next character ("c") is compared with what  Back at the top level, the next character ("c") is compared with what
2138  subpattern 2 matched, which was "a". This fails. Because the recursion is  subpattern 2 matched, which was "a". This fails. Because the recursion is
2139  treated as an atomic group, there are now no backtracking points, and so the  treated as an atomic group, there are now no backtracking points, and so the
2140  entire match fails. (Perl is able, at this point, to re-enter the recursion and  entire match fails. (Perl is able, at this point, to re-enter the recursion and
2141  try the second alternative.) However, if the pattern is written with the  try the second alternative.) However, if the pattern is written with the
# Line 2143  alternatives in the other order, things Line 2143  alternatives in the other order, things
2143  .sp  .sp
2144    ^((.)(?1)\e2|.)$    ^((.)(?1)\e2|.)$
2145  .sp  .sp
2146  This time, the recursing alternative is tried first, and continues to recurse  This time, the recursing alternative is tried first, and continues to recurse
2147  until it runs out of characters, at which point the recursion fails. But this  until it runs out of characters, at which point the recursion fails. But this
2148  time we do have another alternative to try at the higher level. That is the big  time we do have another alternative to try at the higher level. That is the big
2149  difference: in the previous case the remaining alternative is at a deeper  difference: in the previous case the remaining alternative is at a deeper
2150  recursion level, which PCRE cannot use.  recursion level, which PCRE cannot use.
2151  .P  .P
2152  To change the pattern so that matches all palindromic strings, not just those  To change the pattern so that matches all palindromic strings, not just those
2153  with an odd number of characters, it is tempting to change the pattern to this:  with an odd number of characters, it is tempting to change the pattern to this:
2154  .sp  .sp
2155    ^((.)(?1)\e2|.?)$    ^((.)(?1)\e2|.?)$
2156  .sp  .sp
2157  Again, this works in Perl, but not in PCRE, and for the same reason. When a  Again, this works in Perl, but not in PCRE, and for the same reason. When a
2158  deeper recursion has matched a single character, it cannot be entered again in  deeper recursion has matched a single character, it cannot be entered again in
2159  order to match an empty string. The solution is to separate the two cases, and  order to match an empty string. The solution is to separate the two cases, and
2160  write out the odd and even cases as alternatives at the higher level:  write out the odd and even cases as alternatives at the higher level:
2161  .sp  .sp
2162    ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))    ^(?:((.)(?1)\e2|)|((.)(?3)\e4|.))
2163  .sp  .sp
2164  If you want to match typical palindromic phrases, the pattern has to ignore all  If you want to match typical palindromic phrases, the pattern has to ignore all
2165  non-word characters, which can be done like this:  non-word characters, which can be done like this:
2166  .sp  .sp
2167    ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$    ^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$
2168  .sp  .sp
2169  If run with the PCRE_CASELESS option, this pattern matches phrases such as "A  If run with the PCRE_CASELESS option, this pattern matches phrases such as "A
2170  man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note  man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note
2171  the use of the possessive quantifier *+ to avoid backtracking into sequences of  the use of the possessive quantifier *+ to avoid backtracking into sequences of
2172  non-word characters. Without this, PCRE takes a great deal longer (ten times or  non-word characters. Without this, PCRE takes a great deal longer (ten times or
2173  more) to match typical phrases, and Perl takes so long that you think it has  more) to match typical phrases, and Perl takes so long that you think it has
2174  gone into a loop.  gone into a loop.
# Line 2294  a backtracking algorithm. With the excep Line 2294  a backtracking algorithm. With the excep
2294  failing negative assertion, they cause an error if encountered by  failing negative assertion, they cause an error if encountered by
2295  \fBpcre_dfa_exec()\fP.  \fBpcre_dfa_exec()\fP.
2296  .P  .P
2297  If any of these verbs are used in an assertion subpattern, their effect is  If any of these verbs are used in an assertion subpattern, their effect is
2298  confined to that subpattern; it does not extend to the surrounding pattern.  confined to that subpattern; it does not extend to the surrounding pattern.
2299  Note that assertion subpatterns are processed as anchored at the point where  Note that assertion subpatterns are processed as anchored at the point where
2300  they are tested.  they are tested.
2301  .P  .P
2302  The new verbs make use of what was previously invalid syntax: an opening  The new verbs make use of what was previously invalid syntax: an opening
# Line 2319  captured. (This feature was added to PCR Line 2319  captured. (This feature was added to PCR
2319  .sp  .sp
2320    A((?:A|B(*ACCEPT)|C)D)    A((?:A|B(*ACCEPT)|C)D)
2321  .sp  .sp
2322  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by  This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
2323  the outer parentheses.  the outer parentheses.
2324  .sp  .sp
2325    (*FAIL) or (*F)    (*FAIL) or (*F)
# Line 2400  is used outside of any alternation, it a Line 2400  is used outside of any alternation, it a
2400  .SH "SEE ALSO"  .SH "SEE ALSO"
2401  .rs  .rs
2402  .sp  .sp
2403  \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),  \fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3),
2404  \fBpcresyntax\fP(3), \fBpcre\fP(3).  \fBpcresyntax\fP(3), \fBpcre\fP(3).
2405  .  .
2406  .  .

Legend:
Removed from v.459  
changed lines
  Added in v.461

  ViewVC Help
Powered by ViewVC 1.1.5