21 |
description of PCRE's regular expressions is intended as reference material. |
description of PCRE's regular expressions is intended as reference material. |
22 |
.P |
.P |
23 |
The original operation of PCRE was on strings of one-byte characters. However, |
The original operation of PCRE was on strings of one-byte characters. However, |
24 |
there is now also support for UTF-8 character strings. To use this, |
there is now also support for UTF-8 character strings. To use this, |
25 |
PCRE must be built to include UTF-8 support, and you must call |
PCRE must be built to include UTF-8 support, and you must call |
26 |
\fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There |
\fBpcre_compile()\fP or \fBpcre_compile2()\fP with the PCRE_UTF8 option. There |
27 |
is also a special sequence that can be given at the start of a pattern: |
is also a special sequence that can be given at the start of a pattern: |
83 |
(*ANYCRLF) any of the three above |
(*ANYCRLF) any of the three above |
84 |
(*ANY) all Unicode newline sequences |
(*ANY) all Unicode newline sequences |
85 |
.sp |
.sp |
86 |
These override the default and the options given to \fBpcre_compile()\fP or |
These override the default and the options given to \fBpcre_compile()\fP or |
87 |
\fBpcre_compile2()\fP. For example, on a Unix system where LF is the default |
\fBpcre_compile2()\fP. For example, on a Unix system where LF is the default |
88 |
newline sequence, the pattern |
newline sequence, the pattern |
89 |
.sp |
.sp |
333 |
later. |
later. |
334 |
.\" |
.\" |
335 |
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP |
Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP |
336 |
synonymous. The former is a back reference; the latter is a |
synonymous. The former is a back reference; the latter is a |
337 |
.\" HTML <a href="#subpatternsassubroutines"> |
.\" HTML <a href="#subpatternsassubroutines"> |
338 |
.\" </a> |
.\" </a> |
339 |
subroutine |
subroutine |
468 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
469 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
470 |
.sp |
.sp |
471 |
These override the default and the options given to \fBpcre_compile()\fP or |
These override the default and the options given to \fBpcre_compile()\fP or |
472 |
\fBpcre_compile2()\fP, but they can be overridden by options given to |
\fBpcre_compile2()\fP, but they can be overridden by options given to |
473 |
\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings, |
\fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP. Note that these special settings, |
474 |
which are not Perl-compatible, are recognized only at the very start of a |
which are not Perl-compatible, are recognized only at the very start of a |
741 |
A word boundary is a position in the subject string where the current character |
A word boundary is a position in the subject string where the current character |
742 |
and the previous character do not both match \ew or \eW (i.e. one matches |
and the previous character do not both match \ew or \eW (i.e. one matches |
743 |
\ew and the other matches \eW), or the start or end of the string if the |
\ew and the other matches \eW), or the start or end of the string if the |
744 |
first or last character matches \ew, respectively. Neither PCRE nor Perl has a |
first or last character matches \ew, respectively. Neither PCRE nor Perl has a |
745 |
separte "start of word" or "end of word" metasequence. However, whatever |
separte "start of word" or "end of word" metasequence. However, whatever |
746 |
follows \eb normally determines which it is. For example, the fragment |
follows \eb normally determines which it is. For example, the fragment |
747 |
\eba matches "a" at the start of a word. |
\eba matches "a" at the start of a word. |
748 |
.P |
.P |
749 |
The \eA, \eZ, and \ez assertions differ from the traditional circumflex and |
The \eA, \eZ, and \ez assertions differ from the traditional circumflex and |
876 |
.rs |
.rs |
877 |
.sp |
.sp |
878 |
An opening square bracket introduces a character class, terminated by a closing |
An opening square bracket introduces a character class, terminated by a closing |
879 |
square bracket. A closing square bracket on its own is not special by default. |
square bracket. A closing square bracket on its own is not special by default. |
880 |
However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square |
However, if the PCRE_JAVASCRIPT_COMPAT option is set, a lone closing square |
881 |
bracket causes a compile-time error. If a closing square bracket is required as |
bracket causes a compile-time error. If a closing square bracket is required as |
882 |
a member of the class, it should be the first data character in the class |
a member of the class, it should be the first data character in the class |
883 |
(after an initial circumflex, if present) or escaped with a backslash. |
(after an initial circumflex, if present) or escaped with a backslash. |
1163 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
1164 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
1165 |
.sp |
.sp |
1166 |
A backreference to a numbered subpattern uses the most recent value that is set |
A backreference to a numbered subpattern uses the most recent value that is set |
1167 |
for that number by any subpattern. The following pattern matches "abcabc" or |
for that number by any subpattern. The following pattern matches "abcabc" or |
1168 |
"defdef": |
"defdef": |
1169 |
.sp |
.sp |
1170 |
/(?|(abc)|(def))\1/ |
/(?|(abc)|(def))\e1/ |
1171 |
.sp |
.sp |
1172 |
In contrast, a recursive or "subroutine" call to a numbered subpattern always |
In contrast, a recursive or "subroutine" call to a numbered subpattern always |
1173 |
refers to the first one in the pattern with the given number. The following |
refers to the first one in the pattern with the given number. The following |
1174 |
pattern matches "abcabc" or "defabc": |
pattern matches "abcabc" or "defabc": |
1175 |
.sp |
.sp |
1176 |
/(?|(abc)|(def))(?1)/ |
/(?|(abc)|(def))(?1)/ |
1225 |
.P |
.P |
1226 |
By default, a name must be unique within a pattern, but it is possible to relax |
By default, a name must be unique within a pattern, but it is possible to relax |
1227 |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate |
1228 |
names are also always permitted for subpatterns with the same number, set up as |
names are also always permitted for subpatterns with the same number, set up as |
1229 |
described in the previous section.) Duplicate names can be useful for patterns |
described in the previous section.) Duplicate names can be useful for patterns |
1230 |
where only one instance of the named parentheses can match. Suppose you want to |
where only one instance of the named parentheses can match. Suppose you want to |
1231 |
match the name of a weekday, either as a 3-letter abbreviation or as the full |
match the name of a weekday, either as a 3-letter abbreviation or as the full |
1244 |
.P |
.P |
1245 |
The convenience function for extracting the data by name returns the substring |
The convenience function for extracting the data by name returns the substring |
1246 |
for the first (and in this example, the only) subpattern of that name that |
for the first (and in this example, the only) subpattern of that name that |
1247 |
matched. This saves searching to find which numbered subpattern it was. |
matched. This saves searching to find which numbered subpattern it was. |
1248 |
.P |
.P |
1249 |
If you make a backreference to a non-unique named subpattern from elsewhere in |
If you make a backreference to a non-unique named subpattern from elsewhere in |
1250 |
the pattern, the one that corresponds to the first occurrence of the name is |
the pattern, the one that corresponds to the first occurrence of the name is |
1256 |
.\" </a> |
.\" </a> |
1257 |
section about conditions |
section about conditions |
1258 |
.\" |
.\" |
1259 |
below), either to check whether a subpattern has matched, or to check for |
below), either to check whether a subpattern has matched, or to check for |
1260 |
recursion, all subpatterns with the same name are tested. If the condition is |
recursion, all subpatterns with the same name are tested. If the condition is |
1261 |
true for any one of them, the overall condition is true. This is the same |
true for any one of them, the overall condition is true. This is the same |
1262 |
behaviour as testing by number. For further details of the interfaces for |
behaviour as testing by number. For further details of the interfaces for |
1288 |
a character class |
a character class |
1289 |
a back reference (see next section) |
a back reference (see next section) |
1290 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (unless it is an assertion) |
1291 |
a recursive or "subroutine" call to a subpattern |
a recursive or "subroutine" call to a subpattern |
1292 |
.sp |
.sp |
1293 |
The general repetition quantifier specifies a minimum and maximum number of |
The general repetition quantifier specifies a minimum and maximum number of |
1294 |
permitted matches, by giving the two numbers in curly brackets (braces), |
permitted matches, by giving the two numbers in curly brackets (braces), |
1614 |
.sp |
.sp |
1615 |
(a|(bc))\e2 |
(a|(bc))\e2 |
1616 |
.sp |
.sp |
1617 |
always fails if it starts to match "a" rather than "bc". However, if the |
always fails if it starts to match "a" rather than "bc". However, if the |
1618 |
PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an |
PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back reference to an |
1619 |
unset value matches an empty string. |
unset value matches an empty string. |
1620 |
.P |
.P |
1621 |
Because there may be many capturing parentheses in a pattern, all digits |
Because there may be many capturing parentheses in a pattern, all digits |
1737 |
.\" </a> |
.\" </a> |
1738 |
(see above) |
(see above) |
1739 |
.\" |
.\" |
1740 |
can be used instead of a lookbehind assertion to get round the fixed-length |
can be used instead of a lookbehind assertion to get round the fixed-length |
1741 |
restriction. |
restriction. |
1742 |
.P |
.P |
1743 |
The implementation of lookbehind assertions is, for each alternative, to |
The implementation of lookbehind assertions is, for each alternative, to |
1755 |
"Subroutine" |
"Subroutine" |
1756 |
.\" |
.\" |
1757 |
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long |
calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long |
1758 |
as the subpattern matches a fixed-length string. |
as the subpattern matches a fixed-length string. |
1759 |
.\" HTML <a href="#recursion"> |
.\" HTML <a href="#recursion"> |
1760 |
.\" </a> |
.\" </a> |
1761 |
Recursion, |
Recursion, |
1828 |
.sp |
.sp |
1829 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
1830 |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
1831 |
the result of an assertion, or whether a specific capturing subpattern has |
the result of an assertion, or whether a specific capturing subpattern has |
1832 |
already been matched. The two possible forms of conditional subpattern are: |
already been matched. The two possible forms of conditional subpattern are: |
1833 |
.sp |
.sp |
1834 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
1846 |
.sp |
.sp |
1847 |
If the text between the parentheses consists of a sequence of digits, the |
If the text between the parentheses consists of a sequence of digits, the |
1848 |
condition is true if a capturing subpattern of that number has previously |
condition is true if a capturing subpattern of that number has previously |
1849 |
matched. If there is more than one capturing subpattern with the same number |
matched. If there is more than one capturing subpattern with the same number |
1850 |
(see the earlier |
(see the earlier |
1851 |
.\" |
.\" |
1852 |
.\" HTML <a href="#recursion"> |
.\" HTML <a href="#recursion"> |
1853 |
.\" </a> |
.\" </a> |
1899 |
.sp |
.sp |
1900 |
(?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) ) |
(?<OPEN> \e( )? [^()]+ (?(<OPEN>) \e) ) |
1901 |
.sp |
.sp |
1902 |
If the name used in a condition of this kind is a duplicate, the test is |
If the name used in a condition of this kind is a duplicate, the test is |
1903 |
applied to all subpatterns of the same name, and is true if any one of them has |
applied to all subpatterns of the same name, and is true if any one of them has |
1904 |
matched. |
matched. |
1905 |
. |
. |
1906 |
.SS "Checking for pattern recursion" |
.SS "Checking for pattern recursion" |
1915 |
.sp |
.sp |
1916 |
the condition is true if the most recent recursion is into a subpattern whose |
the condition is true if the most recent recursion is into a subpattern whose |
1917 |
number or name is given. This condition does not check the entire recursion |
number or name is given. This condition does not check the entire recursion |
1918 |
stack. If the name used in a condition of this kind is a duplicate, the test is |
stack. If the name used in a condition of this kind is a duplicate, the test is |
1919 |
applied to all subpatterns of the same name, and is true if any one of them is |
applied to all subpatterns of the same name, and is true if any one of them is |
1920 |
the most recent recursion. |
the most recent recursion. |
1921 |
.P |
.P |
1922 |
At "top level", all these recursion test conditions are false. |
At "top level", all these recursion test conditions are false. |
1923 |
.\" HTML <a href="#recursion"> |
.\" HTML <a href="#recursion"> |
1924 |
.\" </a> |
.\" </a> |
1925 |
The syntax for recursive patterns |
The syntax for recursive patterns |
1933 |
name DEFINE, the condition is always false. In this case, there may be only one |
name DEFINE, the condition is always false. In this case, there may be only one |
1934 |
alternative in the subpattern. It is always skipped if control reaches this |
alternative in the subpattern. It is always skipped if control reaches this |
1935 |
point in the pattern; the idea of DEFINE is that it can be used to define |
point in the pattern; the idea of DEFINE is that it can be used to define |
1936 |
"subroutines" that can be referenced from elsewhere. (The use of |
"subroutines" that can be referenced from elsewhere. (The use of |
1937 |
.\" HTML <a href="#subpatternsassubroutines"> |
.\" HTML <a href="#subpatternsassubroutines"> |
1938 |
.\" </a> |
.\" </a> |
1939 |
"subroutines" |
"subroutines" |
2010 |
.P |
.P |
2011 |
A special item that consists of (? followed by a number greater than zero and a |
A special item that consists of (? followed by a number greater than zero and a |
2012 |
closing parenthesis is a recursive call of the subpattern of the given number, |
closing parenthesis is a recursive call of the subpattern of the given number, |
2013 |
provided that it occurs inside that subpattern. (If not, it is a |
provided that it occurs inside that subpattern. (If not, it is a |
2014 |
.\" HTML <a href="#subpatternsassubroutines"> |
.\" HTML <a href="#subpatternsassubroutines"> |
2015 |
.\" </a> |
.\" </a> |
2016 |
"subroutine" |
"subroutine" |
2026 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
2027 |
substrings which can either be a sequence of non-parentheses, or a recursive |
substrings which can either be a sequence of non-parentheses, or a recursive |
2028 |
match of the pattern itself (that is, a correctly parenthesized substring). |
match of the pattern itself (that is, a correctly parenthesized substring). |
2029 |
Finally there is a closing parenthesis. Note the use of a possessive quantifier |
Finally there is a closing parenthesis. Note the use of a possessive quantifier |
2030 |
to avoid backtracking into sequences of non-parentheses. |
to avoid backtracking into sequences of non-parentheses. |
2031 |
.P |
.P |
2032 |
If this were part of a larger pattern, you would not want to recurse the entire |
If this were part of a larger pattern, you would not want to recurse the entire |
2117 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is always |
2118 |
treated as an atomic group. That is, once it has matched some of the subject |
treated as an atomic group. That is, once it has matched some of the subject |
2119 |
string, it is never re-entered, even if it contains untried alternatives and |
string, it is never re-entered, even if it contains untried alternatives and |
2120 |
there is a subsequent matching failure. This can be illustrated by the |
there is a subsequent matching failure. This can be illustrated by the |
2121 |
following pattern, which purports to match a palindromic string that contains |
following pattern, which purports to match a palindromic string that contains |
2122 |
an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"): |
an odd number of characters (for example, "a", "aba", "abcba", "abcdcba"): |
2123 |
.sp |
.sp |
2124 |
^(.|(.)(?1)\e2)$ |
^(.|(.)(?1)\e2)$ |
2125 |
.sp |
.sp |
2126 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
2127 |
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE |
characters surrounding a sub-palindrome. In Perl, this pattern works; in PCRE |
2128 |
it does not if the pattern is longer than three characters. Consider the |
it does not if the pattern is longer than three characters. Consider the |
2129 |
subject string "abcba": |
subject string "abcba": |
2130 |
.P |
.P |
2131 |
At the top level, the first character is matched, but as it is not at the end |
At the top level, the first character is matched, but as it is not at the end |
2132 |
of the string, the first alternative fails; the second alternative is taken |
of the string, the first alternative fails; the second alternative is taken |
2133 |
and the recursion kicks in. The recursive call to subpattern 1 successfully |
and the recursion kicks in. The recursive call to subpattern 1 successfully |
2134 |
matches the next character ("b"). (Note that the beginning and end of line |
matches the next character ("b"). (Note that the beginning and end of line |
2135 |
tests are not part of the recursion). |
tests are not part of the recursion). |
2136 |
.P |
.P |
2137 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
2138 |
subpattern 2 matched, which was "a". This fails. Because the recursion is |
subpattern 2 matched, which was "a". This fails. Because the recursion is |
2139 |
treated as an atomic group, there are now no backtracking points, and so the |
treated as an atomic group, there are now no backtracking points, and so the |
2140 |
entire match fails. (Perl is able, at this point, to re-enter the recursion and |
entire match fails. (Perl is able, at this point, to re-enter the recursion and |
2141 |
try the second alternative.) However, if the pattern is written with the |
try the second alternative.) However, if the pattern is written with the |
2143 |
.sp |
.sp |
2144 |
^((.)(?1)\e2|.)$ |
^((.)(?1)\e2|.)$ |
2145 |
.sp |
.sp |
2146 |
This time, the recursing alternative is tried first, and continues to recurse |
This time, the recursing alternative is tried first, and continues to recurse |
2147 |
until it runs out of characters, at which point the recursion fails. But this |
until it runs out of characters, at which point the recursion fails. But this |
2148 |
time we do have another alternative to try at the higher level. That is the big |
time we do have another alternative to try at the higher level. That is the big |
2149 |
difference: in the previous case the remaining alternative is at a deeper |
difference: in the previous case the remaining alternative is at a deeper |
2150 |
recursion level, which PCRE cannot use. |
recursion level, which PCRE cannot use. |
2151 |
.P |
.P |
2152 |
To change the pattern so that matches all palindromic strings, not just those |
To change the pattern so that matches all palindromic strings, not just those |
2153 |
with an odd number of characters, it is tempting to change the pattern to this: |
with an odd number of characters, it is tempting to change the pattern to this: |
2154 |
.sp |
.sp |
2155 |
^((.)(?1)\e2|.?)$ |
^((.)(?1)\e2|.?)$ |
2156 |
.sp |
.sp |
2157 |
Again, this works in Perl, but not in PCRE, and for the same reason. When a |
Again, this works in Perl, but not in PCRE, and for the same reason. When a |
2158 |
deeper recursion has matched a single character, it cannot be entered again in |
deeper recursion has matched a single character, it cannot be entered again in |
2159 |
order to match an empty string. The solution is to separate the two cases, and |
order to match an empty string. The solution is to separate the two cases, and |
2160 |
write out the odd and even cases as alternatives at the higher level: |
write out the odd and even cases as alternatives at the higher level: |
2161 |
.sp |
.sp |
2162 |
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.)) |
^(?:((.)(?1)\e2|)|((.)(?3)\e4|.)) |
2163 |
.sp |
.sp |
2164 |
If you want to match typical palindromic phrases, the pattern has to ignore all |
If you want to match typical palindromic phrases, the pattern has to ignore all |
2165 |
non-word characters, which can be done like this: |
non-word characters, which can be done like this: |
2166 |
.sp |
.sp |
2167 |
^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\4|\eW*+.\eW*+))\eW*+$ |
^\eW*+(?:((.)\eW*+(?1)\eW*+\e2|)|((.)\eW*+(?3)\eW*+\e4|\eW*+.\eW*+))\eW*+$ |
2168 |
.sp |
.sp |
2169 |
If run with the PCRE_CASELESS option, this pattern matches phrases such as "A |
If run with the PCRE_CASELESS option, this pattern matches phrases such as "A |
2170 |
man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note |
man, a plan, a canal: Panama!" and it works well in both PCRE and Perl. Note |
2171 |
the use of the possessive quantifier *+ to avoid backtracking into sequences of |
the use of the possessive quantifier *+ to avoid backtracking into sequences of |
2172 |
non-word characters. Without this, PCRE takes a great deal longer (ten times or |
non-word characters. Without this, PCRE takes a great deal longer (ten times or |
2173 |
more) to match typical phrases, and Perl takes so long that you think it has |
more) to match typical phrases, and Perl takes so long that you think it has |
2174 |
gone into a loop. |
gone into a loop. |
2294 |
failing negative assertion, they cause an error if encountered by |
failing negative assertion, they cause an error if encountered by |
2295 |
\fBpcre_dfa_exec()\fP. |
\fBpcre_dfa_exec()\fP. |
2296 |
.P |
.P |
2297 |
If any of these verbs are used in an assertion subpattern, their effect is |
If any of these verbs are used in an assertion subpattern, their effect is |
2298 |
confined to that subpattern; it does not extend to the surrounding pattern. |
confined to that subpattern; it does not extend to the surrounding pattern. |
2299 |
Note that assertion subpatterns are processed as anchored at the point where |
Note that assertion subpatterns are processed as anchored at the point where |
2300 |
they are tested. |
they are tested. |
2301 |
.P |
.P |
2302 |
The new verbs make use of what was previously invalid syntax: an opening |
The new verbs make use of what was previously invalid syntax: an opening |
2319 |
.sp |
.sp |
2320 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
2321 |
.sp |
.sp |
2322 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by |
2323 |
the outer parentheses. |
the outer parentheses. |
2324 |
.sp |
.sp |
2325 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
2400 |
.SH "SEE ALSO" |
.SH "SEE ALSO" |
2401 |
.rs |
.rs |
2402 |
.sp |
.sp |
2403 |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
\fBpcreapi\fP(3), \fBpcrecallout\fP(3), \fBpcrematching\fP(3), |
2404 |
\fBpcresyntax\fP(3), \fBpcre\fP(3). |
\fBpcresyntax\fP(3), \fBpcre\fP(3). |
2405 |
. |
. |
2406 |
. |
. |