24 |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
<li><a name="TOC9" href="#SEC9">VERTICAL BAR</a> |
25 |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
<li><a name="TOC10" href="#SEC10">INTERNAL OPTION SETTING</a> |
26 |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
<li><a name="TOC11" href="#SEC11">SUBPATTERNS</a> |
27 |
<li><a name="TOC12" href="#SEC12">NAMED SUBPATTERNS</a> |
<li><a name="TOC12" href="#SEC12">DUPLICATE SUBPATTERN NUMBERS</a> |
28 |
<li><a name="TOC13" href="#SEC13">REPETITION</a> |
<li><a name="TOC13" href="#SEC13">NAMED SUBPATTERNS</a> |
29 |
<li><a name="TOC14" href="#SEC14">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
<li><a name="TOC14" href="#SEC14">REPETITION</a> |
30 |
<li><a name="TOC15" href="#SEC15">BACK REFERENCES</a> |
<li><a name="TOC15" href="#SEC15">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a> |
31 |
<li><a name="TOC16" href="#SEC16">ASSERTIONS</a> |
<li><a name="TOC16" href="#SEC16">BACK REFERENCES</a> |
32 |
<li><a name="TOC17" href="#SEC17">CONDITIONAL SUBPATTERNS</a> |
<li><a name="TOC17" href="#SEC17">ASSERTIONS</a> |
33 |
<li><a name="TOC18" href="#SEC18">COMMENTS</a> |
<li><a name="TOC18" href="#SEC18">CONDITIONAL SUBPATTERNS</a> |
34 |
<li><a name="TOC19" href="#SEC19">RECURSIVE PATTERNS</a> |
<li><a name="TOC19" href="#SEC19">COMMENTS</a> |
35 |
<li><a name="TOC20" href="#SEC20">SUBPATTERNS AS SUBROUTINES</a> |
<li><a name="TOC20" href="#SEC20">RECURSIVE PATTERNS</a> |
36 |
<li><a name="TOC21" href="#SEC21">CALLOUTS</a> |
<li><a name="TOC21" href="#SEC21">SUBPATTERNS AS SUBROUTINES</a> |
37 |
<li><a name="TOC22" href="#SEC22">SEE ALSO</a> |
<li><a name="TOC22" href="#SEC22">CALLOUTS</a> |
38 |
<li><a name="TOC23" href="#SEC23">AUTHOR</a> |
<li><a name="TOC23" href="#SEC23">SEE ALSO</a> |
39 |
<li><a name="TOC24" href="#SEC24">REVISION</a> |
<li><a name="TOC24" href="#SEC24">AUTHOR</a> |
40 |
|
<li><a name="TOC25" href="#SEC25">REVISION</a> |
41 |
</ul> |
</ul> |
42 |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
<br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br> |
43 |
<P> |
<P> |
271 |
<pre> |
<pre> |
272 |
\d any decimal digit |
\d any decimal digit |
273 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
274 |
|
\h any horizontal whitespace character |
275 |
|
\H any character that is not a horizontal whitespace character |
276 |
\s any whitespace character |
\s any whitespace character |
277 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
278 |
|
\v any vertical whitespace character |
279 |
|
\V any character that is not a vertical whitespace character |
280 |
\w any "word" character |
\w any "word" character |
281 |
\W any "non-word" character |
\W any "non-word" character |
282 |
</pre> |
</pre> |
292 |
<P> |
<P> |
293 |
For compatibility with Perl, \s does not match the VT character (code 11). |
For compatibility with Perl, \s does not match the VT character (code 11). |
294 |
This makes it different from the the POSIX "space" class. The \s characters |
This makes it different from the the POSIX "space" class. The \s characters |
295 |
are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is |
296 |
included in a Perl script, \s may match the VT character. In PCRE, it never |
included in a Perl script, \s may match the VT character. In PCRE, it never |
297 |
does.) |
does. |
298 |
|
</P> |
299 |
|
<P> |
300 |
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
301 |
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
302 |
|
character property support is available. These sequences retain their original |
303 |
|
meanings from before UTF-8 support was available, mainly for efficiency |
304 |
|
reasons. |
305 |
|
</P> |
306 |
|
<P> |
307 |
|
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to the |
308 |
|
other sequences, these do match certain high-valued codepoints in UTF-8 mode. |
309 |
|
The horizontal space characters are: |
310 |
|
<pre> |
311 |
|
U+0009 Horizontal tab |
312 |
|
U+0020 Space |
313 |
|
U+00A0 Non-break space |
314 |
|
U+1680 Ogham space mark |
315 |
|
U+180E Mongolian vowel separator |
316 |
|
U+2000 En quad |
317 |
|
U+2001 Em quad |
318 |
|
U+2002 En space |
319 |
|
U+2003 Em space |
320 |
|
U+2004 Three-per-em space |
321 |
|
U+2005 Four-per-em space |
322 |
|
U+2006 Six-per-em space |
323 |
|
U+2007 Figure space |
324 |
|
U+2008 Punctuation space |
325 |
|
U+2009 Thin space |
326 |
|
U+200A Hair space |
327 |
|
U+202F Narrow no-break space |
328 |
|
U+205F Medium mathematical space |
329 |
|
U+3000 Ideographic space |
330 |
|
</pre> |
331 |
|
The vertical space characters are: |
332 |
|
<pre> |
333 |
|
U+000A Linefeed |
334 |
|
U+000B Vertical tab |
335 |
|
U+000C Formfeed |
336 |
|
U+000D Carriage return |
337 |
|
U+0085 Next line |
338 |
|
U+2028 Line separator |
339 |
|
U+2029 Paragraph separator |
340 |
|
</PRE> |
341 |
</P> |
</P> |
342 |
<P> |
<P> |
343 |
A "word" character is an underscore or any character less than 256 that is a |
A "word" character is an underscore or any character less than 256 that is a |
349 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
350 |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
351 |
or "french" in Windows, some character codes greater than 128 are used for |
or "french" in Windows, some character codes greater than 128 are used for |
352 |
accented letters, and these are matched by \w. |
accented letters, and these are matched by \w. The use of locales with Unicode |
353 |
</P> |
is discouraged. |
|
<P> |
|
|
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
|
|
\w, and always match \D, \S, and \W. This is true even when Unicode |
|
|
character property support is available. The use of locales with Unicode is |
|
|
discouraged. |
|
354 |
</P> |
</P> |
355 |
<br><b> |
<br><b> |
356 |
Newline sequences |
Newline sequences |
357 |
</b><br> |
</b><br> |
358 |
<P> |
<P> |
359 |
Outside a character class, the escape sequence \R matches any Unicode newline |
Outside a character class, the escape sequence \R matches any Unicode newline |
360 |
sequence. This is an extension to Perl. In non-UTF-8 mode \R is equivalent to |
sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is equivalent to |
361 |
the following: |
the following: |
362 |
<pre> |
<pre> |
363 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
1009 |
is reached, an option setting in one branch does affect subsequent branches, so |
is reached, an option setting in one branch does affect subsequent branches, so |
1010 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
1011 |
</P> |
</P> |
1012 |
<br><a name="SEC12" href="#TOC1">NAMED SUBPATTERNS</a><br> |
<br><a name="SEC12" href="#TOC1">DUPLICATE SUBPATTERN NUMBERS</a><br> |
1013 |
|
<P> |
1014 |
|
Perl 5.10 introduced a feature whereby each alternative in a subpattern uses |
1015 |
|
the same numbers for its capturing parentheses. Such a subpattern starts with |
1016 |
|
(?| and is itself a non-capturing subpattern. For example, consider this |
1017 |
|
pattern: |
1018 |
|
<pre> |
1019 |
|
(?|(Sat)ur|(Sun))day |
1020 |
|
</pre> |
1021 |
|
Because the two alternatives are inside a (?| group, both sets of capturing |
1022 |
|
parentheses are numbered one. Thus, when the pattern matches, you can look |
1023 |
|
at captured substring number one, whichever alternative matched. This construct |
1024 |
|
is useful when you want to capture part, but not all, of one of a number of |
1025 |
|
alternatives. Inside a (?| group, parentheses are numbered as usual, but the |
1026 |
|
number is reset at the start of each branch. The numbers of any capturing |
1027 |
|
buffers that follow the subpattern start after the highest number used in any |
1028 |
|
branch. The following example is taken from the Perl documentation. |
1029 |
|
The numbers underneath show in which buffer the captured content will be |
1030 |
|
stored. |
1031 |
|
<pre> |
1032 |
|
# before ---------------branch-reset----------- after |
1033 |
|
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
1034 |
|
# 1 2 2 3 2 3 4 |
1035 |
|
</pre> |
1036 |
|
A backreference or a recursive call to a numbered subpattern always refers to |
1037 |
|
the first one in the pattern with the given number. |
1038 |
|
</P> |
1039 |
|
<P> |
1040 |
|
An alternative approach to using this "branch reset" feature is to use |
1041 |
|
duplicate named subpatterns, as described in the next section. |
1042 |
|
</P> |
1043 |
|
<br><a name="SEC13" href="#TOC1">NAMED SUBPATTERNS</a><br> |
1044 |
<P> |
<P> |
1045 |
Identifying capturing parentheses by number is simple, but it can be very hard |
Identifying capturing parentheses by number is simple, but it can be very hard |
1046 |
to keep track of the numbers in complicated regular expressions. Furthermore, |
to keep track of the numbers in complicated regular expressions. Furthermore, |
1082 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
1083 |
</pre> |
</pre> |
1084 |
There are five capturing substrings, but only one is ever set after a match. |
There are five capturing substrings, but only one is ever set after a match. |
1085 |
|
(An alternative way of solving this problem is to use a "branch reset" |
1086 |
|
subpattern, as described in the previous section.) |
1087 |
|
</P> |
1088 |
|
<P> |
1089 |
The convenience function for extracting the data by name returns the substring |
The convenience function for extracting the data by name returns the substring |
1090 |
for the first (and in this example, the only) subpattern of that name that |
for the first (and in this example, the only) subpattern of that name that |
1091 |
matched. This saves searching to find which numbered subpattern it was. If you |
matched. This saves searching to find which numbered subpattern it was. If you |
1095 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
1096 |
documentation. |
documentation. |
1097 |
</P> |
</P> |
1098 |
<br><a name="SEC13" href="#TOC1">REPETITION</a><br> |
<br><a name="SEC14" href="#TOC1">REPETITION</a><br> |
1099 |
<P> |
<P> |
1100 |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
1101 |
items: |
items: |
1246 |
</pre> |
</pre> |
1247 |
matches "aba" the value of the second captured substring is "b". |
matches "aba" the value of the second captured substring is "b". |
1248 |
<a name="atomicgroup"></a></P> |
<a name="atomicgroup"></a></P> |
1249 |
<br><a name="SEC14" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
<br><a name="SEC15" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br> |
1250 |
<P> |
<P> |
1251 |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
1252 |
repetition, failure of what follows normally causes the repeated item to be |
repetition, failure of what follows normally causes the repeated item to be |
1345 |
</pre> |
</pre> |
1346 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
1347 |
<a name="backreferences"></a></P> |
<a name="backreferences"></a></P> |
1348 |
<br><a name="SEC15" href="#TOC1">BACK REFERENCES</a><br> |
<br><a name="SEC16" href="#TOC1">BACK REFERENCES</a><br> |
1349 |
<P> |
<P> |
1350 |
Outside a character class, a backslash followed by a digit greater than 0 (and |
Outside a character class, a backslash followed by a digit greater than 0 (and |
1351 |
possibly further digits) is a back reference to a capturing subpattern earlier |
possibly further digits) is a back reference to a capturing subpattern earlier |
1458 |
done using alternation, as in the example above, or by a quantifier with a |
done using alternation, as in the example above, or by a quantifier with a |
1459 |
minimum of zero. |
minimum of zero. |
1460 |
<a name="bigassertions"></a></P> |
<a name="bigassertions"></a></P> |
1461 |
<br><a name="SEC16" href="#TOC1">ASSERTIONS</a><br> |
<br><a name="SEC17" href="#TOC1">ASSERTIONS</a><br> |
1462 |
<P> |
<P> |
1463 |
An assertion is a test on the characters following or preceding the current |
An assertion is a test on the characters following or preceding the current |
1464 |
matching point that does not actually consume any characters. The simple |
matching point that does not actually consume any characters. The simple |
1618 |
is another pattern that matches "foo" preceded by three digits and any three |
is another pattern that matches "foo" preceded by three digits and any three |
1619 |
characters that are not "999". |
characters that are not "999". |
1620 |
<a name="conditions"></a></P> |
<a name="conditions"></a></P> |
1621 |
<br><a name="SEC17" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
<br><a name="SEC18" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br> |
1622 |
<P> |
<P> |
1623 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
1624 |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
1756 |
against the second. This pattern matches strings in one of the two forms |
against the second. This pattern matches strings in one of the two forms |
1757 |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
1758 |
<a name="comments"></a></P> |
<a name="comments"></a></P> |
1759 |
<br><a name="SEC18" href="#TOC1">COMMENTS</a><br> |
<br><a name="SEC19" href="#TOC1">COMMENTS</a><br> |
1760 |
<P> |
<P> |
1761 |
The sequence (?# marks the start of a comment that continues up to the next |
The sequence (?# marks the start of a comment that continues up to the next |
1762 |
closing parenthesis. Nested parentheses are not permitted. The characters |
closing parenthesis. Nested parentheses are not permitted. The characters |
1767 |
character class introduces a comment that continues to immediately after the |
character class introduces a comment that continues to immediately after the |
1768 |
next newline in the pattern. |
next newline in the pattern. |
1769 |
<a name="recursion"></a></P> |
<a name="recursion"></a></P> |
1770 |
<br><a name="SEC19" href="#TOC1">RECURSIVE PATTERNS</a><br> |
<br><a name="SEC20" href="#TOC1">RECURSIVE PATTERNS</a><br> |
1771 |
<P> |
<P> |
1772 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
1773 |
unlimited nested parentheses. Without the use of recursion, the best that can |
unlimited nested parentheses. Without the use of recursion, the best that can |
1897 |
different alternatives for the recursive and non-recursive cases. The (?R) item |
different alternatives for the recursive and non-recursive cases. The (?R) item |
1898 |
is the actual recursive call. |
is the actual recursive call. |
1899 |
<a name="subpatternsassubroutines"></a></P> |
<a name="subpatternsassubroutines"></a></P> |
1900 |
<br><a name="SEC20" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
<br><a name="SEC21" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br> |
1901 |
<P> |
<P> |
1902 |
If the syntax for a recursive subpattern reference (either by number or by |
If the syntax for a recursive subpattern reference (either by number or by |
1903 |
name) is used outside the parentheses to which it refers, it operates like a |
name) is used outside the parentheses to which it refers, it operates like a |
1937 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
1938 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
1939 |
</P> |
</P> |
1940 |
<br><a name="SEC21" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC22" href="#TOC1">CALLOUTS</a><br> |
1941 |
<P> |
<P> |
1942 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
1943 |
code to be obeyed in the middle of matching a regular expression. This makes it |
code to be obeyed in the middle of matching a regular expression. This makes it |
1972 |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
1973 |
documentation. |
documentation. |
1974 |
</P> |
</P> |
1975 |
<br><a name="SEC22" href="#TOC1">SEE ALSO</a><br> |
<br><a name="SEC23" href="#TOC1">SEE ALSO</a><br> |
1976 |
<P> |
<P> |
1977 |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
<b>pcreapi</b>(3), <b>pcrecallout</b>(3), <b>pcrematching</b>(3), <b>pcre</b>(3). |
1978 |
</P> |
</P> |
1979 |
<br><a name="SEC23" href="#TOC1">AUTHOR</a><br> |
<br><a name="SEC24" href="#TOC1">AUTHOR</a><br> |
1980 |
<P> |
<P> |
1981 |
Philip Hazel |
Philip Hazel |
1982 |
<br> |
<br> |
1985 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
1986 |
<br> |
<br> |
1987 |
</P> |
</P> |
1988 |
<br><a name="SEC24" href="#TOC1">REVISION</a><br> |
<br><a name="SEC25" href="#TOC1">REVISION</a><br> |
1989 |
<P> |
<P> |
1990 |
Last updated: 29 May 2007 |
Last updated: 13 June 2007 |
1991 |
<br> |
<br> |
1992 |
Copyright © 1997-2007 University of Cambridge. |
Copyright © 1997-2007 University of Cambridge. |
1993 |
<br> |
<br> |