120 |
Last updated: 24 August 2011 |
Last updated: 24 August 2011 |
121 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
122 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
123 |
|
|
124 |
|
|
125 |
PCREBUILD(3) PCREBUILD(3) |
PCREBUILD(3) PCREBUILD(3) |
126 |
|
|
127 |
|
|
484 |
Last updated: 06 September 2011 |
Last updated: 06 September 2011 |
485 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
486 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
487 |
|
|
488 |
|
|
489 |
PCREMATCHING(3) PCREMATCHING(3) |
PCREMATCHING(3) PCREMATCHING(3) |
490 |
|
|
491 |
|
|
633 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
634 |
|
|
635 |
7. The \C escape sequence, which (in the standard algorithm) matches a |
7. The \C escape sequence, which (in the standard algorithm) matches a |
636 |
single byte, even in UTF-8 mode, is not supported because the alterna- |
single byte, even in UTF-8 mode, is not supported in UTF-8 mode, |
637 |
tive algorithm moves through the subject string one character at a |
because the alternative algorithm moves through the subject string one |
638 |
time, for all active paths through the tree. |
character at a time, for all active paths through the tree. |
639 |
|
|
640 |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
641 |
are not supported. (*FAIL) is supported, and behaves like a failing |
are not supported. (*FAIL) is supported, and behaves like a failing |
685 |
|
|
686 |
REVISION |
REVISION |
687 |
|
|
688 |
Last updated: 17 November 2010 |
Last updated: 19 November 2011 |
689 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
690 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
691 |
|
|
692 |
|
|
693 |
PCREAPI(3) PCREAPI(3) |
PCREAPI(3) PCREAPI(3) |
694 |
|
|
695 |
|
|
1256 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
1257 |
default, for Perl compatibility. |
default, for Perl compatibility. |
1258 |
|
|
1259 |
|
(3) \U matches an upper case "U" character; by default \U causes a com- |
1260 |
|
pile time error (Perl uses \U to upper case subsequent characters). |
1261 |
|
|
1262 |
|
(4) \u matches a lower case "u" character unless it is followed by four |
1263 |
|
hexadecimal digits, in which case the hexadecimal number defines the |
1264 |
|
code point to match. By default, \u causes a compile time error (Perl |
1265 |
|
uses it to upper case the following character). |
1266 |
|
|
1267 |
|
(5) \x matches a lower case "x" character unless it is followed by two |
1268 |
|
hexadecimal digits, in which case the hexadecimal number defines the |
1269 |
|
code point to match. By default, as in Perl, a hexadecimal number is |
1270 |
|
always expected after \x, but it may have zero, one, or two digits (so, |
1271 |
|
for example, \xz matches a binary zero character followed by z). |
1272 |
|
|
1273 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1274 |
|
|
1275 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1724 |
compiler could not handle this particular pattern. See the pcrejit doc- |
compiler could not handle this particular pattern. See the pcrejit doc- |
1725 |
umentation for details of what can and cannot be handled. |
umentation for details of what can and cannot be handled. |
1726 |
|
|
1727 |
|
PCRE_INFO_JITSIZE |
1728 |
|
|
1729 |
|
If the pattern was successfully studied with the PCRE_STUDY_JIT_COMPILE |
1730 |
|
option, return the size of the JIT compiled code, otherwise return |
1731 |
|
zero. The fourth argument should point to a size_t variable. |
1732 |
|
|
1733 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1734 |
|
|
1735 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1838 |
|
|
1839 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
1840 |
|
|
1841 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern. The fourth argument should |
1842 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
point to a size_t variable. This value does not include the size of the |
1843 |
which to place the compiled data. The fourth argument should point to a |
pcre structure that is returned by pcre_compile(). The value that is |
1844 |
size_t variable. |
passed as the argument to pcre_malloc() when pcre_compile() is getting |
1845 |
|
memory in which to place the compiled data is the value returned by |
1846 |
|
this option plus the size of the pcre structure. Studying a compiled |
1847 |
|
pattern, with or without JIT, does not alter the value returned by this |
1848 |
|
option. |
1849 |
|
|
1850 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
1851 |
|
|
3004 |
|
|
3005 |
REVISION |
REVISION |
3006 |
|
|
3007 |
Last updated: 23 September 2011 |
Last updated: 02 December 2011 |
3008 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
3009 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
3010 |
|
|
3011 |
|
|
3012 |
PCRECALLOUT(3) PCRECALLOUT(3) |
PCRECALLOUT(3) PCRECALLOUT(3) |
3013 |
|
|
3014 |
|
|
3167 |
|
|
3168 |
The mark field is present from version 2 of the pcre_callout structure. |
The mark field is present from version 2 of the pcre_callout structure. |
3169 |
In callouts from pcre_exec() it contains a pointer to the zero-termi- |
In callouts from pcre_exec() it contains a pointer to the zero-termi- |
3170 |
nated name of the most recently passed (*MARK) item in the match, or |
nated name of the most recently passed (*MARK), (*PRUNE), or (*THEN) |
3171 |
NULL if there are no (*MARK)s in the current matching path. In callouts |
item in the match, or NULL if no such items have been passed. Instances |
3172 |
from pcre_dfa_exec() this field always contains NULL. |
of (*PRUNE) or (*THEN) without a name do not obliterate a previous |
3173 |
|
(*MARK). In callouts from pcre_dfa_exec() this field always contains |
3174 |
|
NULL. |
3175 |
|
|
3176 |
|
|
3177 |
RETURN VALUES |
RETURN VALUES |
3199 |
|
|
3200 |
REVISION |
REVISION |
3201 |
|
|
3202 |
Last updated: 26 August 2011 |
Last updated: 30 November 2011 |
3203 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
3204 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
3205 |
|
|
3206 |
|
|
3207 |
PCRECOMPAT(3) PCRECOMPAT(3) |
PCRECOMPAT(3) PCRECOMPAT(3) |
3208 |
|
|
3209 |
|
|
3244 |
its own, matching a non-newline character, is supported.) In fact these |
its own, matching a non-newline character, is supported.) In fact these |
3245 |
are implemented by Perl's general string-handling and are not part of |
are implemented by Perl's general string-handling and are not part of |
3246 |
its pattern matching engine. If any of these are encountered by PCRE, |
its pattern matching engine. If any of these are encountered by PCRE, |
3247 |
an error is generated. |
an error is generated by default. However, if the PCRE_JAVASCRIPT_COM- |
3248 |
|
PAT option is set, \U and \u are interpreted as JavaScript interprets |
3249 |
|
them. |
3250 |
|
|
3251 |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
3252 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
3373 |
|
|
3374 |
REVISION |
REVISION |
3375 |
|
|
3376 |
Last updated: 09 October 2011 |
Last updated: 14 November 2011 |
3377 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
3378 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
3379 |
|
|
3380 |
|
|
3381 |
PCREPATTERN(3) PCREPATTERN(3) |
PCREPATTERN(3) PCREPATTERN(3) |
3382 |
|
|
3383 |
|
|
3600 |
\t tab (hex 09) |
\t tab (hex 09) |
3601 |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
3602 |
\xhh character with hex code hh |
\xhh character with hex code hh |
3603 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
3604 |
|
\uhhhh character with hex code hhhh (JavaScript mode only) |
3605 |
|
|
3606 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
3607 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
3612 |
is compiled in EBCDIC mode, all byte values are valid. A lower case |
is compiled in EBCDIC mode, all byte values are valid. A lower case |
3613 |
letter is converted to upper case, and then the 0xc0 bits are flipped.) |
letter is converted to upper case, and then the 0xc0 bits are flipped.) |
3614 |
|
|
3615 |
After \x, from zero to two hexadecimal digits are read (letters can be |
By default, after \x, from zero to two hexadecimal digits are read |
3616 |
in upper or lower case). Any number of hexadecimal digits may appear |
(letters can be in upper or lower case). Any number of hexadecimal dig- |
3617 |
between \x{ and }, but the value of the character code must be less |
its may appear between \x{ and }, but the value of the character code |
3618 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
must be less than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 |
3619 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
mode. That is, the maximum value in hexadecimal is 7FFFFFFF. Note that |
3620 |
than the largest Unicode code point, which is 10FFFF. |
this is bigger than the largest Unicode code point, which is 10FFFF. |
3621 |
|
|
3622 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
3623 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
3625 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
3626 |
zero. |
zero. |
3627 |
|
|
3628 |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x |
3629 |
|
is as just described only when it is followed by two hexadecimal dig- |
3630 |
|
its. Otherwise, it matches a literal "x" character. In JavaScript |
3631 |
|
mode, support for code points greater than 256 is provided by \u, which |
3632 |
|
must be followed by four hexadecimal digits; otherwise it matches a |
3633 |
|
literal "u" character. |
3634 |
|
|
3635 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
3636 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x (or by \u in JavaScript mode). There is no differ- |
3637 |
dled. For example, \xdc is exactly the same as \x{dc}. |
ence in the way they are handled. For example, \xdc is exactly the same |
3638 |
|
as \x{dc} (or \u00dc in JavaScript mode). |
3639 |
|
|
3640 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
3641 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
3679 |
|
|
3680 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
3681 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
3682 |
class, the sequence \b is interpreted as the backspace character (hex |
class, \b is interpreted as the backspace character (hex 08). |
3683 |
08). The sequences \B, \N, \R, and \X are not special inside a charac- |
|
3684 |
ter class. Like any other unrecognized escape sequences, they are |
\N is not allowed in a character class. \B, \R, and \X are not special |
3685 |
treated as the literal characters "B", "N", "R", and "X" by default, |
inside a character class. Like other unrecognized escape sequences, |
3686 |
but cause an error if the PCRE_EXTRA option is set. Outside a character |
they are treated as the literal characters "B", "R", and "X" by |
3687 |
class, these sequences have different meanings. |
default, but cause an error if the PCRE_EXTRA option is set. Outside a |
3688 |
|
character class, these sequences have different meanings. |
3689 |
|
|
3690 |
|
Unsupported escape sequences |
3691 |
|
|
3692 |
|
In Perl, the sequences \l, \L, \u, and \U are recognized by its string |
3693 |
|
handler and used to modify the case of following characters. By |
3694 |
|
default, PCRE does not support these escape sequences. However, if the |
3695 |
|
PCRE_JAVASCRIPT_COMPAT option is set, \U matches a "U" character, and |
3696 |
|
\u can be used to define a character by code point, as described in the |
3697 |
|
previous section. |
3698 |
|
|
3699 |
Absolute and relative back references |
Absolute and relative back references |
3700 |
|
|
3729 |
|
|
3730 |
There is also the single sequence \N, which matches a non-newline char- |
There is also the single sequence \N, which matches a non-newline char- |
3731 |
acter. This is the same as the "." metacharacter when PCRE_DOTALL is |
acter. This is the same as the "." metacharacter when PCRE_DOTALL is |
3732 |
not set. |
not set. Perl also uses \N to match characters by name; PCRE does not |
3733 |
|
support this. |
3734 |
|
|
3735 |
Each pair of lower and upper case escape sequences partitions the com- |
Each pair of lower and upper case escape sequences partitions the com- |
3736 |
plete set of characters into two disjoint sets. Any given character |
plete set of characters into two disjoint sets. Any given character |
3737 |
matches one, and only one, of each pair. The sequences can appear both |
matches one, and only one, of each pair. The sequences can appear both |
3738 |
inside and outside character classes. They each match one character of |
inside and outside character classes. They each match one character of |
3739 |
the appropriate type. If the current matching point is at the end of |
the appropriate type. If the current matching point is at the end of |
3740 |
the subject string, all of them fail, because there is no character to |
the subject string, all of them fail, because there is no character to |
3741 |
match. |
match. |
3742 |
|
|
3743 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
3744 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
3745 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
3746 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
3747 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
3748 |
|
|
3749 |
A "word" character is an underscore or any character that is a letter |
A "word" character is an underscore or any character that is a letter |
3750 |
or digit. By default, the definition of letters and digits is con- |
or digit. By default, the definition of letters and digits is con- |
3751 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
3752 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
3753 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
3754 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
3755 |
are used for accented letters, and these are then matched by \w. The |
are used for accented letters, and these are then matched by \w. The |
3756 |
use of locales with Unicode is discouraged. |
use of locales with Unicode is discouraged. |
3757 |
|
|
3758 |
By default, in UTF-8 mode, characters with values greater than 128 |
By default, in UTF-8 mode, characters with values greater than 128 |
3759 |
never match \d, \s, or \w, and always match \D, \S, and \W. These |
never match \d, \s, or \w, and always match \D, \S, and \W. These |
3760 |
sequences retain their original meanings from before UTF-8 support was |
sequences retain their original meanings from before UTF-8 support was |
3761 |
available, mainly for efficiency reasons. However, if PCRE is compiled |
available, mainly for efficiency reasons. However, if PCRE is compiled |
3762 |
with Unicode property support, and the PCRE_UCP option is set, the be- |
with Unicode property support, and the PCRE_UCP option is set, the be- |
3763 |
haviour is changed so that Unicode properties are used to determine |
haviour is changed so that Unicode properties are used to determine |
3764 |
character types, as follows: |
character types, as follows: |
3765 |
|
|
3766 |
\d any character that \p{Nd} matches (decimal digit) |
\d any character that \p{Nd} matches (decimal digit) |
3767 |
\s any character that \p{Z} matches, plus HT, LF, FF, CR |
\s any character that \p{Z} matches, plus HT, LF, FF, CR |
3768 |
\w any character that \p{L} or \p{N} matches, plus underscore |
\w any character that \p{L} or \p{N} matches, plus underscore |
3769 |
|
|
3770 |
The upper case escapes match the inverse sets of characters. Note that |
The upper case escapes match the inverse sets of characters. Note that |
3771 |
\d matches only decimal digits, whereas \w matches any Unicode digit, |
\d matches only decimal digits, whereas \w matches any Unicode digit, |
3772 |
as well as any Unicode letter, and underscore. Note also that PCRE_UCP |
as well as any Unicode letter, and underscore. Note also that PCRE_UCP |
3773 |
affects \b, and \B because they are defined in terms of \w and \W. |
affects \b, and \B because they are defined in terms of \w and \W. |
3774 |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
3775 |
|
|
3776 |
The sequences \h, \H, \v, and \V are features that were added to Perl |
The sequences \h, \H, \v, and \V are features that were added to Perl |
3777 |
at release 5.10. In contrast to the other sequences, which match only |
at release 5.10. In contrast to the other sequences, which match only |
3778 |
ASCII characters by default, these always match certain high-valued |
ASCII characters by default, these always match certain high-valued |
3779 |
codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon- |
codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon- |
3780 |
tal space characters are: |
tal space characters are: |
3781 |
|
|
3782 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
3811 |
|
|
3812 |
Newline sequences |
Newline sequences |
3813 |
|
|
3814 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
3815 |
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the |
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the |
3816 |
following: |
following: |
3817 |
|
|
3818 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
3819 |
|
|
3820 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
3821 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
3822 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
3823 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
3824 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
3825 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
3826 |
|
|
3827 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
3828 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
3829 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
3830 |
these characters to be recognized. |
these characters to be recognized. |
3831 |
|
|
3832 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
3833 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
3834 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
3835 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
3836 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
3837 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
3838 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
3839 |
following sequences: |
following sequences: |
3840 |
|
|
3841 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
3842 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
3843 |
|
|
3844 |
These override the default and the options given to pcre_compile() or |
These override the default and the options given to pcre_compile() or |
3845 |
pcre_compile2(), but they can be overridden by options given to |
pcre_compile2(), but they can be overridden by options given to |
3846 |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
3847 |
are not Perl-compatible, are recognized only at the very start of a |
are not Perl-compatible, are recognized only at the very start of a |
3848 |
pattern, and that they must be in upper case. If more than one of them |
pattern, and that they must be in upper case. If more than one of them |
3849 |
is present, the last one is used. They can be combined with a change of |
is present, the last one is used. They can be combined with a change of |
3850 |
newline convention; for example, a pattern can start with: |
newline convention; for example, a pattern can start with: |
3851 |
|
|
3852 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
3853 |
|
|
3854 |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
3855 |
Inside a character class, \R is treated as an unrecognized escape |
Inside a character class, \R is treated as an unrecognized escape |
3856 |
sequence, and so matches the letter "R" by default, but causes an error |
sequence, and so matches the letter "R" by default, but causes an error |
3857 |
if PCRE_EXTRA is set. |
if PCRE_EXTRA is set. |
3858 |
|
|
3859 |
Unicode character properties |
Unicode character properties |
3860 |
|
|
3861 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
3862 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
3863 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
3864 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
3865 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
3866 |
|
|
3867 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
3868 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
3869 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
3870 |
|
|
3871 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
3872 |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
3873 |
character (including newline), and some special PCRE properties |
character (including newline), and some special PCRE properties |
3874 |
(described in the next section). Other Perl properties such as "InMu- |
(described in the next section). Other Perl properties such as "InMu- |
3875 |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
3876 |
does not match any characters, so always causes a match failure. |
does not match any characters, so always causes a match failure. |
3877 |
|
|
3878 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
3879 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
3880 |
For example: |
For example: |
3881 |
|
|
3882 |
\p{Greek} |
\p{Greek} |
3883 |
\P{Han} |
\P{Han} |
3884 |
|
|
3885 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
3886 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
3887 |
|
|
3888 |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
3889 |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
3890 |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
3891 |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
3892 |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
3893 |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
3894 |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
3895 |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
3896 |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
3897 |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
3898 |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
3899 |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
3900 |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
3901 |
Ugaritic, Vai, Yi. |
Ugaritic, Vai, Yi. |
3902 |
|
|
3903 |
Each character has exactly one Unicode general category property, spec- |
Each character has exactly one Unicode general category property, spec- |
3904 |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
3905 |
tion can be specified by including a circumflex between the opening |
tion can be specified by including a circumflex between the opening |
3906 |
brace and the property name. For example, \p{^Lu} is the same as |
brace and the property name. For example, \p{^Lu} is the same as |
3907 |
\P{Lu}. |
\P{Lu}. |
3908 |
|
|
3909 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
3910 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
3911 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
3912 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
3913 |
|
|
3914 |
\p{L} |
\p{L} |
3960 |
Zp Paragraph separator |
Zp Paragraph separator |
3961 |
Zs Space separator |
Zs Space separator |
3962 |
|
|
3963 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
3964 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
3965 |
classified as a modifier or "other". |
classified as a modifier or "other". |
3966 |
|
|
3967 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
3968 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3969 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3970 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3971 |
the pcreapi page). Perl does not support the Cs property. |
the pcreapi page). Perl does not support the Cs property. |
3972 |
|
|
3973 |
The long synonyms for property names that Perl supports (such as |
The long synonyms for property names that Perl supports (such as |
3974 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3975 |
any of these properties with "Is". |
any of these properties with "Is". |
3976 |
|
|
3977 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
3978 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
3979 |
in the Unicode table. |
in the Unicode table. |
3980 |
|
|
3981 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
3982 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3983 |
|
|
3984 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
3985 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
3986 |
|
|
3987 |
(?>\PM\pM*) |
(?>\PM\pM*) |
3988 |
|
|
3989 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
3990 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
3991 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
3992 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
3993 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
3994 |
matches any one character. |
matches any one character. |
3995 |
|
|
3996 |
Note that recent versions of Perl have changed \X to match what Unicode |
Note that recent versions of Perl have changed \X to match what Unicode |
3997 |
calls an "extended grapheme cluster", which has a more complicated def- |
calls an "extended grapheme cluster", which has a more complicated def- |
3998 |
inition. |
inition. |
3999 |
|
|
4000 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
4001 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
4002 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
4003 |
\w do not use Unicode properties in PCRE by default, though you can |
\w do not use Unicode properties in PCRE by default, though you can |
4004 |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
4005 |
starting the pattern with (*UCP). |
starting the pattern with (*UCP). |
4006 |
|
|
4007 |
PCRE's additional properties |
PCRE's additional properties |
4008 |
|
|
4009 |
As well as the standard Unicode properties described in the previous |
As well as the standard Unicode properties described in the previous |
4010 |
section, PCRE supports four more that make it possible to convert tra- |
section, PCRE supports four more that make it possible to convert tra- |
4011 |
ditional escape sequences such as \w and \s and POSIX character classes |
ditional escape sequences such as \w and \s and POSIX character classes |
4012 |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
4013 |
erties internally when PCRE_UCP is set. They are: |
erties internally when PCRE_UCP is set. They are: |
4017 |
Xsp Any Perl space character |
Xsp Any Perl space character |
4018 |
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
4019 |
|
|
4020 |
Xan matches characters that have either the L (letter) or the N (num- |
Xan matches characters that have either the L (letter) or the N (num- |
4021 |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
4022 |
formfeed, or carriage return, and any other character that has the Z |
formfeed, or carriage return, and any other character that has the Z |
4023 |
(separator) property. Xsp is the same as Xps, except that vertical tab |
(separator) property. Xsp is the same as Xps, except that vertical tab |
4024 |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
4025 |
|
|
4026 |
Resetting the match start |
Resetting the match start |
4027 |
|
|
4028 |
The escape sequence \K causes any previously matched characters not to |
The escape sequence \K causes any previously matched characters not to |
4029 |
be included in the final matched sequence. For example, the pattern: |
be included in the final matched sequence. For example, the pattern: |
4030 |
|
|
4031 |
foo\Kbar |
foo\Kbar |
4032 |
|
|
4033 |
matches "foobar", but reports that it has matched "bar". This feature |
matches "foobar", but reports that it has matched "bar". This feature |
4034 |
is similar to a lookbehind assertion (described below). However, in |
is similar to a lookbehind assertion (described below). However, in |
4035 |
this case, the part of the subject before the real match does not have |
this case, the part of the subject before the real match does not have |
4036 |
to be of fixed length, as lookbehind assertions do. The use of \K does |
to be of fixed length, as lookbehind assertions do. The use of \K does |
4037 |
not interfere with the setting of captured substrings. For example, |
not interfere with the setting of captured substrings. For example, |
4038 |
when the pattern |
when the pattern |
4039 |
|
|
4040 |
(foo)\Kbar |
(foo)\Kbar |
4041 |
|
|
4042 |
matches "foobar", the first substring is still set to "foo". |
matches "foobar", the first substring is still set to "foo". |
4043 |
|
|
4044 |
Perl documents that the use of \K within assertions is "not well |
Perl documents that the use of \K within assertions is "not well |
4045 |
defined". In PCRE, \K is acted upon when it occurs inside positive |
defined". In PCRE, \K is acted upon when it occurs inside positive |
4046 |
assertions, but is ignored in negative assertions. |
assertions, but is ignored in negative assertions. |
4047 |
|
|
4048 |
Simple assertions |
Simple assertions |
4049 |
|
|
4050 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
4051 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
4052 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
4053 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
4054 |
The backslashed assertions are: |
The backslashed assertions are: |
4055 |
|
|
4056 |
\b matches at a word boundary |
\b matches at a word boundary |
4061 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
4062 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
4063 |
|
|
4064 |
Inside a character class, \b has a different meaning; it matches the |
Inside a character class, \b has a different meaning; it matches the |
4065 |
backspace character. If any other of these assertions appears in a |
backspace character. If any other of these assertions appears in a |
4066 |
character class, by default it matches the corresponding literal char- |
character class, by default it matches the corresponding literal char- |
4067 |
acter (for example, \B matches the letter B). However, if the |
acter (for example, \B matches the letter B). However, if the |
4068 |
PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- |
PCRE_EXTRA option is set, an "invalid escape sequence" error is gener- |
4069 |
ated instead. |
ated instead. |
4070 |
|
|
4071 |
A word boundary is a position in the subject string where the current |
A word boundary is a position in the subject string where the current |
4072 |
character and the previous character do not both match \w or \W (i.e. |
character and the previous character do not both match \w or \W (i.e. |
4073 |
one matches \w and the other matches \W), or the start or end of the |
one matches \w and the other matches \W), or the start or end of the |
4074 |
string if the first or last character matches \w, respectively. In |
string if the first or last character matches \w, respectively. In |
4075 |
UTF-8 mode, the meanings of \w and \W can be changed by setting the |
UTF-8 mode, the meanings of \w and \W can be changed by setting the |
4076 |
PCRE_UCP option. When this is done, it also affects \b and \B. Neither |
PCRE_UCP option. When this is done, it also affects \b and \B. Neither |
4077 |
PCRE nor Perl has a separate "start of word" or "end of word" metase- |
PCRE nor Perl has a separate "start of word" or "end of word" metase- |
4078 |
quence. However, whatever follows \b normally determines which it is. |
quence. However, whatever follows \b normally determines which it is. |
4079 |
For example, the fragment \ba matches "a" at the start of a word. |
For example, the fragment \ba matches "a" at the start of a word. |
4080 |
|
|
4081 |
The \A, \Z, and \z assertions differ from the traditional circumflex |
The \A, \Z, and \z assertions differ from the traditional circumflex |
4082 |
and dollar (described in the next section) in that they only ever match |
and dollar (described in the next section) in that they only ever match |
4083 |
at the very start and end of the subject string, whatever options are |
at the very start and end of the subject string, whatever options are |
4084 |
set. Thus, they are independent of multiline mode. These three asser- |
set. Thus, they are independent of multiline mode. These three asser- |
4085 |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
4086 |
affect only the behaviour of the circumflex and dollar metacharacters. |
affect only the behaviour of the circumflex and dollar metacharacters. |
4087 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
4088 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
4089 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
4090 |
that \Z matches before a newline at the end of the string as well as at |
that \Z matches before a newline at the end of the string as well as at |
4091 |
the very end, whereas \z matches only at the end. |
the very end, whereas \z matches only at the end. |
4092 |
|
|
4093 |
The \G assertion is true only when the current matching position is at |
The \G assertion is true only when the current matching position is at |
4094 |
the start point of the match, as specified by the startoffset argument |
the start point of the match, as specified by the startoffset argument |
4095 |
of pcre_exec(). It differs from \A when the value of startoffset is |
of pcre_exec(). It differs from \A when the value of startoffset is |
4096 |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
4097 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
4098 |
mentation where \G can be useful. |
mentation where \G can be useful. |
4099 |
|
|
4100 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
4101 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
4102 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
4103 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
4104 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
4105 |
|
|
4106 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
4107 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
4108 |
in the compiled regular expression. |
in the compiled regular expression. |
4109 |
|
|
4111 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
4112 |
|
|
4113 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
4114 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
4115 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
4116 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
4117 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
4118 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
4119 |
|
|
4120 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
4121 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
4122 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
4123 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
4124 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
4125 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
4126 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
4127 |
|
|
4128 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
4129 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
4130 |
before a newline at the end of the string (by default). Dollar need not |
before a newline at the end of the string (by default). Dollar need not |
4131 |
be the last character of the pattern if a number of alternatives are |
be the last character of the pattern if a number of alternatives are |
4132 |
involved, but it should be the last item in any branch in which it |
involved, but it should be the last item in any branch in which it |
4133 |
appears. Dollar has no special meaning in a character class. |
appears. Dollar has no special meaning in a character class. |
4134 |
|
|
4135 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
4136 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
4137 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
4138 |
|
|
4139 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
4140 |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
4141 |
matches immediately after internal newlines as well as at the start of |
matches immediately after internal newlines as well as at the start of |
4142 |
the subject string. It does not match after a newline that ends the |
the subject string. It does not match after a newline that ends the |
4143 |
string. A dollar matches before any newlines in the string, as well as |
string. A dollar matches before any newlines in the string, as well as |
4144 |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
4145 |
as the two-character sequence CRLF, isolated CR and LF characters do |
as the two-character sequence CRLF, isolated CR and LF characters do |
4146 |
not indicate newlines. |
not indicate newlines. |
4147 |
|
|
4148 |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
4149 |
(where \n represents a newline) in multiline mode, but not otherwise. |
(where \n represents a newline) in multiline mode, but not otherwise. |
4150 |
Consequently, patterns that are anchored in single line mode because |
Consequently, patterns that are anchored in single line mode because |
4151 |
all branches start with ^ are not anchored in multiline mode, and a |
all branches start with ^ are not anchored in multiline mode, and a |
4152 |
match for circumflex is possible when the startoffset argument of |
match for circumflex is possible when the startoffset argument of |
4153 |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
4154 |
PCRE_MULTILINE is set. |
PCRE_MULTILINE is set. |
4155 |
|
|
4156 |
Note that the sequences \A, \Z, and \z can be used to match the start |
Note that the sequences \A, \Z, and \z can be used to match the start |
4157 |
and end of the subject in both modes, and if all branches of a pattern |
and end of the subject in both modes, and if all branches of a pattern |
4158 |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
4159 |
set. |
set. |
4160 |
|
|
4161 |
|
|
4162 |
FULL STOP (PERIOD, DOT) AND \N |
FULL STOP (PERIOD, DOT) AND \N |
4163 |
|
|
4164 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
4165 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
4166 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
4167 |
more than one byte long. |
more than one byte long. |
4168 |
|
|
4169 |
When a line ending is defined as a single character, dot never matches |
When a line ending is defined as a single character, dot never matches |
4170 |
that character; when the two-character sequence CRLF is used, dot does |
that character; when the two-character sequence CRLF is used, dot does |
4171 |
not match CR if it is immediately followed by LF, but otherwise it |
not match CR if it is immediately followed by LF, but otherwise it |
4172 |
matches all characters (including isolated CRs and LFs). When any Uni- |
matches all characters (including isolated CRs and LFs). When any Uni- |
4173 |
code line endings are being recognized, dot does not match CR or LF or |
code line endings are being recognized, dot does not match CR or LF or |
4174 |
any of the other line ending characters. |
any of the other line ending characters. |
4175 |
|
|
4176 |
The behaviour of dot with regard to newlines can be changed. If the |
The behaviour of dot with regard to newlines can be changed. If the |
4177 |
PCRE_DOTALL option is set, a dot matches any one character, without |
PCRE_DOTALL option is set, a dot matches any one character, without |
4178 |
exception. If the two-character sequence CRLF is present in the subject |
exception. If the two-character sequence CRLF is present in the subject |
4179 |
string, it takes two dots to match it. |
string, it takes two dots to match it. |
4180 |
|
|
4181 |
The handling of dot is entirely independent of the handling of circum- |
The handling of dot is entirely independent of the handling of circum- |
4182 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
4183 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
4184 |
|
|
4185 |
The escape sequence \N behaves like a dot, except that it is not |
The escape sequence \N behaves like a dot, except that it is not |
4186 |
affected by the PCRE_DOTALL option. In other words, it matches any |
affected by the PCRE_DOTALL option. In other words, it matches any |
4187 |
character except one that signifies the end of a line. |
character except one that signifies the end of a line. Perl also uses |
4188 |
|
\N to match characters by name; PCRE does not support this. |
4189 |
|
|
4190 |
|
|
4191 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
4202 |
PCRE_NO_UTF8_CHECK option is used). |
PCRE_NO_UTF8_CHECK option is used). |
4203 |
|
|
4204 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
4205 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below) in UTF-8 mode, because this would make it impossible to calcu- |
4206 |
late the length of the lookbehind. |
late the length of the lookbehind. |
4207 |
|
|
4208 |
In general, the \C escape sequence is best avoided in UTF-8 mode. How- |
In general, the \C escape sequence is best avoided in UTF-8 mode. How- |
5109 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
5110 |
rent position, the assertion fails. |
rent position, the assertion fails. |
5111 |
|
|
5112 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
In UTF-8 mode, PCRE does not allow the \C escape (which matches a sin- |
5113 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
gle byte, even in UTF-8 mode) to appear in lookbehind assertions, |
5114 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
because it makes it impossible to calculate the length of the lookbe- |
5115 |
which can match different numbers of bytes, are also not permitted. |
hind. The \X and \R escapes, which can match different numbers of |
5116 |
|
bytes, are also not permitted. |
5117 |
|
|
5118 |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
5119 |
lookbehinds, as long as the subpattern matches a fixed-length string. |
lookbehinds, as long as the subpattern matches a fixed-length string. |
5120 |
Recursion, however, is not supported. |
Recursion, however, is not supported. |
5121 |
|
|
5122 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
5123 |
assertions to specify efficient matching of fixed-length strings at the |
assertions to specify efficient matching of fixed-length strings at the |
5124 |
end of subject strings. Consider a simple pattern such as |
end of subject strings. Consider a simple pattern such as |
5125 |
|
|
5126 |
abcd$ |
abcd$ |
5127 |
|
|
5128 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
5129 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
5130 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
5131 |
pattern is specified as |
pattern is specified as |
5132 |
|
|
5133 |
^.*abcd$ |
^.*abcd$ |
5134 |
|
|
5135 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
5136 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
5137 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
5138 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
5139 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
5140 |
|
|
5141 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
5142 |
|
|
5143 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
5144 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
5145 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
5146 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
5147 |
processing time. |
processing time. |
5148 |
|
|
5149 |
Using multiple assertions |
Using multiple assertions |
5152 |
|
|
5153 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
5154 |
|
|
5155 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
5156 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
5157 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
5158 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
5159 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
5160 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
5161 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
5162 |
foo". A pattern to do that is |
foo". A pattern to do that is |
5163 |
|
|
5164 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
5165 |
|
|
5166 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
5167 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
5168 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
5169 |
|
|
5171 |
|
|
5172 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
5173 |
|
|
5174 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
5175 |
is not preceded by "foo", while |
is not preceded by "foo", while |
5176 |
|
|
5177 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
5178 |
|
|
5179 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
5180 |
three characters that are not "999". |
three characters that are not "999". |
5181 |
|
|
5182 |
|
|
5183 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
5184 |
|
|
5185 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
5186 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
5187 |
on the result of an assertion, or whether a specific capturing subpat- |
on the result of an assertion, or whether a specific capturing subpat- |
5188 |
tern has already been matched. The two possible forms of conditional |
tern has already been matched. The two possible forms of conditional |
5189 |
subpattern are: |
subpattern are: |
5190 |
|
|
5191 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
5192 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
5193 |
|
|
5194 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
5195 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
5196 |
tives in the subpattern, a compile-time error occurs. Each of the two |
tives in the subpattern, a compile-time error occurs. Each of the two |
5197 |
alternatives may itself contain nested subpatterns of any form, includ- |
alternatives may itself contain nested subpatterns of any form, includ- |
5198 |
ing conditional subpatterns; the restriction to two alternatives |
ing conditional subpatterns; the restriction to two alternatives |
5199 |
applies only at the level of the condition. This pattern fragment is an |
applies only at the level of the condition. This pattern fragment is an |
5202 |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
(?(1) (A|B|C) | (D | (?(2)E|F) | E) ) |
5203 |
|
|
5204 |
|
|
5205 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
5206 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
5207 |
|
|
5208 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
5209 |
|
|
5210 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
5211 |
the condition is true if a capturing subpattern of that number has pre- |
the condition is true if a capturing subpattern of that number has pre- |
5212 |
viously matched. If there is more than one capturing subpattern with |
viously matched. If there is more than one capturing subpattern with |
5213 |
the same number (see the earlier section about duplicate subpattern |
the same number (see the earlier section about duplicate subpattern |
5214 |
numbers), the condition is true if any of them have matched. An alter- |
numbers), the condition is true if any of them have matched. An alter- |
5215 |
native notation is to precede the digits with a plus or minus sign. In |
native notation is to precede the digits with a plus or minus sign. In |
5216 |
this case, the subpattern number is relative rather than absolute. The |
this case, the subpattern number is relative rather than absolute. The |
5217 |
most recently opened parentheses can be referenced by (?(-1), the next |
most recently opened parentheses can be referenced by (?(-1), the next |
5218 |
most recent by (?(-2), and so on. Inside loops it can also make sense |
most recent by (?(-2), and so on. Inside loops it can also make sense |
5219 |
to refer to subsequent groups. The next parentheses to be opened can be |
to refer to subsequent groups. The next parentheses to be opened can be |
5220 |
referenced as (?(+1), and so on. (The value zero in any of these forms |
referenced as (?(+1), and so on. (The value zero in any of these forms |
5221 |
is not used; it provokes a compile-time error.) |
is not used; it provokes a compile-time error.) |
5222 |
|
|
5223 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
5224 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
5225 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
5226 |
|
|
5227 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
5228 |
|
|
5229 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
5230 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
5231 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
5232 |
third part is a conditional subpattern that tests whether or not the |
third part is a conditional subpattern that tests whether or not the |
5233 |
first set of parentheses matched. If they did, that is, if subject |
first set of parentheses matched. If they did, that is, if subject |
5234 |
started with an opening parenthesis, the condition is true, and so the |
started with an opening parenthesis, the condition is true, and so the |
5235 |
yes-pattern is executed and a closing parenthesis is required. Other- |
yes-pattern is executed and a closing parenthesis is required. Other- |
5236 |
wise, since no-pattern is not present, the subpattern matches nothing. |
wise, since no-pattern is not present, the subpattern matches nothing. |
5237 |
In other words, this pattern matches a sequence of non-parentheses, |
In other words, this pattern matches a sequence of non-parentheses, |
5238 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
5239 |
|
|
5240 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
5241 |
relative reference: |
relative reference: |
5242 |
|
|
5243 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
5244 |
|
|
5245 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
5246 |
pattern. |
pattern. |
5247 |
|
|
5248 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
5249 |
|
|
5250 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
5251 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
5252 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
5253 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
5254 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
5255 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
5256 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
5257 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
5258 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
5259 |
|
|
5260 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
5261 |
|
|
5262 |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
5263 |
|
|
5264 |
If the name used in a condition of this kind is a duplicate, the test |
If the name used in a condition of this kind is a duplicate, the test |
5265 |
is applied to all subpatterns of the same name, and is true if any one |
is applied to all subpatterns of the same name, and is true if any one |
5266 |
of them has matched. |
of them has matched. |
5267 |
|
|
5268 |
Checking for pattern recursion |
Checking for pattern recursion |
5269 |
|
|
5270 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
5271 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
5272 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
5273 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
5274 |
|
|
5276 |
|
|
5277 |
the condition is true if the most recent recursion is into a subpattern |
the condition is true if the most recent recursion is into a subpattern |
5278 |
whose number or name is given. This condition does not check the entire |
whose number or name is given. This condition does not check the entire |
5279 |
recursion stack. If the name used in a condition of this kind is a |
recursion stack. If the name used in a condition of this kind is a |
5280 |
duplicate, the test is applied to all subpatterns of the same name, and |
duplicate, the test is applied to all subpatterns of the same name, and |
5281 |
is true if any one of them is the most recent recursion. |
is true if any one of them is the most recent recursion. |
5282 |
|
|
5283 |
At "top level", all these recursion test conditions are false. The |
At "top level", all these recursion test conditions are false. The |
5284 |
syntax for recursive patterns is described below. |
syntax for recursive patterns is described below. |
5285 |
|
|
5286 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
5287 |
|
|
5288 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
5289 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
5290 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
5291 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
5292 |
DEFINE is that it can be used to define subroutines that can be refer- |
DEFINE is that it can be used to define subroutines that can be refer- |
5293 |
enced from elsewhere. (The use of subroutines is described below.) For |
enced from elsewhere. (The use of subroutines is described below.) For |
5294 |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
example, a pattern to match an IPv4 address such as "192.168.23.245" |
5295 |
could be written like this (ignore whitespace and line breaks): |
could be written like this (ignore whitespace and line breaks): |
5296 |
|
|
5297 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
5298 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
5299 |
|
|
5300 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
5301 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
5302 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
5303 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
5304 |
condition. The rest of the pattern uses references to the named group |
condition. The rest of the pattern uses references to the named group |
5305 |
to match the four dot-separated components of an IPv4 address, insist- |
to match the four dot-separated components of an IPv4 address, insist- |
5306 |
ing on a word boundary at each end. |
ing on a word boundary at each end. |
5307 |
|
|
5308 |
Assertion conditions |
Assertion conditions |
5309 |
|
|
5310 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
5311 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
5312 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
5313 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
5314 |
|
|
5315 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
5316 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
5317 |
|
|
5318 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
5319 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
5320 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
5321 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
5322 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
5323 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
5324 |
letters and dd are digits. |
letters and dd are digits. |
5325 |
|
|
5326 |
|
|
5329 |
There are two ways of including comments in patterns that are processed |
There are two ways of including comments in patterns that are processed |
5330 |
by PCRE. In both cases, the start of the comment must not be in a char- |
by PCRE. In both cases, the start of the comment must not be in a char- |
5331 |
acter class, nor in the middle of any other sequence of related charac- |
acter class, nor in the middle of any other sequence of related charac- |
5332 |
ters such as (?: or a subpattern name or number. The characters that |
ters such as (?: or a subpattern name or number. The characters that |
5333 |
make up a comment play no part in the pattern matching. |
make up a comment play no part in the pattern matching. |
5334 |
|
|
5335 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
5336 |
next closing parenthesis. Nested parentheses are not permitted. If the |
next closing parenthesis. Nested parentheses are not permitted. If the |
5337 |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
PCRE_EXTENDED option is set, an unescaped # character also introduces a |
5338 |
comment, which in this case continues to immediately after the next |
comment, which in this case continues to immediately after the next |
5339 |
newline character or character sequence in the pattern. Which charac- |
newline character or character sequence in the pattern. Which charac- |
5340 |
ters are interpreted as newlines is controlled by the options passed to |
ters are interpreted as newlines is controlled by the options passed to |
5341 |
pcre_compile() or by a special sequence at the start of the pattern, as |
pcre_compile() or by a special sequence at the start of the pattern, as |
5342 |
described in the section entitled "Newline conventions" above. Note |
described in the section entitled "Newline conventions" above. Note |
5343 |
that the end of this type of comment is a literal newline sequence in |
that the end of this type of comment is a literal newline sequence in |
5344 |
the pattern; escape sequences that happen to represent a newline do not |
the pattern; escape sequences that happen to represent a newline do not |
5345 |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
5346 |
and the default newline convention is in force: |
and the default newline convention is in force: |
5347 |
|
|
5348 |
abc #comment \n still comment |
abc #comment \n still comment |
5349 |
|
|
5350 |
On encountering the # character, pcre_compile() skips along, looking |
On encountering the # character, pcre_compile() skips along, looking |
5351 |
for a newline in the pattern. The sequence \n is still literal at this |
for a newline in the pattern. The sequence \n is still literal at this |
5352 |
stage, so it does not terminate the comment. Only an actual character |
stage, so it does not terminate the comment. Only an actual character |
5353 |
with the code value 0x0a (the default newline) does so. |
with the code value 0x0a (the default newline) does so. |
5354 |
|
|
5355 |
|
|
5356 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
5357 |
|
|
5358 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
5359 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
5360 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
5361 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
5362 |
depth. |
depth. |
5363 |
|
|
5364 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
5365 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
5366 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
5367 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
5368 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
5369 |
|
|
5373 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
5374 |
|
|
5375 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
5376 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
5377 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
5378 |
PCRE and Python, this kind of recursion was subsequently introduced |
PCRE and Python, this kind of recursion was subsequently introduced |
5379 |
into Perl at release 5.10. |
into Perl at release 5.10. |
5380 |
|
|
5381 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
5382 |
zero and a closing parenthesis is a recursive subroutine call of the |
zero and a closing parenthesis is a recursive subroutine call of the |
5383 |
subpattern of the given number, provided that it occurs inside that |
subpattern of the given number, provided that it occurs inside that |
5384 |
subpattern. (If not, it is a non-recursive subroutine call, which is |
subpattern. (If not, it is a non-recursive subroutine call, which is |
5385 |
described in the next section.) The special item (?R) or (?0) is a |
described in the next section.) The special item (?R) or (?0) is a |
5386 |
recursive call of the entire regular expression. |
recursive call of the entire regular expression. |
5387 |
|
|
5388 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
5389 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
5390 |
|
|
5391 |
\( ( [^()]++ | (?R) )* \) |
\( ( [^()]++ | (?R) )* \) |
5392 |
|
|
5393 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
5394 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
5395 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
5396 |
sized substring). Finally there is a closing parenthesis. Note the use |
sized substring). Finally there is a closing parenthesis. Note the use |
5397 |
of a possessive quantifier to avoid backtracking into sequences of non- |
of a possessive quantifier to avoid backtracking into sequences of non- |
5398 |
parentheses. |
parentheses. |
5399 |
|
|
5400 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
5401 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
5402 |
|
|
5403 |
( \( ( [^()]++ | (?1) )* \) ) |
( \( ( [^()]++ | (?1) )* \) ) |
5404 |
|
|
5405 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
5406 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
5407 |
|
|
5408 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
5409 |
tricky. This is made easier by the use of relative references. Instead |
tricky. This is made easier by the use of relative references. Instead |
5410 |
of (?1) in the pattern above you can write (?-2) to refer to the second |
of (?1) in the pattern above you can write (?-2) to refer to the second |
5411 |
most recently opened parentheses preceding the recursion. In other |
most recently opened parentheses preceding the recursion. In other |
5412 |
words, a negative number counts capturing parentheses leftwards from |
words, a negative number counts capturing parentheses leftwards from |
5413 |
the point at which it is encountered. |
the point at which it is encountered. |
5414 |
|
|
5415 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
5416 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
5417 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
5418 |
enced. They are always non-recursive subroutine calls, as described in |
enced. They are always non-recursive subroutine calls, as described in |
5419 |
the next section. |
the next section. |
5420 |
|
|
5421 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
5422 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
5423 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
5424 |
|
|
5425 |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
5426 |
|
|
5427 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
5428 |
one is used. |
one is used. |
5429 |
|
|
5430 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
5431 |
nested unlimited repeats, and so the use of a possessive quantifier for |
nested unlimited repeats, and so the use of a possessive quantifier for |
5432 |
matching strings of non-parentheses is important when applying the pat- |
matching strings of non-parentheses is important when applying the pat- |
5433 |
tern to strings that do not match. For example, when this pattern is |
tern to strings that do not match. For example, when this pattern is |
5434 |
applied to |
applied to |
5435 |
|
|
5436 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
5437 |
|
|
5438 |
it yields "no match" quickly. However, if a possessive quantifier is |
it yields "no match" quickly. However, if a possessive quantifier is |
5439 |
not used, the match runs for a very long time indeed because there are |
not used, the match runs for a very long time indeed because there are |
5440 |
so many different ways the + and * repeats can carve up the subject, |
so many different ways the + and * repeats can carve up the subject, |
5441 |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
5442 |
|
|
5443 |
At the end of a match, the values of capturing parentheses are those |
At the end of a match, the values of capturing parentheses are those |
5444 |
from the outermost level. If you want to obtain intermediate values, a |
from the outermost level. If you want to obtain intermediate values, a |
5445 |
callout function can be used (see below and the pcrecallout documenta- |
callout function can be used (see below and the pcrecallout documenta- |
5446 |
tion). If the pattern above is matched against |
tion). If the pattern above is matched against |
5447 |
|
|
5448 |
(ab(cd)ef) |
(ab(cd)ef) |
5449 |
|
|
5450 |
the value for the inner capturing parentheses (numbered 2) is "ef", |
the value for the inner capturing parentheses (numbered 2) is "ef", |
5451 |
which is the last value taken on at the top level. If a capturing sub- |
which is the last value taken on at the top level. If a capturing sub- |
5452 |
pattern is not matched at the top level, its final captured value is |
pattern is not matched at the top level, its final captured value is |
5453 |
unset, even if it was (temporarily) set at a deeper level during the |
unset, even if it was (temporarily) set at a deeper level during the |
5454 |
matching process. |
matching process. |
5455 |
|
|
5456 |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
5457 |
to obtain extra memory to store data during a recursion, which it does |
to obtain extra memory to store data during a recursion, which it does |
5458 |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
5459 |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
5460 |
|
|
5461 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
5462 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
5463 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
5464 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
5465 |
ted at the outer level. |
ted at the outer level. |
5466 |
|
|
5467 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
5468 |
|
|
5469 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
5470 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
5471 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
5472 |
|
|
5473 |
Differences in recursion processing between PCRE and Perl |
Differences in recursion processing between PCRE and Perl |
5474 |
|
|
5475 |
Recursion processing in PCRE differs from Perl in two important ways. |
Recursion processing in PCRE differs from Perl in two important ways. |
5476 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
5477 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
5478 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
5479 |
alternatives and there is a subsequent matching failure. This can be |
alternatives and there is a subsequent matching failure. This can be |
5480 |
illustrated by the following pattern, which purports to match a palin- |
illustrated by the following pattern, which purports to match a palin- |
5481 |
dromic string that contains an odd number of characters (for example, |
dromic string that contains an odd number of characters (for example, |
5482 |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
5483 |
|
|
5484 |
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
5485 |
|
|
5486 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
5487 |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
5488 |
in PCRE it does not if the pattern is longer than three characters. |
in PCRE it does not if the pattern is longer than three characters. |
5489 |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
5490 |
|
|
5491 |
At the top level, the first character is matched, but as it is not at |
At the top level, the first character is matched, but as it is not at |
5492 |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
5493 |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
5494 |
tern 1 successfully matches the next character ("b"). (Note that the |
tern 1 successfully matches the next character ("b"). (Note that the |
5495 |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
5496 |
|
|
5497 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
5498 |
subpattern 2 matched, which was "a". This fails. Because the recursion |
subpattern 2 matched, which was "a". This fails. Because the recursion |
5499 |
is treated as an atomic group, there are now no backtracking points, |
is treated as an atomic group, there are now no backtracking points, |
5500 |
and so the entire match fails. (Perl is able, at this point, to re- |
and so the entire match fails. (Perl is able, at this point, to re- |
5501 |
enter the recursion and try the second alternative.) However, if the |
enter the recursion and try the second alternative.) However, if the |
5502 |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
5503 |
different: |
different: |
5504 |
|
|
5505 |
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
5506 |
|
|
5507 |
This time, the recursing alternative is tried first, and continues to |
This time, the recursing alternative is tried first, and continues to |
5508 |
recurse until it runs out of characters, at which point the recursion |
recurse until it runs out of characters, at which point the recursion |
5509 |
fails. But this time we do have another alternative to try at the |
fails. But this time we do have another alternative to try at the |
5510 |
higher level. That is the big difference: in the previous case the |
higher level. That is the big difference: in the previous case the |
5511 |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
5512 |
use. |
use. |
5513 |
|
|
5514 |
To change the pattern so that it matches all palindromic strings, not |
To change the pattern so that it matches all palindromic strings, not |
5515 |
just those with an odd number of characters, it is tempting to change |
just those with an odd number of characters, it is tempting to change |
5516 |
the pattern to this: |
the pattern to this: |
5517 |
|
|
5518 |
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
5519 |
|
|
5520 |
Again, this works in Perl, but not in PCRE, and for the same reason. |
Again, this works in Perl, but not in PCRE, and for the same reason. |
5521 |
When a deeper recursion has matched a single character, it cannot be |
When a deeper recursion has matched a single character, it cannot be |
5522 |
entered again in order to match an empty string. The solution is to |
entered again in order to match an empty string. The solution is to |
5523 |
separate the two cases, and write out the odd and even cases as alter- |
separate the two cases, and write out the odd and even cases as alter- |
5524 |
natives at the higher level: |
natives at the higher level: |
5525 |
|
|
5526 |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
5527 |
|
|
5528 |
If you want to match typical palindromic phrases, the pattern has to |
If you want to match typical palindromic phrases, the pattern has to |
5529 |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
5530 |
|
|
5531 |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
5532 |
|
|
5533 |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
5534 |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
5535 |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
5536 |
ing into sequences of non-word characters. Without this, PCRE takes a |
ing into sequences of non-word characters. Without this, PCRE takes a |
5537 |
great deal longer (ten times or more) to match typical phrases, and |
great deal longer (ten times or more) to match typical phrases, and |
5538 |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
5539 |
|
|
5540 |
WARNING: The palindrome-matching patterns above work only if the sub- |
WARNING: The palindrome-matching patterns above work only if the sub- |
5541 |
ject string does not start with a palindrome that is shorter than the |
ject string does not start with a palindrome that is shorter than the |
5542 |
entire string. For example, although "abcba" is correctly matched, if |
entire string. For example, although "abcba" is correctly matched, if |
5543 |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
5544 |
then fails at top level because the end of the string does not follow. |
then fails at top level because the end of the string does not follow. |
5545 |
Once again, it cannot jump back into the recursion to try other alter- |
Once again, it cannot jump back into the recursion to try other alter- |
5546 |
natives, so the entire match fails. |
natives, so the entire match fails. |
5547 |
|
|
5548 |
The second way in which PCRE and Perl differ in their recursion pro- |
The second way in which PCRE and Perl differ in their recursion pro- |
5549 |
cessing is in the handling of captured values. In Perl, when a subpat- |
cessing is in the handling of captured values. In Perl, when a subpat- |
5550 |
tern is called recursively or as a subpattern (see the next section), |
tern is called recursively or as a subpattern (see the next section), |
5551 |
it has no access to any values that were captured outside the recur- |
it has no access to any values that were captured outside the recur- |
5552 |
sion, whereas in PCRE these values can be referenced. Consider this |
sion, whereas in PCRE these values can be referenced. Consider this |
5553 |
pattern: |
pattern: |
5554 |
|
|
5555 |
^(.)(\1|a(?2)) |
^(.)(\1|a(?2)) |
5556 |
|
|
5557 |
In PCRE, this pattern matches "bab". The first capturing parentheses |
In PCRE, this pattern matches "bab". The first capturing parentheses |
5558 |
match "b", then in the second group, when the back reference \1 fails |
match "b", then in the second group, when the back reference \1 fails |
5559 |
to match "b", the second alternative matches "a" and then recurses. In |
to match "b", the second alternative matches "a" and then recurses. In |
5560 |
the recursion, \1 does now match "b" and so the whole match succeeds. |
the recursion, \1 does now match "b" and so the whole match succeeds. |
5561 |
In Perl, the pattern fails to match because inside the recursive call |
In Perl, the pattern fails to match because inside the recursive call |
5562 |
\1 cannot access the externally set value. |
\1 cannot access the externally set value. |
5563 |
|
|
5564 |
|
|
5565 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
5566 |
|
|
5567 |
If the syntax for a recursive subpattern call (either by number or by |
If the syntax for a recursive subpattern call (either by number or by |
5568 |
name) is used outside the parentheses to which it refers, it operates |
name) is used outside the parentheses to which it refers, it operates |
5569 |
like a subroutine in a programming language. The called subpattern may |
like a subroutine in a programming language. The called subpattern may |
5570 |
be defined before or after the reference. A numbered reference can be |
be defined before or after the reference. A numbered reference can be |
5571 |
absolute or relative, as in these examples: |
absolute or relative, as in these examples: |
5572 |
|
|
5573 |
(...(absolute)...)...(?2)... |
(...(absolute)...)...(?2)... |
5578 |
|
|
5579 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
5580 |
|
|
5581 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
5582 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
5583 |
|
|
5584 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
5585 |
|
|
5586 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
5587 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
5588 |
above. |
above. |
5589 |
|
|
5590 |
All subroutine calls, whether recursive or not, are always treated as |
All subroutine calls, whether recursive or not, are always treated as |
5591 |
atomic groups. That is, once a subroutine has matched some of the sub- |
atomic groups. That is, once a subroutine has matched some of the sub- |
5592 |
ject string, it is never re-entered, even if it contains untried alter- |
ject string, it is never re-entered, even if it contains untried alter- |
5593 |
natives and there is a subsequent matching failure. Any capturing |
natives and there is a subsequent matching failure. Any capturing |
5594 |
parentheses that are set during the subroutine call revert to their |
parentheses that are set during the subroutine call revert to their |
5595 |
previous values afterwards. |
previous values afterwards. |
5596 |
|
|
5597 |
Processing options such as case-independence are fixed when a subpat- |
Processing options such as case-independence are fixed when a subpat- |
5598 |
tern is defined, so if it is used as a subroutine, such options cannot |
tern is defined, so if it is used as a subroutine, such options cannot |
5599 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
5600 |
|
|
5601 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
5602 |
|
|
5603 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
5604 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
5605 |
|
|
5606 |
|
|
5607 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
5608 |
|
|
5609 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
5610 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
5611 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
5612 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
5613 |
ten using this syntax: |
ten using this syntax: |
5614 |
|
|
5615 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
5616 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
5617 |
|
|
5618 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
5619 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
5620 |
|
|
5621 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
5622 |
|
|
5623 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
5624 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
5625 |
call. |
call. |
5626 |
|
|
5627 |
|
|
5628 |
CALLOUTS |
CALLOUTS |
5629 |
|
|
5630 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
5631 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
5632 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
5633 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
5634 |
tion. |
tion. |
5635 |
|
|
5636 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
5637 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
5638 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
5639 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
5640 |
all calling out. |
all calling out. |
5641 |
|
|
5642 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
5643 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
5644 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
5645 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
5646 |
points: |
points: |
5647 |
|
|
5648 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
5649 |
|
|
5650 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
5651 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
5652 |
numbered 255. |
numbered 255. |
5653 |
|
|
5654 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
5655 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
5656 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
5657 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
5658 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
5659 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
5660 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
5661 |
|
|
5662 |
|
|
5663 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
5664 |
|
|
5665 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
5666 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
5667 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
5668 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
5669 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
5670 |
in this section. |
in this section. |
5671 |
|
|
5672 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
5673 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
5674 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
5675 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
5676 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
5677 |
|
|
5678 |
If any of these verbs are used in an assertion or in a subpattern that |
If any of these verbs are used in an assertion or in a subpattern that |
5679 |
is called as a subroutine (whether or not recursively), their effect is |
is called as a subroutine (whether or not recursively), their effect is |
5680 |
confined to that subpattern; it does not extend to the surrounding pat- |
confined to that subpattern; it does not extend to the surrounding pat- |
5681 |
tern, with one exception: a *MARK that is encountered in a positive |
tern, with one exception: the name from a *(MARK), (*PRUNE), or (*THEN) |
5682 |
assertion is passed back (compare capturing parentheses in assertions). |
that is encountered in a successful positive assertion is passed back |
5683 |
|
when a match succeeds (compare capturing parentheses in assertions). |
5684 |
Note that such subpatterns are processed as anchored at the point where |
Note that such subpatterns are processed as anchored at the point where |
5685 |
they are tested. Note also that Perl's treatment of subroutines is dif- |
they are tested. Note also that Perl's treatment of subroutines is dif- |
5686 |
ferent in some cases. |
ferent in some cases. |
5703 |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
by setting the PCRE_NO_START_OPTIMIZE option when calling pcre_com- |
5704 |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
pile() or pcre_exec(), or by starting the pattern with (*NO_START_OPT). |
5705 |
|
|
5706 |
|
Experiments with Perl suggest that it too has similar optimizations, |
5707 |
|
sometimes leading to anomalous results. |
5708 |
|
|
5709 |
Verbs that act immediately |
Verbs that act immediately |
5710 |
|
|
5711 |
The following verbs act as soon as they are encountered. They may not |
The following verbs act as soon as they are encountered. They may not |
5712 |
be followed by a name. |
be followed by a name. |
5713 |
|
|
5714 |
(*ACCEPT) |
(*ACCEPT) |
5715 |
|
|
5716 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
5717 |
of the pattern. However, when it is inside a subpattern that is called |
of the pattern. However, when it is inside a subpattern that is called |
5718 |
as a subroutine, only that subpattern is ended successfully. Matching |
as a subroutine, only that subpattern is ended successfully. Matching |
5719 |
then continues at the outer level. If (*ACCEPT) is inside capturing |
then continues at the outer level. If (*ACCEPT) is inside capturing |
5720 |
parentheses, the data so far is captured. For example: |
parentheses, the data so far is captured. For example: |
5721 |
|
|
5722 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
5723 |
|
|
5724 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
5725 |
tured by the outer parentheses. |
tured by the outer parentheses. |
5726 |
|
|
5727 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
5728 |
|
|
5729 |
This verb causes a matching failure, forcing backtracking to occur. It |
This verb causes a matching failure, forcing backtracking to occur. It |
5730 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
5731 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
5732 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
5733 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
5734 |
tern: |
tern: |
5735 |
|
|
5736 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
5737 |
|
|
5738 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
5739 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
5740 |
|
|
5741 |
Recording which path was taken |
Recording which path was taken |
5742 |
|
|
5743 |
There is one verb whose main purpose is to track how a match was |
There is one verb whose main purpose is to track how a match was |
5744 |
arrived at, though it also has a secondary use in conjunction with |
arrived at, though it also has a secondary use in conjunction with |
5745 |
advancing the match starting point (see (*SKIP) below). |
advancing the match starting point (see (*SKIP) below). |
5746 |
|
|
5747 |
(*MARK:NAME) or (*:NAME) |
(*MARK:NAME) or (*:NAME) |
5748 |
|
|
5749 |
A name is always required with this verb. There may be as many |
A name is always required with this verb. There may be as many |
5750 |
instances of (*MARK) as you like in a pattern, and their names do not |
instances of (*MARK) as you like in a pattern, and their names do not |
5751 |
have to be unique. |
have to be unique. |
5752 |
|
|
5753 |
When a match succeeds, the name of the last-encountered (*MARK) is |
When a match succeeds, the name of the last-encountered (*MARK) on the |
5754 |
passed back to the caller via the pcre_extra data structure, as |
matching path is passed back to the caller via the pcre_extra data |
5755 |
described in the section on pcre_extra in the pcreapi documentation. No |
structure, as described in the section on pcre_extra in the pcreapi |
5756 |
data is returned for a partial match. Here is an example of pcretest |
documentation. Here is an example of pcretest output, where the /K mod- |
5757 |
output, where the /K modifier requests the retrieval and outputting of |
ifier requests the retrieval and outputting of (*MARK) data: |
|
(*MARK) data: |
|
5758 |
|
|
5759 |
/X(*MARK:A)Y|X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
5760 |
XY |
data> XY |
5761 |
0: XY |
0: XY |
5762 |
MK: A |
MK: A |
5763 |
XZ |
XZ |
5773 |
and passed back if it is the last-encountered. This does not happen for |
and passed back if it is the last-encountered. This does not happen for |
5774 |
negative assertions. |
negative assertions. |
5775 |
|
|
5776 |
A name may also be returned after a failed match if the final path |
After a partial match or a failed match, the name of the last encoun- |
5777 |
through the pattern involves (*MARK). However, unless (*MARK) used in |
tered (*MARK) in the entire match process is returned. For example: |
|
conjunction with (*COMMIT), this is unlikely to happen for an unan- |
|
|
chored pattern because, as the starting point for matching is advanced, |
|
|
the final check is often with an empty string, causing a failure before |
|
|
(*MARK) is reached. For example: |
|
|
|
|
|
/X(*MARK:A)Y|X(*MARK:B)Z/K |
|
|
XP |
|
|
No match |
|
|
|
|
|
There are three potential starting points for this match (starting with |
|
|
X, starting with P, and with an empty string). If the pattern is |
|
|
anchored, the result is different: |
|
5778 |
|
|
5779 |
/^X(*MARK:A)Y|^X(*MARK:B)Z/K |
re> /X(*MARK:A)Y|X(*MARK:B)Z/K |
5780 |
XP |
data> XP |
5781 |
No match, mark = B |
No match, mark = B |
5782 |
|
|
5783 |
PCRE's start-of-match optimizations can also interfere with this. For |
Note that in this unanchored example the mark is retained from the |
5784 |
example, if, as a result of a call to pcre_study(), it knows the mini- |
match attempt that started at the letter "X". Subsequent match attempts |
5785 |
mum subject length for a match, a shorter subject will not be scanned |
starting at "P" and then with an empty string do not get as far as the |
5786 |
at all. |
(*MARK) item, but nevertheless do not reset it. |
|
|
|
|
Note that similar anomalies (though different in detail) exist in Perl, |
|
|
no doubt for the same reasons. The use of (*MARK) data after a failed |
|
|
match of an unanchored pattern is not recommended, unless (*COMMIT) is |
|
|
involved. |
|
5787 |
|
|
5788 |
Verbs that act after backtracking |
Verbs that act after backtracking |
5789 |
|
|
5790 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
5791 |
tinues with what follows, but if there is no subsequent match, causing |
tinues with what follows, but if there is no subsequent match, causing |
5792 |
a backtrack to the verb, a failure is forced. That is, backtracking |
a backtrack to the verb, a failure is forced. That is, backtracking |
5793 |
cannot pass to the left of the verb. However, when one of these verbs |
cannot pass to the left of the verb. However, when one of these verbs |
5794 |
appears inside an atomic group, its effect is confined to that group, |
appears inside an atomic group, its effect is confined to that group, |
5795 |
because once the group has been matched, there is never any backtrack- |
because once the group has been matched, there is never any backtrack- |
5796 |
ing into it. In this situation, backtracking can "jump back" to the |
ing into it. In this situation, backtracking can "jump back" to the |
5797 |
left of the entire atomic group. (Remember also, as stated above, that |
left of the entire atomic group. (Remember also, as stated above, that |
5798 |
this localization also applies in subroutine calls and assertions.) |
this localization also applies in subroutine calls and assertions.) |
5799 |
|
|
5800 |
These verbs differ in exactly what kind of failure occurs when back- |
These verbs differ in exactly what kind of failure occurs when back- |
5801 |
tracking reaches them. |
tracking reaches them. |
5802 |
|
|
5803 |
(*COMMIT) |
(*COMMIT) |
5804 |
|
|
5805 |
This verb, which may not be followed by a name, causes the whole match |
This verb, which may not be followed by a name, causes the whole match |
5806 |
to fail outright if the rest of the pattern does not match. Even if the |
to fail outright if the rest of the pattern does not match. Even if the |
5807 |
pattern is unanchored, no further attempts to find a match by advancing |
pattern is unanchored, no further attempts to find a match by advancing |
5808 |
the starting point take place. Once (*COMMIT) has been passed, |
the starting point take place. Once (*COMMIT) has been passed, |
5809 |
pcre_exec() is committed to finding a match at the current starting |
pcre_exec() is committed to finding a match at the current starting |
5810 |
point, or not at all. For example: |
point, or not at all. For example: |
5811 |
|
|
5812 |
a+(*COMMIT)b |
a+(*COMMIT)b |
5813 |
|
|
5814 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
5815 |
of dynamic anchor, or "I've started, so I must finish." The name of the |
of dynamic anchor, or "I've started, so I must finish." The name of the |
5816 |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
most recently passed (*MARK) in the path is passed back when (*COMMIT) |
5817 |
forces a match failure. |
forces a match failure. |
5818 |
|
|
5819 |
Note that (*COMMIT) at the start of a pattern is not the same as an |
Note that (*COMMIT) at the start of a pattern is not the same as an |
5820 |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
anchor, unless PCRE's start-of-match optimizations are turned off, as |
5821 |
shown in this pcretest example: |
shown in this pcretest example: |
5822 |
|
|
5823 |
/(*COMMIT)abc/ |
re> /(*COMMIT)abc/ |
5824 |
xyzabc |
data> xyzabc |
5825 |
0: abc |
0: abc |
5826 |
xyzabc\Y |
xyzabc\Y |
5827 |
No match |
No match |
5828 |
|
|
5829 |
PCRE knows that any match must start with "a", so the optimization |
PCRE knows that any match must start with "a", so the optimization |
5830 |
skips along the subject to "a" before running the first match attempt, |
skips along the subject to "a" before running the first match attempt, |
5831 |
which succeeds. When the optimization is disabled by the \Y escape in |
which succeeds. When the optimization is disabled by the \Y escape in |
5832 |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
the second subject, the match starts at "x" and so the (*COMMIT) causes |
5833 |
it to fail without trying any other starting points. |
it to fail without trying any other starting points. |
5834 |
|
|
5835 |
(*PRUNE) or (*PRUNE:NAME) |
(*PRUNE) or (*PRUNE:NAME) |
5836 |
|
|
5837 |
This verb causes the match to fail at the current starting position in |
This verb causes the match to fail at the current starting position in |
5838 |
the subject if the rest of the pattern does not match. If the pattern |
the subject if the rest of the pattern does not match. If the pattern |
5839 |
is unanchored, the normal "bumpalong" advance to the next starting |
is unanchored, the normal "bumpalong" advance to the next starting |
5840 |
character then happens. Backtracking can occur as usual to the left of |
character then happens. Backtracking can occur as usual to the left of |
5841 |
(*PRUNE), before it is reached, or when matching to the right of |
(*PRUNE), before it is reached, or when matching to the right of |
5842 |
(*PRUNE), but if there is no match to the right, backtracking cannot |
(*PRUNE), but if there is no match to the right, backtracking cannot |
5843 |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
cross (*PRUNE). In simple cases, the use of (*PRUNE) is just an alter- |
5844 |
native to an atomic group or possessive quantifier, but there are some |
native to an atomic group or possessive quantifier, but there are some |
5845 |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
uses of (*PRUNE) that cannot be expressed in any other way. The behav- |
5846 |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE) when the |
iour of (*PRUNE:NAME) is the same as (*MARK:NAME)(*PRUNE). In an |
5847 |
match fails completely; the name is passed back if this is the final |
anchored pattern (*PRUNE) has the same effect as (*COMMIT). |
|
attempt. (*PRUNE:NAME) does not pass back a name if the match suc- |
|
|
ceeds. In an anchored pattern (*PRUNE) has the same effect as (*COM- |
|
|
MIT). |
|
5848 |
|
|
5849 |
(*SKIP) |
(*SKIP) |
5850 |
|
|
5871 |
is searched for the most recent (*MARK) that has the same name. If one |
is searched for the most recent (*MARK) that has the same name. If one |
5872 |
is found, the "bumpalong" advance is to the subject position that cor- |
is found, the "bumpalong" advance is to the subject position that cor- |
5873 |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
responds to that (*MARK) instead of to where (*SKIP) was encountered. |
5874 |
If no (*MARK) with a matching name is found, normal "bumpalong" of one |
If no (*MARK) with a matching name is found, the (*SKIP) is ignored. |
|
character happens (that is, the (*SKIP) is ignored). |
|
5875 |
|
|
5876 |
(*THEN) or (*THEN:NAME) |
(*THEN) or (*THEN:NAME) |
5877 |
|
|
5878 |
This verb causes a skip to the next innermost alternative if the rest |
This verb causes a skip to the next innermost alternative if the rest |
5879 |
of the pattern does not match. That is, it cancels pending backtrack- |
of the pattern does not match. That is, it cancels pending backtrack- |
5880 |
ing, but only within the current alternative. Its name comes from the |
ing, but only within the current alternative. Its name comes from the |
5881 |
observation that it can be used for a pattern-based if-then-else block: |
observation that it can be used for a pattern-based if-then-else block: |
5882 |
|
|
5883 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
5884 |
|
|
5885 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
5886 |
after the end of the group if FOO succeeds); on failure, the matcher |
after the end of the group if FOO succeeds); on failure, the matcher |
5887 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
5888 |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
into COND1. The behaviour of (*THEN:NAME) is exactly the same as |
5889 |
(*MARK:NAME)(*THEN) if the overall match fails. If (*THEN) is not |
(*MARK:NAME)(*THEN). If (*THEN) is not inside an alternation, it acts |
5890 |
inside an alternation, it acts like (*PRUNE). |
like (*PRUNE). |
5891 |
|
|
5892 |
Note that a subpattern that does not contain a | character is just a |
Note that a subpattern that does not contain a | character is just a |
5893 |
part of the enclosing alternative; it is not a nested alternation with |
part of the enclosing alternative; it is not a nested alternation with |
5894 |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
only one alternative. The effect of (*THEN) extends beyond such a sub- |
5895 |
pattern to the enclosing alternative. Consider this pattern, where A, |
pattern to the enclosing alternative. Consider this pattern, where A, |
5896 |
B, etc. are complex pattern fragments that do not contain any | charac- |
B, etc. are complex pattern fragments that do not contain any | charac- |
5897 |
ters at this level: |
ters at this level: |
5898 |
|
|
5899 |
A (B(*THEN)C) | D |
A (B(*THEN)C) | D |
5900 |
|
|
5901 |
If A and B are matched, but there is a failure in C, matching does not |
If A and B are matched, but there is a failure in C, matching does not |
5902 |
backtrack into A; instead it moves to the next alternative, that is, D. |
backtrack into A; instead it moves to the next alternative, that is, D. |
5903 |
However, if the subpattern containing (*THEN) is given an alternative, |
However, if the subpattern containing (*THEN) is given an alternative, |
5904 |
it behaves differently: |
it behaves differently: |
5905 |
|
|
5906 |
A (B(*THEN)C | (*FAIL)) | D |
A (B(*THEN)C | (*FAIL)) | D |
5907 |
|
|
5908 |
The effect of (*THEN) is now confined to the inner subpattern. After a |
The effect of (*THEN) is now confined to the inner subpattern. After a |
5909 |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
failure in C, matching moves to (*FAIL), which causes the whole subpat- |
5910 |
tern to fail because there are no more alternatives to try. In this |
tern to fail because there are no more alternatives to try. In this |
5911 |
case, matching does now backtrack into A. |
case, matching does now backtrack into A. |
5912 |
|
|
5913 |
Note also that a conditional subpattern is not considered as having two |
Note also that a conditional subpattern is not considered as having two |
5914 |
alternatives, because only one is ever used. In other words, the | |
alternatives, because only one is ever used. In other words, the | |
5915 |
character in a conditional subpattern has a different meaning. Ignoring |
character in a conditional subpattern has a different meaning. Ignoring |
5916 |
white space, consider: |
white space, consider: |
5917 |
|
|
5918 |
^.*? (?(?=a) a | b(*THEN)c ) |
^.*? (?(?=a) a | b(*THEN)c ) |
5919 |
|
|
5920 |
If the subject is "ba", this pattern does not match. Because .*? is |
If the subject is "ba", this pattern does not match. Because .*? is |
5921 |
ungreedy, it initially matches zero characters. The condition (?=a) |
ungreedy, it initially matches zero characters. The condition (?=a) |
5922 |
then fails, the character "b" is matched, but "c" is not. At this |
then fails, the character "b" is matched, but "c" is not. At this |
5923 |
point, matching does not backtrack to .*? as might perhaps be expected |
point, matching does not backtrack to .*? as might perhaps be expected |
5924 |
from the presence of the | character. The conditional subpattern is |
from the presence of the | character. The conditional subpattern is |
5925 |
part of the single alternative that comprises the whole pattern, and so |
part of the single alternative that comprises the whole pattern, and so |
5926 |
the match fails. (If there was a backtrack into .*?, allowing it to |
the match fails. (If there was a backtrack into .*?, allowing it to |
5927 |
match "b", the match would succeed.) |
match "b", the match would succeed.) |
5928 |
|
|
5929 |
The verbs just described provide four different "strengths" of control |
The verbs just described provide four different "strengths" of control |
5930 |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
when subsequent matching fails. (*THEN) is the weakest, carrying on the |
5931 |
match at the next alternative. (*PRUNE) comes next, failing the match |
match at the next alternative. (*PRUNE) comes next, failing the match |
5932 |
at the current starting position, but allowing an advance to the next |
at the current starting position, but allowing an advance to the next |
5933 |
character (for an unanchored pattern). (*SKIP) is similar, except that |
character (for an unanchored pattern). (*SKIP) is similar, except that |
5934 |
the advance may be more than one character. (*COMMIT) is the strongest, |
the advance may be more than one character. (*COMMIT) is the strongest, |
5935 |
causing the entire match to fail. |
causing the entire match to fail. |
5936 |
|
|
5940 |
|
|
5941 |
(A(*COMMIT)B(*THEN)C|D) |
(A(*COMMIT)B(*THEN)C|D) |
5942 |
|
|
5943 |
Once A has matched, PCRE is committed to this match, at the current |
Once A has matched, PCRE is committed to this match, at the current |
5944 |
starting position. If subsequently B matches, but C does not, the nor- |
starting position. If subsequently B matches, but C does not, the nor- |
5945 |
mal (*THEN) action of trying the next alternative (that is, D) does not |
mal (*THEN) action of trying the next alternative (that is, D) does not |
5946 |
happen because (*COMMIT) overrides. |
happen because (*COMMIT) overrides. |
5947 |
|
|
5960 |
|
|
5961 |
REVISION |
REVISION |
5962 |
|
|
5963 |
Last updated: 19 October 2011 |
Last updated: 29 November 2011 |
5964 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
5965 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5966 |
|
|
5967 |
|
|
5968 |
PCRESYNTAX(3) PCRESYNTAX(3) |
PCRESYNTAX(3) PCRESYNTAX(3) |
5969 |
|
|
5970 |
|
|
6333 |
Last updated: 21 November 2010 |
Last updated: 21 November 2010 |
6334 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
6335 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6336 |
|
|
6337 |
|
|
6338 |
PCREUNICODE(3) PCREUNICODE(3) |
PCREUNICODE(3) PCREUNICODE(3) |
6339 |
|
|
6340 |
|
|
6487 |
Last updated: 19 October 2011 |
Last updated: 19 October 2011 |
6488 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
6489 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6490 |
|
|
6491 |
|
|
6492 |
PCREJIT(3) PCREJIT(3) |
PCREJIT(3) PCREJIT(3) |
6493 |
|
|
6494 |
|
|
6529 |
been fully tested. If --enable-jit is set on an unsupported platform, |
been fully tested. If --enable-jit is set on an unsupported platform, |
6530 |
compilation fails. |
compilation fails. |
6531 |
|
|
6532 |
A program can tell if JIT support is available by calling pcre_config() |
A program that is linked with PCRE 8.20 or later can tell if JIT sup- |
6533 |
with the PCRE_CONFIG_JIT option. The result is 1 when JIT is available, |
port is available by calling pcre_config() with the PCRE_CONFIG_JIT |
6534 |
and 0 otherwise. However, a simple program does not need to check this |
option. The result is 1 when JIT is available, and 0 otherwise. How- |
6535 |
in order to use JIT. The API is implemented in a way that falls back to |
ever, a simple program does not need to check this in order to use JIT. |
6536 |
the ordinary PCRE code if JIT is not available. |
The API is implemented in a way that falls back to the ordinary PCRE |
6537 |
|
code if JIT is not available. |
6538 |
|
|
6539 |
|
If your program may sometimes be linked with versions of PCRE that are |
6540 |
|
older than 8.20, but you want to use JIT when it is available, you can |
6541 |
|
test the values of PCRE_MAJOR and PCRE_MINOR, or the existence of a JIT |
6542 |
|
macro such as PCRE_CONFIG_JIT, for compile-time control of your code. |
6543 |
|
|
6544 |
|
|
6545 |
SIMPLE USE OF JIT |
SIMPLE USE OF JIT |
6555 |
no longer needed instead of just freeing it yourself. This |
no longer needed instead of just freeing it yourself. This |
6556 |
ensures that any JIT data is also freed. |
ensures that any JIT data is also freed. |
6557 |
|
|
6558 |
|
For a program that may be linked with pre-8.20 versions of PCRE, you |
6559 |
|
can insert |
6560 |
|
|
6561 |
|
#ifndef PCRE_STUDY_JIT_COMPILE |
6562 |
|
#define PCRE_STUDY_JIT_COMPILE 0 |
6563 |
|
#endif |
6564 |
|
|
6565 |
|
so that no option is passed to pcre_study(), and then use something |
6566 |
|
like this to free the study data: |
6567 |
|
|
6568 |
|
#ifdef PCRE_CONFIG_JIT |
6569 |
|
pcre_free_study(study_ptr); |
6570 |
|
#else |
6571 |
|
pcre_free(study_ptr); |
6572 |
|
#endif |
6573 |
|
|
6574 |
In some circumstances you may need to call additional functions. These |
In some circumstances you may need to call additional functions. These |
6575 |
are described in the section entitled "Controlling the JIT stack" |
are described in the section entitled "Controlling the JIT stack" |
6576 |
below. |
below. |
6609 |
|
|
6610 |
The unsupported pattern items are: |
The unsupported pattern items are: |
6611 |
|
|
6612 |
\C match a single byte; not supported in UTF-8 mode |
\C match a single byte; not supported in UTF-8 mode |
6613 |
(?Cn) callouts |
(?Cn) callouts |
|
(?(<name>)... conditional test on setting of a named subpattern |
|
|
(?(R)... conditional test on whole pattern recursion |
|
|
(?(Rn)... conditional test on recursion, by number |
|
|
(?(R&name)... conditional test on recursion, by name |
|
6614 |
(*COMMIT) ) |
(*COMMIT) ) |
6615 |
(*MARK) ) |
(*MARK) ) |
6616 |
(*PRUNE) ) the backtracking control verbs |
(*PRUNE) ) the backtracking control verbs |
6659 |
large or complicated patterns need more than this. The error |
large or complicated patterns need more than this. The error |
6660 |
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. |
PCRE_ERROR_JIT_STACKLIMIT is given when there is not enough stack. |
6661 |
Three functions are provided for managing blocks of memory for use as |
Three functions are provided for managing blocks of memory for use as |
6662 |
JIT stacks. |
JIT stacks. There is further discussion about the use of JIT stacks in |
6663 |
|
the section entitled "JIT stack FAQ" below. |
6664 |
|
|
6665 |
The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments |
The pcre_jit_stack_alloc() function creates a JIT stack. Its arguments |
6666 |
are a starting size and a maximum size, and it returns a pointer to an |
are a starting size and a maximum size, and it returns a pointer to an |
6667 |
opaque structure of type pcre_jit_stack, or NULL if there is an error. |
opaque structure of type pcre_jit_stack, or NULL if there is an error. |
6668 |
The pcre_jit_stack_free() function can be used to free a stack that is |
The pcre_jit_stack_free() function can be used to free a stack that is |
6669 |
no longer needed. (For the technically minded: the address space is |
no longer needed. (For the technically minded: the address space is |
6670 |
allocated by mmap or VirtualAlloc.) |
allocated by mmap or VirtualAlloc.) |
6671 |
|
|
6672 |
JIT uses far less memory for recursion than the interpretive code, and |
JIT uses far less memory for recursion than the interpretive code, and |
6673 |
a maximum stack size of 512K to 1M should be more than enough for any |
a maximum stack size of 512K to 1M should be more than enough for any |
6674 |
pattern. |
pattern. |
6675 |
|
|
6676 |
The pcre_assign_jit_stack() function specifies which stack JIT code |
The pcre_assign_jit_stack() function specifies which stack JIT code |
6677 |
should use. Its arguments are as follows: |
should use. Its arguments are as follows: |
6678 |
|
|
6679 |
pcre_extra *extra |
pcre_extra *extra |
6680 |
pcre_jit_callback callback |
pcre_jit_callback callback |
6681 |
void *data |
void *data |
6682 |
|
|
6683 |
The extra argument must be the result of studying a pattern with |
The extra argument must be the result of studying a pattern with |
6684 |
PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the |
PCRE_STUDY_JIT_COMPILE. There are three cases for the values of the |
6685 |
other two options: |
other two options: |
6686 |
|
|
6687 |
(1) If callback is NULL and data is NULL, an internal 32K block |
(1) If callback is NULL and data is NULL, an internal 32K block |
6696 |
is used; otherwise the return value must be a valid JIT stack, |
is used; otherwise the return value must be a valid JIT stack, |
6697 |
the result of calling pcre_jit_stack_alloc(). |
the result of calling pcre_jit_stack_alloc(). |
6698 |
|
|
6699 |
You may safely assign the same JIT stack to more than one pattern, as |
You may safely assign the same JIT stack to more than one pattern, as |
6700 |
long as they are all matched sequentially in the same thread. In a mul- |
long as they are all matched sequentially in the same thread. In a mul- |
6701 |
tithread application, each thread must use its own JIT stack. |
tithread application, each thread must use its own JIT stack. |
6702 |
|
|
6703 |
Strictly speaking, even more is allowed. You can assign the same stack |
Strictly speaking, even more is allowed. You can assign the same stack |
6704 |
to any number of patterns as long as they are not used for matching by |
to any number of patterns as long as they are not used for matching by |
6705 |
multiple threads at the same time. For example, you can assign the same |
multiple threads at the same time. For example, you can assign the same |
6706 |
stack to all compiled patterns, and use a global mutex in the callback |
stack to all compiled patterns, and use a global mutex in the callback |
6707 |
to wait until the stack is available for use. However, this is an inef- |
to wait until the stack is available for use. However, this is an inef- |
6708 |
ficient solution, and not recommended. |
ficient solution, and not recommended. |
6709 |
|
|
6710 |
This is a suggestion for how a typical multithreaded program might |
This is a suggestion for how a typical multithreaded program might |
6711 |
operate: |
operate: |
6712 |
|
|
6713 |
During thread initalization |
During thread initalization |
6719 |
Use a one-line callback function |
Use a one-line callback function |
6720 |
return thread_local_var |
return thread_local_var |
6721 |
|
|
6722 |
All the functions described in this section do nothing if JIT is not |
All the functions described in this section do nothing if JIT is not |
6723 |
available, and pcre_assign_jit_stack() does nothing unless the extra |
available, and pcre_assign_jit_stack() does nothing unless the extra |
6724 |
argument is non-NULL and points to a pcre_extra block that is the |
argument is non-NULL and points to a pcre_extra block that is the |
6725 |
result of a successful study with PCRE_STUDY_JIT_COMPILE. |
result of a successful study with PCRE_STUDY_JIT_COMPILE. |
6726 |
|
|
6727 |
|
|
6728 |
|
JIT STACK FAQ |
6729 |
|
|
6730 |
|
(1) Why do we need JIT stacks? |
6731 |
|
|
6732 |
|
PCRE (and JIT) is a recursive, depth-first engine, so it needs a stack |
6733 |
|
where the local data of the current node is pushed before checking its |
6734 |
|
child nodes. Allocating real machine stack on some platforms is diffi- |
6735 |
|
cult. For example, the stack chain needs to be updated every time if we |
6736 |
|
extend the stack on PowerPC. Although it is possible, its updating |
6737 |
|
time overhead decreases performance. So we do the recursion in memory. |
6738 |
|
|
6739 |
|
(2) Why don't we simply allocate blocks of memory with malloc()? |
6740 |
|
|
6741 |
|
Modern operating systems have a nice feature: they can reserve an |
6742 |
|
address space instead of allocating memory. We can safely allocate mem- |
6743 |
|
ory pages inside this address space, so the stack could grow without |
6744 |
|
moving memory data (this is important because of pointers). Thus we can |
6745 |
|
allocate 1M address space, and use only a single memory page (usually |
6746 |
|
4K) if that is enough. However, we can still grow up to 1M anytime if |
6747 |
|
needed. |
6748 |
|
|
6749 |
|
(3) Who "owns" a JIT stack? |
6750 |
|
|
6751 |
|
The owner of the stack is the user program, not the JIT studied pattern |
6752 |
|
or anything else. The user program must ensure that if a stack is used |
6753 |
|
by pcre_exec(), (that is, it is assigned to the pattern currently run- |
6754 |
|
ning), that stack must not be used by any other threads (to avoid over- |
6755 |
|
writing the same memory area). The best practice for multithreaded pro- |
6756 |
|
grams is to allocate a stack for each thread, and return this stack |
6757 |
|
through the JIT callback function. |
6758 |
|
|
6759 |
|
(4) When should a JIT stack be freed? |
6760 |
|
|
6761 |
|
You can free a JIT stack at any time, as long as it will not be used by |
6762 |
|
pcre_exec() again. When you assign the stack to a pattern, only a |
6763 |
|
pointer is set. There is no reference counting or any other magic. You |
6764 |
|
can free the patterns and stacks in any order, anytime. Just do not |
6765 |
|
call pcre_exec() with a pattern pointing to an already freed stack, as |
6766 |
|
that will cause SEGFAULT. (Also, do not free a stack currently used by |
6767 |
|
pcre_exec() in another thread). You can also replace the stack for a |
6768 |
|
pattern at any time. You can even free the previous stack before |
6769 |
|
assigning a replacement. |
6770 |
|
|
6771 |
|
(5) Should I allocate/free a stack every time before/after calling |
6772 |
|
pcre_exec()? |
6773 |
|
|
6774 |
|
No, because this is too costly in terms of resources. However, you |
6775 |
|
could implement some clever idea which release the stack if it is not |
6776 |
|
used in let's say two minutes. The JIT callback can help to achive this |
6777 |
|
without keeping a list of the currently JIT studied patterns. |
6778 |
|
|
6779 |
|
(6) OK, the stack is for long term memory allocation. But what happens |
6780 |
|
if a pattern causes stack overflow with a stack of 1M? Is that 1M kept |
6781 |
|
until the stack is freed? |
6782 |
|
|
6783 |
|
Especially on embedded sytems, it might be a good idea to release mem- |
6784 |
|
ory sometimes without freeing the stack. There is no API for this at |
6785 |
|
the moment. Probably a function call which returns with the currently |
6786 |
|
allocated memory for any stack and another which allows releasing mem- |
6787 |
|
ory (shrinking the stack) would be a good idea if someone needs this. |
6788 |
|
|
6789 |
|
(7) This is too much of a headache. Isn't there any better solution for |
6790 |
|
JIT stack handling? |
6791 |
|
|
6792 |
|
No, thanks to Windows. If POSIX threads were used everywhere, we could |
6793 |
|
throw out this complicated API. |
6794 |
|
|
6795 |
|
|
6796 |
EXAMPLE CODE |
EXAMPLE CODE |
6797 |
|
|
6798 |
This is a single-threaded example that specifies a JIT stack without |
This is a single-threaded example that specifies a JIT stack without |
6824 |
|
|
6825 |
AUTHOR |
AUTHOR |
6826 |
|
|
6827 |
Philip Hazel |
Philip Hazel (FAQ by Zoltan Herczeg) |
6828 |
University Computing Service |
University Computing Service |
6829 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
6830 |
|
|
6831 |
|
|
6832 |
REVISION |
REVISION |
6833 |
|
|
6834 |
Last updated: 19 October 2011 |
Last updated: 26 November 2011 |
6835 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
6836 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6837 |
|
|
6838 |
|
|
6839 |
PCREPARTIAL(3) PCREPARTIAL(3) |
PCREPARTIAL(3) PCREPARTIAL(3) |
6840 |
|
|
6841 |
|
|
7256 |
Last updated: 26 August 2011 |
Last updated: 26 August 2011 |
7257 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
7258 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
7259 |
|
|
7260 |
|
|
7261 |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
7262 |
|
|
7263 |
|
|
7387 |
Last updated: 26 August 2011 |
Last updated: 26 August 2011 |
7388 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
7389 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
7390 |
|
|
7391 |
|
|
7392 |
PCREPERFORM(3) PCREPERFORM(3) |
PCREPERFORM(3) PCREPERFORM(3) |
7393 |
|
|
7394 |
|
|
7555 |
Last updated: 16 May 2010 |
Last updated: 16 May 2010 |
7556 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
7557 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
7558 |
|
|
7559 |
|
|
7560 |
PCREPOSIX(3) PCREPOSIX(3) |
PCREPOSIX(3) PCREPOSIX(3) |
7561 |
|
|
7562 |
|
|
7818 |
Last updated: 16 May 2010 |
Last updated: 16 May 2010 |
7819 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
7820 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
7821 |
|
|
7822 |
|
|
7823 |
PCRECPP(3) PCRECPP(3) |
PCRECPP(3) PCRECPP(3) |
7824 |
|
|
7825 |
|
|
8160 |
Last updated: 17 March 2009 |
Last updated: 17 March 2009 |
8161 |
Minor typo fixed: 25 July 2011 |
Minor typo fixed: 25 July 2011 |
8162 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
8163 |
|
|
8164 |
|
|
8165 |
PCRESAMPLE(3) PCRESAMPLE(3) |
PCRESAMPLE(3) PCRESAMPLE(3) |
8166 |
|
|
8167 |
|
|
8272 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
8273 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
8274 |
|
|
8275 |
|
There is a limit to the number of forward references to subsequent sub- |
8276 |
|
patterns of around 200,000. Repeated forward references with fixed |
8277 |
|
upper limits, for example, (?2){0,100} when subpattern number 2 is to |
8278 |
|
the right, are included in the count. There is no limit to the number |
8279 |
|
of backward references. |
8280 |
|
|
8281 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
8282 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
8283 |
|
|
8298 |
|
|
8299 |
REVISION |
REVISION |
8300 |
|
|
8301 |
Last updated: 24 August 2011 |
Last updated: 30 November 2011 |
8302 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
8303 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
8304 |
|
|
8305 |
|
|
8306 |
PCRESTACK(3) PCRESTACK(3) |
PCRESTACK(3) PCRESTACK(3) |
8307 |
|
|
8308 |
|
|
8462 |
Last updated: 26 August 2011 |
Last updated: 26 August 2011 |
8463 |
Copyright (c) 1997-2011 University of Cambridge. |
Copyright (c) 1997-2011 University of Cambridge. |
8464 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
8465 |
|
|
8466 |
|
|