4904 |
so many different ways the + and * repeats can carve up the subject, |
so many different ways the + and * repeats can carve up the subject, |
4905 |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
4906 |
|
|
4907 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values of capturing parentheses are those |
4908 |
those from the outermost level of the recursion at which the subpattern |
from the outermost level. If you want to obtain intermediate values, a |
4909 |
value is set. If you want to obtain intermediate values, a callout |
callout function can be used (see below and the pcrecallout documenta- |
4910 |
function can be used (see below and the pcrecallout documentation). If |
tion). If the pattern above is matched against |
|
the pattern above is matched against |
|
4911 |
|
|
4912 |
(ab(cd)ef) |
(ab(cd)ef) |
4913 |
|
|
4914 |
the value for the capturing parentheses is "ef", which is the last |
the value for the inner capturing parentheses (numbered 2) is "ef", |
4915 |
value taken on at the top level. If additional parentheses are added, |
which is the last value taken on at the top level. If a capturing sub- |
4916 |
giving |
pattern is not matched at the top level, its final value is unset, even |
4917 |
|
if it is (temporarily) set at a deeper level. |
4918 |
\( ( ( [^()]++ | (?R) )* ) \) |
|
4919 |
^ ^ |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
4920 |
^ ^ |
to obtain extra memory to store data during a recursion, which it does |
4921 |
|
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
4922 |
the string they capture is "ab(cd)ef", the contents of the top level |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
|
parentheses. If there are more than 15 capturing parentheses in a pat- |
|
|
tern, PCRE has to obtain extra memory to store data during a recursion, |
|
|
which it does by using pcre_malloc, freeing it via pcre_free after- |
|
|
wards. If no memory can be obtained, the match fails with the |
|
|
PCRE_ERROR_NOMEMORY error. |
|
4923 |
|
|
4924 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
4925 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
5033 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
5034 |
above. |
above. |
5035 |
|
|
5036 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a subroutine call is always treated as an |
5037 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
5038 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
5039 |
there is a subsequent matching failure. |
there is a subsequent matching failure. Any capturing parentheses that |
5040 |
|
are set during the subroutine call revert to their previous values |
5041 |
|
afterwards. |
5042 |
|
|
5043 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
5044 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
5121 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
5122 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
5123 |
|
|
5124 |
If any of these verbs are used in an assertion subpattern, their effect |
If any of these verbs are used in an assertion or subroutine subpattern |
5125 |
is confined to that subpattern; it does not extend to the surrounding |
(including recursive subpatterns), their effect is confined to that |
5126 |
pattern. Note that assertion subpatterns are processed as anchored at |
subpattern; it does not extend to the surrounding pattern. Note that |
5127 |
the point where they are tested. |
such subpatterns are processed as anchored at the point where they are |
5128 |
|
tested. |
5129 |
|
|
5130 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
5131 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
5132 |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
5133 |
its general form is just (*VERB). Any number of these verbs may occur |
its general form is just (*VERB). Any number of these verbs may occur |
5134 |
in a pattern. There are two kinds: |
in a pattern. There are two kinds: |
5135 |
|
|
5136 |
Verbs that act immediately |
Verbs that act immediately |
5139 |
|
|
5140 |
(*ACCEPT) |
(*ACCEPT) |
5141 |
|
|
5142 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
5143 |
of the pattern. When inside a recursion, only the innermost pattern is |
of the pattern. When inside a recursion, only the innermost pattern is |
5144 |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
5145 |
data so far is captured. (This feature was added to PCRE at release |
data so far is captured. (This feature was added to PCRE at release |
5146 |
8.00.) For example: |
8.00.) For example: |
5147 |
|
|
5148 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
5149 |
|
|
5150 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
5151 |
tured by the outer parentheses. |
tured by the outer parentheses. |
5152 |
|
|
5153 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
5154 |
|
|
5155 |
This verb causes the match to fail, forcing backtracking to occur. It |
This verb causes the match to fail, forcing backtracking to occur. It |
5156 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
5157 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
5158 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
5159 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
5160 |
tern: |
tern: |
5161 |
|
|
5162 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
5163 |
|
|
5164 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
5165 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
5166 |
|
|
5167 |
Verbs that act after backtracking |
Verbs that act after backtracking |
5168 |
|
|
5169 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
5170 |
tinues with what follows, but if there is no subsequent match, a fail- |
tinues with what follows, but if there is no subsequent match, a fail- |
5171 |
ure is forced. The verbs differ in exactly what kind of failure |
ure is forced. The verbs differ in exactly what kind of failure |
5172 |
occurs. |
occurs. |
5173 |
|
|
5174 |
(*COMMIT) |
(*COMMIT) |
5175 |
|
|
5176 |
This verb causes the whole match to fail outright if the rest of the |
This verb causes the whole match to fail outright if the rest of the |
5177 |
pattern does not match. Even if the pattern is unanchored, no further |
pattern does not match. Even if the pattern is unanchored, no further |
5178 |
attempts to find a match by advancing the starting point take place. |
attempts to find a match by advancing the starting point take place. |
5179 |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
5180 |
match at the current starting point, or not at all. For example: |
match at the current starting point, or not at all. For example: |
5181 |
|
|
5182 |
a+(*COMMIT)b |
a+(*COMMIT)b |
5183 |
|
|
5184 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
5185 |
of dynamic anchor, or "I've started, so I must finish." |
of dynamic anchor, or "I've started, so I must finish." |
5186 |
|
|
5187 |
(*PRUNE) |
(*PRUNE) |
5188 |
|
|
5189 |
This verb causes the match to fail at the current position if the rest |
This verb causes the match to fail at the current position if the rest |
5190 |
of the pattern does not match. If the pattern is unanchored, the normal |
of the pattern does not match. If the pattern is unanchored, the normal |
5191 |
"bumpalong" advance to the next starting character then happens. Back- |
"bumpalong" advance to the next starting character then happens. Back- |
5192 |
tracking can occur as usual to the left of (*PRUNE), or when matching |
tracking can occur as usual to the left of (*PRUNE), or when matching |
5193 |
to the right of (*PRUNE), but if there is no match to the right, back- |
to the right of (*PRUNE), but if there is no match to the right, back- |
5194 |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
5195 |
is just an alternative to an atomic group or possessive quantifier, but |
is just an alternative to an atomic group or possessive quantifier, but |
5196 |
there are some uses of (*PRUNE) that cannot be expressed in any other |
there are some uses of (*PRUNE) that cannot be expressed in any other |
5197 |
way. |
way. |
5198 |
|
|
5199 |
(*SKIP) |
(*SKIP) |
5200 |
|
|
5201 |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
5202 |
the "bumpalong" advance is not to the next character, but to the posi- |
the "bumpalong" advance is not to the next character, but to the posi- |
5203 |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
5204 |
that whatever text was matched leading up to it cannot be part of a |
that whatever text was matched leading up to it cannot be part of a |
5205 |
successful match. Consider: |
successful match. Consider: |
5206 |
|
|
5207 |
a+(*SKIP)b |
a+(*SKIP)b |
5208 |
|
|
5209 |
If the subject is "aaaac...", after the first match attempt fails |
If the subject is "aaaac...", after the first match attempt fails |
5210 |
(starting at the first character in the string), the starting point |
(starting at the first character in the string), the starting point |
5211 |
skips on to start the next attempt at "c". Note that a possessive quan- |
skips on to start the next attempt at "c". Note that a possessive quan- |
5212 |
tifer does not have the same effect as this example; although it would |
tifer does not have the same effect as this example; although it would |
5213 |
suppress backtracking during the first match attempt, the second |
suppress backtracking during the first match attempt, the second |
5214 |
attempt would start at the second character instead of skipping on to |
attempt would start at the second character instead of skipping on to |
5215 |
"c". |
"c". |
5216 |
|
|
5217 |
(*THEN) |
(*THEN) |
5218 |
|
|
5219 |
This verb causes a skip to the next alternation if the rest of the pat- |
This verb causes a skip to the next alternation if the rest of the pat- |
5220 |
tern does not match. That is, it cancels pending backtracking, but only |
tern does not match. That is, it cancels pending backtracking, but only |
5221 |
within the current alternation. Its name comes from the observation |
within the current alternation. Its name comes from the observation |
5222 |
that it can be used for a pattern-based if-then-else block: |
that it can be used for a pattern-based if-then-else block: |
5223 |
|
|
5224 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
5225 |
|
|
5226 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
5227 |
after the end of the group if FOO succeeds); on failure the matcher |
after the end of the group if FOO succeeds); on failure the matcher |
5228 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
5229 |
into COND1. If (*THEN) is used outside of any alternation, it acts |
into COND1. If (*THEN) is used outside of any alternation, it acts |
5230 |
exactly like (*PRUNE). |
exactly like (*PRUNE). |
5231 |
|
|
5232 |
|
|
5244 |
|
|
5245 |
REVISION |
REVISION |
5246 |
|
|
5247 |
Last updated: 04 October 2009 |
Last updated: 18 October 2009 |
5248 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5249 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5250 |
|
|
5751 |
|
|
5752 |
PARTIAL MATCHING AND WORD BOUNDARIES |
PARTIAL MATCHING AND WORD BOUNDARIES |
5753 |
|
|
5754 |
If a pattern ends with one of sequences \w or \W, which test for word |
If a pattern ends with one of sequences \b or \B, which test for word |
5755 |
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- |
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- |
5756 |
intuitive results. Consider this pattern: |
intuitive results. Consider this pattern: |
5757 |
|
|
5858 |
data> The date is 23ja\P |
data> The date is 23ja\P |
5859 |
Partial match: 23ja |
Partial match: 23ja |
5860 |
|
|
5861 |
The this stage, an application could discard the text preceding "23ja", |
At this stage, an application could discard the text preceding "23ja", |
5862 |
add on text from the next segment, and call pcre_exec() again. Unlike |
add on text from the next segment, and call pcre_exec() again. Unlike |
5863 |
pcre_dfa_exec(), the entire matching string must always be available, |
pcre_dfa_exec(), the entire matching string must always be available, |
5864 |
and the complete matching process occurs for each call, so more memory |
and the complete matching process occurs for each call, so more memory |
5935 |
|
|
5936 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
5937 |
start with the same pattern item may not work as expected when |
start with the same pattern item may not work as expected when |
5938 |
pcre_dfa_exec() is used. For example, consider this pattern: |
PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider |
5939 |
|
this pattern: |
5940 |
|
|
5941 |
1234|3789 |
1234|3789 |
5942 |
|
|
5943 |
If the first part of the subject is "ABC123", a partial match of the |
If the first part of the subject is "ABC123", a partial match of the |
5944 |
first alternative is found at offset 3. There is no partial match for |
first alternative is found at offset 3. There is no partial match for |
5945 |
the second alternative, because such a match does not start at the same |
the second alternative, because such a match does not start at the same |
5946 |
point in the subject string. Attempting to continue with the string |
point in the subject string. Attempting to continue with the string |
5947 |
"7890" does not yield a match because only those alternatives that |
"7890" does not yield a match because only those alternatives that |
5948 |
match at one point in the subject are remembered. The problem arises |
match at one point in the subject are remembered. The problem arises |
5949 |
because the start of the second alternative matches within the first |
because the start of the second alternative matches within the first |
5950 |
alternative. There is no problem with anchored patterns or patterns |
alternative. There is no problem with anchored patterns or patterns |
5951 |
such as: |
such as: |
5952 |
|
|
5953 |
1234|ABCD |
1234|ABCD |
5954 |
|
|
5955 |
where no string can be a partial match for both alternatives. This is |
where no string can be a partial match for both alternatives. This is |
5956 |
not a problem if pcre_exec() is used, because the entire match has to |
not a problem if pcre_exec() is used, because the entire match has to |
5957 |
be rerun each time: |
be rerun each time: |
5958 |
|
|
5959 |
re> /1234|3789/ |
re> /1234|3789/ |
5962 |
data> 1237890 |
data> 1237890 |
5963 |
0: 3789 |
0: 3789 |
5964 |
|
|
5965 |
|
Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re- |
5966 |
|
running the entire match can also be used with pcre_dfa_exec(). Another |
5967 |
|
possibility is to work with two buffers. If a partial match at offset n |
5968 |
|
in the first buffer is followed by "no match" when PCRE_DFA_RESTART is |
5969 |
|
used on the second buffer, you can then try a new match starting at |
5970 |
|
offset n+1 in the first buffer. |
5971 |
|
|
5972 |
|
|
5973 |
AUTHOR |
AUTHOR |
5974 |
|
|
5979 |
|
|
5980 |
REVISION |
REVISION |
5981 |
|
|
5982 |
Last updated: 29 September 2009 |
Last updated: 19 October 2009 |
5983 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5984 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5985 |
|
|