3246 |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
3247 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
3248 |
\t tab (hex 09) |
\t tab (hex 09) |
3249 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or back reference |
3250 |
\xhh character with hex code hh |
\xhh character with hex code hh |
3251 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
3252 |
|
|
4051 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
4052 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
4053 |
|
|
4054 |
A backreference to a numbered subpattern uses the most recent value |
A back reference to a numbered subpattern uses the most recent value |
4055 |
that is set for that number by any subpattern. The following pattern |
that is set for that number by any subpattern. The following pattern |
4056 |
matches "abcabc" or "defdef": |
matches "abcabc" or "defdef": |
4057 |
|
|
4085 |
|
|
4086 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
4087 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
4088 |
to capturing parentheses from other parts of the pattern, such as back- |
to capturing parentheses from other parts of the pattern, such as back |
4089 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
4090 |
by number. |
by number. |
4091 |
|
|
4121 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
4122 |
subpattern it was. |
subpattern it was. |
4123 |
|
|
4124 |
If you make a backreference to a non-unique named subpattern from else- |
If you make a back reference to a non-unique named subpattern from |
4125 |
where in the pattern, the one that corresponds to the first occurrence |
elsewhere in the pattern, the one that corresponds to the first occur- |
4126 |
of the name is used. In the absence of duplicate numbers (see the pre- |
rence of the name is used. In the absence of duplicate numbers (see the |
4127 |
vious section) this is the one with the lowest number. If you use a |
previous section) this is the one with the lowest number. If you use a |
4128 |
named reference in a condition test (see the section about conditions |
named reference in a condition test (see the section about conditions |
4129 |
below), either to check whether a subpattern has matched, or to check |
below), either to check whether a subpattern has matched, or to check |
4130 |
for recursion, all subpatterns with the same name are tested. If the |
for recursion, all subpatterns with the same name are tested. If the |
4270 |
mization, or alternatively using ^ to indicate anchoring explicitly. |
mization, or alternatively using ^ to indicate anchoring explicitly. |
4271 |
|
|
4272 |
However, there is one situation where the optimization cannot be used. |
However, there is one situation where the optimization cannot be used. |
4273 |
When .* is inside capturing parentheses that are the subject of a |
When .* is inside capturing parentheses that are the subject of a back |
4274 |
backreference elsewhere in the pattern, a match at the start may fail |
reference elsewhere in the pattern, a match at the start may fail where |
4275 |
where a later one succeeds. Consider, for example: |
a later one succeeds. Consider, for example: |
4276 |
|
|
4277 |
(.*)abc\1 |
(.*)abc\1 |
4278 |
|
|
4494 |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
4495 |
syntax or an empty comment (see "Comments" below) can be used. |
syntax or an empty comment (see "Comments" below) can be used. |
4496 |
|
|
4497 |
|
Recursive back references |
4498 |
|
|
4499 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
4500 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
4501 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
4510 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
4511 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
4512 |
|
|
4513 |
|
Back references of this type cause the group that they reference to be |
4514 |
|
treated as an atomic group. Once the whole group has been matched, a |
4515 |
|
subsequent matching failure cannot cause backtracking into the middle |
4516 |
|
of the group. |
4517 |
|
|
4518 |
|
|
4519 |
ASSERTIONS |
ASSERTIONS |
4520 |
|
|
4521 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
4522 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
4523 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
4524 |
described above. |
described above. |
4525 |
|
|
4526 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
4527 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
4528 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
4529 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
4530 |
matching position to be changed. |
matching position to be changed. |
4531 |
|
|
4532 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
4533 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
4534 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
4535 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
4536 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
4537 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
4538 |
negative assertions. |
negative assertions. |
4539 |
|
|
4540 |
Lookahead assertions |
Lookahead assertions |
4544 |
|
|
4545 |
\w+(?=;) |
\w+(?=;) |
4546 |
|
|
4547 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
4548 |
colon in the match, and |
colon in the match, and |
4549 |
|
|
4550 |
foo(?!bar) |
foo(?!bar) |
4551 |
|
|
4552 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
4553 |
that the apparently similar pattern |
that the apparently similar pattern |
4554 |
|
|
4555 |
(?!foo)bar |
(?!foo)bar |
4556 |
|
|
4557 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
4558 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
4559 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
4560 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
4561 |
|
|
4562 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
4563 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
4564 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
4565 |
string must always fail. The Perl 5.10 backtracking control verb |
string must always fail. The Perl 5.10 backtracking control verb |
4566 |
(*FAIL) or (*F) is essentially a synonym for (?!). |
(*FAIL) or (*F) is essentially a synonym for (?!). |
4567 |
|
|
4568 |
Lookbehind assertions |
Lookbehind assertions |
4569 |
|
|
4570 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
4571 |
for negative assertions. For example, |
for negative assertions. For example, |
4572 |
|
|
4573 |
(?<!foo)bar |
(?<!foo)bar |
4574 |
|
|
4575 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
4576 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
4577 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
4578 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
4579 |
fixed length. Thus |
fixed length. Thus |
4580 |
|
|
4581 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
4584 |
|
|
4585 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
4586 |
|
|
4587 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
4588 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
4589 |
This is an extension compared with Perl (5.8 and 5.10), which requires |
This is an extension compared with Perl (5.8 and 5.10), which requires |
4590 |
all branches to match the same length of string. An assertion such as |
all branches to match the same length of string. An assertion such as |
4591 |
|
|
4592 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
4593 |
|
|
4594 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
4595 |
different lengths, but it is acceptable to PCRE if rewritten to use two |
different lengths, but it is acceptable to PCRE if rewritten to use two |
4596 |
top-level branches: |
top-level branches: |
4597 |
|
|
4598 |
(?<=abc|abde) |
(?<=abc|abde) |
4599 |
|
|
4600 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
4601 |
instead of a lookbehind assertion to get round the fixed-length |
instead of a lookbehind assertion to get round the fixed-length |
4602 |
restriction. |
restriction. |
4603 |
|
|
4604 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
4605 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
4606 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
4607 |
rent position, the assertion fails. |
rent position, the assertion fails. |
4608 |
|
|
4609 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
4610 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
4611 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
4612 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
4613 |
|
|
4614 |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
"Subroutine" calls (see below) such as (?2) or (?&X) are permitted in |
4615 |
lookbehinds, as long as the subpattern matches a fixed-length string. |
lookbehinds, as long as the subpattern matches a fixed-length string. |
4616 |
Recursion, however, is not supported. |
Recursion, however, is not supported. |
4617 |
|
|
4618 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
4619 |
assertions to specify efficient matching of fixed-length strings at the |
assertions to specify efficient matching of fixed-length strings at the |
4620 |
end of subject strings. Consider a simple pattern such as |
end of subject strings. Consider a simple pattern such as |
4621 |
|
|
4622 |
abcd$ |
abcd$ |
4623 |
|
|
4624 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
4625 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
4626 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
4627 |
pattern is specified as |
pattern is specified as |
4628 |
|
|
4629 |
^.*abcd$ |
^.*abcd$ |
4630 |
|
|
4631 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
4632 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
4633 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
4634 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
4635 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
4636 |
|
|
4637 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
4638 |
|
|
4639 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
4640 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
4641 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
4642 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
4643 |
processing time. |
processing time. |
4644 |
|
|
4645 |
Using multiple assertions |
Using multiple assertions |
4648 |
|
|
4649 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
4650 |
|
|
4651 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
4652 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
4653 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
4654 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
4655 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
4656 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
4657 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
4658 |
foo". A pattern to do that is |
foo". A pattern to do that is |
4659 |
|
|
4660 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
4661 |
|
|
4662 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
4663 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
4664 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
4665 |
|
|
4667 |
|
|
4668 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
4669 |
|
|
4670 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
4671 |
is not preceded by "foo", while |
is not preceded by "foo", while |
4672 |
|
|
4673 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
4674 |
|
|
4675 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
4676 |
three characters that are not "999". |
three characters that are not "999". |
4677 |
|
|
4678 |
|
|
4679 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
4680 |
|
|
4681 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
4682 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
4683 |
on the result of an assertion, or whether a specific capturing subpat- |
on the result of an assertion, or whether a specific capturing subpat- |
4684 |
tern has already been matched. The two possible forms of conditional |
tern has already been matched. The two possible forms of conditional |
4685 |
subpattern are: |
subpattern are: |
4686 |
|
|
4687 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
4688 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
4689 |
|
|
4690 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
4691 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
4692 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
4693 |
|
|
4694 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
4695 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
4696 |
|
|
4697 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
4698 |
|
|
4699 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
4700 |
the condition is true if a capturing subpattern of that number has pre- |
the condition is true if a capturing subpattern of that number has pre- |
4701 |
viously matched. If there is more than one capturing subpattern with |
viously matched. If there is more than one capturing subpattern with |
4702 |
the same number (see the earlier section about duplicate subpattern |
the same number (see the earlier section about duplicate subpattern |
4703 |
numbers), the condition is true if any of them have been set. An alter- |
numbers), the condition is true if any of them have been set. An alter- |
4704 |
native notation is to precede the digits with a plus or minus sign. In |
native notation is to precede the digits with a plus or minus sign. In |
4705 |
this case, the subpattern number is relative rather than absolute. The |
this case, the subpattern number is relative rather than absolute. The |
4706 |
most recently opened parentheses can be referenced by (?(-1), the next |
most recently opened parentheses can be referenced by (?(-1), the next |
4707 |
most recent by (?(-2), and so on. In looping constructs it can also |
most recent by (?(-2), and so on. In looping constructs it can also |
4708 |
make sense to refer to subsequent groups with constructs such as |
make sense to refer to subsequent groups with constructs such as |
4709 |
(?(+2). |
(?(+2). |
4710 |
|
|
4711 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
4712 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
4713 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
4714 |
|
|
4715 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
4716 |
|
|
4717 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
4718 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
4719 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
4720 |
third part is a conditional subpattern that tests whether the first set |
third part is a conditional subpattern that tests whether the first set |
4721 |
of parentheses matched or not. If they did, that is, if subject started |
of parentheses matched or not. If they did, that is, if subject started |
4722 |
with an opening parenthesis, the condition is true, and so the yes-pat- |
with an opening parenthesis, the condition is true, and so the yes-pat- |
4723 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
4724 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
4725 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
4726 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
4727 |
|
|
4728 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
4729 |
relative reference: |
relative reference: |
4730 |
|
|
4731 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
4732 |
|
|
4733 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
4734 |
pattern. |
pattern. |
4735 |
|
|
4736 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
4737 |
|
|
4738 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
4739 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
4740 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
4741 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
4742 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
4743 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
4744 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
4745 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
4746 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
4747 |
|
|
4748 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
4749 |
|
|
4750 |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
4751 |
|
|
4752 |
If the name used in a condition of this kind is a duplicate, the test |
If the name used in a condition of this kind is a duplicate, the test |
4753 |
is applied to all subpatterns of the same name, and is true if any one |
is applied to all subpatterns of the same name, and is true if any one |
4754 |
of them has matched. |
of them has matched. |
4755 |
|
|
4756 |
Checking for pattern recursion |
Checking for pattern recursion |
4757 |
|
|
4758 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
4759 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
4760 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
4761 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
4762 |
|
|
4764 |
|
|
4765 |
the condition is true if the most recent recursion is into a subpattern |
the condition is true if the most recent recursion is into a subpattern |
4766 |
whose number or name is given. This condition does not check the entire |
whose number or name is given. This condition does not check the entire |
4767 |
recursion stack. If the name used in a condition of this kind is a |
recursion stack. If the name used in a condition of this kind is a |
4768 |
duplicate, the test is applied to all subpatterns of the same name, and |
duplicate, the test is applied to all subpatterns of the same name, and |
4769 |
is true if any one of them is the most recent recursion. |
is true if any one of them is the most recent recursion. |
4770 |
|
|
4771 |
At "top level", all these recursion test conditions are false. The |
At "top level", all these recursion test conditions are false. The |
4772 |
syntax for recursive patterns is described below. |
syntax for recursive patterns is described below. |
4773 |
|
|
4774 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
4775 |
|
|
4776 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
4777 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
4778 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
4779 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
4780 |
DEFINE is that it can be used to define "subroutines" that can be ref- |
DEFINE is that it can be used to define "subroutines" that can be ref- |
4781 |
erenced from elsewhere. (The use of "subroutines" is described below.) |
erenced from elsewhere. (The use of "subroutines" is described below.) |
4782 |
For example, a pattern to match an IPv4 address could be written like |
For example, a pattern to match an IPv4 address could be written like |
4783 |
this (ignore whitespace and line breaks): |
this (ignore whitespace and line breaks): |
4784 |
|
|
4785 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
4786 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
4787 |
|
|
4788 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
4789 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
4790 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
4791 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
4792 |
condition. The rest of the pattern uses references to the named group |
condition. The rest of the pattern uses references to the named group |
4793 |
to match the four dot-separated components of an IPv4 address, insist- |
to match the four dot-separated components of an IPv4 address, insist- |
4794 |
ing on a word boundary at each end. |
ing on a word boundary at each end. |
4795 |
|
|
4796 |
Assertion conditions |
Assertion conditions |
4797 |
|
|
4798 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
4799 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
4800 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
4801 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
4802 |
|
|
4803 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
4804 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
4805 |
|
|
4806 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
4807 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
4808 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
4809 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
4810 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
4811 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
4812 |
letters and dd are digits. |
letters and dd are digits. |
4813 |
|
|
4814 |
|
|
4815 |
COMMENTS |
COMMENTS |
4816 |
|
|
4817 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
4818 |
next closing parenthesis. Nested parentheses are not permitted. The |
next closing parenthesis. Nested parentheses are not permitted. The |
4819 |
characters that make up a comment play no part in the pattern matching |
characters that make up a comment play no part in the pattern matching |
4820 |
at all. |
at all. |
4821 |
|
|
4822 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
4823 |
character class introduces a comment that continues to immediately |
character class introduces a comment that continues to immediately |
4824 |
after the next newline in the pattern. |
after the next newline in the pattern. |
4825 |
|
|
4826 |
|
|
4827 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
4828 |
|
|
4829 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
4830 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
4831 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
4832 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
4833 |
depth. |
depth. |
4834 |
|
|
4835 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
4836 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
4837 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
4838 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
4839 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
4840 |
|
|
4844 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
4845 |
|
|
4846 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
4847 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
4848 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
4849 |
PCRE and Python, this kind of recursion was subsequently introduced |
PCRE and Python, this kind of recursion was subsequently introduced |
4850 |
into Perl at release 5.10. |
into Perl at release 5.10. |
4851 |
|
|
4852 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
4853 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
4854 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
4855 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
4856 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
4857 |
regular expression. |
regular expression. |
4858 |
|
|
4859 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
4860 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
4861 |
|
|
4862 |
\( ( [^()]++ | (?R) )* \) |
\( ( [^()]++ | (?R) )* \) |
4863 |
|
|
4864 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
4865 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
4866 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
4867 |
sized substring). Finally there is a closing parenthesis. Note the use |
sized substring). Finally there is a closing parenthesis. Note the use |
4868 |
of a possessive quantifier to avoid backtracking into sequences of non- |
of a possessive quantifier to avoid backtracking into sequences of non- |
4869 |
parentheses. |
parentheses. |
4870 |
|
|
4871 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
4872 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
4873 |
|
|
4874 |
( \( ( [^()]++ | (?1) )* \) ) |
( \( ( [^()]++ | (?1) )* \) ) |
4875 |
|
|
4876 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
4877 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
4878 |
|
|
4879 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
4880 |
tricky. This is made easier by the use of relative references (a Perl |
tricky. This is made easier by the use of relative references (a Perl |
4881 |
5.10 feature). Instead of (?1) in the pattern above you can write |
5.10 feature). Instead of (?1) in the pattern above you can write |
4882 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
4883 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
4884 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
4885 |
|
|
4886 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
4887 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
4888 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
4889 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
4890 |
section. |
section. |
4891 |
|
|
4892 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
4893 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
4894 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
4895 |
|
|
4896 |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
(?<pn> \( ( [^()]++ | (?&pn) )* \) ) |
4897 |
|
|
4898 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
4899 |
one is used. |
one is used. |
4900 |
|
|
4901 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
4902 |
nested unlimited repeats, and so the use of a possessive quantifier for |
nested unlimited repeats, and so the use of a possessive quantifier for |
4903 |
matching strings of non-parentheses is important when applying the pat- |
matching strings of non-parentheses is important when applying the pat- |
4904 |
tern to strings that do not match. For example, when this pattern is |
tern to strings that do not match. For example, when this pattern is |
4905 |
applied to |
applied to |
4906 |
|
|
4907 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
4908 |
|
|
4909 |
it yields "no match" quickly. However, if a possessive quantifier is |
it yields "no match" quickly. However, if a possessive quantifier is |
4910 |
not used, the match runs for a very long time indeed because there are |
not used, the match runs for a very long time indeed because there are |
4911 |
so many different ways the + and * repeats can carve up the subject, |
so many different ways the + and * repeats can carve up the subject, |
4912 |
and all have to be tested before failure can be reported. |
and all have to be tested before failure can be reported. |
4913 |
|
|
4914 |
At the end of a match, the values of capturing parentheses are those |
At the end of a match, the values of capturing parentheses are those |
4915 |
from the outermost level. If you want to obtain intermediate values, a |
from the outermost level. If you want to obtain intermediate values, a |
4916 |
callout function can be used (see below and the pcrecallout documenta- |
callout function can be used (see below and the pcrecallout documenta- |
4917 |
tion). If the pattern above is matched against |
tion). If the pattern above is matched against |
4918 |
|
|
4919 |
(ab(cd)ef) |
(ab(cd)ef) |
4920 |
|
|
4921 |
the value for the inner capturing parentheses (numbered 2) is "ef", |
the value for the inner capturing parentheses (numbered 2) is "ef", |
4922 |
which is the last value taken on at the top level. If a capturing sub- |
which is the last value taken on at the top level. If a capturing sub- |
4923 |
pattern is not matched at the top level, its final value is unset, even |
pattern is not matched at the top level, its final value is unset, even |
4924 |
if it is (temporarily) set at a deeper level. |
if it is (temporarily) set at a deeper level. |
4925 |
|
|
4926 |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
If there are more than 15 capturing parentheses in a pattern, PCRE has |
4927 |
to obtain extra memory to store data during a recursion, which it does |
to obtain extra memory to store data during a recursion, which it does |
4928 |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
by using pcre_malloc, freeing it via pcre_free afterwards. If no memory |
4929 |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error. |
4930 |
|
|
4931 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
4932 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
4933 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
4934 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
4935 |
ted at the outer level. |
ted at the outer level. |
4936 |
|
|
4937 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
4938 |
|
|
4939 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
4940 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
4941 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
4942 |
|
|
4943 |
Recursion difference from Perl |
Recursion difference from Perl |
4944 |
|
|
4945 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
4946 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
4947 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
4948 |
alternatives and there is a subsequent matching failure. This can be |
alternatives and there is a subsequent matching failure. This can be |
4949 |
illustrated by the following pattern, which purports to match a palin- |
illustrated by the following pattern, which purports to match a palin- |
4950 |
dromic string that contains an odd number of characters (for example, |
dromic string that contains an odd number of characters (for example, |
4951 |
"a", "aba", "abcba", "abcdcba"): |
"a", "aba", "abcba", "abcdcba"): |
4952 |
|
|
4953 |
^(.|(.)(?1)\2)$ |
^(.|(.)(?1)\2)$ |
4954 |
|
|
4955 |
The idea is that it either matches a single character, or two identical |
The idea is that it either matches a single character, or two identical |
4956 |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
characters surrounding a sub-palindrome. In Perl, this pattern works; |
4957 |
in PCRE it does not if the pattern is longer than three characters. |
in PCRE it does not if the pattern is longer than three characters. |
4958 |
Consider the subject string "abcba": |
Consider the subject string "abcba": |
4959 |
|
|
4960 |
At the top level, the first character is matched, but as it is not at |
At the top level, the first character is matched, but as it is not at |
4961 |
the end of the string, the first alternative fails; the second alterna- |
the end of the string, the first alternative fails; the second alterna- |
4962 |
tive is taken and the recursion kicks in. The recursive call to subpat- |
tive is taken and the recursion kicks in. The recursive call to subpat- |
4963 |
tern 1 successfully matches the next character ("b"). (Note that the |
tern 1 successfully matches the next character ("b"). (Note that the |
4964 |
beginning and end of line tests are not part of the recursion). |
beginning and end of line tests are not part of the recursion). |
4965 |
|
|
4966 |
Back at the top level, the next character ("c") is compared with what |
Back at the top level, the next character ("c") is compared with what |
4967 |
subpattern 2 matched, which was "a". This fails. Because the recursion |
subpattern 2 matched, which was "a". This fails. Because the recursion |
4968 |
is treated as an atomic group, there are now no backtracking points, |
is treated as an atomic group, there are now no backtracking points, |
4969 |
and so the entire match fails. (Perl is able, at this point, to re- |
and so the entire match fails. (Perl is able, at this point, to re- |
4970 |
enter the recursion and try the second alternative.) However, if the |
enter the recursion and try the second alternative.) However, if the |
4971 |
pattern is written with the alternatives in the other order, things are |
pattern is written with the alternatives in the other order, things are |
4972 |
different: |
different: |
4973 |
|
|
4974 |
^((.)(?1)\2|.)$ |
^((.)(?1)\2|.)$ |
4975 |
|
|
4976 |
This time, the recursing alternative is tried first, and continues to |
This time, the recursing alternative is tried first, and continues to |
4977 |
recurse until it runs out of characters, at which point the recursion |
recurse until it runs out of characters, at which point the recursion |
4978 |
fails. But this time we do have another alternative to try at the |
fails. But this time we do have another alternative to try at the |
4979 |
higher level. That is the big difference: in the previous case the |
higher level. That is the big difference: in the previous case the |
4980 |
remaining alternative is at a deeper recursion level, which PCRE cannot |
remaining alternative is at a deeper recursion level, which PCRE cannot |
4981 |
use. |
use. |
4982 |
|
|
4983 |
To change the pattern so that matches all palindromic strings, not just |
To change the pattern so that matches all palindromic strings, not just |
4984 |
those with an odd number of characters, it is tempting to change the |
those with an odd number of characters, it is tempting to change the |
4985 |
pattern to this: |
pattern to this: |
4986 |
|
|
4987 |
^((.)(?1)\2|.?)$ |
^((.)(?1)\2|.?)$ |
4988 |
|
|
4989 |
Again, this works in Perl, but not in PCRE, and for the same reason. |
Again, this works in Perl, but not in PCRE, and for the same reason. |
4990 |
When a deeper recursion has matched a single character, it cannot be |
When a deeper recursion has matched a single character, it cannot be |
4991 |
entered again in order to match an empty string. The solution is to |
entered again in order to match an empty string. The solution is to |
4992 |
separate the two cases, and write out the odd and even cases as alter- |
separate the two cases, and write out the odd and even cases as alter- |
4993 |
natives at the higher level: |
natives at the higher level: |
4994 |
|
|
4995 |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
^(?:((.)(?1)\2|)|((.)(?3)\4|.)) |
4996 |
|
|
4997 |
If you want to match typical palindromic phrases, the pattern has to |
If you want to match typical palindromic phrases, the pattern has to |
4998 |
ignore all non-word characters, which can be done like this: |
ignore all non-word characters, which can be done like this: |
4999 |
|
|
5000 |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
^\W*+(?:((.)\W*+(?1)\W*+\2|)|((.)\W*+(?3)\W*+\4|\W*+.\W*+))\W*+$ |
5001 |
|
|
5002 |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
If run with the PCRE_CASELESS option, this pattern matches phrases such |
5003 |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
as "A man, a plan, a canal: Panama!" and it works well in both PCRE and |
5004 |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
Perl. Note the use of the possessive quantifier *+ to avoid backtrack- |
5005 |
ing into sequences of non-word characters. Without this, PCRE takes a |
ing into sequences of non-word characters. Without this, PCRE takes a |
5006 |
great deal longer (ten times or more) to match typical phrases, and |
great deal longer (ten times or more) to match typical phrases, and |
5007 |
Perl takes so long that you think it has gone into a loop. |
Perl takes so long that you think it has gone into a loop. |
5008 |
|
|
5009 |
WARNING: The palindrome-matching patterns above work only if the sub- |
WARNING: The palindrome-matching patterns above work only if the sub- |
5010 |
ject string does not start with a palindrome that is shorter than the |
ject string does not start with a palindrome that is shorter than the |
5011 |
entire string. For example, although "abcba" is correctly matched, if |
entire string. For example, although "abcba" is correctly matched, if |
5012 |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
the subject is "ababa", PCRE finds the palindrome "aba" at the start, |
5013 |
then fails at top level because the end of the string does not follow. |
then fails at top level because the end of the string does not follow. |
5014 |
Once again, it cannot jump back into the recursion to try other alter- |
Once again, it cannot jump back into the recursion to try other alter- |
5015 |
natives, so the entire match fails. |
natives, so the entire match fails. |
5016 |
|
|
5017 |
|
|
5018 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
5019 |
|
|
5020 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
5021 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
5022 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
5023 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
5024 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
5025 |
|
|
5031 |
|
|
5032 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
5033 |
|
|
5034 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
5035 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
5036 |
|
|
5037 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
5038 |
|
|
5039 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
5040 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
5041 |
above. |
above. |
5042 |
|
|
5043 |
Like recursive subpatterns, a subroutine call is always treated as an |
Like recursive subpatterns, a subroutine call is always treated as an |
5044 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
5045 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
5046 |
there is a subsequent matching failure. Any capturing parentheses that |
there is a subsequent matching failure. Any capturing parentheses that |
5047 |
are set during the subroutine call revert to their previous values |
are set during the subroutine call revert to their previous values |
5048 |
afterwards. |
afterwards. |
5049 |
|
|
5050 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
5051 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
5052 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
5053 |
|
|
5054 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
5055 |
|
|
5056 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
5057 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
5058 |
|
|
5059 |
|
|
5060 |
ONIGURUMA SUBROUTINE SYNTAX |
ONIGURUMA SUBROUTINE SYNTAX |
5061 |
|
|
5062 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
5063 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
5064 |
an alternative syntax for referencing a subpattern as a subroutine, |
an alternative syntax for referencing a subpattern as a subroutine, |
5065 |
possibly recursively. Here are two of the examples used above, rewrit- |
possibly recursively. Here are two of the examples used above, rewrit- |
5066 |
ten using this syntax: |
ten using this syntax: |
5067 |
|
|
5068 |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
5069 |
(sens|respons)e and \g'1'ibility |
(sens|respons)e and \g'1'ibility |
5070 |
|
|
5071 |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
PCRE supports an extension to Oniguruma: if a number is preceded by a |
5072 |
plus or a minus sign it is taken as a relative reference. For example: |
plus or a minus sign it is taken as a relative reference. For example: |
5073 |
|
|
5074 |
(abc)(?i:\g<-1>) |
(abc)(?i:\g<-1>) |
5075 |
|
|
5076 |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
5077 |
synonymous. The former is a back reference; the latter is a subroutine |
synonymous. The former is a back reference; the latter is a subroutine |
5078 |
call. |
call. |
5079 |
|
|
5080 |
|
|
5081 |
CALLOUTS |
CALLOUTS |
5082 |
|
|
5083 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
5084 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
5085 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
5086 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
5087 |
tion. |
tion. |
5088 |
|
|
5089 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
5090 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
5091 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
5092 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
5093 |
all calling out. |
all calling out. |
5094 |
|
|
5095 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
5096 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
5097 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
5098 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
5099 |
points: |
points: |
5100 |
|
|
5101 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
5102 |
|
|
5103 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
5104 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
5105 |
numbered 255. |
numbered 255. |
5106 |
|
|
5107 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
5108 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
5109 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
5110 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
5111 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
5112 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
5113 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
5114 |
|
|
5115 |
|
|
5116 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
5117 |
|
|
5118 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
5119 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
5120 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
5121 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
5122 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
5123 |
in this section. |
in this section. |
5124 |
|
|
5125 |
Since these verbs are specifically related to backtracking, most of |
Since these verbs are specifically related to backtracking, most of |
5126 |
them can be used only when the pattern is to be matched using |
them can be used only when the pattern is to be matched using |
5127 |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
5128 |
(*FAIL), which behaves like a failing negative assertion, they cause an |
(*FAIL), which behaves like a failing negative assertion, they cause an |
5129 |
error if encountered by pcre_dfa_exec(). |
error if encountered by pcre_dfa_exec(). |
5130 |
|
|
5131 |
If any of these verbs are used in an assertion or subroutine subpattern |
If any of these verbs are used in an assertion or subroutine subpattern |
5132 |
(including recursive subpatterns), their effect is confined to that |
(including recursive subpatterns), their effect is confined to that |
5133 |
subpattern; it does not extend to the surrounding pattern. Note that |
subpattern; it does not extend to the surrounding pattern. Note that |
5134 |
such subpatterns are processed as anchored at the point where they are |
such subpatterns are processed as anchored at the point where they are |
5135 |
tested. |
tested. |
5136 |
|
|
5137 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
5138 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
5139 |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
the form (*VERB:ARG) but PCRE does not support the use of arguments, so |
5140 |
its general form is just (*VERB). Any number of these verbs may occur |
its general form is just (*VERB). Any number of these verbs may occur |
5141 |
in a pattern. There are two kinds: |
in a pattern. There are two kinds: |
5142 |
|
|
5143 |
Verbs that act immediately |
Verbs that act immediately |
5146 |
|
|
5147 |
(*ACCEPT) |
(*ACCEPT) |
5148 |
|
|
5149 |
This verb causes the match to end successfully, skipping the remainder |
This verb causes the match to end successfully, skipping the remainder |
5150 |
of the pattern. When inside a recursion, only the innermost pattern is |
of the pattern. When inside a recursion, only the innermost pattern is |
5151 |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
ended immediately. If (*ACCEPT) is inside capturing parentheses, the |
5152 |
data so far is captured. (This feature was added to PCRE at release |
data so far is captured. (This feature was added to PCRE at release |
5153 |
8.00.) For example: |
8.00.) For example: |
5154 |
|
|
5155 |
A((?:A|B(*ACCEPT)|C)D) |
A((?:A|B(*ACCEPT)|C)D) |
5156 |
|
|
5157 |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap- |
5158 |
tured by the outer parentheses. |
tured by the outer parentheses. |
5159 |
|
|
5160 |
(*FAIL) or (*F) |
(*FAIL) or (*F) |
5161 |
|
|
5162 |
This verb causes the match to fail, forcing backtracking to occur. It |
This verb causes the match to fail, forcing backtracking to occur. It |
5163 |
is equivalent to (?!) but easier to read. The Perl documentation notes |
is equivalent to (?!) but easier to read. The Perl documentation notes |
5164 |
that it is probably useful only when combined with (?{}) or (??{}). |
that it is probably useful only when combined with (?{}) or (??{}). |
5165 |
Those are, of course, Perl features that are not present in PCRE. The |
Those are, of course, Perl features that are not present in PCRE. The |
5166 |
nearest equivalent is the callout feature, as for example in this pat- |
nearest equivalent is the callout feature, as for example in this pat- |
5167 |
tern: |
tern: |
5168 |
|
|
5169 |
a+(?C)(*FAIL) |
a+(?C)(*FAIL) |
5170 |
|
|
5171 |
A match with the string "aaaa" always fails, but the callout is taken |
A match with the string "aaaa" always fails, but the callout is taken |
5172 |
before each backtrack happens (in this example, 10 times). |
before each backtrack happens (in this example, 10 times). |
5173 |
|
|
5174 |
Verbs that act after backtracking |
Verbs that act after backtracking |
5175 |
|
|
5176 |
The following verbs do nothing when they are encountered. Matching con- |
The following verbs do nothing when they are encountered. Matching con- |
5177 |
tinues with what follows, but if there is no subsequent match, a fail- |
tinues with what follows, but if there is no subsequent match, a fail- |
5178 |
ure is forced. The verbs differ in exactly what kind of failure |
ure is forced. The verbs differ in exactly what kind of failure |
5179 |
occurs. |
occurs. |
5180 |
|
|
5181 |
(*COMMIT) |
(*COMMIT) |
5182 |
|
|
5183 |
This verb causes the whole match to fail outright if the rest of the |
This verb causes the whole match to fail outright if the rest of the |
5184 |
pattern does not match. Even if the pattern is unanchored, no further |
pattern does not match. Even if the pattern is unanchored, no further |
5185 |
attempts to find a match by advancing the starting point take place. |
attempts to find a match by advancing the starting point take place. |
5186 |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
Once (*COMMIT) has been passed, pcre_exec() is committed to finding a |
5187 |
match at the current starting point, or not at all. For example: |
match at the current starting point, or not at all. For example: |
5188 |
|
|
5189 |
a+(*COMMIT)b |
a+(*COMMIT)b |
5190 |
|
|
5191 |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
This matches "xxaab" but not "aacaab". It can be thought of as a kind |
5192 |
of dynamic anchor, or "I've started, so I must finish." |
of dynamic anchor, or "I've started, so I must finish." |
5193 |
|
|
5194 |
(*PRUNE) |
(*PRUNE) |
5195 |
|
|
5196 |
This verb causes the match to fail at the current position if the rest |
This verb causes the match to fail at the current position if the rest |
5197 |
of the pattern does not match. If the pattern is unanchored, the normal |
of the pattern does not match. If the pattern is unanchored, the normal |
5198 |
"bumpalong" advance to the next starting character then happens. Back- |
"bumpalong" advance to the next starting character then happens. Back- |
5199 |
tracking can occur as usual to the left of (*PRUNE), or when matching |
tracking can occur as usual to the left of (*PRUNE), or when matching |
5200 |
to the right of (*PRUNE), but if there is no match to the right, back- |
to the right of (*PRUNE), but if there is no match to the right, back- |
5201 |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
tracking cannot cross (*PRUNE). In simple cases, the use of (*PRUNE) |
5202 |
is just an alternative to an atomic group or possessive quantifier, but |
is just an alternative to an atomic group or possessive quantifier, but |
5203 |
there are some uses of (*PRUNE) that cannot be expressed in any other |
there are some uses of (*PRUNE) that cannot be expressed in any other |
5204 |
way. |
way. |
5205 |
|
|
5206 |
(*SKIP) |
(*SKIP) |
5207 |
|
|
5208 |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
This verb is like (*PRUNE), except that if the pattern is unanchored, |
5209 |
the "bumpalong" advance is not to the next character, but to the posi- |
the "bumpalong" advance is not to the next character, but to the posi- |
5210 |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
tion in the subject where (*SKIP) was encountered. (*SKIP) signifies |
5211 |
that whatever text was matched leading up to it cannot be part of a |
that whatever text was matched leading up to it cannot be part of a |
5212 |
successful match. Consider: |
successful match. Consider: |
5213 |
|
|
5214 |
a+(*SKIP)b |
a+(*SKIP)b |
5215 |
|
|
5216 |
If the subject is "aaaac...", after the first match attempt fails |
If the subject is "aaaac...", after the first match attempt fails |
5217 |
(starting at the first character in the string), the starting point |
(starting at the first character in the string), the starting point |
5218 |
skips on to start the next attempt at "c". Note that a possessive quan- |
skips on to start the next attempt at "c". Note that a possessive quan- |
5219 |
tifer does not have the same effect as this example; although it would |
tifer does not have the same effect as this example; although it would |
5220 |
suppress backtracking during the first match attempt, the second |
suppress backtracking during the first match attempt, the second |
5221 |
attempt would start at the second character instead of skipping on to |
attempt would start at the second character instead of skipping on to |
5222 |
"c". |
"c". |
5223 |
|
|
5224 |
(*THEN) |
(*THEN) |
5225 |
|
|
5226 |
This verb causes a skip to the next alternation if the rest of the pat- |
This verb causes a skip to the next alternation if the rest of the pat- |
5227 |
tern does not match. That is, it cancels pending backtracking, but only |
tern does not match. That is, it cancels pending backtracking, but only |
5228 |
within the current alternation. Its name comes from the observation |
within the current alternation. Its name comes from the observation |
5229 |
that it can be used for a pattern-based if-then-else block: |
that it can be used for a pattern-based if-then-else block: |
5230 |
|
|
5231 |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... |
5232 |
|
|
5233 |
If the COND1 pattern matches, FOO is tried (and possibly further items |
If the COND1 pattern matches, FOO is tried (and possibly further items |
5234 |
after the end of the group if FOO succeeds); on failure the matcher |
after the end of the group if FOO succeeds); on failure the matcher |
5235 |
skips to the second alternative and tries COND2, without backtracking |
skips to the second alternative and tries COND2, without backtracking |
5236 |
into COND1. If (*THEN) is used outside of any alternation, it acts |
into COND1. If (*THEN) is used outside of any alternation, it acts |
5237 |
exactly like (*PRUNE). |
exactly like (*PRUNE). |
5238 |
|
|
5239 |
|
|
5251 |
|
|
5252 |
REVISION |
REVISION |
5253 |
|
|
5254 |
Last updated: 18 October 2009 |
Last updated: 11 January 2010 |
5255 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
5256 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5257 |
|
|
5258 |
|
|