45 |
|
|
46 |
Details of exactly which Perl regular expression features are and are |
Details of exactly which Perl regular expression features are and are |
47 |
not supported by PCRE are given in separate documents. See the pcrepat- |
not supported by PCRE are given in separate documents. See the pcrepat- |
48 |
tern and pcrecompat pages. |
tern and pcrecompat pages. There is a syntax summary in the pcresyntax |
49 |
|
page. |
50 |
|
|
51 |
Some features of PCRE can be included, excluded, or changed when the |
Some features of PCRE can be included, excluded, or changed when the |
52 |
library is built. The pcre_config() function makes it possible for a |
library is built. The pcre_config() function makes it possible for a |
53 |
client to discover which features are available. The features them- |
client to discover which features are available. The features them- |
54 |
selves are described in the pcrebuild page. Documentation about build- |
selves are described in the pcrebuild page. Documentation about build- |
55 |
ing PCRE for various operating systems can be found in the README file |
ing PCRE for various operating systems can be found in the README file |
56 |
in the source distribution. |
in the source distribution. |
57 |
|
|
58 |
The library contains a number of undocumented internal functions and |
The library contains a number of undocumented internal functions and |
59 |
data tables that are used by more than one of the exported external |
data tables that are used by more than one of the exported external |
60 |
functions, but which are not intended for use by external callers. |
functions, but which are not intended for use by external callers. |
61 |
Their names all begin with "_pcre_", which hopefully will not provoke |
Their names all begin with "_pcre_", which hopefully will not provoke |
62 |
any name clashes. In some environments, it is possible to control which |
any name clashes. In some environments, it is possible to control which |
63 |
external symbols are exported when a shared library is built, and in |
external symbols are exported when a shared library is built, and in |
64 |
these cases the undocumented symbols are not exported. |
these cases the undocumented symbols are not exported. |
65 |
|
|
66 |
|
|
67 |
USER DOCUMENTATION |
USER DOCUMENTATION |
68 |
|
|
69 |
The user documentation for PCRE comprises a number of different sec- |
The user documentation for PCRE comprises a number of different sec- |
70 |
tions. In the "man" format, each of these is a separate "man page". In |
tions. In the "man" format, each of these is a separate "man page". In |
71 |
the HTML format, each is a separate page, linked from the index page. |
the HTML format, each is a separate page, linked from the index page. |
72 |
In the plain text format, all the sections are concatenated, for ease |
In the plain text format, all the sections are concatenated, for ease |
73 |
of searching. The sections are as follows: |
of searching. The sections are as follows: |
74 |
|
|
75 |
pcre this document |
pcre this document |
84 |
pcrepartial details of the partial matching facility |
pcrepartial details of the partial matching facility |
85 |
pcrepattern syntax and semantics of supported |
pcrepattern syntax and semantics of supported |
86 |
regular expressions |
regular expressions |
87 |
|
pcresyntax quick syntax reference |
88 |
pcreperform discussion of performance issues |
pcreperform discussion of performance issues |
89 |
pcreposix the POSIX-compatible C API |
pcreposix the POSIX-compatible C API |
90 |
pcreprecompile details of saving and re-using precompiled patterns |
pcreprecompile details of saving and re-using precompiled patterns |
92 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
93 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
94 |
|
|
95 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
96 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
97 |
|
|
98 |
|
|
99 |
LIMITATIONS |
LIMITATIONS |
100 |
|
|
101 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
102 |
never in practice be relevant. |
never in practice be relevant. |
103 |
|
|
104 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
105 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
106 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
107 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
108 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
109 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
110 |
of execution is slower. |
of execution is slower. |
111 |
|
|
112 |
All values in repeating quantifiers must be less than 65536. The maxi- |
All values in repeating quantifiers must be less than 65536. |
|
mum compiled length of subpattern with an explicit repeat count is |
|
|
30000 bytes. The maximum number of capturing subpatterns is 65535. |
|
113 |
|
|
114 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
115 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
116 |
|
|
|
If a non-capturing subpattern with an unlimited repetition quantifier |
|
|
can match an empty string, there is a limit of 1000 on the number of |
|
|
times it can be repeated while not matching an empty string - if it |
|
|
does match an empty string, the loop is immediately broken. |
|
|
|
|
117 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
118 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
119 |
|
|
226 |
|
|
227 |
REVISION |
REVISION |
228 |
|
|
229 |
Last updated: 30 July 2007 |
Last updated: 06 August 2007 |
230 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
231 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
232 |
|
|
2207 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
2208 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
2209 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
2210 |
mentation. When duplicates are present, pcre_copy_named_substring() and |
mentation. |
2211 |
|
|
2212 |
|
When duplicates are present, pcre_copy_named_substring() and |
2213 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2214 |
the given name that is set. If none are set, an empty string is |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
2215 |
returned. The pcre_get_stringnumber() function returns one of the num- |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
2216 |
bers that are associated with the name, but it is not defined which it |
function returns one of the numbers that are associated with the name, |
2217 |
is. |
but it is not defined which it is. |
2218 |
|
|
2219 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2220 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2729 |
|
|
2730 |
PCRE REGULAR EXPRESSION DETAILS |
PCRE REGULAR EXPRESSION DETAILS |
2731 |
|
|
2732 |
The syntax and semantics of the regular expressions supported by PCRE |
The syntax and semantics of the regular expressions that are supported |
2733 |
are described below. Regular expressions are also described in the Perl |
by PCRE are described in detail below. There is a quick-reference syn- |
2734 |
documentation and in a number of books, some of which have copious |
tax summary in the pcresyntax page. Perl's regular expressions are |
2735 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published |
described in its own documentation, and regular expressions in general |
2736 |
by O'Reilly, covers regular expressions in great detail. This descrip- |
are covered in a number of books, some of which have copious examples. |
2737 |
tion of PCRE's regular expressions is intended as reference material. |
Jeffrey Friedl's "Mastering Regular Expressions", published by |
2738 |
|
O'Reilly, covers regular expressions in great detail. This description |
2739 |
|
of PCRE's regular expressions is intended as reference material. |
2740 |
|
|
2741 |
The original operation of PCRE was on strings of one-byte characters. |
The original operation of PCRE was on strings of one-byte characters. |
2742 |
However, there is now also support for UTF-8 character strings. To use |
However, there is now also support for UTF-8 character strings. To use |
2938 |
|
|
2939 |
Absolute and relative back references |
Absolute and relative back references |
2940 |
|
|
2941 |
The sequence \g followed by a positive or negative number, optionally |
The sequence \g followed by an unsigned or a negative number, option- |
2942 |
enclosed in braces, is an absolute or relative back reference. A named |
ally enclosed in braces, is an absolute or relative back reference. A |
2943 |
back reference can be coded as \g{name}. Back references are discussed |
named back reference can be coded as \g{name}. Back references are dis- |
2944 |
later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
2945 |
|
|
2946 |
Generic character types |
Generic character types |
2947 |
|
|
3877 |
|
|
3878 |
\d++foo |
\d++foo |
3879 |
|
|
3880 |
Possessive quantifiers are always greedy; the setting of the |
Note that a possessive quantifier can be used with an entire group, for |
3881 |
|
example: |
3882 |
|
|
3883 |
|
(abc|xyz){2,3}+ |
3884 |
|
|
3885 |
|
Possessive quantifiers are always greedy; the setting of the |
3886 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
3887 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
3888 |
meaning of a possessive quantifier and the equivalent atomic group, |
meaning of a possessive quantifier and the equivalent atomic group, |
3889 |
though there may be a performance difference; possessive quantifiers |
though there may be a performance difference; possessive quantifiers |
3890 |
should be slightly faster. |
should be slightly faster. |
3891 |
|
|
3892 |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
3893 |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
3894 |
edition of his book. Mike McCloskey liked it, so implemented it when he |
edition of his book. Mike McCloskey liked it, so implemented it when he |
3895 |
built Sun's Java package, and PCRE copied it from there. It ultimately |
built Sun's Java package, and PCRE copied it from there. It ultimately |
3896 |
found its way into Perl at release 5.10. |
found its way into Perl at release 5.10. |
3897 |
|
|
3898 |
PCRE has an optimization that automatically "possessifies" certain sim- |
PCRE has an optimization that automatically "possessifies" certain sim- |
3899 |
ple pattern constructs. For example, the sequence A+B is treated as |
ple pattern constructs. For example, the sequence A+B is treated as |
3900 |
A++B because there is no point in backtracking into a sequence of A's |
A++B because there is no point in backtracking into a sequence of A's |
3901 |
when B must follow. |
when B must follow. |
3902 |
|
|
3903 |
When a pattern contains an unlimited repeat inside a subpattern that |
When a pattern contains an unlimited repeat inside a subpattern that |
3904 |
can itself be repeated an unlimited number of times, the use of an |
can itself be repeated an unlimited number of times, the use of an |
3905 |
atomic group is the only way to avoid some failing matches taking a |
atomic group is the only way to avoid some failing matches taking a |
3906 |
very long time indeed. The pattern |
very long time indeed. The pattern |
3907 |
|
|
3908 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
3909 |
|
|
3910 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
3911 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
3912 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
3913 |
|
|
3914 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
3915 |
|
|
3916 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
3917 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
3918 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
3919 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
3920 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
3921 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
3922 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
3923 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
3924 |
group, like this: |
group, like this: |
3925 |
|
|
3926 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
3927 |
|
|
3928 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
3929 |
|
|
3930 |
|
|
3931 |
BACK REFERENCES |
BACK REFERENCES |
3932 |
|
|
3933 |
Outside a character class, a backslash followed by a digit greater than |
Outside a character class, a backslash followed by a digit greater than |
3934 |
0 (and possibly further digits) is a back reference to a capturing sub- |
0 (and possibly further digits) is a back reference to a capturing sub- |
3935 |
pattern earlier (that is, to its left) in the pattern, provided there |
pattern earlier (that is, to its left) in the pattern, provided there |
3936 |
have been that many previous capturing left parentheses. |
have been that many previous capturing left parentheses. |
3937 |
|
|
3938 |
However, if the decimal number following the backslash is less than 10, |
However, if the decimal number following the backslash is less than 10, |
3939 |
it is always taken as a back reference, and causes an error only if |
it is always taken as a back reference, and causes an error only if |
3940 |
there are not that many capturing left parentheses in the entire pat- |
there are not that many capturing left parentheses in the entire pat- |
3941 |
tern. In other words, the parentheses that are referenced need not be |
tern. In other words, the parentheses that are referenced need not be |
3942 |
to the left of the reference for numbers less than 10. A "forward back |
to the left of the reference for numbers less than 10. A "forward back |
3943 |
reference" of this type can make sense when a repetition is involved |
reference" of this type can make sense when a repetition is involved |
3944 |
and the subpattern to the right has participated in an earlier itera- |
and the subpattern to the right has participated in an earlier itera- |
3945 |
tion. |
tion. |
3946 |
|
|
3947 |
It is not possible to have a numerical "forward back reference" to a |
It is not possible to have a numerical "forward back reference" to a |
3948 |
subpattern whose number is 10 or more using this syntax because a |
subpattern whose number is 10 or more using this syntax because a |
3949 |
sequence such as \50 is interpreted as a character defined in octal. |
sequence such as \50 is interpreted as a character defined in octal. |
3950 |
See the subsection entitled "Non-printing characters" above for further |
See the subsection entitled "Non-printing characters" above for further |
3951 |
details of the handling of digits following a backslash. There is no |
details of the handling of digits following a backslash. There is no |
3952 |
such problem when named parentheses are used. A back reference to any |
such problem when named parentheses are used. A back reference to any |
3953 |
subpattern is possible using named parentheses (see below). |
subpattern is possible using named parentheses (see below). |
3954 |
|
|
3955 |
Another way of avoiding the ambiguity inherent in the use of digits |
Another way of avoiding the ambiguity inherent in the use of digits |
3956 |
following a backslash is to use the \g escape sequence, which is a fea- |
following a backslash is to use the \g escape sequence, which is a fea- |
3957 |
ture introduced in Perl 5.10. This escape must be followed by a posi- |
ture introduced in Perl 5.10. This escape must be followed by an |
3958 |
tive or a negative number, optionally enclosed in braces. These exam- |
unsigned number or a negative number, optionally enclosed in braces. |
3959 |
ples are all identical: |
These examples are all identical: |
3960 |
|
|
3961 |
(ring), \1 |
(ring), \1 |
3962 |
(ring), \g1 |
(ring), \g1 |
3963 |
(ring), \g{1} |
(ring), \g{1} |
3964 |
|
|
3965 |
A positive number specifies an absolute reference without the ambiguity |
An unsigned number specifies an absolute reference without the ambigu- |
3966 |
that is present in the older syntax. It is also useful when literal |
ity that is present in the older syntax. It is also useful when literal |
3967 |
digits follow the reference. A negative number is a relative reference. |
digits follow the reference. A negative number is a relative reference. |
3968 |
Consider this example: |
Consider this example: |
3969 |
|
|
3970 |
(abc(def)ghi)\g{-1} |
(abc(def)ghi)\g{-1} |
3971 |
|
|
3972 |
The sequence \g{-1} is a reference to the most recently started captur- |
The sequence \g{-1} is a reference to the most recently started captur- |
3973 |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
3974 |
\g{-2} would be equivalent to \1. The use of relative references can be |
\g{-2} would be equivalent to \1. The use of relative references can be |
3975 |
helpful in long patterns, and also in patterns that are created by |
helpful in long patterns, and also in patterns that are created by |
3976 |
joining together fragments that contain references within themselves. |
joining together fragments that contain references within themselves. |
3977 |
|
|
3978 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
3979 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
3980 |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
3981 |
of doing that). So the pattern |
of doing that). So the pattern |
3982 |
|
|
3983 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
3984 |
|
|
3985 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
3986 |
not "sense and responsibility". If caseful matching is in force at the |
not "sense and responsibility". If caseful matching is in force at the |
3987 |
time of the back reference, the case of letters is relevant. For exam- |
time of the back reference, the case of letters is relevant. For exam- |
3988 |
ple, |
ple, |
3989 |
|
|
3990 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
3991 |
|
|
3992 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
3993 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
3994 |
|
|
3995 |
There are several different ways of writing back references to named |
There are several different ways of writing back references to named |
3996 |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
3997 |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
3998 |
unified back reference syntax, in which \g can be used for both numeric |
unified back reference syntax, in which \g can be used for both numeric |
3999 |
and named references, is also supported. We could rewrite the above |
and named references, is also supported. We could rewrite the above |
4000 |
example in any of the following ways: |
example in any of the following ways: |
4001 |
|
|
4002 |
(?<p1>(?i)rah)\s+\k<p1> |
(?<p1>(?i)rah)\s+\k<p1> |
4004 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
4005 |
(?<p1>(?i)rah)\s+\g{p1} |
(?<p1>(?i)rah)\s+\g{p1} |
4006 |
|
|
4007 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
4008 |
before or after the reference. |
before or after the reference. |
4009 |
|
|
4010 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
4011 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
4012 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
4013 |
|
|
4014 |
(a|(bc))\2 |
(a|(bc))\2 |
4015 |
|
|
4016 |
always fails if it starts to match "a" rather than "bc". Because there |
always fails if it starts to match "a" rather than "bc". Because there |
4017 |
may be many capturing parentheses in a pattern, all digits following |
may be many capturing parentheses in a pattern, all digits following |
4018 |
the backslash are taken as part of a potential back reference number. |
the backslash are taken as part of a potential back reference number. |
4019 |
If the pattern continues with a digit character, some delimiter must be |
If the pattern continues with a digit character, some delimiter must be |
4020 |
used to terminate the back reference. If the PCRE_EXTENDED option is |
used to terminate the back reference. If the PCRE_EXTENDED option is |
4021 |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
4022 |
ments" below) can be used. |
ments" below) can be used. |
4023 |
|
|
4024 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
4025 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
4026 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
4027 |
patterns. For example, the pattern |
patterns. For example, the pattern |
4028 |
|
|
4029 |
(a|b\1)+ |
(a|b\1)+ |
4030 |
|
|
4031 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
4032 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
4033 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
4034 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
4035 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
4036 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
4037 |
|
|
4038 |
|
|
4039 |
ASSERTIONS |
ASSERTIONS |
4040 |
|
|
4041 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
4042 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
4043 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
4044 |
described above. |
described above. |
4045 |
|
|
4046 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
4047 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
4048 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
4049 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
4050 |
matching position to be changed. |
matching position to be changed. |
4051 |
|
|
4052 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
4053 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
4054 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
4055 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
4056 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
4057 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
4058 |
negative assertions. |
negative assertions. |
4059 |
|
|
4060 |
Lookahead assertions |
Lookahead assertions |
4064 |
|
|
4065 |
\w+(?=;) |
\w+(?=;) |
4066 |
|
|
4067 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
4068 |
colon in the match, and |
colon in the match, and |
4069 |
|
|
4070 |
foo(?!bar) |
foo(?!bar) |
4071 |
|
|
4072 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
4073 |
that the apparently similar pattern |
that the apparently similar pattern |
4074 |
|
|
4075 |
(?!foo)bar |
(?!foo)bar |
4076 |
|
|
4077 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
4078 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
4079 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
4080 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
4081 |
|
|
4082 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
4083 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
4084 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
4085 |
string must always fail. |
string must always fail. |
4086 |
|
|
4087 |
Lookbehind assertions |
Lookbehind assertions |
4088 |
|
|
4089 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
4090 |
for negative assertions. For example, |
for negative assertions. For example, |
4091 |
|
|
4092 |
(?<!foo)bar |
(?<!foo)bar |
4093 |
|
|
4094 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
4095 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
4096 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
4097 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
4098 |
fixed length. Thus |
fixed length. Thus |
4099 |
|
|
4100 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
4103 |
|
|
4104 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
4105 |
|
|
4106 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
4107 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
4108 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (at least for 5.8), which |
4109 |
requires all branches to match the same length of string. An assertion |
requires all branches to match the same length of string. An assertion |
4110 |
such as |
such as |
4111 |
|
|
4112 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
4113 |
|
|
4114 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
4115 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable if rewritten to use two top- |
4116 |
level branches: |
level branches: |
4117 |
|
|
4118 |
(?<=abc|abde) |
(?<=abc|abde) |
4119 |
|
|
4120 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
4121 |
instead of a lookbehind assertion; this is not restricted to a fixed- |
instead of a lookbehind assertion; this is not restricted to a fixed- |
4122 |
length. |
length. |
4123 |
|
|
4124 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
4125 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
4126 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
4127 |
rent position, the assertion fails. |
rent position, the assertion fails. |
4128 |
|
|
4129 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
4130 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
4131 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
4132 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
4133 |
|
|
4134 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
4135 |
assertions to specify efficient matching at the end of the subject |
assertions to specify efficient matching at the end of the subject |
4136 |
string. Consider a simple pattern such as |
string. Consider a simple pattern such as |
4137 |
|
|
4138 |
abcd$ |
abcd$ |
4139 |
|
|
4140 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
4141 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
4142 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
4143 |
pattern is specified as |
pattern is specified as |
4144 |
|
|
4145 |
^.*abcd$ |
^.*abcd$ |
4146 |
|
|
4147 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
4148 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
4149 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
4150 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
4151 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
4152 |
|
|
4153 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
4154 |
|
|
4155 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
4156 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
4157 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
4158 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
4159 |
processing time. |
processing time. |
4160 |
|
|
4161 |
Using multiple assertions |
Using multiple assertions |
4164 |
|
|
4165 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
4166 |
|
|
4167 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
4168 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
4169 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
4170 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
4171 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
4172 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
4173 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
4174 |
foo". A pattern to do that is |
foo". A pattern to do that is |
4175 |
|
|
4176 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
4177 |
|
|
4178 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
4179 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
4180 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
4181 |
|
|
4183 |
|
|
4184 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
4185 |
|
|
4186 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
4187 |
is not preceded by "foo", while |
is not preceded by "foo", while |
4188 |
|
|
4189 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
4190 |
|
|
4191 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
4192 |
three characters that are not "999". |
three characters that are not "999". |
4193 |
|
|
4194 |
|
|
4195 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
4196 |
|
|
4197 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
4198 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
4199 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
4200 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
4201 |
are |
are |
4202 |
|
|
4203 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
4204 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
4205 |
|
|
4206 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
4207 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
4208 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
4209 |
|
|
4210 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
4211 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
4212 |
|
|
4213 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
4214 |
|
|
4215 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
4216 |
the condition is true if the capturing subpattern of that number has |
the condition is true if the capturing subpattern of that number has |
4217 |
previously matched. An alternative notation is to precede the digits |
previously matched. An alternative notation is to precede the digits |
4218 |
with a plus or minus sign. In this case, the subpattern number is rela- |
with a plus or minus sign. In this case, the subpattern number is rela- |
4219 |
tive rather than absolute. The most recently opened parentheses can be |
tive rather than absolute. The most recently opened parentheses can be |
4220 |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
4221 |
looping constructs it can also make sense to refer to subsequent groups |
looping constructs it can also make sense to refer to subsequent groups |
4222 |
with constructs such as (?(+2). |
with constructs such as (?(+2). |
4223 |
|
|
4224 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
4225 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
4226 |
divide it into three parts for ease of discussion: |
divide it into three parts for ease of discussion: |
4227 |
|
|
4228 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
4229 |
|
|
4230 |
The first part matches an optional opening parenthesis, and if that |
The first part matches an optional opening parenthesis, and if that |
4231 |
character is present, sets it as the first captured substring. The sec- |
character is present, sets it as the first captured substring. The sec- |
4232 |
ond part matches one or more characters that are not parentheses. The |
ond part matches one or more characters that are not parentheses. The |
4233 |
third part is a conditional subpattern that tests whether the first set |
third part is a conditional subpattern that tests whether the first set |
4234 |
of parentheses matched or not. If they did, that is, if subject started |
of parentheses matched or not. If they did, that is, if subject started |
4235 |
with an opening parenthesis, the condition is true, and so the yes-pat- |
with an opening parenthesis, the condition is true, and so the yes-pat- |
4236 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
4237 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
4238 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
4239 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
4240 |
|
|
4241 |
If you were embedding this pattern in a larger one, you could use a |
If you were embedding this pattern in a larger one, you could use a |
4242 |
relative reference: |
relative reference: |
4243 |
|
|
4244 |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
4245 |
|
|
4246 |
This makes the fragment independent of the parentheses in the larger |
This makes the fragment independent of the parentheses in the larger |
4247 |
pattern. |
pattern. |
4248 |
|
|
4249 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
4250 |
|
|
4251 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
4252 |
used subpattern by name. For compatibility with earlier versions of |
used subpattern by name. For compatibility with earlier versions of |
4253 |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
4254 |
also recognized. However, there is a possible ambiguity with this syn- |
also recognized. However, there is a possible ambiguity with this syn- |
4255 |
tax, because subpattern names may consist entirely of digits. PCRE |
tax, because subpattern names may consist entirely of digits. PCRE |
4256 |
looks first for a named subpattern; if it cannot find one and the name |
looks first for a named subpattern; if it cannot find one and the name |
4257 |
consists entirely of digits, PCRE looks for a subpattern of that num- |
consists entirely of digits, PCRE looks for a subpattern of that num- |
4258 |
ber, which must be greater than zero. Using subpattern names that con- |
ber, which must be greater than zero. Using subpattern names that con- |
4259 |
sist entirely of digits is not recommended. |
sist entirely of digits is not recommended. |
4260 |
|
|
4261 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
4266 |
Checking for pattern recursion |
Checking for pattern recursion |
4267 |
|
|
4268 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
4269 |
name R, the condition is true if a recursive call to the whole pattern |
name R, the condition is true if a recursive call to the whole pattern |
4270 |
or any subpattern has been made. If digits or a name preceded by amper- |
or any subpattern has been made. If digits or a name preceded by amper- |
4271 |
sand follow the letter R, for example: |
sand follow the letter R, for example: |
4272 |
|
|
4273 |
(?(R3)...) or (?(R&name)...) |
(?(R3)...) or (?(R&name)...) |
4274 |
|
|
4275 |
the condition is true if the most recent recursion is into the subpat- |
the condition is true if the most recent recursion is into the subpat- |
4276 |
tern whose number or name is given. This condition does not check the |
tern whose number or name is given. This condition does not check the |
4277 |
entire recursion stack. |
entire recursion stack. |
4278 |
|
|
4279 |
At "top level", all these recursion test conditions are false. Recur- |
At "top level", all these recursion test conditions are false. Recur- |
4280 |
sive patterns are described below. |
sive patterns are described below. |
4281 |
|
|
4282 |
Defining subpatterns for use by reference only |
Defining subpatterns for use by reference only |
4283 |
|
|
4284 |
If the condition is the string (DEFINE), and there is no subpattern |
If the condition is the string (DEFINE), and there is no subpattern |
4285 |
with the name DEFINE, the condition is always false. In this case, |
with the name DEFINE, the condition is always false. In this case, |
4286 |
there may be only one alternative in the subpattern. It is always |
there may be only one alternative in the subpattern. It is always |
4287 |
skipped if control reaches this point in the pattern; the idea of |
skipped if control reaches this point in the pattern; the idea of |
4288 |
DEFINE is that it can be used to define "subroutines" that can be ref- |
DEFINE is that it can be used to define "subroutines" that can be ref- |
4289 |
erenced from elsewhere. (The use of "subroutines" is described below.) |
erenced from elsewhere. (The use of "subroutines" is described below.) |
4290 |
For example, a pattern to match an IPv4 address could be written like |
For example, a pattern to match an IPv4 address could be written like |
4291 |
this (ignore whitespace and line breaks): |
this (ignore whitespace and line breaks): |
4292 |
|
|
4293 |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
4294 |
\b (?&byte) (\.(?&byte)){3} \b |
\b (?&byte) (\.(?&byte)){3} \b |
4295 |
|
|
4296 |
The first part of the pattern is a DEFINE group inside which a another |
The first part of the pattern is a DEFINE group inside which a another |
4297 |
group named "byte" is defined. This matches an individual component of |
group named "byte" is defined. This matches an individual component of |
4298 |
an IPv4 address (a number less than 256). When matching takes place, |
an IPv4 address (a number less than 256). When matching takes place, |
4299 |
this part of the pattern is skipped because DEFINE acts like a false |
this part of the pattern is skipped because DEFINE acts like a false |
4300 |
condition. |
condition. |
4301 |
|
|
4302 |
The rest of the pattern uses references to the named group to match the |
The rest of the pattern uses references to the named group to match the |
4303 |
four dot-separated components of an IPv4 address, insisting on a word |
four dot-separated components of an IPv4 address, insisting on a word |
4304 |
boundary at each end. |
boundary at each end. |
4305 |
|
|
4306 |
Assertion conditions |
Assertion conditions |
4307 |
|
|
4308 |
If the condition is not in any of the above formats, it must be an |
If the condition is not in any of the above formats, it must be an |
4309 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
4310 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
4311 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
4312 |
|
|
4313 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
4314 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
4315 |
|
|
4316 |
The condition is a positive lookahead assertion that matches an |
The condition is a positive lookahead assertion that matches an |
4317 |
optional sequence of non-letters followed by a letter. In other words, |
optional sequence of non-letters followed by a letter. In other words, |
4318 |
it tests for the presence of at least one letter in the subject. If a |
it tests for the presence of at least one letter in the subject. If a |
4319 |
letter is found, the subject is matched against the first alternative; |
letter is found, the subject is matched against the first alternative; |
4320 |
otherwise it is matched against the second. This pattern matches |
otherwise it is matched against the second. This pattern matches |
4321 |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are |
4322 |
letters and dd are digits. |
letters and dd are digits. |
4323 |
|
|
4324 |
|
|
4325 |
COMMENTS |
COMMENTS |
4326 |
|
|
4327 |
The sequence (?# marks the start of a comment that continues up to the |
The sequence (?# marks the start of a comment that continues up to the |
4328 |
next closing parenthesis. Nested parentheses are not permitted. The |
next closing parenthesis. Nested parentheses are not permitted. The |
4329 |
characters that make up a comment play no part in the pattern matching |
characters that make up a comment play no part in the pattern matching |
4330 |
at all. |
at all. |
4331 |
|
|
4332 |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
If the PCRE_EXTENDED option is set, an unescaped # character outside a |
4333 |
character class introduces a comment that continues to immediately |
character class introduces a comment that continues to immediately |
4334 |
after the next newline in the pattern. |
after the next newline in the pattern. |
4335 |
|
|
4336 |
|
|
4337 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
4338 |
|
|
4339 |
Consider the problem of matching a string in parentheses, allowing for |
Consider the problem of matching a string in parentheses, allowing for |
4340 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
4341 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
4342 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
4343 |
depth. |
depth. |
4344 |
|
|
4345 |
For some time, Perl has provided a facility that allows regular expres- |
For some time, Perl has provided a facility that allows regular expres- |
4346 |
sions to recurse (amongst other things). It does this by interpolating |
sions to recurse (amongst other things). It does this by interpolating |
4347 |
Perl code in the expression at run time, and the code can refer to the |
Perl code in the expression at run time, and the code can refer to the |
4348 |
expression itself. A Perl pattern using code interpolation to solve the |
expression itself. A Perl pattern using code interpolation to solve the |
4349 |
parentheses problem can be created like this: |
parentheses problem can be created like this: |
4350 |
|
|
4354 |
refers recursively to the pattern in which it appears. |
refers recursively to the pattern in which it appears. |
4355 |
|
|
4356 |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
4357 |
it supports special syntax for recursion of the entire pattern, and |
it supports special syntax for recursion of the entire pattern, and |
4358 |
also for individual subpattern recursion. After its introduction in |
also for individual subpattern recursion. After its introduction in |
4359 |
PCRE and Python, this kind of recursion was introduced into Perl at |
PCRE and Python, this kind of recursion was introduced into Perl at |
4360 |
release 5.10. |
release 5.10. |
4361 |
|
|
4362 |
A special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
4363 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
4364 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
4365 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
4366 |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
4367 |
regular expression. |
regular expression. |
4368 |
|
|
4369 |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
4370 |
always treated as an atomic group. That is, once it has matched some of |
always treated as an atomic group. That is, once it has matched some of |
4371 |
the subject string, it is never re-entered, even if it contains untried |
the subject string, it is never re-entered, even if it contains untried |
4372 |
alternatives and there is a subsequent matching failure. |
alternatives and there is a subsequent matching failure. |
4373 |
|
|
4374 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
4375 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
4376 |
|
|
4377 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
4378 |
|
|
4379 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
4380 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
4381 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
4382 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
4383 |
|
|
4384 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
4385 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
4386 |
|
|
4387 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
4388 |
|
|
4389 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
4390 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
4391 |
|
|
4392 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
4393 |
tricky. This is made easier by the use of relative references. (A Perl |
tricky. This is made easier by the use of relative references. (A Perl |
4394 |
5.10 feature.) Instead of (?1) in the pattern above you can write |
5.10 feature.) Instead of (?1) in the pattern above you can write |
4395 |
(?-2) to refer to the second most recently opened parentheses preceding |
(?-2) to refer to the second most recently opened parentheses preceding |
4396 |
the recursion. In other words, a negative number counts capturing |
the recursion. In other words, a negative number counts capturing |
4397 |
parentheses leftwards from the point at which it is encountered. |
parentheses leftwards from the point at which it is encountered. |
4398 |
|
|
4399 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
4400 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
4401 |
because the reference is not inside the parentheses that are refer- |
because the reference is not inside the parentheses that are refer- |
4402 |
enced. They are always "subroutine" calls, as described in the next |
enced. They are always "subroutine" calls, as described in the next |
4403 |
section. |
section. |
4404 |
|
|
4405 |
An alternative approach is to use named parentheses instead. The Perl |
An alternative approach is to use named parentheses instead. The Perl |
4406 |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
4407 |
supported. We could rewrite the above example as follows: |
supported. We could rewrite the above example as follows: |
4408 |
|
|
4409 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
4410 |
|
|
4411 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
4412 |
one is used. |
one is used. |
4413 |
|
|
4414 |
This particular example pattern that we have been looking at contains |
This particular example pattern that we have been looking at contains |
4415 |
nested unlimited repeats, and so the use of atomic grouping for match- |
nested unlimited repeats, and so the use of atomic grouping for match- |
4416 |
ing strings of non-parentheses is important when applying the pattern |
ing strings of non-parentheses is important when applying the pattern |
4417 |
to strings that do not match. For example, when this pattern is applied |
to strings that do not match. For example, when this pattern is applied |
4418 |
to |
to |
4419 |
|
|
4420 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
4421 |
|
|
4422 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
4423 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
4424 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
4425 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
4426 |
|
|
4427 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
4428 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
4429 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
4430 |
function can be used (see below and the pcrecallout documentation). If |
function can be used (see below and the pcrecallout documentation). If |
4431 |
the pattern above is matched against |
the pattern above is matched against |
4432 |
|
|
4433 |
(ab(cd)ef) |
(ab(cd)ef) |
4434 |
|
|
4435 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
4436 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
4437 |
giving |
giving |
4438 |
|
|
4439 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
4440 |
^ ^ |
^ ^ |
4441 |
^ ^ |
^ ^ |
4442 |
|
|
4443 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
4444 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
4445 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
4446 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
4447 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
4448 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
4449 |
|
|
4450 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
4451 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
4452 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
4453 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
4454 |
ted at the outer level. |
ted at the outer level. |
4455 |
|
|
4456 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
4457 |
|
|
4458 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
4459 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
4460 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
4461 |
|
|
4462 |
|
|
4463 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
4464 |
|
|
4465 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
4466 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
4467 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
4468 |
tern may be defined before or after the reference. A numbered reference |
tern may be defined before or after the reference. A numbered reference |
4469 |
can be absolute or relative, as in these examples: |
can be absolute or relative, as in these examples: |
4470 |
|
|
4476 |
|
|
4477 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
4478 |
|
|
4479 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
4480 |
not "sense and responsibility". If instead the pattern |
not "sense and responsibility". If instead the pattern |
4481 |
|
|
4482 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
4483 |
|
|
4484 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
4485 |
two strings. Another example is given in the discussion of DEFINE |
two strings. Another example is given in the discussion of DEFINE |
4486 |
above. |
above. |
4487 |
|
|
4488 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
4489 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
4490 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
4491 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
4492 |
|
|
4493 |
When a subpattern is used as a subroutine, processing options such as |
When a subpattern is used as a subroutine, processing options such as |
4494 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
4495 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
4496 |
|
|
4497 |
(abc)(?i:(?-1)) |
(abc)(?i:(?-1)) |
4498 |
|
|
4499 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
4500 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
4501 |
|
|
4502 |
|
|
4503 |
CALLOUTS |
CALLOUTS |
4504 |
|
|
4505 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
4506 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
4507 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
4508 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
4509 |
tion. |
tion. |
4510 |
|
|
4511 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
4512 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
4513 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
4514 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
4515 |
all calling out. |
all calling out. |
4516 |
|
|
4517 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
4518 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
4519 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
4520 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
4521 |
points: |
points: |
4522 |
|
|
4523 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
4524 |
|
|
4525 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
4526 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
4527 |
numbered 255. |
numbered 255. |
4528 |
|
|
4529 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
4530 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
4531 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
4532 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
4533 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
4534 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
4535 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
4536 |
|
|
4549 |
|
|
4550 |
REVISION |
REVISION |
4551 |
|
|
4552 |
Last updated: 19 June 2007 |
Last updated: 06 August 2007 |
4553 |
|
Copyright (c) 1997-2007 University of Cambridge. |
4554 |
|
------------------------------------------------------------------------------ |
4555 |
|
|
4556 |
|
|
4557 |
|
PCRESYNTAX(3) PCRESYNTAX(3) |
4558 |
|
|
4559 |
|
|
4560 |
|
NAME |
4561 |
|
PCRE - Perl-compatible regular expressions |
4562 |
|
|
4563 |
|
|
4564 |
|
PCRE REGULAR EXPRESSION SYNTAX SUMMARY |
4565 |
|
|
4566 |
|
The full syntax and semantics of the regular expressions that are sup- |
4567 |
|
ported by PCRE are described in the pcrepattern documentation. This |
4568 |
|
document contains just a quick-reference summary of the syntax. |
4569 |
|
|
4570 |
|
|
4571 |
|
QUOTING |
4572 |
|
|
4573 |
|
\x where x is non-alphanumeric is a literal x |
4574 |
|
\Q...\E treat enclosed characters as literal |
4575 |
|
|
4576 |
|
|
4577 |
|
CHARACTERS |
4578 |
|
|
4579 |
|
\a alarm, that is, the BEL character (hex 07) |
4580 |
|
\cx "control-x", where x is any character |
4581 |
|
\e escape (hex 1B) |
4582 |
|
\f formfeed (hex 0C) |
4583 |
|
\n newline (hex 0A) |
4584 |
|
\r carriage return (hex 0D) |
4585 |
|
\t tab (hex 09) |
4586 |
|
\ddd character with octal code ddd, or backreference |
4587 |
|
\xhh character with hex code hh |
4588 |
|
\x{hhh..} character with hex code hhh.. |
4589 |
|
|
4590 |
|
|
4591 |
|
CHARACTER TYPES |
4592 |
|
|
4593 |
|
. any character except newline; |
4594 |
|
in dotall mode, any character whatsoever |
4595 |
|
\C one byte, even in UTF-8 mode (best avoided) |
4596 |
|
\d a decimal digit |
4597 |
|
\D a character that is not a decimal digit |
4598 |
|
\h a horizontal whitespace character |
4599 |
|
\H a character that is not a horizontal whitespace character |
4600 |
|
\p{xx} a character with the xx property |
4601 |
|
\P{xx} a character without the xx property |
4602 |
|
\R a newline sequence |
4603 |
|
\s a whitespace character |
4604 |
|
\S a character that is not a whitespace character |
4605 |
|
\v a vertical whitespace character |
4606 |
|
\V a character that is not a vertical whitespace character |
4607 |
|
\w a "word" character |
4608 |
|
\W a "non-word" character |
4609 |
|
\X an extended Unicode sequence |
4610 |
|
|
4611 |
|
In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters. |
4612 |
|
|
4613 |
|
|
4614 |
|
GENERAL CATEGORY PROPERTY CODES FOR \p and \P |
4615 |
|
|
4616 |
|
C Other |
4617 |
|
Cc Control |
4618 |
|
Cf Format |
4619 |
|
Cn Unassigned |
4620 |
|
Co Private use |
4621 |
|
Cs Surrogate |
4622 |
|
|
4623 |
|
L Letter |
4624 |
|
Ll Lower case letter |
4625 |
|
Lm Modifier letter |
4626 |
|
Lo Other letter |
4627 |
|
Lt Title case letter |
4628 |
|
Lu Upper case letter |
4629 |
|
L& Ll, Lu, or Lt |
4630 |
|
|
4631 |
|
M Mark |
4632 |
|
Mc Spacing mark |
4633 |
|
Me Enclosing mark |
4634 |
|
Mn Non-spacing mark |
4635 |
|
|
4636 |
|
N Number |
4637 |
|
Nd Decimal number |
4638 |
|
Nl Letter number |
4639 |
|
No Other number |
4640 |
|
|
4641 |
|
P Punctuation |
4642 |
|
Pc Connector punctuation |
4643 |
|
Pd Dash punctuation |
4644 |
|
Pe Close punctuation |
4645 |
|
Pf Final punctuation |
4646 |
|
Pi Initial punctuation |
4647 |
|
Po Other punctuation |
4648 |
|
Ps Open punctuation |
4649 |
|
|
4650 |
|
S Symbol |
4651 |
|
Sc Currency symbol |
4652 |
|
Sk Modifier symbol |
4653 |
|
Sm Mathematical symbol |
4654 |
|
So Other symbol |
4655 |
|
|
4656 |
|
Z Separator |
4657 |
|
Zl Line separator |
4658 |
|
Zp Paragraph separator |
4659 |
|
Zs Space separator |
4660 |
|
|
4661 |
|
|
4662 |
|
SCRIPT NAMES FOR \p AND \P |
4663 |
|
|
4664 |
|
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
4665 |
|
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
4666 |
|
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
4667 |
|
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
4668 |
|
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
4669 |
|
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
4670 |
|
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
4671 |
|
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
4672 |
|
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
4673 |
|
|
4674 |
|
|
4675 |
|
CHARACTER CLASSES |
4676 |
|
|
4677 |
|
[...] positive character class |
4678 |
|
[^...] negative character class |
4679 |
|
[x-y] range (can be used for hex characters) |
4680 |
|
[[:xxx:]] positive POSIX named set |
4681 |
|
[[^:xxx:]] negative POSIX named set |
4682 |
|
|
4683 |
|
alnum alphanumeric |
4684 |
|
alpha alphabetic |
4685 |
|
ascii 0-127 |
4686 |
|
blank space or tab |
4687 |
|
cntrl control character |
4688 |
|
digit decimal digit |
4689 |
|
graph printing, excluding space |
4690 |
|
lower lower case letter |
4691 |
|
print printing, including space |
4692 |
|
punct printing, excluding alphanumeric |
4693 |
|
space whitespace |
4694 |
|
upper upper case letter |
4695 |
|
word same as \w |
4696 |
|
xdigit hexadecimal digit |
4697 |
|
|
4698 |
|
In PCRE, POSIX character set names recognize only ASCII characters. You |
4699 |
|
can use \Q...\E inside a character class. |
4700 |
|
|
4701 |
|
|
4702 |
|
QUANTIFIERS |
4703 |
|
|
4704 |
|
? 0 or 1, greedy |
4705 |
|
?+ 0 or 1, possessive |
4706 |
|
?? 0 or 1, lazy |
4707 |
|
* 0 or more, greedy |
4708 |
|
*+ 0 or more, possessive |
4709 |
|
*? 0 or more, lazy |
4710 |
|
+ 1 or more, greedy |
4711 |
|
++ 1 or more, possessive |
4712 |
|
+? 1 or more, lazy |
4713 |
|
{n} exactly n |
4714 |
|
{n,m} at least n, no more than m, greedy |
4715 |
|
{n,m}+ at least n, no more than m, possessive |
4716 |
|
{n,m}? at least n, no more than m, lazy |
4717 |
|
{n,} n or more, greedy |
4718 |
|
{n,}+ n or more, possessive |
4719 |
|
{n,}? n or more, lazy |
4720 |
|
|
4721 |
|
|
4722 |
|
ANCHORS AND SIMPLE ASSERTIONS |
4723 |
|
|
4724 |
|
\b word boundary |
4725 |
|
\B not a word boundary |
4726 |
|
^ start of subject |
4727 |
|
also after internal newline in multiline mode |
4728 |
|
\A start of subject |
4729 |
|
$ end of subject |
4730 |
|
also before newline at end of subject |
4731 |
|
also before internal newline in multiline mode |
4732 |
|
\Z end of subject |
4733 |
|
also before newline at end of subject |
4734 |
|
\z end of subject |
4735 |
|
\G first matching position in subject |
4736 |
|
|
4737 |
|
|
4738 |
|
MATCH POINT RESET |
4739 |
|
|
4740 |
|
\K reset start of match |
4741 |
|
|
4742 |
|
|
4743 |
|
ALTERNATION |
4744 |
|
|
4745 |
|
expr|expr|expr... |
4746 |
|
|
4747 |
|
|
4748 |
|
CAPTURING |
4749 |
|
|
4750 |
|
(...) capturing group |
4751 |
|
(?<name>...) named capturing group (Perl) |
4752 |
|
(?'name'...) named capturing group (Perl) |
4753 |
|
(?P<name>...) named capturing group (Python) |
4754 |
|
(?:...) non-capturing group |
4755 |
|
(?|...) non-capturing group; reset group numbers for |
4756 |
|
capturing groups in each alternative |
4757 |
|
|
4758 |
|
|
4759 |
|
ATOMIC GROUPS |
4760 |
|
|
4761 |
|
(?>...) atomic, non-capturing group |
4762 |
|
|
4763 |
|
|
4764 |
|
COMMENT |
4765 |
|
|
4766 |
|
(?#....) comment (not nestable) |
4767 |
|
|
4768 |
|
|
4769 |
|
OPTION SETTING |
4770 |
|
|
4771 |
|
(?i) caseless |
4772 |
|
(?J) allow duplicate names |
4773 |
|
(?m) multiline |
4774 |
|
(?s) single line (dotall) |
4775 |
|
(?U) default ungreedy (lazy) |
4776 |
|
(?x) extended (ignore white space) |
4777 |
|
(?-...) unset option(s) |
4778 |
|
|
4779 |
|
|
4780 |
|
LOOKAHEAD AND LOOKBEHIND ASSERTIONS |
4781 |
|
|
4782 |
|
(?=...) positive look ahead |
4783 |
|
(?!...) negative look ahead |
4784 |
|
(?<=...) positive look behind |
4785 |
|
(?<!...) negative look behind |
4786 |
|
|
4787 |
|
Each top-level branch of a look behind must be of a fixed length. |
4788 |
|
|
4789 |
|
|
4790 |
|
BACKREFERENCES |
4791 |
|
|
4792 |
|
\n reference by number (can be ambiguous) |
4793 |
|
\gn reference by number |
4794 |
|
\g{n} reference by number |
4795 |
|
\g{-n} relative reference by number |
4796 |
|
\k<name> reference by name (Perl) |
4797 |
|
\k'name' reference by name (Perl) |
4798 |
|
\g{name} reference by name (Perl) |
4799 |
|
\k{name} reference by name (.NET) |
4800 |
|
(?P=name) reference by name (Python) |
4801 |
|
|
4802 |
|
|
4803 |
|
SUBROUTINE REFERENCES (POSSIBLY RECURSIVE) |
4804 |
|
|
4805 |
|
(?R) recurse whole pattern |
4806 |
|
(?n) call subpattern by absolute number |
4807 |
|
(?+n) call subpattern by relative number |
4808 |
|
(?-n) call subpattern by relative number |
4809 |
|
(?&name) call subpattern by name (Perl) |
4810 |
|
(?P>name) call subpattern by name (Python) |
4811 |
|
|
4812 |
|
|
4813 |
|
CONDITIONAL PATTERNS |
4814 |
|
|
4815 |
|
(?(condition)yes-pattern) |
4816 |
|
(?(condition)yes-pattern|no-pattern) |
4817 |
|
|
4818 |
|
(?(n)... absolute reference condition |
4819 |
|
(?(+n)... relative reference condition |
4820 |
|
(?(-n)... relative reference condition |
4821 |
|
(?(<name>)... named reference condition (Perl) |
4822 |
|
(?('name')... named reference condition (Perl) |
4823 |
|
(?(name)... named reference condition (PCRE) |
4824 |
|
(?(R)... overall recursion condition |
4825 |
|
(?(Rn)... specific group recursion condition |
4826 |
|
(?(R&name)... specific recursion condition |
4827 |
|
(?(DEFINE)... define subpattern for reference |
4828 |
|
(?(assert)... assertion condition |
4829 |
|
|
4830 |
|
|
4831 |
|
CALLOUTS |
4832 |
|
|
4833 |
|
(?C) callout |
4834 |
|
(?Cn) callout with data n |
4835 |
|
|
4836 |
|
|
4837 |
|
SEE ALSO |
4838 |
|
|
4839 |
|
pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). |
4840 |
|
|
4841 |
|
|
4842 |
|
AUTHOR |
4843 |
|
|
4844 |
|
Philip Hazel |
4845 |
|
University Computing Service |
4846 |
|
Cambridge CB2 3QH, England. |
4847 |
|
|
4848 |
|
|
4849 |
|
REVISION |
4850 |
|
|
4851 |
|
Last updated: 06 August 2007 |
4852 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
4853 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4854 |
|
|