30 |
|
|
31 |
const unsigned char *pcre_maketables(void); |
const unsigned char *pcre_maketables(void); |
32 |
|
|
33 |
|
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
34 |
|
int what, void *where); |
35 |
|
|
36 |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
37 |
|
|
38 |
char *pcre_version(void); |
char *pcre_version(void); |
49 |
lar expression pattern matching using the same syntax and |
lar expression pattern matching using the same syntax and |
50 |
semantics as Perl 5, with just a few differences (see |
semantics as Perl 5, with just a few differences (see |
51 |
below). The current implementation corresponds to Perl |
below). The current implementation corresponds to Perl |
52 |
5.005. |
5.005, with some additional features from the Perl develop- |
53 |
|
ment release. |
54 |
|
|
55 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
56 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
57 |
correspond to the POSIX API. These are described in the |
correspond to the POSIX regular expression API. These are |
58 |
pcreposix documentation. |
described in the pcreposix documentation. |
59 |
|
|
60 |
The native API function prototypes are defined in the header |
The native API function prototypes are defined in the header |
61 |
file pcre.h, and on Unix systems the library itself is |
file pcre.h, and on Unix systems the library itself is |
62 |
called libpcre.a, so can be accessed by adding -lpcre to the |
called libpcre.a, so can be accessed by adding -lpcre to the |
63 |
command for linking an application which calls it. |
command for linking an application which calls it. The |
64 |
|
header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
65 |
|
contain the major and minor release numbers for the library. |
66 |
|
Applications can use these to include support for different |
67 |
|
releases. |
68 |
|
|
69 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
70 |
are used for compiling and matching regular expressions, |
are used for compiling and matching regular expressions, |
75 |
to build a set of character tables in the current locale for |
to build a set of character tables in the current locale for |
76 |
passing to pcre_compile(). |
passing to pcre_compile(). |
77 |
|
|
78 |
The function pcre_info() is used to find out information |
The function pcre_fullinfo() is used to find out information |
79 |
about a compiled pattern, while the function pcre_version() |
about a compiled pattern; pcre_info() is an obsolete version |
80 |
returns a pointer to a string containing the version of PCRE |
which returns only some of the available information, but is |
81 |
and its date of release. |
retained for backwards compatibility. The function |
82 |
|
pcre_version() returns a pointer to a string containing the |
83 |
|
version of PCRE and its date of release. |
84 |
|
|
85 |
The global variables pcre_malloc and pcre_free initially |
The global variables pcre_malloc and pcre_free initially |
86 |
contain the entry points of the standard malloc() and free() |
contain the entry points of the standard malloc() and free() |
103 |
|
|
104 |
|
|
105 |
|
|
106 |
|
|
107 |
COMPILING A PATTERN |
COMPILING A PATTERN |
108 |
The function pcre_compile() is called to compile a pattern |
The function pcre_compile() is called to compile a pattern |
109 |
into an internal form. The pattern is a C string terminated |
into an internal form. The pattern is a C string terminated |
199 |
|
|
200 |
PCRE_EXTRA |
PCRE_EXTRA |
201 |
|
|
202 |
This option turns on additional functionality of PCRE that |
This option was invented in order to turn on additional |
203 |
is incompatible with Perl. Any backslash in a pattern that |
functionality of PCRE that is incompatible with Perl, but it |
204 |
is followed by a letter that has no special meaning causes |
is currently of very little use. When set, any backslash in |
205 |
an error, thus reserving these combinations for future |
a pattern that is followed by a letter that has no special |
206 |
expansion. By default, as in Perl, a backslash followed by a |
meaning causes an error, thus reserving these combinations |
207 |
letter with no special meaning is treated as a literal. |
for future expansion. By default, as in Perl, a backslash |
208 |
There are at present no other features controlled by this |
followed by a letter with no special meaning is treated as a |
209 |
option. |
literal. There are at present no other features controlled |
210 |
|
by this option. It can also be set by a (?X) option setting |
211 |
|
within a pattern. |
212 |
|
|
213 |
PCRE_MULTILINE |
PCRE_MULTILINE |
214 |
|
|
221 |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
222 |
|
|
223 |
When PCRE_MULTILINE it is set, the "start of line" and "end |
When PCRE_MULTILINE it is set, the "start of line" and "end |
224 |
of line" constructs match immediately following or |
of line" constructs match immediately following or immedi- |
225 |
immediately before any newline in the subject string, |
ately before any newline in the subject string, respec- |
226 |
respectively, as well as at the very start and end. This is |
tively, as well as at the very start and end. This is |
227 |
equivalent to Perl's /m option. If there are no "\n" charac- |
equivalent to Perl's /m option. If there are no "\n" charac- |
228 |
ters in a subject string, or no occurrences of ^ or $ in a |
ters in a subject string, or no occurrences of ^ or $ in a |
229 |
pattern, setting PCRE_MULTILINE has no effect. |
pattern, setting PCRE_MULTILINE has no effect. |
298 |
|
|
299 |
|
|
300 |
INFORMATION ABOUT A PATTERN |
INFORMATION ABOUT A PATTERN |
301 |
The pcre_info() function returns information about a com- |
The pcre_fullinfo() function returns information about a |
302 |
piled pattern. Its yield is the number of capturing subpat- |
compiled pattern. It replaces the obsolete pcre_info() func- |
303 |
terns, or one of the following negative numbers: |
tion, which is nevertheless retained for backwards compabil- |
304 |
|
ity (and is documented below). |
305 |
|
|
306 |
|
The first argument for pcre_fullinfo() is a pointer to the |
307 |
|
compiled pattern. The second argument is the result of |
308 |
|
pcre_study(), or NULL if the pattern was not studied. The |
309 |
|
third argument specifies which piece of information is |
310 |
|
required, while the fourth argument is a pointer to a vari- |
311 |
|
able to receive the data. The yield of the function is zero |
312 |
|
for success, or one of the following negative numbers: |
313 |
|
|
314 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
315 |
|
the argument where was NULL |
316 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
317 |
|
PCRE_ERROR_BADOPTION the value of what was invalid |
318 |
|
|
319 |
If the optptr argument is not NULL, a copy of the options |
The possible values for the third argument are defined in |
320 |
with which the pattern was compiled is placed in the integer |
pcre.h, and are as follows: |
321 |
it points to. These option bits are those specified in the |
|
322 |
|
PCRE_INFO_OPTIONS |
323 |
|
|
324 |
|
Return a copy of the options with which the pattern was com- |
325 |
|
piled. The fourth argument should point to au unsigned long |
326 |
|
int variable. These option bits are those specified in the |
327 |
call to pcre_compile(), modified by any top-level option |
call to pcre_compile(), modified by any top-level option |
328 |
settings within the pattern itself, and with the |
settings within the pattern itself, and with the |
329 |
PCRE_ANCHORED bit set if the form of the pattern implies |
PCRE_ANCHORED bit forcibly set if the form of the pattern |
330 |
that it can match only at the start of a subject string. |
implies that it can match only at the start of a subject |
331 |
|
string. |
332 |
|
|
333 |
If the pattern is not anchored and the firstcharptr argument |
PCRE_INFO_SIZE |
334 |
is not NULL, it is used to pass back information about the |
|
335 |
first character of any matched string. If there is a fixed |
Return the size of the compiled pattern, that is, the value |
336 |
first character, e.g. from a pattern such as |
that was passed as the argument to pcre_malloc() when PCRE |
337 |
|
was getting memory in which to place the compiled data. The |
338 |
|
fourth argument should point to a size_t variable. |
339 |
|
|
340 |
|
PCRE_INFO_CAPTURECOUNT |
341 |
|
|
342 |
|
Return the number of capturing subpatterns in the pattern. |
343 |
|
The fourth argument should point to an int variable. |
344 |
|
|
345 |
|
PCRE_INFO_BACKREFMAX |
346 |
|
|
347 |
|
Return the number of the highest back reference in the pat- |
348 |
|
tern. The fourth argument should point to an int variable. |
349 |
|
Zero is returned if there are no back references. |
350 |
|
|
351 |
|
PCRE_INFO_FIRSTCHAR |
352 |
|
|
353 |
|
Return information about the first character of any matched |
354 |
|
string, for a non-anchored pattern. If there is a fixed |
355 |
|
first character, e.g. from a pattern such as |
356 |
(cat|cow|coyote), then it is returned in the integer pointed |
(cat|cow|coyote), then it is returned in the integer pointed |
357 |
to by firstcharptr. Otherwise, if either |
to by where. Otherwise, if either |
358 |
|
|
359 |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
360 |
and every branch starts with "^", or |
and every branch starts with "^", or |
362 |
(b) every branch of the pattern starts with ".*" and |
(b) every branch of the pattern starts with ".*" and |
363 |
PCRE_DOTALL is not set (if it were set, the pattern would be |
PCRE_DOTALL is not set (if it were set, the pattern would be |
364 |
anchored), |
anchored), |
365 |
|
|
366 |
then -1 is returned, indicating that the pattern matches |
then -1 is returned, indicating that the pattern matches |
367 |
only at the start of a subject string or after any "\n" |
only at the start of a subject string or after any "\n" |
368 |
within the string. Otherwise -2 is returned. |
within the string. Otherwise -2 is returned. For anchored |
369 |
|
patterns, -2 is returned. |
370 |
|
|
371 |
|
PCRE_INFO_FIRSTTABLE |
372 |
|
|
373 |
|
If the pattern was studied, and this resulted in the con- |
374 |
|
struction of a 256-bit table indicating a fixed set of char- |
375 |
|
acters for the first character in any matching string, a |
376 |
|
pointer to the table is returned. Otherwise NULL is |
377 |
|
returned. The fourth argument should point to an unsigned |
378 |
|
char * variable. |
379 |
|
|
380 |
|
PCRE_INFO_LASTLITERAL |
381 |
|
|
382 |
|
For a non-anchored pattern, return the value of the right- |
383 |
|
most literal character which must exist in any matched |
384 |
|
string, other than at its start. The fourth argument should |
385 |
|
point to an int variable. If there is no such character, or |
386 |
|
if the pattern is anchored, -1 is returned. For example, for |
387 |
|
the pattern /a\d+z\d+/ the returned value is 'z'. |
388 |
|
|
389 |
|
The pcre_info() function is now obsolete because its inter- |
390 |
|
face is too restrictive to return all the available data |
391 |
|
about a compiled pattern. New programs should use |
392 |
|
pcre_fullinfo() instead. The yield of pcre_info() is the |
393 |
|
number of capturing subpatterns, or one of the following |
394 |
|
negative numbers: |
395 |
|
|
396 |
|
PCRE_ERROR_NULL the argument code was NULL |
397 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
398 |
|
|
399 |
|
If the optptr argument is not NULL, a copy of the options |
400 |
|
with which the pattern was compiled is placed in the integer |
401 |
|
it points to (see PCRE_INFO_OPTIONS above). |
402 |
|
|
403 |
|
If the pattern is not anchored and the firstcharptr argument |
404 |
|
is not NULL, it is used to pass back information about the |
405 |
|
first character of any matched string (see |
406 |
|
PCRE_INFO_FIRSTCHAR above). |
407 |
|
|
408 |
|
|
409 |
|
|
729 |
6. The Perl \G assertion is not supported as it is not |
6. The Perl \G assertion is not supported as it is not |
730 |
relevant to single pattern matches. |
relevant to single pattern matches. |
731 |
|
|
732 |
7. Fairly obviously, PCRE does not support the (?{code}) |
7. Fairly obviously, PCRE does not support the (?{code}) and |
733 |
construction. |
(?p{code}) constructions. However, there is some experimen- |
734 |
|
tal support for recursive patterns using the non-Perl item |
735 |
|
(?R). |
736 |
8. There are at the time of writing some oddities in Perl |
8. There are at the time of writing some oddities in Perl |
737 |
5.005_02 concerned with the settings of captured strings |
5.005_02 concerned with the settings of captured strings |
738 |
when part of a pattern is repeated. For example, matching |
when part of a pattern is repeated. For example, matching |
765 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
766 |
with no special meaning is faulted. |
with no special meaning is faulted. |
767 |
|
|
768 |
(d) If PCRE_UNGREEDY is set, the greediness of the |
(d) If PCRE_UNGREEDY is set, the greediness of the repeti- |
769 |
repetition quantifiers is inverted, that is, by default they |
tion quantifiers is inverted, that is, by default they are |
770 |
are not greedy, but if followed by a question mark they are. |
not greedy, but if followed by a question mark they are. |
771 |
|
|
772 |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
773 |
only at the start of the subject. |
only at the start of the subject. |
775 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
776 |
for pcre_exec() have no Perl equivalents. |
for pcre_exec() have no Perl equivalents. |
777 |
|
|
778 |
|
(g) The (?R) construct allows for recursive pattern matching |
779 |
|
(Perl 5.6 can do this using the (?p{code}) construct, which |
780 |
|
PCRE cannot of course support.) |
781 |
|
|
782 |
|
|
783 |
|
|
784 |
REGULAR EXPRESSION DETAILS |
REGULAR EXPRESSION DETAILS |
785 |
The syntax and semantics of the regular expressions sup- |
The syntax and semantics of the regular expressions sup- |
786 |
ported by PCRE are described below. Regular expressions are |
ported by PCRE are described below. Regular expressions are |
787 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
788 |
|
|
789 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
790 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
791 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
792 |
The description here is intended as reference documentation. |
The description here is intended as reference documentation. |
793 |
|
|
794 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
875 |
\f formfeed (hex 0C) |
\f formfeed (hex 0C) |
876 |
\n newline (hex 0A) |
\n newline (hex 0A) |
877 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
878 |
|
\t tab (hex 09) |
|
tab (hex 09) |
|
879 |
\xhh character with hex code hh |
\xhh character with hex code hh |
880 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
881 |
|
|
927 |
Note that octal values of 100 or greater must not be intro- |
Note that octal values of 100 or greater must not be intro- |
928 |
duced by a leading zero, because no more than three octal |
duced by a leading zero, because no more than three octal |
929 |
digits are ever read. |
digits are ever read. |
930 |
|
|
931 |
All the sequences that define a single byte value can be |
All the sequences that define a single byte value can be |
932 |
used both inside and outside character classes. In addition, |
used both inside and outside character classes. In addition, |
933 |
inside a character class, the sequence "\b" is interpreted |
inside a character class, the sequence "\b" is interpreted |
980 |
These assertions may not appear in character classes (but |
These assertions may not appear in character classes (but |
981 |
note that "\b" has a different meaning, namely the backspace |
note that "\b" has a different meaning, namely the backspace |
982 |
character, inside a character class). |
character, inside a character class). |
983 |
|
|
984 |
A word boundary is a position in the subject string where |
A word boundary is a position in the subject string where |
985 |
the current character and the previous character do not both |
the current character and the previous character do not both |
986 |
match \w or \W (i.e. one matches \w and the other matches |
match \w or \W (i.e. one matches \w and the other matches |
1142 |
|
|
1143 |
|
|
1144 |
|
|
1145 |
|
POSIX CHARACTER CLASSES |
1146 |
|
Perl 5.6 (not yet released at the time of writing) is going |
1147 |
|
to support the POSIX notation for character classes, which |
1148 |
|
uses names enclosed by [: and :] within the enclosing |
1149 |
|
square brackets. PCRE supports this notation. For example, |
1150 |
|
|
1151 |
|
[01[:alpha:]%] |
1152 |
|
|
1153 |
|
matches "0", "1", any alphabetic character, or "%". The sup- |
1154 |
|
ported class names are |
1155 |
|
|
1156 |
|
alnum letters and digits |
1157 |
|
alpha letters |
1158 |
|
ascii character codes 0 - 127 |
1159 |
|
cntrl control characters |
1160 |
|
digit decimal digits (same as \d) |
1161 |
|
graph printing characters, excluding space |
1162 |
|
lower lower case letters |
1163 |
|
print printing characters, including space |
1164 |
|
punct printing characters, excluding letters and digits |
1165 |
|
space white space (same as \s) |
1166 |
|
upper upper case letters |
1167 |
|
word "word" characters (same as \w) |
1168 |
|
xdigit hexadecimal digits |
1169 |
|
|
1170 |
|
The names "ascii" and "word" are Perl extensions. Another |
1171 |
|
Perl extension is negation, which is indicated by a ^ char- |
1172 |
|
acter after the colon. For example, |
1173 |
|
|
1174 |
|
[12[:^digit:]] |
1175 |
|
|
1176 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also |
1177 |
|
recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a |
1178 |
|
"collating element", but these are not supported, and an |
1179 |
|
error is given if they are encountered. |
1180 |
|
|
1181 |
|
|
1182 |
|
|
1183 |
VERTICAL BAR |
VERTICAL BAR |
1184 |
Vertical bar characters are used to separate alternative |
Vertical bar characters are used to separate alternative |
1185 |
patterns. For example, the pattern |
patterns. For example, the pattern |
1331 |
Repetition is specified by quantifiers, which can follow any |
Repetition is specified by quantifiers, which can follow any |
1332 |
of the following items: |
of the following items: |
1333 |
|
|
|
|
|
1334 |
a single character, possibly escaped |
a single character, possibly escaped |
1335 |
the . metacharacter |
the . metacharacter |
1336 |
a character class |
a character class |
1517 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
1518 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
1519 |
example, (a\1) never matches. However, such references can |
example, (a\1) never matches. However, such references can |
1520 |
be useful inside repeated subpatterns. For example, the pat- |
be useful inside repeated subpatterns. For example, the |
1521 |
tern |
pattern |
1522 |
|
|
1523 |
(a|b\1)+ |
(a|b\1)+ |
1524 |
|
|
1540 |
cated assertions are coded as subpatterns. There are two |
cated assertions are coded as subpatterns. There are two |
1541 |
kinds: those that look ahead of the current position in the |
kinds: those that look ahead of the current position in the |
1542 |
subject string, and those that look behind it. |
subject string, and those that look behind it. |
1543 |
|
|
1544 |
An assertion subpattern is matched in the normal way, except |
An assertion subpattern is matched in the normal way, except |
1545 |
that it does not cause the current matching position to be |
that it does not cause the current matching position to be |
1546 |
changed. Lookahead assertions start with (?= for positive |
changed. Lookahead assertions start with (?= for positive |
1706 |
|
|
1707 |
abcd$ |
abcd$ |
1708 |
|
|
1709 |
when applied to a long string which does not match it. |
when applied to a long string which does not match. Because |
1710 |
Because matching proceeds from left to right, PCRE will look |
matching proceeds from left to right, PCRE will look for |
1711 |
for each "a" in the subject and then see if what follows |
each "a" in the subject and then see if what follows matches |
1712 |
matches the rest of the pattern. If the pattern is specified |
the rest of the pattern. If the pattern is specified as |
|
as |
|
1713 |
|
|
1714 |
^.*abcd$ |
^.*abcd$ |
1715 |
|
|
1716 |
then the initial .* matches the entire string at first, but |
then the initial .* matches the entire string at first, but |
1717 |
when this fails, it backtracks to match all but the last |
when this fails (because there is no following "a"), it |
1718 |
character, then all but the last two characters, and so on. |
backtracks to match all but the last character, then all but |
1719 |
Once again the search for "a" covers the entire string, from |
the last two characters, and so on. Once again the search |
1720 |
right to left, so we are no better off. However, if the pat- |
for "a" covers the entire string, from right to left, so we |
1721 |
tern is written as |
are no better off. However, if the pattern is written as |
1722 |
|
|
1723 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1724 |
|
|
1729 |
this approach makes a significant difference to the process- |
this approach makes a significant difference to the process- |
1730 |
ing time. |
ing time. |
1731 |
|
|
1732 |
|
When a pattern contains an unlimited repeat inside a subpat- |
1733 |
|
tern that can itself be repeated an unlimited number of |
1734 |
|
times, the use of a once-only subpattern is the only way to |
1735 |
|
avoid some failing matches taking a very long time indeed. |
1736 |
|
The pattern |
1737 |
|
|
1738 |
|
(\D+|<\d+>)*[!?] |
1739 |
|
|
1740 |
|
matches an unlimited number of substrings that either con- |
1741 |
|
sist of non-digits, or digits enclosed in <>, followed by |
1742 |
|
either ! or ?. When it matches, it runs quickly. However, if |
1743 |
|
it is applied to |
1744 |
|
|
1745 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
1746 |
|
|
1747 |
|
it takes a long time before reporting failure. This is |
1748 |
|
because the string can be divided between the two repeats in |
1749 |
|
a large number of ways, and all have to be tried. (The exam- |
1750 |
|
ple used [!?] rather than a single character at the end, |
1751 |
|
because both PCRE and Perl have an optimization that allows |
1752 |
|
for fast failure when a single character is used. They |
1753 |
|
remember the last single character that is required for a |
1754 |
|
match, and fail early if it is not present in the string.) |
1755 |
|
If the pattern is changed to |
1756 |
|
|
1757 |
|
((?>\D+)|<\d+>)*[!?] |
1758 |
|
|
1759 |
|
sequences of non-digits cannot be broken, and failure hap- |
1760 |
|
pens quickly. |
1761 |
|
|
1762 |
|
|
1763 |
|
|
1764 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
1831 |
|
|
1832 |
|
|
1833 |
|
|
1834 |
|
RECURSIVE PATTERNS |
1835 |
|
Consider the problem of matching a string in parentheses, |
1836 |
|
allowing for unlimited nested parentheses. Without the use |
1837 |
|
of recursion, the best that can be done is to use a pattern |
1838 |
|
that matches up to some fixed depth of nesting. It is not |
1839 |
|
possible to handle an arbitrary nesting depth. Perl 5.6 has |
1840 |
|
provided an experimental facility that allows regular |
1841 |
|
expressions to recurse (amongst other things). It does this |
1842 |
|
by interpolating Perl code in the expression at run time, |
1843 |
|
and the code can refer to the expression itself. A Perl pat- |
1844 |
|
tern to solve the parentheses problem can be created like |
1845 |
|
this: |
1846 |
|
|
1847 |
|
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
1848 |
|
|
1849 |
|
The (?p{...}) item interpolates Perl code at run time, and |
1850 |
|
in this case refers recursively to the pattern in which it |
1851 |
|
appears. Obviously, PCRE cannot support the interpolation of |
1852 |
|
Perl code. Instead, the special item (?R) is provided for |
1853 |
|
the specific case of recursion. This PCRE pattern solves the |
1854 |
|
parentheses problem (assume the PCRE_EXTENDED option is set |
1855 |
|
so that white space is ignored): |
1856 |
|
|
1857 |
|
\( ( (?>[^()]+) | (?R) )* \) |
1858 |
|
|
1859 |
|
First it matches an opening parenthesis. Then it matches any |
1860 |
|
number of substrings which can either be a sequence of non- |
1861 |
|
parentheses, or a recursive match of the pattern itself |
1862 |
|
(i.e. a correctly parenthesized substring). Finally there is |
1863 |
|
a closing parenthesis. |
1864 |
|
|
1865 |
|
This particular example pattern contains nested unlimited |
1866 |
|
repeats, and so the use of a once-only subpattern for match- |
1867 |
|
ing strings of non-parentheses is important when applying |
1868 |
|
the pattern to strings that do not match. For example, when |
1869 |
|
it is applied to |
1870 |
|
|
1871 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
1872 |
|
|
1873 |
|
it yields "no match" quickly. However, if a once-only sub- |
1874 |
|
pattern is not used, the match runs for a very long time |
1875 |
|
indeed because there are so many different ways the + and * |
1876 |
|
repeats can carve up the subject, and all have to be tested |
1877 |
|
before failure can be reported. |
1878 |
|
|
1879 |
|
The values set for any capturing subpatterns are those from |
1880 |
|
the outermost level of the recursion at which the subpattern |
1881 |
|
value is set. If the pattern above is matched against |
1882 |
|
|
1883 |
|
(ab(cd)ef) |
1884 |
|
|
1885 |
|
the value for the capturing parentheses is "ef", which is |
1886 |
|
the last value taken on at the top level. If additional |
1887 |
|
parentheses are added, giving |
1888 |
|
|
1889 |
|
\( ( ( (?>[^()]+) | (?R) )* ) \) |
1890 |
|
^ ^ |
1891 |
|
^ ^ then the string they capture |
1892 |
|
is "ab(cd)ef", the contents of the top level parentheses. If |
1893 |
|
there are more than 15 capturing parentheses in a pattern, |
1894 |
|
PCRE has to obtain extra memory to store data during a |
1895 |
|
recursion, which it does by using pcre_malloc, freeing it |
1896 |
|
via pcre_free afterwards. If no memory can be obtained, it |
1897 |
|
saves data for the first 15 capturing parentheses only, as |
1898 |
|
there is no way to give an out-of-memory error from within a |
1899 |
|
recursion. |
1900 |
|
|
1901 |
|
|
1902 |
|
|
1903 |
PERFORMANCE |
PERFORMANCE |
1904 |
Certain items that may appear in patterns are more efficient |
Certain items that may appear in patterns are more efficient |
1905 |
than others. It is more efficient to use a character class |
than others. It is more efficient to use a character class |
1974 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
1975 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
1976 |
|
|
1977 |
Last updated: 29 July 1999 |
Last updated: 27 January 2000 |
1978 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |