25 |
<LI><A NAME="TOC15" HREF="#SEC15">CIRCUMFLEX AND DOLLAR</A> |
<LI><A NAME="TOC15" HREF="#SEC15">CIRCUMFLEX AND DOLLAR</A> |
26 |
<LI><A NAME="TOC16" HREF="#SEC16">FULL STOP (PERIOD, DOT)</A> |
<LI><A NAME="TOC16" HREF="#SEC16">FULL STOP (PERIOD, DOT)</A> |
27 |
<LI><A NAME="TOC17" HREF="#SEC17">SQUARE BRACKETS</A> |
<LI><A NAME="TOC17" HREF="#SEC17">SQUARE BRACKETS</A> |
28 |
<LI><A NAME="TOC18" HREF="#SEC18">VERTICAL BAR</A> |
<LI><A NAME="TOC18" HREF="#SEC18">POSIX CHARACTER CLASSES</A> |
29 |
<LI><A NAME="TOC19" HREF="#SEC19">INTERNAL OPTION SETTING</A> |
<LI><A NAME="TOC19" HREF="#SEC19">VERTICAL BAR</A> |
30 |
<LI><A NAME="TOC20" HREF="#SEC20">SUBPATTERNS</A> |
<LI><A NAME="TOC20" HREF="#SEC20">INTERNAL OPTION SETTING</A> |
31 |
<LI><A NAME="TOC21" HREF="#SEC21">REPETITION</A> |
<LI><A NAME="TOC21" HREF="#SEC21">SUBPATTERNS</A> |
32 |
<LI><A NAME="TOC22" HREF="#SEC22">BACK REFERENCES</A> |
<LI><A NAME="TOC22" HREF="#SEC22">REPETITION</A> |
33 |
<LI><A NAME="TOC23" HREF="#SEC23">ASSERTIONS</A> |
<LI><A NAME="TOC23" HREF="#SEC23">BACK REFERENCES</A> |
34 |
<LI><A NAME="TOC24" HREF="#SEC24">ONCE-ONLY SUBPATTERNS</A> |
<LI><A NAME="TOC24" HREF="#SEC24">ASSERTIONS</A> |
35 |
<LI><A NAME="TOC25" HREF="#SEC25">CONDITIONAL SUBPATTERNS</A> |
<LI><A NAME="TOC25" HREF="#SEC25">ONCE-ONLY SUBPATTERNS</A> |
36 |
<LI><A NAME="TOC26" HREF="#SEC26">COMMENTS</A> |
<LI><A NAME="TOC26" HREF="#SEC26">CONDITIONAL SUBPATTERNS</A> |
37 |
<LI><A NAME="TOC27" HREF="#SEC27">PERFORMANCE</A> |
<LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A> |
38 |
<LI><A NAME="TOC28" HREF="#SEC28">AUTHOR</A> |
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A> |
39 |
|
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A> |
40 |
|
<LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A> |
41 |
</UL> |
</UL> |
42 |
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A> |
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A> |
43 |
<P> |
<P> |
79 |
<B>const unsigned char *pcre_maketables(void);</B> |
<B>const unsigned char *pcre_maketables(void);</B> |
80 |
</P> |
</P> |
81 |
<P> |
<P> |
82 |
|
<B>int pcre_fullinfo(const pcre *<I>code</I>, const pcre_extra *<I>extra</I>,</B> |
83 |
|
<B>int <I>what</I>, void *<I>where</I>);</B> |
84 |
|
</P> |
85 |
|
<P> |
86 |
<B>int pcre_info(const pcre *<I>code</I>, int *<I>optptr</I>, int</B> |
<B>int pcre_info(const pcre *<I>code</I>, int *<I>optptr</I>, int</B> |
87 |
<B>*<I>firstcharptr</I>);</B> |
<B>*<I>firstcharptr</I>);</B> |
88 |
</P> |
</P> |
99 |
<P> |
<P> |
100 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
101 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
102 |
differences (see below). The current implementation corresponds to Perl 5.005. |
differences (see below). The current implementation corresponds to Perl 5.005, |
103 |
|
with some additional features from the Perl development release. |
104 |
</P> |
</P> |
105 |
<P> |
<P> |
106 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
107 |
a set of wrapper functions that correspond to the POSIX API. These are |
a set of wrapper functions that correspond to the POSIX regular expression API. |
108 |
described in the <B>pcreposix</B> documentation. |
These are described in the <B>pcreposix</B> documentation. |
109 |
</P> |
</P> |
110 |
<P> |
<P> |
111 |
The native API function prototypes are defined in the header file <B>pcre.h</B>, |
The native API function prototypes are defined in the header file <B>pcre.h</B>, |
112 |
and on Unix systems the library itself is called <B>libpcre.a</B>, so can be |
and on Unix systems the library itself is called <B>libpcre.a</B>, so can be |
113 |
accessed by adding <B>-lpcre</B> to the command for linking an application which |
accessed by adding <B>-lpcre</B> to the command for linking an application which |
114 |
calls it. |
calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
115 |
|
contain the major and minor release numbers for the library. Applications can |
116 |
|
use these to include support for different releases. |
117 |
</P> |
</P> |
118 |
<P> |
<P> |
119 |
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B> |
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B> |
125 |
in the current locale for passing to <B>pcre_compile()</B>. |
in the current locale for passing to <B>pcre_compile()</B>. |
126 |
</P> |
</P> |
127 |
<P> |
<P> |
128 |
The function <B>pcre_info()</B> is used to find out information about a compiled |
The function <B>pcre_fullinfo()</B> is used to find out information about a |
129 |
pattern, while the function <B>pcre_version()</B> returns a pointer to a string |
compiled pattern; <B>pcre_info()</B> is an obsolete version which returns only |
130 |
containing the version of PCRE and its date of release. |
some of the available information, but is retained for backwards compatibility. |
131 |
|
The function <B>pcre_version()</B> returns a pointer to a string containing the |
132 |
|
version of PCRE and its date of release. |
133 |
</P> |
</P> |
134 |
<P> |
<P> |
135 |
The global variables <B>pcre_malloc</B> and <B>pcre_free</B> initially contain |
The global variables <B>pcre_malloc</B> and <B>pcre_free</B> initially contain |
257 |
</PRE> |
</PRE> |
258 |
</P> |
</P> |
259 |
<P> |
<P> |
260 |
This option turns on additional functionality of PCRE that is incompatible with |
This option was invented in order to turn on additional functionality of PCRE |
261 |
Perl. Any backslash in a pattern that is followed by a letter that has no |
that is incompatible with Perl, but it is currently of very little use. When |
262 |
|
set, any backslash in a pattern that is followed by a letter that has no |
263 |
special meaning causes an error, thus reserving these combinations for future |
special meaning causes an error, thus reserving these combinations for future |
264 |
expansion. By default, as in Perl, a backslash followed by a letter with no |
expansion. By default, as in Perl, a backslash followed by a letter with no |
265 |
special meaning is treated as a literal. There are at present no other features |
special meaning is treated as a literal. There are at present no other features |
266 |
controlled by this option. |
controlled by this option. It can also be set by a (?X) option setting within a |
267 |
|
pattern. |
268 |
</P> |
</P> |
269 |
<P> |
<P> |
270 |
<PRE> |
<PRE> |
355 |
</P> |
</P> |
356 |
<LI><A NAME="SEC8" HREF="#TOC1">INFORMATION ABOUT A PATTERN</A> |
<LI><A NAME="SEC8" HREF="#TOC1">INFORMATION ABOUT A PATTERN</A> |
357 |
<P> |
<P> |
358 |
The <B>pcre_info()</B> function returns information about a compiled pattern. |
The <B>pcre_fullinfo()</B> function returns information about a compiled |
359 |
Its yield is the number of capturing subpatterns, or one of the following |
pattern. It replaces the obsolete <B>pcre_info()</B> function, which is |
360 |
negative numbers: |
nevertheless retained for backwards compability (and is documented below). |
361 |
|
</P> |
362 |
|
<P> |
363 |
|
The first argument for <B>pcre_fullinfo()</B> is a pointer to the compiled |
364 |
|
pattern. The second argument is the result of <B>pcre_study()</B>, or NULL if |
365 |
|
the pattern was not studied. The third argument specifies which piece of |
366 |
|
information is required, while the fourth argument is a pointer to a variable |
367 |
|
to receive the data. The yield of the function is zero for success, or one of |
368 |
|
the following negative numbers: |
369 |
</P> |
</P> |
370 |
<P> |
<P> |
371 |
<PRE> |
<PRE> |
372 |
PCRE_ERROR_NULL the argument <I>code</I> was NULL |
PCRE_ERROR_NULL the argument <I>code</I> was NULL |
373 |
|
the argument <I>where</I> was NULL |
374 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
375 |
|
PCRE_ERROR_BADOPTION the value of <I>what</I> was invalid |
376 |
</PRE> |
</PRE> |
377 |
</P> |
</P> |
378 |
<P> |
<P> |
379 |
If the <I>optptr</I> argument is not NULL, a copy of the options with which the |
The possible values for the third argument are defined in <B>pcre.h</B>, and are |
380 |
pattern was compiled is placed in the integer it points to. These option bits |
as follows: |
381 |
|
</P> |
382 |
|
<P> |
383 |
|
<PRE> |
384 |
|
PCRE_INFO_OPTIONS |
385 |
|
</PRE> |
386 |
|
</P> |
387 |
|
<P> |
388 |
|
Return a copy of the options with which the pattern was compiled. The fourth |
389 |
|
argument should point to au <B>unsigned long int</B> variable. These option bits |
390 |
are those specified in the call to <B>pcre_compile()</B>, modified by any |
are those specified in the call to <B>pcre_compile()</B>, modified by any |
391 |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
392 |
bit set if the form of the pattern implies that it can match only at the start |
bit forcibly set if the form of the pattern implies that it can match only at |
393 |
of a subject string. |
the start of a subject string. |
394 |
</P> |
</P> |
395 |
<P> |
<P> |
396 |
If the pattern is not anchored and the <I>firstcharptr</I> argument is not NULL, |
<PRE> |
397 |
it is used to pass back information about the first character of any matched |
PCRE_INFO_SIZE |
398 |
string. If there is a fixed first character, e.g. from a pattern such as |
</PRE> |
399 |
(cat|cow|coyote), then it is returned in the integer pointed to by |
</P> |
400 |
<I>firstcharptr</I>. Otherwise, if either |
<P> |
401 |
|
Return the size of the compiled pattern, that is, the value that was passed as |
402 |
|
the argument to <B>pcre_malloc()</B> when PCRE was getting memory in which to |
403 |
|
place the compiled data. The fourth argument should point to a <B>size_t</B> |
404 |
|
variable. |
405 |
|
</P> |
406 |
|
<P> |
407 |
|
<PRE> |
408 |
|
PCRE_INFO_CAPTURECOUNT |
409 |
|
</PRE> |
410 |
|
</P> |
411 |
|
<P> |
412 |
|
Return the number of capturing subpatterns in the pattern. The fourth argument |
413 |
|
should point to an \fbint\fR variable. |
414 |
|
</P> |
415 |
|
<P> |
416 |
|
<PRE> |
417 |
|
PCRE_INFO_BACKREFMAX |
418 |
|
</PRE> |
419 |
|
</P> |
420 |
|
<P> |
421 |
|
Return the number of the highest back reference in the pattern. The fourth |
422 |
|
argument should point to an <B>int</B> variable. Zero is returned if there are |
423 |
|
no back references. |
424 |
|
</P> |
425 |
|
<P> |
426 |
|
<PRE> |
427 |
|
PCRE_INFO_FIRSTCHAR |
428 |
|
</PRE> |
429 |
|
</P> |
430 |
|
<P> |
431 |
|
Return information about the first character of any matched string, for a |
432 |
|
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
433 |
|
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
434 |
|
<I>where</I>. Otherwise, if either |
435 |
</P> |
</P> |
436 |
<P> |
<P> |
437 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
444 |
<P> |
<P> |
445 |
then -1 is returned, indicating that the pattern matches only at the |
then -1 is returned, indicating that the pattern matches only at the |
446 |
start of a subject string or after any "\n" within the string. Otherwise -2 is |
start of a subject string or after any "\n" within the string. Otherwise -2 is |
447 |
returned. |
returned. For anchored patterns, -2 is returned. |
448 |
|
</P> |
449 |
|
<P> |
450 |
|
<PRE> |
451 |
|
PCRE_INFO_FIRSTTABLE |
452 |
|
</PRE> |
453 |
|
</P> |
454 |
|
<P> |
455 |
|
If the pattern was studied, and this resulted in the construction of a 256-bit |
456 |
|
table indicating a fixed set of characters for the first character in any |
457 |
|
matching string, a pointer to the table is returned. Otherwise NULL is |
458 |
|
returned. The fourth argument should point to an <B>unsigned char *</B> |
459 |
|
variable. |
460 |
|
</P> |
461 |
|
<P> |
462 |
|
<PRE> |
463 |
|
PCRE_INFO_LASTLITERAL |
464 |
|
</PRE> |
465 |
|
</P> |
466 |
|
<P> |
467 |
|
For a non-anchored pattern, return the value of the rightmost literal character |
468 |
|
which must exist in any matched string, other than at its start. The fourth |
469 |
|
argument should point to an <B>int</B> variable. If there is no such character, |
470 |
|
or if the pattern is anchored, -1 is returned. For example, for the pattern |
471 |
|
/a\d+z\d+/ the returned value is 'z'. |
472 |
|
</P> |
473 |
|
<P> |
474 |
|
The <B>pcre_info()</B> function is now obsolete because its interface is too |
475 |
|
restrictive to return all the available data about a compiled pattern. New |
476 |
|
programs should use <B>pcre_fullinfo()</B> instead. The yield of |
477 |
|
<B>pcre_info()</B> is the number of capturing subpatterns, or one of the |
478 |
|
following negative numbers: |
479 |
|
</P> |
480 |
|
<P> |
481 |
|
<PRE> |
482 |
|
PCRE_ERROR_NULL the argument <I>code</I> was NULL |
483 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
484 |
|
</PRE> |
485 |
|
</P> |
486 |
|
<P> |
487 |
|
If the <I>optptr</I> argument is not NULL, a copy of the options with which the |
488 |
|
pattern was compiled is placed in the integer it points to (see |
489 |
|
PCRE_INFO_OPTIONS above). |
490 |
|
</P> |
491 |
|
<P> |
492 |
|
If the pattern is not anchored and the <I>firstcharptr</I> argument is not NULL, |
493 |
|
it is used to pass back information about the first character of any matched |
494 |
|
string (see PCRE_INFO_FIRSTCHAR above). |
495 |
</P> |
</P> |
496 |
<LI><A NAME="SEC9" HREF="#TOC1">MATCHING A PATTERN</A> |
<LI><A NAME="SEC9" HREF="#TOC1">MATCHING A PATTERN</A> |
497 |
<P> |
<P> |
848 |
pattern matches. |
pattern matches. |
849 |
</P> |
</P> |
850 |
<P> |
<P> |
851 |
7. Fairly obviously, PCRE does not support the (?{code}) construction. |
7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) |
852 |
|
constructions. However, there is some experimental support for recursive |
853 |
|
patterns using the non-Perl item (?R). |
854 |
</P> |
</P> |
855 |
<P> |
<P> |
856 |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
898 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
899 |
<B>pcre_exec()</B> have no Perl equivalents. |
<B>pcre_exec()</B> have no Perl equivalents. |
900 |
</P> |
</P> |
901 |
|
<P> |
902 |
|
(g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do |
903 |
|
this using the (?p{code}) construct, which PCRE cannot of course support.) |
904 |
|
</P> |
905 |
<LI><A NAME="SEC13" HREF="#TOC1">REGULAR EXPRESSION DETAILS</A> |
<LI><A NAME="SEC13" HREF="#TOC1">REGULAR EXPRESSION DETAILS</A> |
906 |
<P> |
<P> |
907 |
The syntax and semantics of the regular expressions supported by PCRE are |
The syntax and semantics of the regular expressions supported by PCRE are |
908 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
909 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
910 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
911 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description |
912 |
here is intended as reference documentation. |
here is intended as reference documentation. |
913 |
</P> |
</P> |
914 |
<P> |
<P> |
1263 |
terminating ] are non-special in character classes, but it does no harm if they |
terminating ] are non-special in character classes, but it does no harm if they |
1264 |
are escaped. |
are escaped. |
1265 |
</P> |
</P> |
1266 |
<LI><A NAME="SEC18" HREF="#TOC1">VERTICAL BAR</A> |
<LI><A NAME="SEC18" HREF="#TOC1">POSIX CHARACTER CLASSES</A> |
1267 |
|
<P> |
1268 |
|
Perl 5.6 (not yet released at the time of writing) is going to support the |
1269 |
|
POSIX notation for character classes, which uses names enclosed by [: and :] |
1270 |
|
within the enclosing square brackets. PCRE supports this notation. For example, |
1271 |
|
</P> |
1272 |
|
<P> |
1273 |
|
<PRE> |
1274 |
|
[01[:alpha:]%] |
1275 |
|
</PRE> |
1276 |
|
</P> |
1277 |
|
<P> |
1278 |
|
matches "0", "1", any alphabetic character, or "%". The supported class names |
1279 |
|
are |
1280 |
|
</P> |
1281 |
|
<P> |
1282 |
|
<PRE> |
1283 |
|
alnum letters and digits |
1284 |
|
alpha letters |
1285 |
|
ascii character codes 0 - 127 |
1286 |
|
cntrl control characters |
1287 |
|
digit decimal digits (same as \d) |
1288 |
|
graph printing characters, excluding space |
1289 |
|
lower lower case letters |
1290 |
|
print printing characters, including space |
1291 |
|
punct printing characters, excluding letters and digits |
1292 |
|
space white space (same as \s) |
1293 |
|
upper upper case letters |
1294 |
|
word "word" characters (same as \w) |
1295 |
|
xdigit hexadecimal digits |
1296 |
|
</PRE> |
1297 |
|
</P> |
1298 |
|
<P> |
1299 |
|
The names "ascii" and "word" are Perl extensions. Another Perl extension is |
1300 |
|
negation, which is indicated by a ^ character after the colon. For example, |
1301 |
|
</P> |
1302 |
|
<P> |
1303 |
|
<PRE> |
1304 |
|
[12[:^digit:]] |
1305 |
|
</PRE> |
1306 |
|
</P> |
1307 |
|
<P> |
1308 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX |
1309 |
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
1310 |
|
supported, and an error is given if they are encountered. |
1311 |
|
</P> |
1312 |
|
<LI><A NAME="SEC19" HREF="#TOC1">VERTICAL BAR</A> |
1313 |
<P> |
<P> |
1314 |
Vertical bar characters are used to separate alternative patterns. For example, |
Vertical bar characters are used to separate alternative patterns. For example, |
1315 |
the pattern |
the pattern |
1327 |
subpattern (defined below), "succeeds" means matching the rest of the main |
subpattern (defined below), "succeeds" means matching the rest of the main |
1328 |
pattern as well as the alternative in the subpattern. |
pattern as well as the alternative in the subpattern. |
1329 |
</P> |
</P> |
1330 |
<LI><A NAME="SEC19" HREF="#TOC1">INTERNAL OPTION SETTING</A> |
<LI><A NAME="SEC20" HREF="#TOC1">INTERNAL OPTION SETTING</A> |
1331 |
<P> |
<P> |
1332 |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED |
1333 |
can be changed from within the pattern by a sequence of Perl option letters |
can be changed from within the pattern by a sequence of Perl option letters |
1403 |
earlier in the pattern than any of the additional features it turns on, even |
earlier in the pattern than any of the additional features it turns on, even |
1404 |
when it is at top level. It is best put at the start. |
when it is at top level. It is best put at the start. |
1405 |
</P> |
</P> |
1406 |
<LI><A NAME="SEC20" HREF="#TOC1">SUBPATTERNS</A> |
<LI><A NAME="SEC21" HREF="#TOC1">SUBPATTERNS</A> |
1407 |
<P> |
<P> |
1408 |
Subpatterns are delimited by parentheses (round brackets), which can be nested. |
Subpatterns are delimited by parentheses (round brackets), which can be nested. |
1409 |
Marking part of a pattern as a subpattern does two things: |
Marking part of a pattern as a subpattern does two things: |
1474 |
is reached, an option setting in one branch does affect subsequent branches, so |
is reached, an option setting in one branch does affect subsequent branches, so |
1475 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
1476 |
</P> |
</P> |
1477 |
<LI><A NAME="SEC21" HREF="#TOC1">REPETITION</A> |
<LI><A NAME="SEC22" HREF="#TOC1">REPETITION</A> |
1478 |
<P> |
<P> |
1479 |
Repetition is specified by quantifiers, which can follow any of the following |
Repetition is specified by quantifiers, which can follow any of the following |
1480 |
items: |
items: |
1649 |
<P> |
<P> |
1650 |
matches "aba" the value of the second captured substring is "b". |
matches "aba" the value of the second captured substring is "b". |
1651 |
</P> |
</P> |
1652 |
<LI><A NAME="SEC22" HREF="#TOC1">BACK REFERENCES</A> |
<LI><A NAME="SEC23" HREF="#TOC1">BACK REFERENCES</A> |
1653 |
<P> |
<P> |
1654 |
Outside a character class, a backslash followed by a digit greater than 0 (and |
Outside a character class, a backslash followed by a digit greater than 0 (and |
1655 |
possibly further digits) is a back reference to a capturing subpattern earlier |
possibly further digits) is a back reference to a capturing subpattern earlier |
1725 |
done using alternation, as in the example above, or by a quantifier with a |
done using alternation, as in the example above, or by a quantifier with a |
1726 |
minimum of zero. |
minimum of zero. |
1727 |
</P> |
</P> |
1728 |
<LI><A NAME="SEC23" HREF="#TOC1">ASSERTIONS</A> |
<LI><A NAME="SEC24" HREF="#TOC1">ASSERTIONS</A> |
1729 |
<P> |
<P> |
1730 |
An assertion is a test on the characters following or preceding the current |
An assertion is a test on the characters following or preceding the current |
1731 |
matching point that does not actually consume any characters. The simple |
matching point that does not actually consume any characters. The simple |
1883 |
<P> |
<P> |
1884 |
Assertions count towards the maximum of 200 parenthesized subpatterns. |
Assertions count towards the maximum of 200 parenthesized subpatterns. |
1885 |
</P> |
</P> |
1886 |
<LI><A NAME="SEC24" HREF="#TOC1">ONCE-ONLY SUBPATTERNS</A> |
<LI><A NAME="SEC25" HREF="#TOC1">ONCE-ONLY SUBPATTERNS</A> |
1887 |
<P> |
<P> |
1888 |
With both maximizing and minimizing repetition, failure of what follows |
With both maximizing and minimizing repetition, failure of what follows |
1889 |
normally causes the repeated item to be re-evaluated to see if a different |
normally causes the repeated item to be re-evaluated to see if a different |
1947 |
</PRE> |
</PRE> |
1948 |
</P> |
</P> |
1949 |
<P> |
<P> |
1950 |
when applied to a long string which does not match it. Because matching |
when applied to a long string which does not match. Because matching proceeds |
1951 |
proceeds from left to right, PCRE will look for each "a" in the subject and |
from left to right, PCRE will look for each "a" in the subject and then see if |
1952 |
then see if what follows matches the rest of the pattern. If the pattern is |
what follows matches the rest of the pattern. If the pattern is specified as |
|
specified as |
|
1953 |
</P> |
</P> |
1954 |
<P> |
<P> |
1955 |
<PRE> |
<PRE> |
1957 |
</PRE> |
</PRE> |
1958 |
</P> |
</P> |
1959 |
<P> |
<P> |
1960 |
then the initial .* matches the entire string at first, but when this fails, it |
then the initial .* matches the entire string at first, but when this fails |
1961 |
backtracks to match all but the last character, then all but the last two |
(because there is no following "a"), it backtracks to match all but the last |
1962 |
characters, and so on. Once again the search for "a" covers the entire string, |
character, then all but the last two characters, and so on. Once again the |
1963 |
from right to left, so we are no better off. However, if the pattern is written |
search for "a" covers the entire string, from right to left, so we are no |
1964 |
as |
better off. However, if the pattern is written as |
1965 |
</P> |
</P> |
1966 |
<P> |
<P> |
1967 |
<PRE> |
<PRE> |
1974 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
1975 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
1976 |
</P> |
</P> |
1977 |
<LI><A NAME="SEC25" HREF="#TOC1">CONDITIONAL SUBPATTERNS</A> |
<P> |
1978 |
|
When a pattern contains an unlimited repeat inside a subpattern that can itself |
1979 |
|
be repeated an unlimited number of times, the use of a once-only subpattern is |
1980 |
|
the only way to avoid some failing matches taking a very long time indeed. |
1981 |
|
The pattern |
1982 |
|
</P> |
1983 |
|
<P> |
1984 |
|
<PRE> |
1985 |
|
(\D+|<\d+>)*[!?] |
1986 |
|
</PRE> |
1987 |
|
</P> |
1988 |
|
<P> |
1989 |
|
matches an unlimited number of substrings that either consist of non-digits, or |
1990 |
|
digits enclosed in <>, followed by either ! or ?. When it matches, it runs |
1991 |
|
quickly. However, if it is applied to |
1992 |
|
</P> |
1993 |
|
<P> |
1994 |
|
<PRE> |
1995 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
1996 |
|
</PRE> |
1997 |
|
</P> |
1998 |
|
<P> |
1999 |
|
it takes a long time before reporting failure. This is because the string can |
2000 |
|
be divided between the two repeats in a large number of ways, and all have to |
2001 |
|
be tried. (The example used [!?] rather than a single character at the end, |
2002 |
|
because both PCRE and Perl have an optimization that allows for fast failure |
2003 |
|
when a single character is used. They remember the last single character that |
2004 |
|
is required for a match, and fail early if it is not present in the string.) |
2005 |
|
If the pattern is changed to |
2006 |
|
</P> |
2007 |
|
<P> |
2008 |
|
<PRE> |
2009 |
|
((?>\D+)|<\d+>)*[!?] |
2010 |
|
</PRE> |
2011 |
|
</P> |
2012 |
|
<P> |
2013 |
|
sequences of non-digits cannot be broken, and failure happens quickly. |
2014 |
|
</P> |
2015 |
|
<LI><A NAME="SEC26" HREF="#TOC1">CONDITIONAL SUBPATTERNS</A> |
2016 |
<P> |
<P> |
2017 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
2018 |
conditionally or to choose between two alternative subpatterns, depending on |
conditionally or to choose between two alternative subpatterns, depending on |
2074 |
against the second. This pattern matches strings in one of the two forms |
against the second. This pattern matches strings in one of the two forms |
2075 |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. |
2076 |
</P> |
</P> |
2077 |
<LI><A NAME="SEC26" HREF="#TOC1">COMMENTS</A> |
<LI><A NAME="SEC27" HREF="#TOC1">COMMENTS</A> |
2078 |
<P> |
<P> |
2079 |
The sequence (?# marks the start of a comment which continues up to the next |
The sequence (?# marks the start of a comment which continues up to the next |
2080 |
closing parenthesis. Nested parentheses are not permitted. The characters |
closing parenthesis. Nested parentheses are not permitted. The characters |
2085 |
character class introduces a comment that continues up to the next newline |
character class introduces a comment that continues up to the next newline |
2086 |
character in the pattern. |
character in the pattern. |
2087 |
</P> |
</P> |
2088 |
<LI><A NAME="SEC27" HREF="#TOC1">PERFORMANCE</A> |
<LI><A NAME="SEC28" HREF="#TOC1">RECURSIVE PATTERNS</A> |
2089 |
|
<P> |
2090 |
|
Consider the problem of matching a string in parentheses, allowing for |
2091 |
|
unlimited nested parentheses. Without the use of recursion, the best that can |
2092 |
|
be done is to use a pattern that matches up to some fixed depth of nesting. It |
2093 |
|
is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an |
2094 |
|
experimental facility that allows regular expressions to recurse (amongst other |
2095 |
|
things). It does this by interpolating Perl code in the expression at run time, |
2096 |
|
and the code can refer to the expression itself. A Perl pattern to solve the |
2097 |
|
parentheses problem can be created like this: |
2098 |
|
</P> |
2099 |
|
<P> |
2100 |
|
<PRE> |
2101 |
|
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
2102 |
|
</PRE> |
2103 |
|
</P> |
2104 |
|
<P> |
2105 |
|
The (?p{...}) item interpolates Perl code at run time, and in this case refers |
2106 |
|
recursively to the pattern in which it appears. Obviously, PCRE cannot support |
2107 |
|
the interpolation of Perl code. Instead, the special item (?R) is provided for |
2108 |
|
the specific case of recursion. This PCRE pattern solves the parentheses |
2109 |
|
problem (assume the PCRE_EXTENDED option is set so that white space is |
2110 |
|
ignored): |
2111 |
|
</P> |
2112 |
|
<P> |
2113 |
|
<PRE> |
2114 |
|
\( ( (?>[^()]+) | (?R) )* \) |
2115 |
|
</PRE> |
2116 |
|
</P> |
2117 |
|
<P> |
2118 |
|
First it matches an opening parenthesis. Then it matches any number of |
2119 |
|
substrings which can either be a sequence of non-parentheses, or a recursive |
2120 |
|
match of the pattern itself (i.e. a correctly parenthesized substring). Finally |
2121 |
|
there is a closing parenthesis. |
2122 |
|
</P> |
2123 |
|
<P> |
2124 |
|
This particular example pattern contains nested unlimited repeats, and so the |
2125 |
|
use of a once-only subpattern for matching strings of non-parentheses is |
2126 |
|
important when applying the pattern to strings that do not match. For example, |
2127 |
|
when it is applied to |
2128 |
|
</P> |
2129 |
|
<P> |
2130 |
|
<PRE> |
2131 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
2132 |
|
</PRE> |
2133 |
|
</P> |
2134 |
|
<P> |
2135 |
|
it yields "no match" quickly. However, if a once-only subpattern is not used, |
2136 |
|
the match runs for a very long time indeed because there are so many different |
2137 |
|
ways the + and * repeats can carve up the subject, and all have to be tested |
2138 |
|
before failure can be reported. |
2139 |
|
</P> |
2140 |
|
<P> |
2141 |
|
The values set for any capturing subpatterns are those from the outermost level |
2142 |
|
of the recursion at which the subpattern value is set. If the pattern above is |
2143 |
|
matched against |
2144 |
|
</P> |
2145 |
|
<P> |
2146 |
|
<PRE> |
2147 |
|
(ab(cd)ef) |
2148 |
|
</PRE> |
2149 |
|
</P> |
2150 |
|
<P> |
2151 |
|
the value for the capturing parentheses is "ef", which is the last value taken |
2152 |
|
on at the top level. If additional parentheses are added, giving |
2153 |
|
</P> |
2154 |
|
<P> |
2155 |
|
<PRE> |
2156 |
|
\( ( ( (?>[^()]+) | (?R) )* ) \) |
2157 |
|
^ ^ |
2158 |
|
^ ^ |
2159 |
|
</PRE> |
2160 |
|
then the string they capture is "ab(cd)ef", the contents of the top level |
2161 |
|
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
2162 |
|
has to obtain extra memory to store data during a recursion, which it does by |
2163 |
|
using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no |
2164 |
|
memory can be obtained, it saves data for the first 15 capturing parentheses |
2165 |
|
only, as there is no way to give an out-of-memory error from within a |
2166 |
|
recursion. |
2167 |
|
</P> |
2168 |
|
<LI><A NAME="SEC29" HREF="#TOC1">PERFORMANCE</A> |
2169 |
<P> |
<P> |
2170 |
Certain items that may appear in patterns are more efficient than others. It is |
Certain items that may appear in patterns are more efficient than others. It is |
2171 |
more efficient to use a character class like [aeiou] than a set of alternatives |
more efficient to use a character class like [aeiou] than a set of alternatives |
2241 |
applied to a whole line of "a" characters, whereas the latter takes an |
applied to a whole line of "a" characters, whereas the latter takes an |
2242 |
appreciable time with strings longer than about 20 characters. |
appreciable time with strings longer than about 20 characters. |
2243 |
</P> |
</P> |
2244 |
<LI><A NAME="SEC28" HREF="#TOC1">AUTHOR</A> |
<LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A> |
2245 |
<P> |
<P> |
2246 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
2247 |
<BR> |
<BR> |
2254 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
2255 |
</P> |
</P> |
2256 |
<P> |
<P> |
2257 |
Last updated: 29 July 1999 |
Last updated: 27 January 2000 |
2258 |
<BR> |
<BR> |
2259 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |