44 |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
.B int *\fIovector\fR, int \fIstringcount\fR, "const char ***\fIlistptr\fR);" |
45 |
.PP |
.PP |
46 |
.br |
.br |
47 |
|
.B void pcre_free_substring(const char *\fIstringptr\fR); |
48 |
|
.PP |
49 |
|
.br |
50 |
|
.B void pcre_free_substring_list(const char **\fIstringptr\fR); |
51 |
|
.PP |
52 |
|
.br |
53 |
.B const unsigned char *pcre_maketables(void); |
.B const unsigned char *pcre_maketables(void); |
54 |
.PP |
.PP |
55 |
.br |
.br |
56 |
|
.B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," |
57 |
|
.ti +5n |
58 |
|
.B int \fIwhat\fR, void *\fIwhere\fR); |
59 |
|
.PP |
60 |
|
.br |
61 |
.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int |
.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int |
62 |
.B *\fIfirstcharptr\fR); |
.B *\fIfirstcharptr\fR); |
63 |
.PP |
.PP |
75 |
.SH DESCRIPTION |
.SH DESCRIPTION |
76 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
77 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
78 |
differences (see below). The current implementation corresponds to Perl 5.005. |
differences (see below). The current implementation corresponds to Perl 5.005, |
79 |
|
with some additional features from later versions. This includes some |
80 |
|
experimental, incomplete support for UTF-8 encoded strings. Details of exactly |
81 |
|
what is and what is not supported are given below. |
82 |
|
|
83 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
84 |
a set of wrapper functions that correspond to the POSIX API. These are |
a set of wrapper functions that correspond to the POSIX regular expression API. |
85 |
described in the \fBpcreposix\fR documentation. |
These are described in the \fBpcreposix\fR documentation. |
86 |
|
|
87 |
The native API function prototypes are defined in the header file \fBpcre.h\fR, |
The native API function prototypes are defined in the header file \fBpcre.h\fR, |
88 |
and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be |
and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be |
89 |
accessed by adding \fB-lpcre\fR to the command for linking an application which |
accessed by adding \fB-lpcre\fR to the command for linking an application which |
90 |
calls it. |
calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
91 |
|
contain the major and minor release numbers for the library. Applications can |
92 |
|
use these to include support for different releases. |
93 |
|
|
94 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
95 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions. |
96 |
\fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
|
97 |
|
The functions \fBpcre_copy_substring()\fR, \fBpcre_get_substring()\fR, and |
98 |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
\fBpcre_get_substring_list()\fR are convenience functions for extracting |
99 |
captured substrings from a matched subject string. The function |
captured substrings from a matched subject string; \fBpcre_free_substring()\fR |
100 |
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables |
and \fBpcre_free_substring_list()\fR are also provided, to free the memory used |
101 |
in the current locale for passing to \fBpcre_compile()\fR. |
for extracted strings. |
102 |
|
|
103 |
The function \fBpcre_info()\fR is used to find out information about a compiled |
The function \fBpcre_maketables()\fR is used (optionally) to build a set of |
104 |
pattern, while the function \fBpcre_version()\fR returns a pointer to a string |
character tables in the current locale for passing to \fBpcre_compile()\fR. |
105 |
containing the version of PCRE and its date of release. |
|
106 |
|
The function \fBpcre_fullinfo()\fR is used to find out information about a |
107 |
|
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
108 |
|
some of the available information, but is retained for backwards compatibility. |
109 |
|
The function \fBpcre_version()\fR returns a pointer to a string containing the |
110 |
|
version of PCRE and its date of release. |
111 |
|
|
112 |
The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain |
The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain |
113 |
the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions |
the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions |
204 |
|
|
205 |
PCRE_EXTRA |
PCRE_EXTRA |
206 |
|
|
207 |
This option turns on additional functionality of PCRE that is incompatible with |
This option was invented in order to turn on additional functionality of PCRE |
208 |
Perl. Any backslash in a pattern that is followed by a letter that has no |
that is incompatible with Perl, but it is currently of very little use. When |
209 |
|
set, any backslash in a pattern that is followed by a letter that has no |
210 |
special meaning causes an error, thus reserving these combinations for future |
special meaning causes an error, thus reserving these combinations for future |
211 |
expansion. By default, as in Perl, a backslash followed by a letter with no |
expansion. By default, as in Perl, a backslash followed by a letter with no |
212 |
special meaning is treated as a literal. There are at present no other features |
special meaning is treated as a literal. There are at present no other features |
213 |
controlled by this option. |
controlled by this option. It can also be set by a (?X) option setting within a |
214 |
|
pattern. |
215 |
|
|
216 |
PCRE_MULTILINE |
PCRE_MULTILINE |
217 |
|
|
235 |
greedy by default, but become greedy if followed by "?". It is not compatible |
greedy by default, but become greedy if followed by "?". It is not compatible |
236 |
with Perl. It can also be set by a (?U) option setting within the pattern. |
with Perl. It can also be set by a (?U) option setting within the pattern. |
237 |
|
|
238 |
|
PCRE_UTF8 |
239 |
|
|
240 |
|
This option causes PCRE to regard both the pattern and the subject as strings |
241 |
|
of UTF-8 characters instead of just byte strings. However, it is available only |
242 |
|
if PCRE has been built to include UTF-8 support. If not, the use of this option |
243 |
|
provokes an error. Support for UTF-8 is new, experimental, and incomplete. |
244 |
|
Details of exactly what it entails are given below. |
245 |
|
|
246 |
|
|
247 |
.SH STUDYING A PATTERN |
.SH STUDYING A PATTERN |
248 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
293 |
|
|
294 |
|
|
295 |
.SH INFORMATION ABOUT A PATTERN |
.SH INFORMATION ABOUT A PATTERN |
296 |
The \fBpcre_info()\fR function returns information about a compiled pattern. |
The \fBpcre_fullinfo()\fR function returns information about a compiled |
297 |
Its yield is the number of capturing subpatterns, or one of the following |
pattern. It replaces the obsolete \fBpcre_info()\fR function, which is |
298 |
negative numbers: |
nevertheless retained for backwards compability (and is documented below). |
299 |
|
|
300 |
|
The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled |
301 |
|
pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if |
302 |
|
the pattern was not studied. The third argument specifies which piece of |
303 |
|
information is required, while the fourth argument is a pointer to a variable |
304 |
|
to receive the data. The yield of the function is zero for success, or one of |
305 |
|
the following negative numbers: |
306 |
|
|
307 |
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
308 |
|
the argument \fIwhere\fR was NULL |
309 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
310 |
|
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
311 |
|
|
312 |
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
313 |
pattern was compiled is placed in the integer it points to. These option bits |
as follows: |
314 |
|
|
315 |
|
PCRE_INFO_OPTIONS |
316 |
|
|
317 |
|
Return a copy of the options with which the pattern was compiled. The fourth |
318 |
|
argument should point to au \fBunsigned long int\fR variable. These option bits |
319 |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
320 |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
321 |
bit set if the form of the pattern implies that it can match only at the start |
bit forcibly set if the form of the pattern implies that it can match only at |
322 |
of a subject string. |
the start of a subject string. |
323 |
|
|
324 |
If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, |
PCRE_INFO_SIZE |
325 |
it is used to pass back information about the first character of any matched |
|
326 |
string. If there is a fixed first character, e.g. from a pattern such as |
Return the size of the compiled pattern, that is, the value that was passed as |
327 |
(cat|cow|coyote), then it is returned in the integer pointed to by |
the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to |
328 |
\fIfirstcharptr\fR. Otherwise, if either |
place the compiled data. The fourth argument should point to a \fBsize_t\fR |
329 |
|
variable. |
330 |
|
|
331 |
|
PCRE_INFO_CAPTURECOUNT |
332 |
|
|
333 |
|
Return the number of capturing subpatterns in the pattern. The fourth argument |
334 |
|
should point to an \fbint\fR variable. |
335 |
|
|
336 |
|
PCRE_INFO_BACKREFMAX |
337 |
|
|
338 |
|
Return the number of the highest back reference in the pattern. The fourth |
339 |
|
argument should point to an \fBint\fR variable. Zero is returned if there are |
340 |
|
no back references. |
341 |
|
|
342 |
|
PCRE_INFO_FIRSTCHAR |
343 |
|
|
344 |
|
Return information about the first character of any matched string, for a |
345 |
|
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
346 |
|
such as (cat|cow|coyote), it is returned in the integer pointed to by |
347 |
|
\fIwhere\fR. Otherwise, if either |
348 |
|
|
349 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
350 |
starts with "^", or |
starts with "^", or |
352 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
353 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
354 |
|
|
355 |
then -1 is returned, indicating that the pattern matches only at the |
-1 is returned, indicating that the pattern matches only at the start of a |
356 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
subject string or after any "\\n" within the string. Otherwise -2 is returned. |
357 |
returned. |
For anchored patterns, -2 is returned. |
358 |
|
|
359 |
|
PCRE_INFO_FIRSTTABLE |
360 |
|
|
361 |
|
If the pattern was studied, and this resulted in the construction of a 256-bit |
362 |
|
table indicating a fixed set of characters for the first character in any |
363 |
|
matching string, a pointer to the table is returned. Otherwise NULL is |
364 |
|
returned. The fourth argument should point to an \fBunsigned char *\fR |
365 |
|
variable. |
366 |
|
|
367 |
|
PCRE_INFO_LASTLITERAL |
368 |
|
|
369 |
|
For a non-anchored pattern, return the value of the rightmost literal character |
370 |
|
which must exist in any matched string, other than at its start. The fourth |
371 |
|
argument should point to an \fBint\fR variable. If there is no such character, |
372 |
|
or if the pattern is anchored, -1 is returned. For example, for the pattern |
373 |
|
/a\\d+z\\d+/ the returned value is 'z'. |
374 |
|
|
375 |
|
The \fBpcre_info()\fR function is now obsolete because its interface is too |
376 |
|
restrictive to return all the available data about a compiled pattern. New |
377 |
|
programs should use \fBpcre_fullinfo()\fR instead. The yield of |
378 |
|
\fBpcre_info()\fR is the number of capturing subpatterns, or one of the |
379 |
|
following negative numbers: |
380 |
|
|
381 |
|
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
382 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
383 |
|
|
384 |
|
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
385 |
|
pattern was compiled is placed in the integer it points to (see |
386 |
|
PCRE_INFO_OPTIONS above). |
387 |
|
|
388 |
|
If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, |
389 |
|
it is used to pass back information about the first character of any matched |
390 |
|
string (see PCRE_INFO_FIRSTCHAR above). |
391 |
|
|
392 |
|
|
393 |
.SH MATCHING A PATTERN |
.SH MATCHING A PATTERN |
570 |
were captured by the match, including the substring that matched the entire |
were captured by the match, including the substring that matched the entire |
571 |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
572 |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
573 |
ran out of space in \fIovector\fR, then the value passed as |
ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should |
574 |
\fIstringcount\fR should be the size of the vector divided by three. |
be the size of the vector divided by three. |
575 |
|
|
576 |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
577 |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
578 |
value of zero extracts the substring that matched the entire pattern, while |
value of zero extracts the substring that matched the entire pattern, while |
579 |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
higher values extract the captured substrings. For \fBpcre_copy_substring()\fR, |
580 |
the string is placed in \fIbuffer\fR, whose length is given by |
the string is placed in \fIbuffer\fR, whose length is given by |
581 |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of store is |
\fIbuffersize\fR, while for \fBpcre_get_substring()\fR a new block of memory is |
582 |
obtained via \fBpcre_malloc\fR, and its address is returned via |
obtained via \fBpcre_malloc\fR, and its address is returned via |
583 |
\fIstringptr\fR. The yield of the function is the length of the string, not |
\fIstringptr\fR. The yield of the function is the length of the string, not |
584 |
including the terminating zero, or one of |
including the terminating zero, or one of |
610 |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
inspecting the appropriate offset in \fIovector\fR, which is negative for unset |
611 |
substrings. |
substrings. |
612 |
|
|
613 |
|
The two convenience functions \fBpcre_free_substring()\fR and |
614 |
|
\fBpcre_free_substring_list()\fR can be used to free the memory returned by |
615 |
|
a previous call of \fBpcre_get_substring()\fR or |
616 |
|
\fBpcre_get_substring_list()\fR, respectively. They do nothing more than call |
617 |
|
the function pointed to by \fBpcre_free\fR, which of course could be called |
618 |
|
directly from a C program. However, PCRE is used in some situations where it is |
619 |
|
linked via a special interface to another programming language which cannot use |
620 |
|
\fBpcre_free\fR directly; it is for these cases that the functions are |
621 |
|
provided. |
622 |
|
|
623 |
|
|
624 |
.SH LIMITATIONS |
.SH LIMITATIONS |
671 |
6. The Perl \\G assertion is not supported as it is not relevant to single |
6. The Perl \\G assertion is not supported as it is not relevant to single |
672 |
pattern matches. |
pattern matches. |
673 |
|
|
674 |
7. Fairly obviously, PCRE does not support the (?{code}) construction. |
7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) |
675 |
|
constructions. However, there is some experimental support for recursive |
676 |
|
patterns using the non-Perl item (?R). |
677 |
|
|
678 |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
679 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
680 |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
681 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
682 |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. |
683 |
|
|
684 |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
685 |
future Perl changes to a consistent state that is different, PCRE may change to |
future Perl changes to a consistent state that is different, PCRE may change to |
711 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
712 |
\fBpcre_exec()\fR have no Perl equivalents. |
\fBpcre_exec()\fR have no Perl equivalents. |
713 |
|
|
714 |
|
(g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do |
715 |
|
this using the (?p{code}) construct, which PCRE cannot of course support.) |
716 |
|
|
717 |
|
|
718 |
.SH REGULAR EXPRESSION DETAILS |
.SH REGULAR EXPRESSION DETAILS |
719 |
The syntax and semantics of the regular expressions supported by PCRE are |
The syntax and semantics of the regular expressions supported by PCRE are |
720 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
721 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
722 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
723 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
724 |
here is intended as reference documentation. |
|
725 |
|
The description here is intended as reference documentation. The basic |
726 |
|
operation of PCRE is on strings of bytes. However, there is the beginnings of |
727 |
|
some support for UTF-8 character strings. To use this support you must |
728 |
|
configure PCRE to include it, and then call \fBpcre_compile()\fR with the |
729 |
|
PCRE_UTF8 option. How this affects the pattern matching is described in the |
730 |
|
final section of this document. |
731 |
|
|
732 |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
733 |
left to right. Most characters stand for themselves in a pattern, and match the |
left to right. Most characters stand for themselves in a pattern, and match the |
955 |
.SH FULL STOP (PERIOD, DOT) |
.SH FULL STOP (PERIOD, DOT) |
956 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
957 |
the subject, including a non-printing character, but not (by default) newline. |
the subject, including a non-printing character, but not (by default) newline. |
958 |
If the PCRE_DOTALL option is set, then dots match newlines as well. The |
If the PCRE_DOTALL option is set, dots match newlines as well. The handling of |
959 |
handling of dot is entirely independent of the handling of circumflex and |
dot is entirely independent of the handling of circumflex and dollar, the only |
960 |
dollar, the only relationship being that they both involve newline characters. |
relationship being that they both involve newline characters. Dot has no |
961 |
Dot has no special meaning in a character class. |
special meaning in a character class. |
962 |
|
|
963 |
|
|
964 |
.SH SQUARE BRACKETS |
.SH SQUARE BRACKETS |
1024 |
are escaped. |
are escaped. |
1025 |
|
|
1026 |
|
|
1027 |
|
.SH POSIX CHARACTER CLASSES |
1028 |
|
Perl 5.6 (not yet released at the time of writing) is going to support the |
1029 |
|
POSIX notation for character classes, which uses names enclosed by [: and :] |
1030 |
|
within the enclosing square brackets. PCRE supports this notation. For example, |
1031 |
|
|
1032 |
|
[01[:alpha:]%] |
1033 |
|
|
1034 |
|
matches "0", "1", any alphabetic character, or "%". The supported class names |
1035 |
|
are |
1036 |
|
|
1037 |
|
alnum letters and digits |
1038 |
|
alpha letters |
1039 |
|
ascii character codes 0 - 127 |
1040 |
|
cntrl control characters |
1041 |
|
digit decimal digits (same as \\d) |
1042 |
|
graph printing characters, excluding space |
1043 |
|
lower lower case letters |
1044 |
|
print printing characters, including space |
1045 |
|
punct printing characters, excluding letters and digits |
1046 |
|
space white space (same as \\s) |
1047 |
|
upper upper case letters |
1048 |
|
word "word" characters (same as \\w) |
1049 |
|
xdigit hexadecimal digits |
1050 |
|
|
1051 |
|
The names "ascii" and "word" are Perl extensions. Another Perl extension is |
1052 |
|
negation, which is indicated by a ^ character after the colon. For example, |
1053 |
|
|
1054 |
|
[12[:^digit:]] |
1055 |
|
|
1056 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX |
1057 |
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
1058 |
|
supported, and an error is given if they are encountered. |
1059 |
|
|
1060 |
|
|
1061 |
.SH VERTICAL BAR |
.SH VERTICAL BAR |
1062 |
Vertical bar characters are used to separate alternative patterns. For example, |
Vertical bar characters are used to separate alternative patterns. For example, |
1063 |
the pattern |
the pattern |
1245 |
|
|
1246 |
/* first command */ not comment /* second comment */ |
/* first command */ not comment /* second comment */ |
1247 |
|
|
1248 |
fails, because it matches the entire string due to the greediness of the .* |
fails, because it matches the entire string owing to the greediness of the .* |
1249 |
item. |
item. |
1250 |
|
|
1251 |
However, if a quantifier is followed by a question mark, then it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
1252 |
greedy, and instead matches the minimum number of times possible, so the |
greedy, and instead matches the minimum number of times possible, so the |
1253 |
pattern |
pattern |
1254 |
|
|
1264 |
which matches one digit by preference, but can match two if that is the only |
which matches one digit by preference, but can match two if that is the only |
1265 |
way the rest of the pattern matches. |
way the rest of the pattern matches. |
1266 |
|
|
1267 |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl), |
1268 |
then the quantifiers are not greedy by default, but individual ones can be made |
the quantifiers are not greedy by default, but individual ones can be made |
1269 |
greedy by following them with a question mark. In other words, it inverts the |
greedy by following them with a question mark. In other words, it inverts the |
1270 |
default behaviour. |
default behaviour. |
1271 |
|
|
1274 |
compiled pattern, in proportion to the size of the minimum or maximum. |
compiled pattern, in proportion to the size of the minimum or maximum. |
1275 |
|
|
1276 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
1277 |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is |
1278 |
is implicitly anchored, because whatever follows will be tried against every |
implicitly anchored, because whatever follows will be tried against every |
1279 |
character position in the subject string, so there is no point in retrying the |
character position in the subject string, so there is no point in retrying the |
1280 |
overall match at any position after the first. PCRE treats such a pattern as |
overall match at any position after the first. PCRE treats such a pattern as |
1281 |
though it were preceded by \\A. In cases where it is known that the subject |
though it were preceded by \\A. In cases where it is known that the subject |
1319 |
|
|
1320 |
matches "sense and sensibility" and "response and responsibility", but not |
matches "sense and sensibility" and "response and responsibility", but not |
1321 |
"sense and responsibility". If caseful matching is in force at the time of the |
"sense and responsibility". If caseful matching is in force at the time of the |
1322 |
back reference, then the case of letters is relevant. For example, |
back reference, the case of letters is relevant. For example, |
1323 |
|
|
1324 |
((?i)rah)\\s+\\1 |
((?i)rah)\\s+\\1 |
1325 |
|
|
1327 |
capturing subpattern is matched caselessly. |
capturing subpattern is matched caselessly. |
1328 |
|
|
1329 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
1330 |
subpattern has not actually been used in a particular match, then any back |
subpattern has not actually been used in a particular match, any back |
1331 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
1332 |
|
|
1333 |
(a|(bc))\\2 |
(a|(bc))\\2 |
1335 |
always fails if it starts to match "a" rather than "bc". Because there may be |
always fails if it starts to match "a" rather than "bc". Because there may be |
1336 |
up to 99 back references, all digits following the backslash are taken |
up to 99 back references, all digits following the backslash are taken |
1337 |
as part of a potential back reference number. If the pattern continues with a |
as part of a potential back reference number. If the pattern continues with a |
1338 |
digit character, then some delimiter must be used to terminate the back |
digit character, some delimiter must be used to terminate the back reference. |
1339 |
reference. If the PCRE_EXTENDED option is set, this can be whitespace. |
If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty |
1340 |
Otherwise an empty comment can be used. |
comment can be used. |
1341 |
|
|
1342 |
A back reference that occurs inside the parentheses to which it refers fails |
A back reference that occurs inside the parentheses to which it refers fails |
1343 |
when the subpattern is first used, so, for example, (a\\1) never matches. |
when the subpattern is first used, so, for example, (a\\1) never matches. |
1346 |
|
|
1347 |
(a|b\\1)+ |
(a|b\\1)+ |
1348 |
|
|
1349 |
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of |
1350 |
the subpattern, the back reference matches the character string corresponding |
the subpattern, the back reference matches the character string corresponding |
1351 |
to the previous iteration. In order for this to work, the pattern must be such |
to the previous iteration. In order for this to work, the pattern must be such |
1352 |
that the first iteration does not need to match the back reference. This can be |
that the first iteration does not need to match the back reference. This can be |
1425 |
matches "foo" preceded by three digits that are not "999". Notice that each of |
matches "foo" preceded by three digits that are not "999". Notice that each of |
1426 |
the assertions is applied independently at the same point in the subject |
the assertions is applied independently at the same point in the subject |
1427 |
string. First there is a check that the previous three characters are all |
string. First there is a check that the previous three characters are all |
1428 |
digits, then there is a check that the same three characters are not "999". |
digits, and then there is a check that the same three characters are not "999". |
1429 |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
1430 |
of which are digits and the last three of which are not "999". For example, it |
of which are digits and the last three of which are not "999". For example, it |
1431 |
doesn't match "123abcfoo". A pattern to do that is |
doesn't match "123abcfoo". A pattern to do that is |
1504 |
|
|
1505 |
abcd$ |
abcd$ |
1506 |
|
|
1507 |
when applied to a long string which does not match it. Because matching |
when applied to a long string which does not match. Because matching proceeds |
1508 |
proceeds from left to right, PCRE will look for each "a" in the subject and |
from left to right, PCRE will look for each "a" in the subject and then see if |
1509 |
then see if what follows matches the rest of the pattern. If the pattern is |
what follows matches the rest of the pattern. If the pattern is specified as |
|
specified as |
|
1510 |
|
|
1511 |
^.*abcd$ |
^.*abcd$ |
1512 |
|
|
1513 |
then the initial .* matches the entire string at first, but when this fails, it |
the initial .* matches the entire string at first, but when this fails (because |
1514 |
backtracks to match all but the last character, then all but the last two |
there is no following "a"), it backtracks to match all but the last character, |
1515 |
characters, and so on. Once again the search for "a" covers the entire string, |
then all but the last two characters, and so on. Once again the search for "a" |
1516 |
from right to left, so we are no better off. However, if the pattern is written |
covers the entire string, from right to left, so we are no better off. However, |
1517 |
as |
if the pattern is written as |
1518 |
|
|
1519 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1520 |
|
|
1521 |
then there can be no backtracking for the .* item; it can match only the entire |
there can be no backtracking for the .* item; it can match only the entire |
1522 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
1523 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
1524 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
1525 |
|
|
1526 |
|
When a pattern contains an unlimited repeat inside a subpattern that can itself |
1527 |
|
be repeated an unlimited number of times, the use of a once-only subpattern is |
1528 |
|
the only way to avoid some failing matches taking a very long time indeed. |
1529 |
|
The pattern |
1530 |
|
|
1531 |
|
(\\D+|<\\d+>)*[!?] |
1532 |
|
|
1533 |
|
matches an unlimited number of substrings that either consist of non-digits, or |
1534 |
|
digits enclosed in <>, followed by either ! or ?. When it matches, it runs |
1535 |
|
quickly. However, if it is applied to |
1536 |
|
|
1537 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
1538 |
|
|
1539 |
|
it takes a long time before reporting failure. This is because the string can |
1540 |
|
be divided between the two repeats in a large number of ways, and all have to |
1541 |
|
be tried. (The example used [!?] rather than a single character at the end, |
1542 |
|
because both PCRE and Perl have an optimization that allows for fast failure |
1543 |
|
when a single character is used. They remember the last single character that |
1544 |
|
is required for a match, and fail early if it is not present in the string.) |
1545 |
|
If the pattern is changed to |
1546 |
|
|
1547 |
|
((?>\\D+)|<\\d+>)*[!?] |
1548 |
|
|
1549 |
|
sequences of non-digits cannot be broken, and failure happens quickly. |
1550 |
|
|
1551 |
|
|
1552 |
.SH CONDITIONAL SUBPATTERNS |
.SH CONDITIONAL SUBPATTERNS |
1553 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
1563 |
subpattern, a compile-time error occurs. |
subpattern, a compile-time error occurs. |
1564 |
|
|
1565 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
1566 |
of a sequence of digits, then the condition is satisfied if the capturing |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
1567 |
subpattern of that number has previously matched. Consider the following |
of that number has previously matched. The number must be greater than zero. |
1568 |
pattern, which contains non-significant white space to make it more readable |
Consider the following pattern, which contains non-significant white space to |
1569 |
(assume the PCRE_EXTENDED option) and to divide it into three parts for ease |
make it more readable (assume the PCRE_EXTENDED option) and to divide it into |
1570 |
of discussion: |
three parts for ease of discussion: |
1571 |
|
|
1572 |
( \\( )? [^()]+ (?(1) \\) ) |
( \\( )? [^()]+ (?(1) \\) ) |
1573 |
|
|
1607 |
character in the pattern. |
character in the pattern. |
1608 |
|
|
1609 |
|
|
1610 |
|
.SH RECURSIVE PATTERNS |
1611 |
|
Consider the problem of matching a string in parentheses, allowing for |
1612 |
|
unlimited nested parentheses. Without the use of recursion, the best that can |
1613 |
|
be done is to use a pattern that matches up to some fixed depth of nesting. It |
1614 |
|
is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an |
1615 |
|
experimental facility that allows regular expressions to recurse (amongst other |
1616 |
|
things). It does this by interpolating Perl code in the expression at run time, |
1617 |
|
and the code can refer to the expression itself. A Perl pattern to solve the |
1618 |
|
parentheses problem can be created like this: |
1619 |
|
|
1620 |
|
$re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x; |
1621 |
|
|
1622 |
|
The (?p{...}) item interpolates Perl code at run time, and in this case refers |
1623 |
|
recursively to the pattern in which it appears. Obviously, PCRE cannot support |
1624 |
|
the interpolation of Perl code. Instead, the special item (?R) is provided for |
1625 |
|
the specific case of recursion. This PCRE pattern solves the parentheses |
1626 |
|
problem (assume the PCRE_EXTENDED option is set so that white space is |
1627 |
|
ignored): |
1628 |
|
|
1629 |
|
\\( ( (?>[^()]+) | (?R) )* \\) |
1630 |
|
|
1631 |
|
First it matches an opening parenthesis. Then it matches any number of |
1632 |
|
substrings which can either be a sequence of non-parentheses, or a recursive |
1633 |
|
match of the pattern itself (i.e. a correctly parenthesized substring). Finally |
1634 |
|
there is a closing parenthesis. |
1635 |
|
|
1636 |
|
This particular example pattern contains nested unlimited repeats, and so the |
1637 |
|
use of a once-only subpattern for matching strings of non-parentheses is |
1638 |
|
important when applying the pattern to strings that do not match. For example, |
1639 |
|
when it is applied to |
1640 |
|
|
1641 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
1642 |
|
|
1643 |
|
it yields "no match" quickly. However, if a once-only subpattern is not used, |
1644 |
|
the match runs for a very long time indeed because there are so many different |
1645 |
|
ways the + and * repeats can carve up the subject, and all have to be tested |
1646 |
|
before failure can be reported. |
1647 |
|
|
1648 |
|
The values set for any capturing subpatterns are those from the outermost level |
1649 |
|
of the recursion at which the subpattern value is set. If the pattern above is |
1650 |
|
matched against |
1651 |
|
|
1652 |
|
(ab(cd)ef) |
1653 |
|
|
1654 |
|
the value for the capturing parentheses is "ef", which is the last value taken |
1655 |
|
on at the top level. If additional parentheses are added, giving |
1656 |
|
|
1657 |
|
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
1658 |
|
^ ^ |
1659 |
|
^ ^ |
1660 |
|
the string they capture is "ab(cd)ef", the contents of the top level |
1661 |
|
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
1662 |
|
has to obtain extra memory to store data during a recursion, which it does by |
1663 |
|
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |
1664 |
|
memory can be obtained, it saves data for the first 15 capturing parentheses |
1665 |
|
only, as there is no way to give an out-of-memory error from within a |
1666 |
|
recursion. |
1667 |
|
|
1668 |
|
|
1669 |
.SH PERFORMANCE |
.SH PERFORMANCE |
1670 |
Certain items that may appear in patterns are more efficient than others. It is |
Certain items that may appear in patterns are more efficient than others. It is |
1671 |
more efficient to use a character class like [aeiou] than a set of alternatives |
more efficient to use a character class like [aeiou] than a set of alternatives |
1721 |
applied to a whole line of "a" characters, whereas the latter takes an |
applied to a whole line of "a" characters, whereas the latter takes an |
1722 |
appreciable time with strings longer than about 20 characters. |
appreciable time with strings longer than about 20 characters. |
1723 |
|
|
1724 |
|
|
1725 |
|
.SH UTF-8 SUPPORT |
1726 |
|
Starting at release 3.3, PCRE has some support for character strings encoded |
1727 |
|
in the UTF-8 format. This is incomplete, and is regarded as experimental. In |
1728 |
|
order to use it, you must configure PCRE to include UTF-8 support in the code, |
1729 |
|
and, in addition, you must call \fBpcre_compile()\fR with the PCRE_UTF8 option |
1730 |
|
flag. When you do this, both the pattern and any subject strings that are |
1731 |
|
matched against it are treated as UTF-8 strings instead of just strings of |
1732 |
|
bytes, but only in the cases that are mentioned below. |
1733 |
|
|
1734 |
|
If you compile PCRE with UTF-8 support, but do not use it at run time, the |
1735 |
|
library will be a bit bigger, but the additional run time overhead is limited |
1736 |
|
to testing the PCRE_UTF8 flag in several places, so should not be very large. |
1737 |
|
|
1738 |
|
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does |
1739 |
|
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, |
1740 |
|
the results are undefined. |
1741 |
|
|
1742 |
|
Running with PCRE_UTF8 set causes these changes in the way PCRE works: |
1743 |
|
|
1744 |
|
1. In a pattern, the escape sequence \\x{...}, where the contents of the braces |
1745 |
|
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
1746 |
|
code number is the given hexadecimal number, for example: \\x{1234}. This |
1747 |
|
inserts from one to six literal bytes into the pattern, using the UTF-8 |
1748 |
|
encoding. If a non-hexadecimal digit appears between the braces, the item is |
1749 |
|
not recognized. |
1750 |
|
|
1751 |
|
2. The original hexadecimal escape sequence, \\xhh, generates a two-byte UTF-8 |
1752 |
|
character if its value is greater than 127. |
1753 |
|
|
1754 |
|
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte |
1755 |
|
character. For example, \\x{100}* and \\xc3+ do not work. If you want to |
1756 |
|
repeat such characters, you must enclose them in non-capturing parentheses, |
1757 |
|
for example (?:\\x{100}), at present. |
1758 |
|
|
1759 |
|
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
1760 |
|
|
1761 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a |
1762 |
|
repeat quantifier does operate correctly on UTF-8 characters instead of |
1763 |
|
single bytes. |
1764 |
|
|
1765 |
|
4. Although the \\x{...} escape is permitted in a character class, characters |
1766 |
|
whose values are greater than 255 cannot be included in a class. |
1767 |
|
|
1768 |
|
5. A class is matched against a UTF-8 character instead of just a single byte, |
1769 |
|
but it can match only characters whose values are less than 256. Characters |
1770 |
|
with greater values always fail to match a class. |
1771 |
|
|
1772 |
|
6. Repeated classes work correctly on multiple characters. |
1773 |
|
|
1774 |
|
7. Classes containing just a single character whose value is greater than 127 |
1775 |
|
(but less than 256), for example, [\\x80] or [^\\x{93}], do not work because |
1776 |
|
these are optimized into single byte matches. In the first case, of course, |
1777 |
|
the class brackets are just redundant. |
1778 |
|
|
1779 |
|
8. Lookbehind assertions move backwards in the subject by a fixed number of |
1780 |
|
characters instead of a fixed number of bytes. Simple cases have been tested |
1781 |
|
to work correctly, but there may be hidden gotchas herein. |
1782 |
|
|
1783 |
|
9. The character types such as \\d and \\w do not work correctly with UTF-8 |
1784 |
|
characters. They continue to test a single byte. |
1785 |
|
|
1786 |
|
10. Anything not explicitly mentioned here continues to work in bytes rather |
1787 |
|
than in characters. |
1788 |
|
|
1789 |
|
The following UTF-8 features of Perl 5.6 are not implemented: |
1790 |
|
|
1791 |
|
1. The escape sequence \\C to match a single byte. |
1792 |
|
|
1793 |
|
2. The use of Unicode tables and properties and escapes \\p, \\P, and \\X. |
1794 |
|
|
1795 |
.SH AUTHOR |
.SH AUTHOR |
1796 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
1797 |
.br |
.br |
1803 |
.br |
.br |
1804 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
1805 |
|
|
1806 |
Last updated: 29 July 1999 |
Last updated: 28 August 2000, |
1807 |
|
.br |
1808 |
|
the 250th anniversary of the death of J.S. Bach. |
1809 |
.br |
.br |
1810 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |