28 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
29 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
30 |
|
|
31 |
|
void pcre_free_substring(const char *stringptr); |
32 |
|
|
33 |
|
void pcre_free_substring_list(const char **stringptr); |
34 |
|
|
35 |
const unsigned char *pcre_maketables(void); |
const unsigned char *pcre_maketables(void); |
36 |
|
|
37 |
|
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
38 |
|
int what, void *where); |
39 |
|
|
40 |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
41 |
|
|
42 |
char *pcre_version(void); |
char *pcre_version(void); |
52 |
The PCRE library is a set of functions that implement regu- |
The PCRE library is a set of functions that implement regu- |
53 |
lar expression pattern matching using the same syntax and |
lar expression pattern matching using the same syntax and |
54 |
semantics as Perl 5, with just a few differences (see |
semantics as Perl 5, with just a few differences (see |
55 |
|
|
56 |
below). The current implementation corresponds to Perl |
below). The current implementation corresponds to Perl |
57 |
5.005. |
5.005, with some additional features from later versions. |
58 |
|
This includes some experimental, incomplete support for |
59 |
|
UTF-8 encoded strings. Details of exactly what is and what |
60 |
|
is not supported are given below. |
61 |
|
|
62 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
63 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
64 |
correspond to the POSIX API. These are described in the |
correspond to the POSIX regular expression API. These are |
65 |
pcreposix documentation. |
described in the pcreposix documentation. |
66 |
|
|
67 |
The native API function prototypes are defined in the header |
The native API function prototypes are defined in the header |
68 |
file pcre.h, and on Unix systems the library itself is |
file pcre.h, and on Unix systems the library itself is |
69 |
called libpcre.a, so can be accessed by adding -lpcre to the |
called libpcre.a, so can be accessed by adding -lpcre to the |
70 |
command for linking an application which calls it. |
command for linking an application which calls it. The |
71 |
|
header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
72 |
|
contain the major and minor release numbers for the library. |
73 |
|
Applications can use these to include support for different |
74 |
|
releases. |
75 |
|
|
76 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
77 |
are used for compiling and matching regular expressions, |
are used for compiling and matching regular expressions. |
78 |
while pcre_copy_substring(), pcre_get_substring(), and |
|
79 |
pcre_get_substring_list() are convenience functions for |
The functions pcre_copy_substring(), pcre_get_substring(), |
80 |
|
and pcre_get_substring_list() are convenience functions for |
81 |
extracting captured substrings from a matched subject |
extracting captured substrings from a matched subject |
82 |
string. The function pcre_maketables() is used (optionally) |
string; pcre_free_substring() and pcre_free_substring_list() |
83 |
to build a set of character tables in the current locale for |
are also provided, to free the memory used for extracted |
84 |
passing to pcre_compile(). |
strings. |
85 |
|
|
86 |
The function pcre_info() is used to find out information |
The function pcre_maketables() is used (optionally) to build |
87 |
about a compiled pattern, while the function pcre_version() |
a set of character tables in the current locale for passing |
88 |
returns a pointer to a string containing the version of PCRE |
to pcre_compile(). |
89 |
and its date of release. |
|
90 |
|
The function pcre_fullinfo() is used to find out information |
91 |
|
about a compiled pattern; pcre_info() is an obsolete version |
92 |
|
which returns only some of the available information, but is |
93 |
|
retained for backwards compatibility. The function |
94 |
|
pcre_version() returns a pointer to a string containing the |
95 |
|
version of PCRE and its date of release. |
96 |
|
|
97 |
The global variables pcre_malloc and pcre_free initially |
The global variables pcre_malloc and pcre_free initially |
98 |
contain the entry points of the standard malloc() and free() |
contain the entry points of the standard malloc() and free() |
104 |
|
|
105 |
|
|
106 |
MULTI-THREADING |
MULTI-THREADING |
107 |
The PCRE functions can be used in multi-threading applica- |
The PCRE functions can be used in multi-threading |
108 |
tions, with the proviso that the memory management functions |
|
109 |
pointed to by pcre_malloc and pcre_free are shared by all |
|
110 |
threads. |
|
111 |
|
|
112 |
|
|
113 |
|
SunOS 5.8 Last change: 2 |
114 |
|
|
115 |
|
|
116 |
|
|
117 |
|
applications, with the proviso that the memory management |
118 |
|
functions pointed to by pcre_malloc and pcre_free are shared |
119 |
|
by all threads. |
120 |
|
|
121 |
The compiled form of a regular expression is not altered |
The compiled form of a regular expression is not altered |
122 |
during matching, so the same compiled pattern can safely be |
during matching, so the same compiled pattern can safely be |
219 |
|
|
220 |
PCRE_EXTRA |
PCRE_EXTRA |
221 |
|
|
222 |
This option turns on additional functionality of PCRE that |
This option was invented in order to turn on additional |
223 |
is incompatible with Perl. Any backslash in a pattern that |
functionality of PCRE that is incompatible with Perl, but it |
224 |
is followed by a letter that has no special meaning causes |
is currently of very little use. When set, any backslash in |
225 |
an error, thus reserving these combinations for future |
a pattern that is followed by a letter that has no special |
226 |
expansion. By default, as in Perl, a backslash followed by a |
meaning causes an error, thus reserving these combinations |
227 |
letter with no special meaning is treated as a literal. |
for future expansion. By default, as in Perl, a backslash |
228 |
There are at present no other features controlled by this |
followed by a letter with no special meaning is treated as a |
229 |
option. |
literal. There are at present no other features controlled |
230 |
|
by this option. It can also be set by a (?X) option setting |
231 |
|
within a pattern. |
232 |
|
|
233 |
PCRE_MULTILINE |
PCRE_MULTILINE |
234 |
|
|
241 |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
242 |
|
|
243 |
When PCRE_MULTILINE it is set, the "start of line" and "end |
When PCRE_MULTILINE it is set, the "start of line" and "end |
244 |
of line" constructs match immediately following or |
of line" constructs match immediately following or immedi- |
245 |
immediately before any newline in the subject string, |
ately before any newline in the subject string, respec- |
246 |
respectively, as well as at the very start and end. This is |
tively, as well as at the very start and end. This is |
247 |
equivalent to Perl's /m option. If there are no "\n" charac- |
equivalent to Perl's /m option. If there are no "\n" charac- |
248 |
ters in a subject string, or no occurrences of ^ or $ in a |
ters in a subject string, or no occurrences of ^ or $ in a |
249 |
pattern, setting PCRE_MULTILINE has no effect. |
pattern, setting PCRE_MULTILINE has no effect. |
255 |
followed by "?". It is not compatible with Perl. It can also |
followed by "?". It is not compatible with Perl. It can also |
256 |
be set by a (?U) option setting within the pattern. |
be set by a (?U) option setting within the pattern. |
257 |
|
|
258 |
|
PCRE_UTF8 |
259 |
|
|
260 |
|
This option causes PCRE to regard both the pattern and the |
261 |
|
subject as strings of UTF-8 characters instead of just byte |
262 |
|
strings. However, it is available only if PCRE has been |
263 |
|
built to include UTF-8 support. If not, the use of this |
264 |
|
option provokes an error. Support for UTF-8 is new, experi- |
265 |
|
mental, and incomplete. Details of exactly what it entails |
266 |
|
are given below. |
267 |
|
|
268 |
|
|
269 |
|
|
270 |
STUDYING A PATTERN |
STUDYING A PATTERN |
271 |
When a pattern is going to be used several times, it is |
When a pattern is going to be used several times, it is |
272 |
worth spending more time analyzing it in order to speed up |
worth spending more time analyzing it in order to speed up |
273 |
the time taken for matching. The function pcre_study() takes |
the time taken for matching. The function pcre_study() takes |
274 |
|
|
275 |
a pointer to a compiled pattern as its first argument, and |
a pointer to a compiled pattern as its first argument, and |
276 |
returns a pointer to a pcre_extra block (another void |
returns a pointer to a pcre_extra block (another void |
277 |
typedef) containing additional information about the pat- |
typedef) containing additional information about the pat- |
329 |
|
|
330 |
|
|
331 |
INFORMATION ABOUT A PATTERN |
INFORMATION ABOUT A PATTERN |
332 |
The pcre_info() function returns information about a com- |
The pcre_fullinfo() function returns information about a |
333 |
piled pattern. Its yield is the number of capturing subpat- |
compiled pattern. It replaces the obsolete pcre_info() func- |
334 |
terns, or one of the following negative numbers: |
tion, which is nevertheless retained for backwards compabil- |
335 |
|
ity (and is documented below). |
336 |
|
|
337 |
|
The first argument for pcre_fullinfo() is a pointer to the |
338 |
|
compiled pattern. The second argument is the result of |
339 |
|
pcre_study(), or NULL if the pattern was not studied. The |
340 |
|
third argument specifies which piece of information is |
341 |
|
required, while the fourth argument is a pointer to a vari- |
342 |
|
able to receive the data. The yield of the function is zero |
343 |
|
for success, or one of the following negative numbers: |
344 |
|
|
345 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
346 |
|
the argument where was NULL |
347 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
348 |
|
PCRE_ERROR_BADOPTION the value of what was invalid |
349 |
|
|
350 |
If the optptr argument is not NULL, a copy of the options |
The possible values for the third argument are defined in |
351 |
with which the pattern was compiled is placed in the integer |
pcre.h, and are as follows: |
352 |
it points to. These option bits are those specified in the |
|
353 |
|
PCRE_INFO_OPTIONS |
354 |
|
|
355 |
|
Return a copy of the options with which the pattern was com- |
356 |
|
piled. The fourth argument should point to au unsigned long |
357 |
|
int variable. These option bits are those specified in the |
358 |
call to pcre_compile(), modified by any top-level option |
call to pcre_compile(), modified by any top-level option |
359 |
settings within the pattern itself, and with the |
settings within the pattern itself, and with the |
360 |
PCRE_ANCHORED bit set if the form of the pattern implies |
PCRE_ANCHORED bit forcibly set if the form of the pattern |
361 |
that it can match only at the start of a subject string. |
implies that it can match only at the start of a subject |
362 |
|
string. |
363 |
|
|
364 |
If the pattern is not anchored and the firstcharptr argument |
PCRE_INFO_SIZE |
365 |
is not NULL, it is used to pass back information about the |
|
366 |
first character of any matched string. If there is a fixed |
Return the size of the compiled pattern, that is, the value |
367 |
first character, e.g. from a pattern such as |
that was passed as the argument to pcre_malloc() when PCRE |
368 |
(cat|cow|coyote), then it is returned in the integer pointed |
was getting memory in which to place the compiled data. The |
369 |
to by firstcharptr. Otherwise, if either |
fourth argument should point to a size_t variable. |
370 |
|
|
371 |
|
PCRE_INFO_CAPTURECOUNT |
372 |
|
|
373 |
|
Return the number of capturing subpatterns in the pattern. |
374 |
|
The fourth argument should point to an int variable. |
375 |
|
|
376 |
|
PCRE_INFO_BACKREFMAX |
377 |
|
|
378 |
|
Return the number of the highest back reference in the |
379 |
|
pattern. The fourth argument should point to an int vari- |
380 |
|
able. Zero is returned if there are no back references. |
381 |
|
|
382 |
|
PCRE_INFO_FIRSTCHAR |
383 |
|
|
384 |
|
Return information about the first character of any matched |
385 |
|
string, for a non-anchored pattern. If there is a fixed |
386 |
|
first character, e.g. from a pattern such as |
387 |
|
(cat|cow|coyote), it is returned in the integer pointed to |
388 |
|
by where. Otherwise, if either |
389 |
|
|
390 |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
391 |
and every branch starts with "^", or |
and every branch starts with "^", or |
393 |
(b) every branch of the pattern starts with ".*" and |
(b) every branch of the pattern starts with ".*" and |
394 |
PCRE_DOTALL is not set (if it were set, the pattern would be |
PCRE_DOTALL is not set (if it were set, the pattern would be |
395 |
anchored), |
anchored), |
396 |
then -1 is returned, indicating that the pattern matches |
|
397 |
only at the start of a subject string or after any "\n" |
-1 is returned, indicating that the pattern matches only at |
398 |
within the string. Otherwise -2 is returned. |
the start of a subject string or after any "\n" within the |
399 |
|
string. Otherwise -2 is returned. For anchored patterns, -2 |
400 |
|
is returned. |
401 |
|
|
402 |
|
PCRE_INFO_FIRSTTABLE |
403 |
|
|
404 |
|
If the pattern was studied, and this resulted in the con- |
405 |
|
struction of a 256-bit table indicating a fixed set of char- |
406 |
|
acters for the first character in any matching string, a |
407 |
|
pointer to the table is returned. Otherwise NULL is |
408 |
|
returned. The fourth argument should point to an unsigned |
409 |
|
char * variable. |
410 |
|
|
411 |
|
PCRE_INFO_LASTLITERAL |
412 |
|
|
413 |
|
For a non-anchored pattern, return the value of the right- |
414 |
|
most literal character which must exist in any matched |
415 |
|
string, other than at its start. The fourth argument should |
416 |
|
point to an int variable. If there is no such character, or |
417 |
|
if the pattern is anchored, -1 is returned. For example, for |
418 |
|
the pattern /a\d+z\d+/ the returned value is 'z'. |
419 |
|
|
420 |
|
The pcre_info() function is now obsolete because its inter- |
421 |
|
face is too restrictive to return all the available data |
422 |
|
about a compiled pattern. New programs should use |
423 |
|
pcre_fullinfo() instead. The yield of pcre_info() is the |
424 |
|
number of capturing subpatterns, or one of the following |
425 |
|
negative numbers: |
426 |
|
|
427 |
|
PCRE_ERROR_NULL the argument code was NULL |
428 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
429 |
|
|
430 |
|
If the optptr argument is not NULL, a copy of the options |
431 |
|
with which the pattern was compiled is placed in the integer |
432 |
|
it points to (see PCRE_INFO_OPTIONS above). |
433 |
|
|
434 |
|
If the pattern is not anchored and the firstcharptr argument |
435 |
|
is not NULL, it is used to pass back information about the |
436 |
|
first character of any matched string (see |
437 |
|
PCRE_INFO_FIRSTCHAR above). |
438 |
|
|
439 |
|
|
440 |
|
|
636 |
|
|
637 |
EXTRACTING CAPTURED SUBSTRINGS |
EXTRACTING CAPTURED SUBSTRINGS |
638 |
Captured substrings can be accessed directly by using the |
Captured substrings can be accessed directly by using the |
639 |
|
|
640 |
|
|
641 |
|
|
642 |
|
|
643 |
|
|
644 |
|
SunOS 5.8 Last change: 12 |
645 |
|
|
646 |
|
|
647 |
|
|
648 |
offsets returned by pcre_exec() in ovector. For convenience, |
offsets returned by pcre_exec() in ovector. For convenience, |
649 |
the functions pcre_copy_substring(), pcre_get_substring(), |
the functions pcre_copy_substring(), pcre_get_substring(), |
650 |
and pcre_get_substring_list() are provided for extracting |
and pcre_get_substring_list() are provided for extracting |
662 |
entire regular expression. This is the value returned by |
entire regular expression. This is the value returned by |
663 |
pcre_exec if it is greater than zero. If pcre_exec() |
pcre_exec if it is greater than zero. If pcre_exec() |
664 |
returned zero, indicating that it ran out of space in ovec- |
returned zero, indicating that it ran out of space in ovec- |
665 |
tor, then the value passed as stringcount should be the size |
tor, the value passed as stringcount should be the size of |
666 |
of the vector divided by three. |
the vector divided by three. |
667 |
|
|
668 |
The functions pcre_copy_substring() and pcre_get_substring() |
The functions pcre_copy_substring() and pcre_get_substring() |
669 |
extract a single substring, whose number is given as string- |
extract a single substring, whose number is given as string- |
671 |
the entire pattern, while higher values extract the captured |
the entire pattern, while higher values extract the captured |
672 |
substrings. For pcre_copy_substring(), the string is placed |
substrings. For pcre_copy_substring(), the string is placed |
673 |
in buffer, whose length is given by buffersize, while for |
in buffer, whose length is given by buffersize, while for |
674 |
pcre_get_substring() a new block of store is obtained via |
pcre_get_substring() a new block of memory is obtained via |
675 |
pcre_malloc, and its address is returned via stringptr. The |
pcre_malloc, and its address is returned via stringptr. The |
676 |
yield of the function is the length of the string, not |
yield of the function is the length of the string, not |
677 |
including the terminating zero, or one of |
including the terminating zero, or one of |
705 |
inspecting the appropriate offset in ovector, which is nega- |
inspecting the appropriate offset in ovector, which is nega- |
706 |
tive for unset substrings. |
tive for unset substrings. |
707 |
|
|
708 |
|
The two convenience functions pcre_free_substring() and |
709 |
|
pcre_free_substring_list() can be used to free the memory |
710 |
|
returned by a previous call of pcre_get_substring() or |
711 |
|
pcre_get_substring_list(), respectively. They do nothing |
712 |
|
more than call the function pointed to by pcre_free, which |
713 |
|
of course could be called directly from a C program. How- |
714 |
|
ever, PCRE is used in some situations where it is linked via |
715 |
|
a special interface to another programming language which |
716 |
|
cannot use pcre_free directly; it is for these cases that |
717 |
|
the functions are provided. |
718 |
|
|
719 |
|
|
720 |
|
|
779 |
6. The Perl \G assertion is not supported as it is not |
6. The Perl \G assertion is not supported as it is not |
780 |
relevant to single pattern matches. |
relevant to single pattern matches. |
781 |
|
|
782 |
7. Fairly obviously, PCRE does not support the (?{code}) |
7. Fairly obviously, PCRE does not support the (?{code}) and |
783 |
construction. |
(?p{code}) constructions. However, there is some experimen- |
784 |
|
tal support for recursive patterns using the non-Perl item |
785 |
|
(?R). |
786 |
|
|
787 |
8. There are at the time of writing some oddities in Perl |
8. There are at the time of writing some oddities in Perl |
788 |
5.005_02 concerned with the settings of captured strings |
5.005_02 concerned with the settings of captured strings |
790 |
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
791 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 |
792 |
unset. However, if the pattern is changed to |
unset. However, if the pattern is changed to |
793 |
/^(aa(b(b))?)+$/ then $2 (and $3) get set. |
/^(aa(b(b))?)+$/ then $2 (and $3) are set. |
794 |
|
|
795 |
In Perl 5.004 $2 is set in both cases, and that is also true |
In Perl 5.004 $2 is set in both cases, and that is also true |
796 |
of PCRE. If in the future Perl changes to a consistent state |
of PCRE. If in the future Perl changes to a consistent state |
816 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
817 |
with no special meaning is faulted. |
with no special meaning is faulted. |
818 |
|
|
819 |
(d) If PCRE_UNGREEDY is set, the greediness of the |
(d) If PCRE_UNGREEDY is set, the greediness of the repeti- |
820 |
repetition quantifiers is inverted, that is, by default they |
tion quantifiers is inverted, that is, by default they are |
821 |
are not greedy, but if followed by a question mark they are. |
not greedy, but if followed by a question mark they are. |
822 |
|
|
823 |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
824 |
only at the start of the subject. |
only at the start of the subject. |
826 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
827 |
for pcre_exec() have no Perl equivalents. |
for pcre_exec() have no Perl equivalents. |
828 |
|
|
829 |
|
(g) The (?R) construct allows for recursive pattern matching |
830 |
|
(Perl 5.6 can do this using the (?p{code}) construct, which |
831 |
|
PCRE cannot of course support.) |
832 |
|
|
833 |
|
|
834 |
|
|
835 |
REGULAR EXPRESSION DETAILS |
REGULAR EXPRESSION DETAILS |
838 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
839 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
840 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
841 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
842 |
|
|
843 |
The description here is intended as reference documentation. |
The description here is intended as reference documentation. |
844 |
|
The basic operation of PCRE is on strings of bytes. However, |
845 |
|
there is the beginnings of some support for UTF-8 character |
846 |
|
strings. To use this support you must configure PCRE to |
847 |
|
include it, and then call pcre_compile() with the PCRE_UTF8 |
848 |
|
option. How this affects the pattern matching is described |
849 |
|
in the final section of this document. |
850 |
|
|
851 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
852 |
subject string from left to right. Most characters stand for |
subject string from left to right. Most characters stand for |
932 |
\f formfeed (hex 0C) |
\f formfeed (hex 0C) |
933 |
\n newline (hex 0A) |
\n newline (hex 0A) |
934 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
935 |
|
\t tab (hex 09) |
|
tab (hex 09) |
|
936 |
\xhh character with hex code hh |
\xhh character with hex code hh |
937 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
938 |
|
|
984 |
Note that octal values of 100 or greater must not be intro- |
Note that octal values of 100 or greater must not be intro- |
985 |
duced by a leading zero, because no more than three octal |
duced by a leading zero, because no more than three octal |
986 |
digits are ever read. |
digits are ever read. |
987 |
|
|
988 |
All the sequences that define a single byte value can be |
All the sequences that define a single byte value can be |
989 |
used both inside and outside character classes. In addition, |
used both inside and outside character classes. In addition, |
990 |
inside a character class, the sequence "\b" is interpreted |
inside a character class, the sequence "\b" is interpreted |
1037 |
These assertions may not appear in character classes (but |
These assertions may not appear in character classes (but |
1038 |
note that "\b" has a different meaning, namely the backspace |
note that "\b" has a different meaning, namely the backspace |
1039 |
character, inside a character class). |
character, inside a character class). |
1040 |
|
|
1041 |
A word boundary is a position in the subject string where |
A word boundary is a position in the subject string where |
1042 |
the current character and the previous character do not both |
the current character and the previous character do not both |
1043 |
match \w or \W (i.e. one matches \w and the other matches |
match \w or \W (i.e. one matches \w and the other matches |
1061 |
Outside a character class, in the default matching mode, the |
Outside a character class, in the default matching mode, the |
1062 |
circumflex character is an assertion which is true only if |
circumflex character is an assertion which is true only if |
1063 |
the current matching point is at the start of the subject |
the current matching point is at the start of the subject |
1064 |
|
|
1065 |
string. If the startoffset argument of pcre_exec() is non- |
string. If the startoffset argument of pcre_exec() is non- |
1066 |
zero, circumflex can never match. Inside a character class, |
zero, circumflex can never match. Inside a character class, |
1067 |
circumflex has an entirely different meaning (see below). |
circumflex has an entirely different meaning (see below). |
1114 |
Outside a character class, a dot in the pattern matches any |
Outside a character class, a dot in the pattern matches any |
1115 |
one character in the subject, including a non-printing char- |
one character in the subject, including a non-printing char- |
1116 |
acter, but not (by default) newline. If the PCRE_DOTALL |
acter, but not (by default) newline. If the PCRE_DOTALL |
1117 |
option is set, then dots match newlines as well. The han- |
|
1118 |
dling of dot is entirely independent of the handling of cir- |
option is set, dots match newlines as well. The handling of |
1119 |
cumflex and dollar, the only relationship being that they |
dot is entirely independent of the handling of circumflex |
1120 |
both involve newline characters. Dot has no special meaning |
and dollar, the only relationship being that they both |
1121 |
in a character class. |
involve newline characters. Dot has no special meaning in a |
1122 |
|
character class. |
1123 |
|
|
1124 |
|
|
1125 |
|
|
1201 |
|
|
1202 |
|
|
1203 |
|
|
1204 |
|
POSIX CHARACTER CLASSES |
1205 |
|
Perl 5.6 (not yet released at the time of writing) is going |
1206 |
|
to support the POSIX notation for character classes, which |
1207 |
|
uses names enclosed by [: and :] within the enclosing |
1208 |
|
square brackets. PCRE supports this notation. For example, |
1209 |
|
|
1210 |
|
[01[:alpha:]%] |
1211 |
|
|
1212 |
|
matches "0", "1", any alphabetic character, or "%". The sup- |
1213 |
|
ported class names are |
1214 |
|
|
1215 |
|
alnum letters and digits |
1216 |
|
alpha letters |
1217 |
|
ascii character codes 0 - 127 |
1218 |
|
cntrl control characters |
1219 |
|
digit decimal digits (same as \d) |
1220 |
|
graph printing characters, excluding space |
1221 |
|
lower lower case letters |
1222 |
|
print printing characters, including space |
1223 |
|
punct printing characters, excluding letters and digits |
1224 |
|
space white space (same as \s) |
1225 |
|
upper upper case letters |
1226 |
|
word "word" characters (same as \w) |
1227 |
|
xdigit hexadecimal digits |
1228 |
|
|
1229 |
|
The names "ascii" and "word" are Perl extensions. Another |
1230 |
|
Perl extension is negation, which is indicated by a ^ char- |
1231 |
|
acter after the colon. For example, |
1232 |
|
|
1233 |
|
[12[:^digit:]] |
1234 |
|
|
1235 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also |
1236 |
|
recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a |
1237 |
|
"collating element", but these are not supported, and an |
1238 |
|
error is given if they are encountered. |
1239 |
|
|
1240 |
|
|
1241 |
|
|
1242 |
VERTICAL BAR |
VERTICAL BAR |
1243 |
Vertical bar characters are used to separate alternative |
Vertical bar characters are used to separate alternative |
1244 |
patterns. For example, the pattern |
patterns. For example, the pattern |
1390 |
Repetition is specified by quantifiers, which can follow any |
Repetition is specified by quantifiers, which can follow any |
1391 |
of the following items: |
of the following items: |
1392 |
|
|
|
|
|
1393 |
a single character, possibly escaped |
a single character, possibly escaped |
1394 |
the . metacharacter |
the . metacharacter |
1395 |
a character class |
a character class |
1462 |
|
|
1463 |
/* first command */ not comment /* second comment */ |
/* first command */ not comment /* second comment */ |
1464 |
|
|
1465 |
fails, because it matches the entire string due to the |
fails, because it matches the entire string owing to the |
1466 |
greediness of the .* item. |
greediness of the .* item. |
1467 |
|
|
1468 |
However, if a quantifier is followed by a question mark, |
However, if a quantifier is followed by a question mark, it |
1469 |
then it ceases to be greedy, and instead matches the minimum |
ceases to be greedy, and instead matches the minimum number |
1470 |
number of times possible, so the pattern |
of times possible, so the pattern |
1471 |
|
|
1472 |
/\*.*?\*/ |
/\*.*?\*/ |
1473 |
|
|
1484 |
that is the only way the rest of the pattern matches. |
that is the only way the rest of the pattern matches. |
1485 |
|
|
1486 |
If the PCRE_UNGREEDY option is set (an option which is not |
If the PCRE_UNGREEDY option is set (an option which is not |
1487 |
available in Perl) then the quantifiers are not greedy by |
available in Perl), the quantifiers are not greedy by |
1488 |
default, but individual ones can be made greedy by following |
default, but individual ones can be made greedy by following |
1489 |
them with a question mark. In other words, it inverts the |
them with a question mark. In other words, it inverts the |
1490 |
default behaviour. |
default behaviour. |
1496 |
|
|
1497 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
1498 |
option (equivalent to Perl's /s) is set, thus allowing the . |
option (equivalent to Perl's /s) is set, thus allowing the . |
1499 |
to match newlines, then the pattern is implicitly anchored, |
to match newlines, the pattern is implicitly anchored, |
1500 |
because whatever follows will be tried against every charac- |
because whatever follows will be tried against every charac- |
1501 |
ter position in the subject string, so there is no point in |
ter position in the subject string, so there is no point in |
1502 |
retrying the overall match at any position after the first. |
retrying the overall match at any position after the first. |
1549 |
|
|
1550 |
matches "sense and sensibility" and "response and responsi- |
matches "sense and sensibility" and "response and responsi- |
1551 |
bility", but not "sense and responsibility". If caseful |
bility", but not "sense and responsibility". If caseful |
1552 |
matching is in force at the time of the back reference, then |
matching is in force at the time of the back reference, the |
1553 |
the case of letters is relevant. For example, |
case of letters is relevant. For example, |
1554 |
|
|
1555 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
1556 |
|
|
1560 |
|
|
1561 |
There may be more than one back reference to the same sub- |
There may be more than one back reference to the same sub- |
1562 |
pattern. If a subpattern has not actually been used in a |
pattern. If a subpattern has not actually been used in a |
1563 |
particular match, then any back references to it always |
particular match, any back references to it always fail. For |
1564 |
fail. For example, the pattern |
example, the pattern |
1565 |
|
|
1566 |
(a|(bc))\2 |
(a|(bc))\2 |
1567 |
|
|
1569 |
Because there may be up to 99 back references, all digits |
Because there may be up to 99 back references, all digits |
1570 |
following the backslash are taken as part of a potential |
following the backslash are taken as part of a potential |
1571 |
back reference number. If the pattern continues with a digit |
back reference number. If the pattern continues with a digit |
1572 |
character, then some delimiter must be used to terminate the |
character, some delimiter must be used to terminate the back |
1573 |
back reference. If the PCRE_EXTENDED option is set, this can |
reference. If the PCRE_EXTENDED option is set, this can be |
1574 |
be whitespace. Otherwise an empty comment can be used. |
whitespace. Otherwise an empty comment can be used. |
1575 |
|
|
1576 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
1577 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
1581 |
|
|
1582 |
(a|b\1)+ |
(a|b\1)+ |
1583 |
|
|
1584 |
matches any number of "a"s and also "aba", "ababaa" etc. At |
matches any number of "a"s and also "aba", "ababbaa" etc. At |
1585 |
each iteration of the subpattern, the back reference matches |
each iteration of the subpattern, the back reference matches |
1586 |
the character string corresponding to the previous itera- |
the character string corresponding to the previous |
1587 |
tion. In order for this to work, the pattern must be such |
iteration. In order for this to work, the pattern must be |
1588 |
that the first iteration does not need to match the back |
such that the first iteration does not need to match the |
1589 |
reference. This can be done using alternation, as in the |
back reference. This can be done using alternation, as in |
1590 |
example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of |
1591 |
|
zero. |
1592 |
|
|
1593 |
|
|
1594 |
|
|
1600 |
cated assertions are coded as subpatterns. There are two |
cated assertions are coded as subpatterns. There are two |
1601 |
kinds: those that look ahead of the current position in the |
kinds: those that look ahead of the current position in the |
1602 |
subject string, and those that look behind it. |
subject string, and those that look behind it. |
1603 |
|
|
1604 |
An assertion subpattern is matched in the normal way, except |
An assertion subpattern is matched in the normal way, except |
1605 |
that it does not cause the current matching position to be |
that it does not cause the current matching position to be |
1606 |
changed. Lookahead assertions start with (?= for positive |
changed. Lookahead assertions start with (?= for positive |
1672 |
matches "foo" preceded by three digits that are not "999". |
matches "foo" preceded by three digits that are not "999". |
1673 |
Notice that each of the assertions is applied independently |
Notice that each of the assertions is applied independently |
1674 |
at the same point in the subject string. First there is a |
at the same point in the subject string. First there is a |
1675 |
check that the previous three characters are all digits, |
check that the previous three characters are all digits, and |
1676 |
then there is a check that the same three characters are not |
then there is a check that the same three characters are not |
1677 |
"999". This pattern does not match "foo" preceded by six |
"999". This pattern does not match "foo" preceded by six |
1678 |
characters, the first of which are digits and the last three |
characters, the first of which are digits and the last three |
1741 |
|
|
1742 |
This kind of parenthesis "locks up" the part of the pattern |
This kind of parenthesis "locks up" the part of the pattern |
1743 |
it contains once it has matched, and a failure further into |
it contains once it has matched, and a failure further into |
1744 |
the pattern is prevented from backtracking into it. Back- |
the pattern is prevented from backtracking into it. |
1745 |
tracking past it to previous items, however, works as nor- |
Backtracking past it to previous items, however, works as |
1746 |
mal. |
normal. |
1747 |
|
|
1748 |
An alternative description is that a subpattern of this type |
An alternative description is that a subpattern of this type |
1749 |
matches the string of characters that an identical stan- |
matches the string of characters that an identical stan- |
1766 |
|
|
1767 |
abcd$ |
abcd$ |
1768 |
|
|
1769 |
when applied to a long string which does not match it. |
when applied to a long string which does not match. Because |
1770 |
Because matching proceeds from left to right, PCRE will look |
matching proceeds from left to right, PCRE will look for |
1771 |
for each "a" in the subject and then see if what follows |
each "a" in the subject and then see if what follows matches |
1772 |
matches the rest of the pattern. If the pattern is specified |
the rest of the pattern. If the pattern is specified as |
|
as |
|
1773 |
|
|
1774 |
^.*abcd$ |
^.*abcd$ |
1775 |
|
|
1776 |
then the initial .* matches the entire string at first, but |
the initial .* matches the entire string at first, but when |
1777 |
when this fails, it backtracks to match all but the last |
this fails (because there is no following "a"), it back- |
1778 |
character, then all but the last two characters, and so on. |
tracks to match all but the last character, then all but the |
1779 |
Once again the search for "a" covers the entire string, from |
last two characters, and so on. Once again the search for |
1780 |
right to left, so we are no better off. However, if the pat- |
"a" covers the entire string, from right to left, so we are |
1781 |
tern is written as |
no better off. However, if the pattern is written as |
1782 |
|
|
1783 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1784 |
|
|
1785 |
then there can be no backtracking for the .* item; it can |
there can be no backtracking for the .* item; it can match |
1786 |
match only the entire string. The subsequent lookbehind |
only the entire string. The subsequent lookbehind assertion |
1787 |
assertion does a single test on the last four characters. If |
does a single test on the last four characters. If it fails, |
1788 |
it fails, the match fails immediately. For long strings, |
the match fails immediately. For long strings, this approach |
1789 |
this approach makes a significant difference to the process- |
makes a significant difference to the processing time. |
1790 |
ing time. |
|
1791 |
|
When a pattern contains an unlimited repeat inside a subpat- |
1792 |
|
tern that can itself be repeated an unlimited number of |
1793 |
|
times, the use of a once-only subpattern is the only way to |
1794 |
|
avoid some failing matches taking a very long time indeed. |
1795 |
|
The pattern |
1796 |
|
|
1797 |
|
(\D+|<\d+>)*[!?] |
1798 |
|
|
1799 |
|
matches an unlimited number of substrings that either con- |
1800 |
|
sist of non-digits, or digits enclosed in <>, followed by |
1801 |
|
either ! or ?. When it matches, it runs quickly. However, if |
1802 |
|
it is applied to |
1803 |
|
|
1804 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
1805 |
|
|
1806 |
|
it takes a long time before reporting failure. This is |
1807 |
|
because the string can be divided between the two repeats in |
1808 |
|
a large number of ways, and all have to be tried. (The exam- |
1809 |
|
ple used [!?] rather than a single character at the end, |
1810 |
|
because both PCRE and Perl have an optimization that allows |
1811 |
|
for fast failure when a single character is used. They |
1812 |
|
remember the last single character that is required for a |
1813 |
|
match, and fail early if it is not present in the string.) |
1814 |
|
If the pattern is changed to |
1815 |
|
|
1816 |
|
((?>\D+)|<\d+>)*[!?] |
1817 |
|
|
1818 |
|
sequences of non-digits cannot be broken, and failure hap- |
1819 |
|
pens quickly. |
1820 |
|
|
1821 |
|
|
1822 |
|
|
1836 |
error occurs. |
error occurs. |
1837 |
|
|
1838 |
There are two kinds of condition. If the text between the |
There are two kinds of condition. If the text between the |
1839 |
parentheses consists of a sequence of digits, then the |
parentheses consists of a sequence of digits, the condition |
1840 |
condition is satisfied if the capturing subpattern of that |
is satisfied if the capturing subpattern of that number has |
1841 |
number has previously matched. Consider the following pat- |
previously matched. The number must be greater than zero. |
1842 |
tern, which contains non-significant white space to make it |
Consider the following pattern, which contains non- |
1843 |
more readable (assume the PCRE_EXTENDED option) and to |
significant white space to make it more readable (assume the |
1844 |
divide it into three parts for ease of discussion: |
PCRE_EXTENDED option) and to divide it into three parts for |
1845 |
|
ease of discussion: |
1846 |
|
|
1847 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
1848 |
|
|
1891 |
|
|
1892 |
|
|
1893 |
|
|
1894 |
|
RECURSIVE PATTERNS |
1895 |
|
Consider the problem of matching a string in parentheses, |
1896 |
|
allowing for unlimited nested parentheses. Without the use |
1897 |
|
of recursion, the best that can be done is to use a pattern |
1898 |
|
that matches up to some fixed depth of nesting. It is not |
1899 |
|
possible to handle an arbitrary nesting depth. Perl 5.6 has |
1900 |
|
provided an experimental facility that allows regular |
1901 |
|
expressions to recurse (amongst other things). It does this |
1902 |
|
by interpolating Perl code in the expression at run time, |
1903 |
|
and the code can refer to the expression itself. A Perl pat- |
1904 |
|
tern to solve the parentheses problem can be created like |
1905 |
|
this: |
1906 |
|
|
1907 |
|
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
1908 |
|
|
1909 |
|
The (?p{...}) item interpolates Perl code at run time, and |
1910 |
|
in this case refers recursively to the pattern in which it |
1911 |
|
appears. Obviously, PCRE cannot support the interpolation of |
1912 |
|
Perl code. Instead, the special item (?R) is provided for |
1913 |
|
the specific case of recursion. This PCRE pattern solves the |
1914 |
|
parentheses problem (assume the PCRE_EXTENDED option is set |
1915 |
|
so that white space is ignored): |
1916 |
|
|
1917 |
|
\( ( (?>[^()]+) | (?R) )* \) |
1918 |
|
|
1919 |
|
First it matches an opening parenthesis. Then it matches any |
1920 |
|
number of substrings which can either be a sequence of non- |
1921 |
|
parentheses, or a recursive match of the pattern itself |
1922 |
|
(i.e. a correctly parenthesized substring). Finally there is |
1923 |
|
a closing parenthesis. |
1924 |
|
|
1925 |
|
This particular example pattern contains nested unlimited |
1926 |
|
repeats, and so the use of a once-only subpattern for match- |
1927 |
|
ing strings of non-parentheses is important when applying |
1928 |
|
the pattern to strings that do not match. For example, when |
1929 |
|
it is applied to |
1930 |
|
|
1931 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
1932 |
|
|
1933 |
|
it yields "no match" quickly. However, if a once-only sub- |
1934 |
|
pattern is not used, the match runs for a very long time |
1935 |
|
indeed because there are so many different ways the + and * |
1936 |
|
repeats can carve up the subject, and all have to be tested |
1937 |
|
before failure can be reported. |
1938 |
|
|
1939 |
|
The values set for any capturing subpatterns are those from |
1940 |
|
the outermost level of the recursion at which the subpattern |
1941 |
|
value is set. If the pattern above is matched against |
1942 |
|
|
1943 |
|
(ab(cd)ef) |
1944 |
|
|
1945 |
|
the value for the capturing parentheses is "ef", which is |
1946 |
|
the last value taken on at the top level. If additional |
1947 |
|
parentheses are added, giving |
1948 |
|
|
1949 |
|
\( ( ( (?>[^()]+) | (?R) )* ) \) |
1950 |
|
^ ^ |
1951 |
|
^ ^ the string they capture is |
1952 |
|
"ab(cd)ef", the contents of the top level parentheses. If |
1953 |
|
there are more than 15 capturing parentheses in a pattern, |
1954 |
|
PCRE has to obtain extra memory to store data during a |
1955 |
|
recursion, which it does by using pcre_malloc, freeing it |
1956 |
|
via pcre_free afterwards. If no memory can be obtained, it |
1957 |
|
saves data for the first 15 capturing parentheses only, as |
1958 |
|
there is no way to give an out-of-memory error from within a |
1959 |
|
recursion. |
1960 |
|
|
1961 |
|
|
1962 |
|
|
1963 |
PERFORMANCE |
PERFORMANCE |
1964 |
Certain items that may appear in patterns are more efficient |
Certain items that may appear in patterns are more efficient |
1965 |
than others. It is more efficient to use a character class |
than others. It is more efficient to use a character class |
2027 |
|
|
2028 |
|
|
2029 |
|
|
2030 |
|
UTF-8 SUPPORT |
2031 |
|
Starting at release 3.3, PCRE has some support for character |
2032 |
|
strings encoded in the UTF-8 format. This is incomplete, and |
2033 |
|
is regarded as experimental. In order to use it, you must |
2034 |
|
configure PCRE to include UTF-8 support in the code, and, in |
2035 |
|
addition, you must call pcre_compile() with the PCRE_UTF8 |
2036 |
|
option flag. When you do this, both the pattern and any sub- |
2037 |
|
ject strings that are matched against it are treated as |
2038 |
|
UTF-8 strings instead of just strings of bytes, but only in |
2039 |
|
the cases that are mentioned below. |
2040 |
|
|
2041 |
|
If you compile PCRE with UTF-8 support, but do not use it at |
2042 |
|
run time, the library will be a bit bigger, but the addi- |
2043 |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
2044 |
|
flag in several places, so should not be very large. |
2045 |
|
|
2046 |
|
PCRE assumes that the strings it is given contain valid |
2047 |
|
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
2048 |
|
you pass invalid UTF-8 strings to PCRE, the results are |
2049 |
|
undefined. |
2050 |
|
|
2051 |
|
Running with PCRE_UTF8 set causes these changes in the way |
2052 |
|
PCRE works: |
2053 |
|
|
2054 |
|
1. In a pattern, the escape sequence \x{...}, where the con- |
2055 |
|
tents of the braces is a string of hexadecimal digits, is |
2056 |
|
interpreted as a UTF-8 character whose code number is the |
2057 |
|
given hexadecimal number, for example: \x{1234}. This |
2058 |
|
inserts from one to six literal bytes into the pattern, |
2059 |
|
using the UTF-8 encoding. If a non-hexadecimal digit appears |
2060 |
|
between the braces, the item is not recognized. |
2061 |
|
|
2062 |
|
2. The original hexadecimal escape sequence, \xhh, generates |
2063 |
|
a two-byte UTF-8 character if its value is greater than 127. |
2064 |
|
|
2065 |
|
3. Repeat quantifiers are NOT correctly handled if they fol- |
2066 |
|
low a multibyte character. For example, \x{100}* and \xc3+ |
2067 |
|
do not work. If you want to repeat such characters, you must |
2068 |
|
enclose them in non-capturing parentheses, for example |
2069 |
|
(?:\x{100}), at present. |
2070 |
|
|
2071 |
|
4. The dot metacharacter matches one UTF-8 character instead |
2072 |
|
of a single byte. |
2073 |
|
|
2074 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter |
2075 |
|
followed by a repeat quantifier does operate correctly on |
2076 |
|
UTF-8 characters instead of single bytes. |
2077 |
|
|
2078 |
|
4. Although the \x{...} escape is permitted in a character |
2079 |
|
class, characters whose values are greater than 255 cannot |
2080 |
|
be included in a class. |
2081 |
|
|
2082 |
|
5. A class is matched against a UTF-8 character instead of |
2083 |
|
just a single byte, but it can match only characters whose |
2084 |
|
values are less than 256. Characters with greater values |
2085 |
|
always fail to match a class. |
2086 |
|
|
2087 |
|
6. Repeated classes work correctly on multiple characters. |
2088 |
|
|
2089 |
|
7. Classes containing just a single character whose value is |
2090 |
|
greater than 127 (but less than 256), for example, [\x80] or |
2091 |
|
[^\x{93}], do not work because these are optimized into sin- |
2092 |
|
gle byte matches. In the first case, of course, the class |
2093 |
|
brackets are just redundant. |
2094 |
|
|
2095 |
|
8. Lookbehind assertions move backwards in the subject by a |
2096 |
|
fixed number of characters instead of a fixed number of |
2097 |
|
bytes. Simple cases have been tested to work correctly, but |
2098 |
|
there may be hidden gotchas herein. |
2099 |
|
|
2100 |
|
9. The character types such as \d and \w do not work |
2101 |
|
correctly with UTF-8 characters. They continue to test a |
2102 |
|
single byte. |
2103 |
|
|
2104 |
|
10. Anything not explicitly mentioned here continues to work |
2105 |
|
in bytes rather than in characters. |
2106 |
|
|
2107 |
|
The following UTF-8 features of Perl 5.6 are not imple- |
2108 |
|
mented: |
2109 |
|
1. The escape sequence \C to match a single byte. |
2110 |
|
|
2111 |
|
2. The use of Unicode tables and properties and escapes \p, |
2112 |
|
\P, and \X. |
2113 |
|
|
2114 |
|
|
2115 |
|
|
2116 |
AUTHOR |
AUTHOR |
2117 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
2118 |
University Computing Service, |
University Computing Service, |
2120 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
2121 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
2122 |
|
|
2123 |
Last updated: 29 July 1999 |
Last updated: 28 August 2000, |
2124 |
Copyright (c) 1997-1999 University of Cambridge. |
the 250th anniversary of the death of J.S. Bach. |
2125 |
|
Copyright (c) 1997-2000 University of Cambridge. |