1 |
|
This file contains a concatenation of the PCRE man pages, converted to plain |
2 |
|
text format for ease of searching with a text editor, or for use on systems |
3 |
|
that do not have a man page processor. The small individual files that give |
4 |
|
synopses of each function in the library have not been included. There are |
5 |
|
separate text files for the pcregrep and pcretest commands. |
6 |
|
----------------------------------------------------------------------------- |
7 |
|
|
8 |
|
NAME |
9 |
|
PCRE - Perl-compatible regular expressions |
10 |
|
|
11 |
|
|
12 |
|
DESCRIPTION |
13 |
|
|
14 |
|
The PCRE library is a set of functions that implement regu- |
15 |
|
lar expression pattern matching using the same syntax and |
16 |
|
semantics as Perl, with just a few differences. The current |
17 |
|
implementation of PCRE (release 4.x) corresponds approxi- |
18 |
|
mately with Perl 5.8, including support for UTF-8 encoded |
19 |
|
strings. However, this support has to be explicitly |
20 |
|
enabled; it is not the default. |
21 |
|
|
22 |
|
PCRE is written in C and released as a C library. However, a |
23 |
|
number of people have written wrappers and interfaces of |
24 |
|
various kinds. A C++ class is included in these contribu- |
25 |
|
tions, which can be found in the Contrib directory at the |
26 |
|
primary FTP site, which is: |
27 |
|
|
28 |
|
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
29 |
|
|
30 |
|
Details of exactly which Perl regular expression features |
31 |
|
are and are not supported by PCRE are given in separate |
32 |
|
documents. See the pcrepattern and pcrecompat pages. |
33 |
|
|
34 |
|
Some features of PCRE can be included, excluded, or changed |
35 |
|
when the library is built. The pcre_config() function makes |
36 |
|
it possible for a client to discover which features are |
37 |
|
available. Documentation about building PCRE for various |
38 |
|
operating systems can be found in the README file in the |
39 |
|
source distribution. |
40 |
|
|
41 |
|
|
42 |
|
USER DOCUMENTATION |
43 |
|
|
44 |
|
The user documentation for PCRE has been split up into a |
45 |
|
number of different sections. In the "man" format, each of |
46 |
|
these is a separate "man page". In the HTML format, each is |
47 |
|
a separate page, linked from the index page. In the plain |
48 |
|
text format, all the sections are concatenated, for ease of |
49 |
|
searching. The sections are as follows: |
50 |
|
|
51 |
|
pcre this document |
52 |
|
pcreapi details of PCRE's native API |
53 |
|
pcrebuild options for building PCRE |
54 |
|
pcrecallout details of the callout feature |
55 |
|
pcrecompat discussion of Perl compatibility |
56 |
|
pcregrep description of the pcregrep command |
57 |
|
pcrepattern syntax and semantics of supported |
58 |
|
regular expressions |
59 |
|
pcreperform discussion of performance issues |
60 |
|
pcreposix the POSIX-compatible API |
61 |
|
pcresample discussion of the sample program |
62 |
|
pcretest the pcretest testing command |
63 |
|
|
64 |
|
In addition, in the "man" and HTML formats, there is a short |
65 |
|
page for each library function, listing its arguments and |
66 |
|
results. |
67 |
|
|
68 |
|
|
69 |
|
LIMITATIONS |
70 |
|
|
71 |
|
There are some size limitations in PCRE but it is hoped that |
72 |
|
they will never in practice be relevant. |
73 |
|
|
74 |
|
The maximum length of a compiled pattern is 65539 (sic) |
75 |
|
bytes if PCRE is compiled with the default internal linkage |
76 |
|
size of 2. If you want to process regular expressions that |
77 |
|
are truly enormous, you can compile PCRE with an internal |
78 |
|
linkage size of 3 or 4 (see the README file in the source |
79 |
|
distribution and the pcrebuild documentation for details). |
80 |
|
If these cases the limit is substantially larger. However, |
81 |
|
the speed of execution will be slower. |
82 |
|
|
83 |
|
All values in repeating quantifiers must be less than 65536. |
84 |
|
The maximum number of capturing subpatterns is 65535. |
85 |
|
|
86 |
|
There is no limit to the number of non-capturing subpat- |
87 |
|
terns, but the maximum depth of nesting of all kinds of |
88 |
|
parenthesized subpattern, including capturing subpatterns, |
89 |
|
assertions, and other types of subpattern, is 200. |
90 |
|
|
91 |
|
The maximum length of a subject string is the largest posi- |
92 |
|
tive number that an integer variable can hold. However, PCRE |
93 |
|
uses recursion to handle subpatterns and indefinite repeti- |
94 |
|
tion. This means that the available stack space may limit |
95 |
|
the size of a subject string that can be processed by cer- |
96 |
|
tain patterns. |
97 |
|
|
98 |
|
|
99 |
|
UTF-8 SUPPORT |
100 |
|
|
101 |
|
Starting at release 3.3, PCRE has had some support for char- |
102 |
|
acter strings encoded in the UTF-8 format. For release 4.0 |
103 |
|
this has been greatly extended to cover most common require- |
104 |
|
ments. |
105 |
|
|
106 |
|
In order process UTF-8 strings, you must build PCRE to |
107 |
|
include UTF-8 support in the code, and, in addition, you |
108 |
|
must call pcre_compile() with the PCRE_UTF8 option flag. |
109 |
|
When you do this, both the pattern and any subject strings |
110 |
|
that are matched against it are treated as UTF-8 strings |
111 |
|
instead of just strings of bytes. |
112 |
|
|
113 |
|
If you compile PCRE with UTF-8 support, but do not use it at |
114 |
|
run time, the library will be a bit bigger, but the addi- |
115 |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
116 |
|
flag in several places, so should not be very large. |
117 |
|
|
118 |
|
The following comments apply when PCRE is running in UTF-8 |
119 |
|
mode: |
120 |
|
|
121 |
|
1. PCRE assumes that the strings it is given contain valid |
122 |
|
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
123 |
|
you pass invalid UTF-8 strings to PCRE, the results are |
124 |
|
undefined. |
125 |
|
|
126 |
|
2. In a pattern, the escape sequence \x{...}, where the con- |
127 |
|
tents of the braces is a string of hexadecimal digits, is |
128 |
|
interpreted as a UTF-8 character whose code number is the |
129 |
|
given hexadecimal number, for example: \x{1234}. If a non- |
130 |
|
hexadecimal digit appears between the braces, the item is |
131 |
|
not recognized. This escape sequence can be used either as |
132 |
|
a literal, or within a character class. |
133 |
|
|
134 |
|
3. The original hexadecimal escape sequence, \xhh, matches a |
135 |
|
two-byte UTF-8 character if the value is greater than 127. |
136 |
|
|
137 |
|
4. Repeat quantifiers apply to complete UTF-8 characters, |
138 |
|
not to individual bytes, for example: \x{100}{3}. |
139 |
|
|
140 |
|
5. The dot metacharacter matches one UTF-8 character instead |
141 |
|
of a single byte. |
142 |
|
|
143 |
|
6. The escape sequence \C can be used to match a single byte |
144 |
|
in UTF-8 mode, but its use can lead to some strange effects. |
145 |
|
|
146 |
|
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W |
147 |
|
correctly test characters of any code value, but the charac- |
148 |
|
ters that PCRE recognizes as digits, spaces, or word charac- |
149 |
|
ters remain the same set as before, all with values less |
150 |
|
than 256. |
151 |
|
|
152 |
|
8. Case-insensitive matching applies only to characters |
153 |
|
whose values are less than 256. PCRE does not support the |
154 |
|
notion of "case" for higher-valued characters. |
155 |
|
|
156 |
|
9. PCRE does not support the use of Unicode tables and pro- |
157 |
|
perties or the Perl escapes \p, \P, and \X. |
158 |
|
|
159 |
|
|
160 |
|
AUTHOR |
161 |
|
|
162 |
|
Philip Hazel <ph10@cam.ac.uk> |
163 |
|
University Computing Service, |
164 |
|
Cambridge CB2 3QG, England. |
165 |
|
Phone: +44 1223 334714 |
166 |
|
|
167 |
|
Last updated: 04 February 2003 |
168 |
|
Copyright (c) 1997-2003 University of Cambridge. |
169 |
|
----------------------------------------------------------------------------- |
170 |
|
|
171 |
|
NAME |
172 |
|
PCRE - Perl-compatible regular expressions |
173 |
|
|
174 |
|
|
175 |
|
PCRE BUILD-TIME OPTIONS |
176 |
|
|
177 |
|
This document describes the optional features of PCRE that |
178 |
|
can be selected when the library is compiled. They are all |
179 |
|
selected, or deselected, by providing options to the config- |
180 |
|
ure script which is run before the make command. The com- |
181 |
|
plete list of options for configure (which includes the |
182 |
|
standard ones such as the selection of the installation |
183 |
|
directory) can be obtained by running |
184 |
|
|
185 |
|
./configure --help |
186 |
|
|
187 |
|
The following sections describe certain options whose names |
188 |
|
begin with --enable or --disable. These settings specify |
189 |
|
changes to the defaults for the configure command. Because |
190 |
|
of the way that configure works, --enable and --disable |
191 |
|
always come in pairs, so the complementary option always |
192 |
|
exists as well, but as it specifies the default, it is not |
193 |
|
described. |
194 |
|
|
195 |
|
|
196 |
|
UTF-8 SUPPORT |
197 |
|
|
198 |
|
To build PCRE with support for UTF-8 character strings, add |
199 |
|
|
200 |
|
--enable-utf8 |
201 |
|
|
202 |
|
to the configure command. Of itself, this does not make PCRE |
203 |
|
treat strings as UTF-8. As well as compiling PCRE with this |
204 |
|
option, you also have have to set the PCRE_UTF8 option when |
205 |
|
you call the pcre_compile() function. |
206 |
|
|
207 |
|
|
208 |
|
CODE VALUE OF NEWLINE |
209 |
|
|
210 |
|
By default, PCRE treats character 10 (linefeed) as the new- |
211 |
|
line character. This is the normal newline character on |
212 |
|
Unix-like systems. You can compile PCRE to use character 13 |
213 |
|
(carriage return) instead by adding |
214 |
|
|
215 |
|
--enable-newline-is-cr |
216 |
|
|
217 |
|
to the configure command. For completeness there is also a |
218 |
|
--enable-newline-is-lf option, which explicitly specifies |
219 |
|
linefeed as the newline character. |
220 |
|
|
221 |
|
|
222 |
|
BUILDING SHARED AND STATIC LIBRARIES |
223 |
|
|
224 |
|
The PCRE building process uses libtool to build both shared |
225 |
|
and static Unix libraries by default. You can suppress one |
226 |
|
of these by adding one of |
227 |
|
|
228 |
|
--disable-shared |
229 |
|
--disable-static |
230 |
|
|
231 |
|
to the configure command, as required. |
232 |
|
|
233 |
|
|
234 |
|
POSIX MALLOC USAGE |
235 |
|
|
236 |
|
When PCRE is called through the POSIX interface (see the |
237 |
|
pcreposix documentation), additional working storage is |
238 |
|
required for holding the pointers to capturing substrings |
239 |
|
because PCRE requires three integers per substring, whereas |
240 |
|
the POSIX interface provides only two. If the number of |
241 |
|
expected substrings is small, the wrapper function uses |
242 |
|
space on the stack, because this is faster than using mal- |
243 |
|
loc() for each call. The default threshold above which the |
244 |
|
stack is no longer used is 10; it can be changed by adding a |
245 |
|
setting such as |
246 |
|
|
247 |
|
--with-posix-malloc-threshold=20 |
248 |
|
|
249 |
|
to the configure command. |
250 |
|
|
251 |
|
|
252 |
|
LIMITING PCRE RESOURCE USAGE |
253 |
|
|
254 |
|
Internally, PCRE has a function called match() which it |
255 |
|
calls repeatedly (possibly recursively) when performing a |
256 |
|
matching operation. By limiting the number of times this |
257 |
|
function may be called, a limit can be placed on the |
258 |
|
resources used by a single call to pcre_exec(). The limit |
259 |
|
can be changed at run time, as described in the pcreapi |
260 |
|
documentation. The default is 10 million, but this can be |
261 |
|
changed by adding a setting such as |
262 |
|
|
263 |
|
--with-match-limit=500000 |
264 |
|
|
265 |
|
to the configure command. |
266 |
|
|
267 |
|
|
268 |
|
HANDLING VERY LARGE PATTERNS |
269 |
|
|
270 |
|
Within a compiled pattern, offset values are used to point |
271 |
|
from one part to another (for example, from an opening |
272 |
|
parenthesis to an alternation metacharacter). By default |
273 |
|
two-byte values are used for these offsets, leading to a |
274 |
|
maximum size for a compiled pattern of around 64K. This is |
275 |
|
sufficient to handle all but the most gigantic patterns. |
276 |
|
Nevertheless, some people do want to process enormous pat- |
277 |
|
terns, so it is possible to compile PCRE to use three-byte |
278 |
|
or four-byte offsets by adding a setting such as |
279 |
|
|
280 |
|
--with-link-size=3 |
281 |
|
|
282 |
|
to the configure command. The value given must be 2, 3, or |
283 |
|
4. Using longer offsets slows down the operation of PCRE |
284 |
|
because it has to load additional bytes when handling them. |
285 |
|
|
286 |
|
If you build PCRE with an increased link size, test 2 (and |
287 |
|
test 5 if you are using UTF-8) will fail. Part of the output |
288 |
|
of these tests is a representation of the compiled pattern, |
289 |
|
and this changes with the link size. |
290 |
|
|
291 |
|
Last updated: 21 January 2003 |
292 |
|
Copyright (c) 1997-2003 University of Cambridge. |
293 |
|
----------------------------------------------------------------------------- |
294 |
|
|
295 |
NAME |
NAME |
296 |
pcre - Perl-compatible regular expressions. |
PCRE - Perl-compatible regular expressions |
297 |
|
|
298 |
|
|
299 |
|
SYNOPSIS OF PCRE API |
300 |
|
|
|
SYNOPSIS |
|
301 |
#include <pcre.h> |
#include <pcre.h> |
302 |
|
|
303 |
pcre *pcre_compile(const char *pattern, int options, |
pcre *pcre_compile(const char *pattern, int options, |
311 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
312 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
313 |
|
|
314 |
|
int pcre_copy_named_substring(const pcre *code, |
315 |
|
const char *subject, int *ovector, |
316 |
|
int stringcount, const char *stringname, |
317 |
|
char *buffer, int buffersize); |
318 |
|
|
319 |
int pcre_copy_substring(const char *subject, int *ovector, |
int pcre_copy_substring(const char *subject, int *ovector, |
320 |
int stringcount, int stringnumber, char *buffer, |
int stringcount, int stringnumber, char *buffer, |
321 |
int buffersize); |
int buffersize); |
322 |
|
|
323 |
|
int pcre_get_named_substring(const pcre *code, |
324 |
|
const char *subject, int *ovector, |
325 |
|
int stringcount, const char *stringname, |
326 |
|
const char **stringptr); |
327 |
|
|
328 |
|
int pcre_get_stringnumber(const pcre *code, |
329 |
|
const char *name); |
330 |
|
|
331 |
int pcre_get_substring(const char *subject, int *ovector, |
int pcre_get_substring(const char *subject, int *ovector, |
332 |
int stringcount, int stringnumber, |
int stringcount, int stringnumber, |
333 |
const char **stringptr); |
const char **stringptr); |
335 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
336 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
337 |
|
|
338 |
|
void pcre_free_substring(const char *stringptr); |
339 |
|
|
340 |
|
void pcre_free_substring_list(const char **stringptr); |
341 |
|
|
342 |
const unsigned char *pcre_maketables(void); |
const unsigned char *pcre_maketables(void); |
343 |
|
|
344 |
|
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
345 |
|
int what, void *where); |
346 |
|
|
347 |
|
|
348 |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
349 |
|
|
350 |
|
int pcre_config(int what, void *where); |
351 |
|
|
352 |
char *pcre_version(void); |
char *pcre_version(void); |
353 |
|
|
354 |
void *(*pcre_malloc)(size_t); |
void *(*pcre_malloc)(size_t); |
355 |
|
|
356 |
void (*pcre_free)(void *); |
void (*pcre_free)(void *); |
357 |
|
|
358 |
|
int (*pcre_callout)(pcre_callout_block *); |
359 |
|
|
360 |
|
|
361 |
|
PCRE API |
|
DESCRIPTION |
|
|
The PCRE library is a set of functions that implement regu- |
|
|
lar expression pattern matching using the same syntax and |
|
|
semantics as Perl 5, with just a few differences (see |
|
|
below). The current implementation corresponds to Perl |
|
|
5.005. |
|
362 |
|
|
363 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
364 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
365 |
correspond to the POSIX API. These are described in the |
correspond to the POSIX regular expression API. These are |
366 |
pcreposix documentation. |
described in the pcreposix documentation. |
367 |
|
|
368 |
The native API function prototypes are defined in the header |
The native API function prototypes are defined in the header |
369 |
file pcre.h, and on Unix systems the library itself is |
file pcre.h, and on Unix systems the library itself is |
370 |
called libpcre.a, so can be accessed by adding -lpcre to the |
called libpcre.a, so can be accessed by adding -lpcre to the |
371 |
command for linking an application which calls it. |
command for linking an application which calls it. The |
372 |
|
header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
373 |
|
contain the major and minor release numbers for the library. |
374 |
|
Applications can use these to include support for different |
375 |
|
releases. |
376 |
|
|
377 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
378 |
are used for compiling and matching regular expressions, |
are used for compiling and matching regular expressions. A |
379 |
while pcre_copy_substring(), pcre_get_substring(), and |
sample program that demonstrates the simplest way of using |
380 |
pcre_get_substring_list() are convenience functions for |
them is given in the file pcredemo.c. The pcresample docu- |
381 |
extracting captured substrings from a matched subject |
mentation describes how to run it. |
382 |
string. The function pcre_maketables() is used (optionally) |
|
383 |
to build a set of character tables in the current locale for |
There are convenience functions for extracting captured sub- |
384 |
passing to pcre_compile(). |
strings from a matched subject string. They are: |
385 |
|
|
386 |
The function pcre_info() is used to find out information |
pcre_copy_substring() |
387 |
about a compiled pattern, while the function pcre_version() |
pcre_copy_named_substring() |
388 |
returns a pointer to a string containing the version of PCRE |
pcre_get_substring() |
389 |
and its date of release. |
pcre_get_named_substring() |
390 |
|
pcre_get_substring_list() |
391 |
|
|
392 |
|
pcre_free_substring() and pcre_free_substring_list() are |
393 |
|
also provided, to free the memory used for extracted |
394 |
|
strings. |
395 |
|
|
396 |
|
The function pcre_maketables() is used (optionally) to build |
397 |
|
a set of character tables in the current locale for passing |
398 |
|
to pcre_compile(). |
399 |
|
|
400 |
|
The function pcre_fullinfo() is used to find out information |
401 |
|
about a compiled pattern; pcre_info() is an obsolete version |
402 |
|
which returns only some of the available information, but is |
403 |
|
retained for backwards compatibility. The function |
404 |
|
pcre_version() returns a pointer to a string containing the |
405 |
|
version of PCRE and its date of release. |
406 |
|
|
407 |
The global variables pcre_malloc and pcre_free initially |
The global variables pcre_malloc and pcre_free initially |
408 |
contain the entry points of the standard malloc() and free() |
contain the entry points of the standard malloc() and free() |
411 |
replace them if it wishes to intercept the calls. This |
replace them if it wishes to intercept the calls. This |
412 |
should be done before calling any PCRE functions. |
should be done before calling any PCRE functions. |
413 |
|
|
414 |
|
The global variable pcre_callout initially contains NULL. It |
415 |
|
can be set by the caller to a "callout" function, which PCRE |
416 |
|
will then call at specified points during a matching opera- |
417 |
|
tion. Details are given in the pcrecallout documentation. |
418 |
|
|
419 |
|
|
420 |
|
MULTITHREADING |
421 |
|
|
|
MULTI-THREADING |
|
422 |
The PCRE functions can be used in multi-threading applica- |
The PCRE functions can be used in multi-threading applica- |
423 |
tions, with the proviso that the memory management functions |
tions, with the proviso that the memory management functions |
424 |
pointed to by pcre_malloc and pcre_free are shared by all |
pointed to by pcre_malloc and pcre_free, and the callout |
425 |
|
function pointed to by pcre_callout, are shared by all |
426 |
threads. |
threads. |
427 |
|
|
428 |
The compiled form of a regular expression is not altered |
The compiled form of a regular expression is not altered |
430 |
used by several threads at once. |
used by several threads at once. |
431 |
|
|
432 |
|
|
433 |
|
CHECKING BUILD-TIME OPTIONS |
434 |
|
|
435 |
|
int pcre_config(int what, void *where); |
436 |
|
|
437 |
|
The function pcre_config() makes it possible for a PCRE |
438 |
|
client to discover which optional features have been com- |
439 |
|
piled into the PCRE library. The pcrebuild documentation has |
440 |
|
more details about these optional features. |
441 |
|
|
442 |
|
The first argument for pcre_config() is an integer, specify- |
443 |
|
ing which information is required; the second argument is a |
444 |
|
pointer to a variable into which the information is placed. |
445 |
|
The following information is available: |
446 |
|
|
447 |
|
PCRE_CONFIG_UTF8 |
448 |
|
|
449 |
|
The output is an integer that is set to one if UTF-8 support |
450 |
|
is available; otherwise it is set to zero. |
451 |
|
|
452 |
|
PCRE_CONFIG_NEWLINE |
453 |
|
|
454 |
|
The output is an integer that is set to the value of the |
455 |
|
code that is used for the newline character. It is either |
456 |
|
linefeed (10) or carriage return (13), and should normally |
457 |
|
be the standard character for your operating system. |
458 |
|
|
459 |
|
PCRE_CONFIG_LINK_SIZE |
460 |
|
|
461 |
|
The output is an integer that contains the number of bytes |
462 |
|
used for internal linkage in compiled regular expressions. |
463 |
|
The value is 2, 3, or 4. Larger values allow larger regular |
464 |
|
expressions to be compiled, at the expense of slower match- |
465 |
|
ing. The default value of 2 is sufficient for all but the |
466 |
|
most massive patterns, since it allows the compiled pattern |
467 |
|
to be up to 64K in size. |
468 |
|
|
469 |
|
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
470 |
|
|
471 |
|
The output is an integer that contains the threshold above |
472 |
|
which the POSIX interface uses malloc() for output vectors. |
473 |
|
Further details are given in the pcreposix documentation. |
474 |
|
|
475 |
|
PCRE_CONFIG_MATCH_LIMIT |
476 |
|
|
477 |
|
The output is an integer that gives the default limit for |
478 |
|
the number of internal matching function calls in a |
479 |
|
pcre_exec() execution. Further details are given with |
480 |
|
pcre_exec() below. |
481 |
|
|
482 |
|
|
483 |
COMPILING A PATTERN |
COMPILING A PATTERN |
484 |
|
|
485 |
|
pcre *pcre_compile(const char *pattern, int options, |
486 |
|
const char **errptr, int *erroffset, |
487 |
|
const unsigned char *tableptr); |
488 |
|
|
489 |
The function pcre_compile() is called to compile a pattern |
The function pcre_compile() is called to compile a pattern |
490 |
into an internal form. The pattern is a C string terminated |
into an internal form. The pattern is a C string terminated |
491 |
by a binary zero, and is passed in the argument pattern. A |
by a binary zero, and is passed in the argument pattern. A |
492 |
pointer to a single block of memory that is obtained via |
pointer to a single block of memory that is obtained via |
493 |
pcre_malloc is returned. This contains the compiled code and |
pcre_malloc is returned. This contains the compiled code and |
494 |
related data. The pcre type is defined for this for conveni- |
related data. The pcre type is defined for the returned |
495 |
ence, but in fact pcre is just a typedef for void, since the |
block; this is a typedef for a structure whose contents are |
496 |
contents of the block are not externally defined. It is up |
not externally defined. It is up to the caller to free the |
497 |
to the caller to free the memory when it is no longer |
memory when it is no longer required. |
498 |
required. |
|
499 |
|
Although the compiled code of a PCRE regex is relocatable, |
500 |
The size of a compiled pattern is roughly proportional to |
that is, it does not depend on memory location, the complete |
501 |
the length of the pattern string, except that each character |
pcre data block is not fully relocatable, because it con- |
502 |
class (other than those containing just a single character, |
tains a copy of the tableptr argument, which is an address |
503 |
negated or not) requires 33 bytes, and repeat quantifiers |
(see below). |
|
with a minimum greater than one or a bounded maximum cause |
|
|
the relevant portions of the compiled pattern to be repli- |
|
|
cated. |
|
|
|
|
504 |
The options argument contains independent bits that affect |
The options argument contains independent bits that affect |
505 |
the compilation. It should be zero if no options are |
the compilation. It should be zero if no options are |
506 |
required. Some of the options, in particular, those that are |
required. Some of the options, in particular, those that are |
507 |
compatible with Perl, can also be set and unset from within |
compatible with Perl, can also be set and unset from within |
508 |
the pattern (see the detailed description of regular expres- |
the pattern (see the detailed description of regular expres- |
509 |
sions below). For these options, the contents of the options |
sions in the pcrepattern documentation). For these options, |
510 |
argument specifies their initial settings at the start of |
the contents of the options argument specifies their initial |
511 |
compilation and execution. The PCRE_ANCHORED option can be |
settings at the start of compilation and execution. The |
512 |
set at the time of matching as well as at compile time. |
PCRE_ANCHORED option can be set at the time of matching as |
513 |
|
well as at compile time. |
514 |
|
|
515 |
If errptr is NULL, pcre_compile() returns NULL immediately. |
If errptr is NULL, pcre_compile() returns NULL immediately. |
516 |
Otherwise, if compilation of a pattern fails, pcre_compile() |
Otherwise, if compilation of a pattern fails, pcre_compile() |
527 |
must be the result of a call to pcre_maketables(). See the |
must be the result of a call to pcre_maketables(). See the |
528 |
section on locale support below. |
section on locale support below. |
529 |
|
|
530 |
The following option bits are defined in the header file: |
This code fragment shows a typical straightforward call to |
531 |
|
pcre_compile(): |
532 |
|
|
533 |
|
pcre *re; |
534 |
|
const char *error; |
535 |
|
int erroffset; |
536 |
|
re = pcre_compile( |
537 |
|
"^A.*Z", /* the pattern */ |
538 |
|
0, /* default options */ |
539 |
|
&error, /* for error message */ |
540 |
|
&erroffset, /* for error offset */ |
541 |
|
NULL); /* use default character tables */ |
542 |
|
|
543 |
|
The following option bits are defined: |
544 |
|
|
545 |
PCRE_ANCHORED |
PCRE_ANCHORED |
546 |
|
|
547 |
If this bit is set, the pattern is forced to be "anchored", |
If this bit is set, the pattern is forced to be "anchored", |
548 |
that is, it is constrained to match only at the start of the |
that is, it is constrained to match only at the first match- |
549 |
string which is being searched (the "subject string"). This |
ing point in the string which is being searched (the "sub- |
550 |
effect can also be achieved by appropriate constructs in the |
ject string"). This effect can also be achieved by appropri- |
551 |
pattern itself, which is the only way to do it in Perl. |
ate constructs in the pattern itself, which is the only way |
552 |
|
to do it in Perl. |
553 |
|
|
554 |
PCRE_CASELESS |
PCRE_CASELESS |
555 |
|
|
556 |
If this bit is set, letters in the pattern match both upper |
If this bit is set, letters in the pattern match both upper |
557 |
and lower case letters. It is equivalent to Perl's /i |
and lower case letters. It is equivalent to Perl's /i |
558 |
option. |
option, and it can be changed within a pattern by a (?i) |
559 |
|
option setting. |
560 |
|
|
561 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
562 |
|
|
566 |
character if it is a newline (but not before any other new- |
character if it is a newline (but not before any other new- |
567 |
lines). The PCRE_DOLLAR_ENDONLY option is ignored if |
lines). The PCRE_DOLLAR_ENDONLY option is ignored if |
568 |
PCRE_MULTILINE is set. There is no equivalent to this option |
PCRE_MULTILINE is set. There is no equivalent to this option |
569 |
in Perl. |
in Perl, and no way to set it within a pattern. |
570 |
|
|
571 |
PCRE_DOTALL |
PCRE_DOTALL |
572 |
|
|
573 |
If this bit is set, a dot metacharater in the pattern |
If this bit is set, a dot metacharater in the pattern |
574 |
matches all characters, including newlines. Without it, new- |
matches all characters, including newlines. Without it, new- |
575 |
lines are excluded. This option is equivalent to Perl's /s |
lines are excluded. This option is equivalent to Perl's /s |
576 |
option. A negative class such as [^a] always matches a new- |
option, and it can be changed within a pattern by a (?s) |
577 |
line character, independent of the setting of this option. |
option setting. A negative class such as [^a] always matches |
578 |
|
a newline character, independent of the setting of this |
579 |
|
option. |
580 |
|
|
581 |
PCRE_EXTENDED |
PCRE_EXTENDED |
582 |
|
|
583 |
If this bit is set, whitespace data characters in the pat- |
If this bit is set, whitespace data characters in the pat- |
584 |
tern are totally ignored except when escaped or inside a |
tern are totally ignored except when escaped or inside a |
585 |
character class, and characters between an unescaped # out- |
character class. Whitespace does not include the VT charac- |
586 |
side a character class and the next newline character, |
ter (code 11). In addition, characters between an unescaped |
587 |
|
# outside a character class and the next newline character, |
588 |
inclusive, are also ignored. This is equivalent to Perl's /x |
inclusive, are also ignored. This is equivalent to Perl's /x |
589 |
option, and makes it possible to include comments inside |
option, and it can be changed within a pattern by a (?x) |
590 |
complicated patterns. Note, however, that this applies only |
option setting. |
591 |
to data characters. Whitespace characters may never appear |
|
592 |
|
This option makes it possible to include comments inside |
593 |
|
complicated patterns. Note, however, that this applies only |
594 |
|
to data characters. Whitespace characters may never appear |
595 |
within special character sequences in a pattern, for example |
within special character sequences in a pattern, for example |
596 |
within the sequence (?( which introduces a conditional sub- |
within the sequence (?( which introduces a conditional sub- |
597 |
pattern. |
pattern. |
598 |
|
|
599 |
PCRE_EXTRA |
PCRE_EXTRA |
600 |
|
|
601 |
This option turns on additional functionality of PCRE that |
This option was invented in order to turn on additional |
602 |
is incompatible with Perl. Any backslash in a pattern that |
functionality of PCRE that is incompatible with Perl, but it |
603 |
is followed by a letter that has no special meaning causes |
is currently of very little use. When set, any backslash in |
604 |
an error, thus reserving these combinations for future |
a pattern that is followed by a letter that has no special |
605 |
expansion. By default, as in Perl, a backslash followed by a |
meaning causes an error, thus reserving these combinations |
606 |
letter with no special meaning is treated as a literal. |
for future expansion. By default, as in Perl, a backslash |
607 |
There are at present no other features controlled by this |
followed by a letter with no special meaning is treated as a |
608 |
option. |
literal. There are at present no other features controlled |
609 |
|
by this option. It can also be set by a (?X) option setting |
610 |
|
within a pattern. |
611 |
|
|
612 |
PCRE_MULTILINE |
PCRE_MULTILINE |
613 |
|
|
620 |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
PCRE_DOLLAR_ENDONLY is set). This is the same as Perl. |
621 |
|
|
622 |
When PCRE_MULTILINE it is set, the "start of line" and "end |
When PCRE_MULTILINE it is set, the "start of line" and "end |
623 |
of line" constructs match immediately following or |
of line" constructs match immediately following or immedi- |
624 |
immediately before any newline in the subject string, |
ately before any newline in the subject string, respec- |
625 |
respectively, as well as at the very start and end. This is |
tively, as well as at the very start and end. This is |
626 |
equivalent to Perl's /m option. If there are no "\n" charac- |
equivalent to Perl's /m option, and it can be changed within |
627 |
ters in a subject string, or no occurrences of ^ or $ in a |
a pattern by a (?m) option setting. If there are no "\n" |
628 |
pattern, setting PCRE_MULTILINE has no effect. |
characters in a subject string, or no occurrences of ^ or $ |
629 |
|
in a pattern, setting PCRE_MULTILINE has no effect. |
630 |
|
|
631 |
|
PCRE_NO_AUTO_CAPTURE |
632 |
|
|
633 |
|
If this option is set, it disables the use of numbered cap- |
634 |
|
turing parentheses in the pattern. Any opening parenthesis |
635 |
|
that is not followed by ? behaves as if it were followed by |
636 |
|
?: but named parentheses can still be used for capturing |
637 |
|
(and they acquire numbers in the usual way). There is no |
638 |
|
equivalent of this option in Perl. |
639 |
|
|
640 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
641 |
|
|
644 |
followed by "?". It is not compatible with Perl. It can also |
followed by "?". It is not compatible with Perl. It can also |
645 |
be set by a (?U) option setting within the pattern. |
be set by a (?U) option setting within the pattern. |
646 |
|
|
647 |
|
PCRE_UTF8 |
648 |
|
|
649 |
|
This option causes PCRE to regard both the pattern and the |
650 |
|
subject as strings of UTF-8 characters instead of single- |
651 |
|
byte character strings. However, it is available only if |
652 |
|
PCRE has been built to include UTF-8 support. If not, the |
653 |
|
use of this option provokes an error. Details of how this |
654 |
|
option changes the behaviour of PCRE are given in the sec- |
655 |
|
tion on UTF-8 support in the main pcre page. |
656 |
|
|
657 |
|
|
658 |
STUDYING A PATTERN |
STUDYING A PATTERN |
659 |
|
|
660 |
|
pcre_extra *pcre_study(const pcre *code, int options, |
661 |
|
const char **errptr); |
662 |
|
|
663 |
When a pattern is going to be used several times, it is |
When a pattern is going to be used several times, it is |
664 |
worth spending more time analyzing it in order to speed up |
worth spending more time analyzing it in order to speed up |
665 |
the time taken for matching. The function pcre_study() takes |
the time taken for matching. The function pcre_study() takes |
666 |
a pointer to a compiled pattern as its first argument, and |
a pointer to a compiled pattern as its first argument. If |
667 |
returns a pointer to a pcre_extra block (another void |
studing the pattern produces additional information that |
668 |
typedef) containing additional information about the pat- |
will help speed up matching, pcre_study() returns a pointer |
669 |
tern; this can be passed to pcre_exec(). If no additional |
to a pcre_extra block, in which the study_data field points |
670 |
information is available, NULL is returned. |
to the results of the study. |
671 |
|
|
672 |
|
The returned value from a pcre_study() can be passed |
673 |
|
directly to pcre_exec(). However, the pcre_extra block also |
674 |
|
contains other fields that can be set by the caller before |
675 |
|
the block is passed; these are described below. If studying |
676 |
|
the pattern does not produce any additional information, |
677 |
|
pcre_study() returns NULL. In that circumstance, if the cal- |
678 |
|
ling program wants to pass some of the other fields to |
679 |
|
pcre_exec(), it must set up its own pcre_extra block. |
680 |
|
|
681 |
The second argument contains option bits. At present, no |
The second argument contains option bits. At present, no |
682 |
options are defined for pcre_study(), and this argument |
options are defined for pcre_study(), and this argument |
683 |
should always be zero. |
should always be zero. |
684 |
|
|
685 |
The third argument for pcre_study() is a pointer to an error |
The third argument for pcre_study() is a pointer for an |
686 |
message. If studying succeeds (even if no data is returned), |
error message. If studying succeeds (even if no data is |
687 |
the variable it points to is set to NULL. Otherwise it |
returned), the variable it points to is set to NULL. Other- |
688 |
points to a textual error message. |
wise it points to a textual error message. You should there- |
689 |
|
fore test the error pointer for NULL after calling |
690 |
|
pcre_study(), to be sure that it has run successfully. |
691 |
|
|
692 |
|
This is a typical call to pcre_study(): |
693 |
|
|
694 |
|
pcre_extra *pe; |
695 |
|
pe = pcre_study( |
696 |
|
re, /* result of pcre_compile() */ |
697 |
|
0, /* no options exist */ |
698 |
|
&error); /* set to NULL or points to a message */ |
699 |
|
|
700 |
At present, studying a pattern is useful only for non- |
At present, studying a pattern is useful only for non- |
701 |
anchored patterns that do not have a single fixed starting |
anchored patterns that do not have a single fixed starting |
703 |
created. |
created. |
704 |
|
|
705 |
|
|
|
|
|
706 |
LOCALE SUPPORT |
LOCALE SUPPORT |
707 |
|
|
708 |
PCRE handles caseless matching, and determines whether char- |
PCRE handles caseless matching, and determines whether char- |
709 |
acters are letters, digits, or whatever, by reference to a |
acters are letters, digits, or whatever, by reference to a |
710 |
set of tables. The library contains a default set of tables |
set of tables. When running in UTF-8 mode, this applies only |
711 |
which is created in the default C locale when PCRE is com- |
to characters with codes less than 256. The library contains |
712 |
piled. This is used when the final argument of |
a default set of tables that is created in the default C |
713 |
pcre_compile() is NULL, and is sufficient for many applica- |
locale when PCRE is compiled. This is used when the final |
714 |
tions. |
argument of pcre_compile() is NULL, and is sufficient for |
715 |
|
many applications. |
716 |
|
|
717 |
An alternative set of tables can, however, be supplied. Such |
An alternative set of tables can, however, be supplied. Such |
718 |
tables are built by calling the pcre_maketables() function, |
tables are built by calling the pcre_maketables() function, |
730 |
The tables are built in memory that is obtained via |
The tables are built in memory that is obtained via |
731 |
pcre_malloc. The pointer that is passed to pcre_compile is |
pcre_malloc. The pointer that is passed to pcre_compile is |
732 |
saved with the compiled pattern, and the same tables are |
saved with the compiled pattern, and the same tables are |
733 |
used via this pointer by pcre_study() and pcre_exec(). Thus |
used via this pointer by pcre_study() and pcre_exec(). Thus, |
734 |
for any single pattern, compilation, studying and matching |
for any single pattern, compilation, studying and matching |
735 |
all happen in the same locale, but different patterns can be |
all happen in the same locale, but different patterns can be |
736 |
compiled in different locales. It is the caller's responsi- |
compiled in different locales. It is the caller's responsi- |
738 |
remains available for as long as it is needed. |
remains available for as long as it is needed. |
739 |
|
|
740 |
|
|
|
|
|
741 |
INFORMATION ABOUT A PATTERN |
INFORMATION ABOUT A PATTERN |
742 |
The pcre_info() function returns information about a com- |
|
743 |
piled pattern. Its yield is the number of capturing subpat- |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
744 |
terns, or one of the following negative numbers: |
int what, void *where); |
745 |
|
|
746 |
|
The pcre_fullinfo() function returns information about a |
747 |
|
compiled pattern. It replaces the obsolete pcre_info() func- |
748 |
|
tion, which is nevertheless retained for backwards compabil- |
749 |
|
ity (and is documented below). |
750 |
|
|
751 |
|
The first argument for pcre_fullinfo() is a pointer to the |
752 |
|
compiled pattern. The second argument is the result of |
753 |
|
pcre_study(), or NULL if the pattern was not studied. The |
754 |
|
third argument specifies which piece of information is |
755 |
|
required, and the fourth argument is a pointer to a variable |
756 |
|
to receive the data. The yield of the function is zero for |
757 |
|
success, or one of the following negative numbers: |
758 |
|
|
759 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
760 |
|
the argument where was NULL |
761 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
762 |
|
PCRE_ERROR_BADOPTION the value of what was invalid |
763 |
|
|
764 |
If the optptr argument is not NULL, a copy of the options |
Here is a typical call of pcre_fullinfo(), to obtain the |
765 |
with which the pattern was compiled is placed in the integer |
length of the compiled pattern: |
|
it points to. These option bits are those specified in the |
|
|
call to pcre_compile(), modified by any top-level option |
|
|
settings within the pattern itself, and with the |
|
|
PCRE_ANCHORED bit set if the form of the pattern implies |
|
|
that it can match only at the start of a subject string. |
|
766 |
|
|
767 |
If the pattern is not anchored and the firstcharptr argument |
int rc; |
768 |
is not NULL, it is used to pass back information about the |
unsigned long int length; |
769 |
first character of any matched string. If there is a fixed |
rc = pcre_fullinfo( |
770 |
first character, e.g. from a pattern such as |
re, /* result of pcre_compile() */ |
771 |
(cat|cow|coyote), then it is returned in the integer pointed |
pe, /* result of pcre_study(), or NULL */ |
772 |
to by firstcharptr. Otherwise, if either |
PCRE_INFO_SIZE, /* what is required */ |
773 |
|
&length); /* where to put the data */ |
774 |
|
|
775 |
|
The possible values for the third argument are defined in |
776 |
|
pcre.h, and are as follows: |
777 |
|
|
778 |
|
PCRE_INFO_BACKREFMAX |
779 |
|
|
780 |
|
Return the number of the highest back reference in the pat- |
781 |
|
tern. The fourth argument should point to an int variable. |
782 |
|
Zero is returned if there are no back references. |
783 |
|
|
784 |
|
PCRE_INFO_CAPTURECOUNT |
785 |
|
|
786 |
|
Return the number of capturing subpatterns in the pattern. |
787 |
|
The fourth argument should point to an int variable. |
788 |
|
|
789 |
|
PCRE_INFO_FIRSTBYTE |
790 |
|
|
791 |
|
Return information about the first byte of any matched |
792 |
|
string, for a non-anchored pattern. (This option used to be |
793 |
|
called PCRE_INFO_FIRSTCHAR; the old name is still recognized |
794 |
|
for backwards compatibility.) |
795 |
|
|
796 |
|
If there is a fixed first byte, e.g. from a pattern such as |
797 |
|
(cat|cow|coyote), it is returned in the integer pointed to |
798 |
|
by where. Otherwise, if either |
799 |
|
|
800 |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
(a) the pattern was compiled with the PCRE_MULTILINE option, |
801 |
and every branch starts with "^", or |
and every branch starts with "^", or |
803 |
(b) every branch of the pattern starts with ".*" and |
(b) every branch of the pattern starts with ".*" and |
804 |
PCRE_DOTALL is not set (if it were set, the pattern would be |
PCRE_DOTALL is not set (if it were set, the pattern would be |
805 |
anchored), |
anchored), |
|
then -1 is returned, indicating that the pattern matches |
|
|
only at the start of a subject string or after any "\n" |
|
|
within the string. Otherwise -2 is returned. |
|
806 |
|
|
807 |
|
-1 is returned, indicating that the pattern matches only at |
808 |
|
the start of a subject string or after any newline within |
809 |
|
the string. Otherwise -2 is returned. For anchored patterns, |
810 |
|
-2 is returned. |
811 |
|
|
812 |
|
PCRE_INFO_FIRSTTABLE |
813 |
|
|
814 |
|
If the pattern was studied, and this resulted in the con- |
815 |
|
struction of a 256-bit table indicating a fixed set of bytes |
816 |
|
for the first byte in any matching string, a pointer to the |
817 |
|
table is returned. Otherwise NULL is returned. The fourth |
818 |
|
argument should point to an unsigned char * variable. |
819 |
|
|
820 |
|
PCRE_INFO_LASTLITERAL |
821 |
|
|
822 |
|
Return the value of the rightmost literal byte that must |
823 |
|
exist in any matched string, other than at its start, if |
824 |
|
such a byte has been recorded. The fourth argument should |
825 |
|
point to an int variable. If there is no such byte, -1 is |
826 |
|
returned. For anchored patterns, a last literal byte is |
827 |
|
recorded only if it follows something of variable length. |
828 |
|
For example, for the pattern /^a\d+z\d+/ the returned value |
829 |
|
is "z", but for /^a\dz\d/ the returned value is -1. |
830 |
|
|
831 |
|
PCRE_INFO_NAMECOUNT |
832 |
|
PCRE_INFO_NAMEENTRYSIZE |
833 |
|
PCRE_INFO_NAMETABLE |
834 |
|
|
835 |
|
PCRE supports the use of named as well as numbered capturing |
836 |
|
parentheses. The names are just an additional way of identi- |
837 |
|
fying the parentheses, which still acquire a number. A |
838 |
|
caller that wants to extract data from a named subpattern |
839 |
|
must convert the name to a number in order to access the |
840 |
|
correct pointers in the output vector (described with |
841 |
|
pcre_exec() below). In order to do this, it must first use |
842 |
|
these three values to obtain the name-to-number mapping |
843 |
|
table for the pattern. |
844 |
|
|
845 |
|
The map consists of a number of fixed-size entries. |
846 |
|
PCRE_INFO_NAMECOUNT gives the number of entries, and |
847 |
|
PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both |
848 |
|
of these return an int value. The entry size depends on the |
849 |
|
length of the longest name. PCRE_INFO_NAMETABLE returns a |
850 |
|
pointer to the first entry of the table (a pointer to char). |
851 |
|
The first two bytes of each entry are the number of the cap- |
852 |
|
turing parenthesis, most significant byte first. The rest of |
853 |
|
the entry is the corresponding name, zero terminated. The |
854 |
|
names are in alphabetical order. For example, consider the |
855 |
|
following pattern (assume PCRE_EXTENDED is set, so white |
856 |
|
space - including newlines - is ignored): |
857 |
|
|
858 |
|
(?P<date> (?P<year>(\d\d)?\d\d) - |
859 |
|
(?P<month>\d\d) - (?P<day>\d\d) ) |
860 |
|
|
861 |
|
There are four named subpatterns, so the table has four |
862 |
|
entries, and each entry in the table is eight bytes long. |
863 |
|
The table is as follows, with non-printing bytes shows in |
864 |
|
hex, and undefined bytes shown as ??: |
865 |
|
|
866 |
|
00 01 d a t e 00 ?? |
867 |
|
00 05 d a y 00 ?? ?? |
868 |
|
00 04 m o n t h 00 |
869 |
|
00 02 y e a r 00 ?? |
870 |
|
|
871 |
|
When writing code to extract data from named subpatterns, |
872 |
|
remember that the length of each entry may be different for |
873 |
|
each compiled pattern. |
874 |
|
|
875 |
|
PCRE_INFO_OPTIONS |
876 |
|
|
877 |
|
Return a copy of the options with which the pattern was com- |
878 |
|
piled. The fourth argument should point to an unsigned long |
879 |
|
int variable. These option bits are those specified in the |
880 |
|
call to pcre_compile(), modified by any top-level option |
881 |
|
settings within the pattern itself. |
882 |
|
|
883 |
|
A pattern is automatically anchored by PCRE if all of its |
884 |
|
top-level alternatives begin with one of the following: |
885 |
|
|
886 |
|
^ unless PCRE_MULTILINE is set |
887 |
|
\A always |
888 |
|
\G always |
889 |
|
.* if PCRE_DOTALL is set and there are no back |
890 |
|
references to the subpattern in which .* appears |
891 |
|
|
892 |
|
For such patterns, the PCRE_ANCHORED bit is set in the |
893 |
|
options returned by pcre_fullinfo(). |
894 |
|
|
895 |
|
PCRE_INFO_SIZE |
896 |
|
|
897 |
|
Return the size of the compiled pattern, that is, the value |
898 |
|
that was passed as the argument to pcre_malloc() when PCRE |
899 |
|
was getting memory in which to place the compiled data. The |
900 |
|
fourth argument should point to a size_t variable. |
901 |
|
|
902 |
|
PCRE_INFO_STUDYSIZE |
903 |
|
|
904 |
|
Returns the size of the data block pointed to by the |
905 |
|
study_data field in a pcre_extra block. That is, it is the |
906 |
|
value that was passed to pcre_malloc() when PCRE was getting |
907 |
|
memory into which to place the data created by pcre_study(). |
908 |
|
The fourth argument should point to a size_t variable. |
909 |
|
|
910 |
|
|
911 |
|
OBSOLETE INFO FUNCTION |
912 |
|
|
913 |
|
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
914 |
|
|
915 |
|
The pcre_info() function is now obsolete because its inter- |
916 |
|
face is too restrictive to return all the available data |
917 |
|
about a compiled pattern. New programs should use |
918 |
|
pcre_fullinfo() instead. The yield of pcre_info() is the |
919 |
|
number of capturing subpatterns, or one of the following |
920 |
|
negative numbers: |
921 |
|
|
922 |
|
PCRE_ERROR_NULL the argument code was NULL |
923 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
924 |
|
|
925 |
|
If the optptr argument is not NULL, a copy of the options |
926 |
|
with which the pattern was compiled is placed in the integer |
927 |
|
it points to (see PCRE_INFO_OPTIONS above). |
928 |
|
|
929 |
|
If the pattern is not anchored and the firstcharptr argument |
930 |
|
is not NULL, it is used to pass back information about the |
931 |
|
first character of any matched string (see |
932 |
|
PCRE_INFO_FIRSTBYTE above). |
933 |
|
|
934 |
|
|
935 |
MATCHING A PATTERN |
MATCHING A PATTERN |
936 |
|
|
937 |
|
int pcre_exec(const pcre *code, const pcre_extra *extra, |
938 |
|
const char *subject, int length, int startoffset, |
939 |
|
int options, int *ovector, int ovecsize); |
940 |
|
|
941 |
The function pcre_exec() is called to match a subject string |
The function pcre_exec() is called to match a subject string |
942 |
against a pre-compiled pattern, which is passed in the code |
against a pre-compiled pattern, which is passed in the code |
943 |
argument. If the pattern has been studied, the result of the |
argument. If the pattern has been studied, the result of the |
944 |
study should be passed in the extra argument. Otherwise this |
study should be passed in the extra argument. |
945 |
must be NULL. |
|
946 |
|
Here is an example of a simple call to pcre_exec(): |
947 |
|
|
948 |
|
int rc; |
949 |
|
int ovector[30]; |
950 |
|
rc = pcre_exec( |
951 |
|
re, /* result of pcre_compile() */ |
952 |
|
NULL, /* we didn't study the pattern */ |
953 |
|
"some string", /* the subject string */ |
954 |
|
11, /* the length of the subject string */ |
955 |
|
0, /* start at offset 0 in the subject */ |
956 |
|
0, /* default options */ |
957 |
|
ovector, /* vector for substring information */ |
958 |
|
30); /* number of elements in the vector */ |
959 |
|
|
960 |
|
If the extra argument is not NULL, it must point to a |
961 |
|
pcre_extra data block. The pcre_study() function returns |
962 |
|
such a block (when it doesn't return NULL), but you can also |
963 |
|
create one for yourself, and pass additional information in |
964 |
|
it. The fields in the block are as follows: |
965 |
|
|
966 |
|
unsigned long int flags; |
967 |
|
void *study_data; |
968 |
|
unsigned long int match_limit; |
969 |
|
void *callout_data; |
970 |
|
|
971 |
|
The flags field is a bitmap that specifies which of the |
972 |
|
other fields are set. The flag bits are: |
973 |
|
|
974 |
|
PCRE_EXTRA_STUDY_DATA |
975 |
|
PCRE_EXTRA_MATCH_LIMIT |
976 |
|
PCRE_EXTRA_CALLOUT_DATA |
977 |
|
|
978 |
|
Other flag bits should be set to zero. The study_data field |
979 |
|
is set in the pcre_extra block that is returned by |
980 |
|
pcre_study(), together with the appropriate flag bit. You |
981 |
|
should not set this yourself, but you can add to the block |
982 |
|
by setting the other fields. |
983 |
|
|
984 |
|
The match_limit field provides a means of preventing PCRE |
985 |
|
from using up a vast amount of resources when running pat- |
986 |
|
terns that are not going to match, but which have a very |
987 |
|
large number of possibilities in their search trees. The |
988 |
|
classic example is the use of nested unlimited repeats. |
989 |
|
Internally, PCRE uses a function called match() which it |
990 |
|
calls repeatedly (sometimes recursively). The limit is |
991 |
|
imposed on the number of times this function is called dur- |
992 |
|
ing a match, which has the effect of limiting the amount of |
993 |
|
recursion and backtracking that can take place. For patterns |
994 |
|
that are not anchored, the count starts from zero for each |
995 |
|
position in the subject string. |
996 |
|
|
997 |
|
The default limit for the library can be set when PCRE is |
998 |
|
built; the default default is 10 million, which handles all |
999 |
|
but the most extreme cases. You can reduce the default by |
1000 |
|
suppling pcre_exec() with a pcre_extra block in which |
1001 |
|
match_limit is set to a smaller value, and |
1002 |
|
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the |
1003 |
|
limit is exceeded, pcre_exec() returns |
1004 |
|
PCRE_ERROR_MATCHLIMIT. |
1005 |
|
|
1006 |
|
The pcre_callout field is used in conjunction with the "cal- |
1007 |
|
lout" feature, which is described in the pcrecallout docu- |
1008 |
|
mentation. |
1009 |
|
|
1010 |
The PCRE_ANCHORED option can be passed in the options argu- |
The PCRE_ANCHORED option can be passed in the options argu- |
1011 |
ment, whose unused bits must be zero. However, if a pattern |
ment, whose unused bits must be zero. This limits |
1012 |
was compiled with PCRE_ANCHORED, or turned out to be |
pcre_exec() to matching at the first matching position. How- |
1013 |
anchored by virtue of its contents, it cannot be made |
ever, if a pattern was compiled with PCRE_ANCHORED, or |
1014 |
unachored at matching time. |
turned out to be anchored by virtue of its contents, it can- |
1015 |
|
not be made unachored at matching time. |
1016 |
|
|
1017 |
There are also three further options that can be set only at |
There are also three further options that can be set only at |
1018 |
matching time: |
matching time: |
1056 |
advancing the starting offset (see below) and trying an |
advancing the starting offset (see below) and trying an |
1057 |
ordinary match again. |
ordinary match again. |
1058 |
|
|
1059 |
The subject string is passed as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in |
1060 |
length in length, and a starting offset in startoffset. |
subject, a length in length, and a starting offset in star- |
1061 |
Unlike the pattern string, it may contain binary zero char- |
toffset. Unlike the pattern string, the subject may contain |
1062 |
acters. When the starting offset is zero, the search for a |
binary zero bytes. When the starting offset is zero, the |
1063 |
match starts at the beginning of the subject, and this is by |
search for a match starts at the beginning of the subject, |
1064 |
far the most common case. |
and this is by far the most common case. |
1065 |
|
|
1066 |
|
If the pattern was compiled with the PCRE_UTF8 option, the |
1067 |
|
subject must be a sequence of bytes that is a valid UTF-8 |
1068 |
|
string. If an invalid UTF-8 string is passed, PCRE's |
1069 |
|
behaviour is not defined. |
1070 |
|
|
1071 |
A non-zero starting offset is useful when searching for |
A non-zero starting offset is useful when searching for |
1072 |
another match in the same subject by calling pcre_exec() |
another match in the same subject by calling pcre_exec() |
1159 |
Note that pcre_info() can be used to find out how many cap- |
Note that pcre_info() can be used to find out how many cap- |
1160 |
turing subpatterns there are in a compiled pattern. The |
turing subpatterns there are in a compiled pattern. The |
1161 |
smallest size for ovector that will allow for n captured |
smallest size for ovector that will allow for n captured |
1162 |
substrings in addition to the offsets of the substring |
substrings, in addition to the offsets of the substring |
1163 |
matched by the whole pattern is (n+1)*3. |
matched by the whole pattern, is (n+1)*3. |
1164 |
|
|
1165 |
If pcre_exec() fails, it returns a negative number. The fol- |
If pcre_exec() fails, it returns a negative number. The fol- |
1166 |
lowing are defined in the header file: |
lowing are defined in the header file: |
1200 |
pcre_malloc() fails, this error is given. The memory is |
pcre_malloc() fails, this error is given. The memory is |
1201 |
freed at the end of matching. |
freed at the end of matching. |
1202 |
|
|
1203 |
|
PCRE_ERROR_NOSUBSTRING (-7) |
1204 |
|
|
1205 |
|
This error is used by the pcre_copy_substring(), |
1206 |
|
pcre_get_substring(), and pcre_get_substring_list() func- |
1207 |
|
tions (see below). It is never returned by pcre_exec(). |
1208 |
|
|
1209 |
|
PCRE_ERROR_MATCHLIMIT (-8) |
1210 |
|
|
1211 |
|
The recursion and backtracking limit, as specified by the |
1212 |
|
match_limit field in a pcre_extra structure (or defaulted) |
1213 |
|
was reached. See the description above. |
1214 |
|
|
1215 |
|
PCRE_ERROR_CALLOUT (-9) |
1216 |
|
|
1217 |
|
This error is never generated by pcre_exec() itself. It is |
1218 |
|
provided for use by callout functions that want to yield a |
1219 |
|
distinctive error code. See the pcrecallout documentation |
1220 |
|
for details. |
1221 |
|
|
1222 |
|
|
1223 |
EXTRACTING CAPTURED SUBSTRINGS |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
1224 |
|
|
1225 |
|
int pcre_copy_substring(const char *subject, int *ovector, |
1226 |
|
int stringcount, int stringnumber, char *buffer, |
1227 |
|
int buffersize); |
1228 |
|
|
1229 |
|
int pcre_get_substring(const char *subject, int *ovector, |
1230 |
|
int stringcount, int stringnumber, |
1231 |
|
const char **stringptr); |
1232 |
|
|
1233 |
|
int pcre_get_substring_list(const char *subject, |
1234 |
|
int *ovector, int stringcount, const char ***listptr); |
1235 |
|
|
1236 |
Captured substrings can be accessed directly by using the |
Captured substrings can be accessed directly by using the |
1237 |
offsets returned by pcre_exec() in ovector. For convenience, |
offsets returned by pcre_exec() in ovector. For convenience, |
1238 |
the functions pcre_copy_substring(), pcre_get_substring(), |
the functions pcre_copy_substring(), pcre_get_substring(), |
1239 |
and pcre_get_substring_list() are provided for extracting |
and pcre_get_substring_list() are provided for extracting |
1240 |
captured substrings as new, separate, zero-terminated |
captured substrings as new, separate, zero-terminated |
1241 |
|
strings. These functions identify substrings by number. The |
1242 |
|
next section describes functions for extracting named sub- |
1243 |
strings. A substring that contains a binary zero is |
strings. A substring that contains a binary zero is |
1244 |
correctly extracted and has a further zero added on the end, |
correctly extracted and has a further zero added on the end, |
1245 |
but the result does not, of course, function as a C string. |
but the result is not, of course, a C string. |
1246 |
|
|
1247 |
The first three arguments are the same for all three func- |
The first three arguments are the same for all three of |
1248 |
tions: subject is the subject string which has just been |
these functions: subject is the subject string which has |
1249 |
successfully matched, ovector is a pointer to the vector of |
just been successfully matched, ovector is a pointer to the |
1250 |
integer offsets that was passed to pcre_exec(), and |
vector of integer offsets that was passed to pcre_exec(), |
1251 |
stringcount is the number of substrings that were captured |
and stringcount is the number of substrings that were cap- |
1252 |
by the match, including the substring that matched the |
tured by the match, including the substring that matched the |
1253 |
entire regular expression. This is the value returned by |
entire regular expression. This is the value returned by |
1254 |
pcre_exec if it is greater than zero. If pcre_exec() |
pcre_exec if it is greater than zero. If pcre_exec() |
1255 |
returned zero, indicating that it ran out of space in ovec- |
returned zero, indicating that it ran out of space in ovec- |
1256 |
tor, then the value passed as stringcount should be the size |
tor, the value passed as stringcount should be the size of |
1257 |
of the vector divided by three. |
the vector divided by three. |
1258 |
|
|
1259 |
The functions pcre_copy_substring() and pcre_get_substring() |
The functions pcre_copy_substring() and pcre_get_substring() |
1260 |
extract a single substring, whose number is given as string- |
extract a single substring, whose number is given as string- |
1262 |
the entire pattern, while higher values extract the captured |
the entire pattern, while higher values extract the captured |
1263 |
substrings. For pcre_copy_substring(), the string is placed |
substrings. For pcre_copy_substring(), the string is placed |
1264 |
in buffer, whose length is given by buffersize, while for |
in buffer, whose length is given by buffersize, while for |
1265 |
pcre_get_substring() a new block of store is obtained via |
pcre_get_substring() a new block of memory is obtained via |
1266 |
pcre_malloc, and its address is returned via stringptr. The |
pcre_malloc, and its address is returned via stringptr. The |
1267 |
yield of the function is the length of the string, not |
yield of the function is the length of the string, not |
1268 |
including the terminating zero, or one of |
including the terminating zero, or one of |
1296 |
inspecting the appropriate offset in ovector, which is nega- |
inspecting the appropriate offset in ovector, which is nega- |
1297 |
tive for unset substrings. |
tive for unset substrings. |
1298 |
|
|
1299 |
|
The two convenience functions pcre_free_substring() and |
1300 |
|
pcre_free_substring_list() can be used to free the memory |
1301 |
|
returned by a previous call of pcre_get_substring() or |
1302 |
|
pcre_get_substring_list(), respectively. They do nothing |
1303 |
|
more than call the function pointed to by pcre_free, which |
1304 |
|
of course could be called directly from a C program. How- |
1305 |
|
ever, PCRE is used in some situations where it is linked via |
1306 |
|
a special interface to another programming language which |
1307 |
|
cannot use pcre_free directly; it is for these cases that |
1308 |
|
the functions are provided. |
1309 |
|
|
1310 |
|
|
1311 |
|
EXTRACTING CAPTURED SUBSTRINGS BY NAME |
1312 |
|
|
1313 |
|
int pcre_copy_named_substring(const pcre *code, |
1314 |
|
const char *subject, int *ovector, |
1315 |
|
int stringcount, const char *stringname, |
1316 |
|
char *buffer, int buffersize); |
1317 |
|
|
1318 |
|
int pcre_get_stringnumber(const pcre *code, |
1319 |
|
const char *name); |
1320 |
|
|
1321 |
|
int pcre_get_named_substring(const pcre *code, |
1322 |
|
const char *subject, int *ovector, |
1323 |
|
int stringcount, const char *stringname, |
1324 |
|
const char **stringptr); |
1325 |
|
|
1326 |
|
To extract a substring by name, you first have to find asso- |
1327 |
|
ciated number. This can be done by calling |
1328 |
|
pcre_get_stringnumber(). The first argument is the compiled |
1329 |
|
pattern, and the second is the name. For example, for this |
1330 |
|
pattern |
1331 |
|
|
1332 |
|
ab(?<xxx>\d+)... |
1333 |
|
|
1334 |
LIMITATIONS |
the number of the subpattern called "xxx" is 1. Given the |
1335 |
There are some size limitations in PCRE but it is hoped that |
number, you can then extract the substring directly, or use |
1336 |
they will never in practice be relevant. The maximum length |
one of the functions described in the previous section. For |
1337 |
of a compiled pattern is 65539 (sic) bytes. All values in |
convenience, there are also two functions that do the whole |
1338 |
repeating quantifiers must be less than 65536. The maximum |
job. |
1339 |
number of capturing subpatterns is 99. The maximum number |
|
1340 |
of all parenthesized subpatterns, including capturing sub- |
Most of the arguments of pcre_copy_named_substring() and |
1341 |
patterns, assertions, and other types of subpattern, is 200. |
pcre_get_named_substring() are the same as those for the |
1342 |
|
functions that extract by number, and so are not re- |
1343 |
|
described here. There are just two differences. |
1344 |
|
|
1345 |
|
First, instead of a substring number, a substring name is |
1346 |
|
given. Second, there is an extra argument, given at the |
1347 |
|
start, which is a pointer to the compiled pattern. This is |
1348 |
|
needed in order to gain access to the name-to-number trans- |
1349 |
|
lation table. |
1350 |
|
|
1351 |
|
These functions call pcre_get_stringnumber(), and if it |
1352 |
|
succeeds, they then call pcre_copy_substring() or |
1353 |
|
pcre_get_substring(), as appropriate. |
1354 |
|
|
1355 |
|
Last updated: 03 February 2003 |
1356 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1357 |
|
----------------------------------------------------------------------------- |
1358 |
|
|
1359 |
The maximum length of a subject string is the largest posi- |
NAME |
1360 |
tive number that an integer variable can hold. However, PCRE |
PCRE - Perl-compatible regular expressions |
1361 |
uses recursion to handle subpatterns and indefinite repeti- |
|
1362 |
tion. This means that the available stack space may limit |
|
1363 |
the size of a subject string that can be processed by cer- |
PCRE CALLOUTS |
1364 |
tain patterns. |
|
1365 |
|
int (*pcre_callout)(pcre_callout_block *); |
1366 |
|
|
1367 |
|
PCRE provides a feature called "callout", which is a means |
1368 |
|
of temporarily passing control to the caller of PCRE in the |
1369 |
|
middle of pattern matching. The caller of PCRE provides an |
1370 |
|
external function by putting its entry point in the global |
1371 |
|
variable pcre_callout. By default, this variable contains |
1372 |
|
NULL, which disables all calling out. |
1373 |
|
|
1374 |
|
Within a regular expression, (?C) indicates the points at |
1375 |
|
which the external function is to be called. Different cal- |
1376 |
|
lout points can be identified by putting a number less than |
1377 |
|
256 after the letter C. The default value is zero. For |
1378 |
|
example, this pattern has two callout points: |
1379 |
|
|
1380 |
|
(?C1)9abc(?C2)def |
1381 |
|
|
1382 |
|
During matching, when PCRE reaches a callout point (and |
1383 |
|
pcre_callout is set), the external function is called. Its |
1384 |
|
only argument is a pointer to a pcre_callout block. This |
1385 |
|
contains the following variables: |
1386 |
|
|
1387 |
|
int version; |
1388 |
|
int callout_number; |
1389 |
|
int *offset_vector; |
1390 |
|
const char *subject; |
1391 |
|
int subject_length; |
1392 |
|
int start_match; |
1393 |
|
int current_position; |
1394 |
|
int capture_top; |
1395 |
|
int capture_last; |
1396 |
|
void *callout_data; |
1397 |
|
|
1398 |
|
The version field is an integer containing the version |
1399 |
|
number of the block format. The current version is zero. The |
1400 |
|
version number may change in future if additional fields are |
1401 |
|
added, but the intention is never to remove any of the |
1402 |
|
existing fields. |
1403 |
|
|
1404 |
|
The callout_number field contains the number of the callout, |
1405 |
|
as compiled into the pattern (that is, the number after ?C). |
1406 |
|
|
1407 |
|
The offset_vector field is a pointer to the vector of |
1408 |
|
offsets that was passed by the caller to pcre_exec(). The |
1409 |
|
contents can be inspected in order to extract substrings |
1410 |
|
that have been matched so far, in the same way as for |
1411 |
|
extracting substrings after a match has completed. |
1412 |
|
The subject and subject_length fields contain copies the |
1413 |
|
values that were passed to pcre_exec(). |
1414 |
|
|
1415 |
|
The start_match field contains the offset within the subject |
1416 |
|
at which the current match attempt started. If the pattern |
1417 |
|
is not anchored, the callout function may be called several |
1418 |
|
times for different starting points. |
1419 |
|
|
1420 |
|
The current_position field contains the offset within the |
1421 |
|
subject of the current match pointer. |
1422 |
|
|
1423 |
|
The capture_top field contains the number of the highest |
1424 |
|
captured substring so far. |
1425 |
|
|
1426 |
|
The capture_last field contains the number of the most |
1427 |
|
recently captured substring. |
1428 |
|
|
1429 |
|
The callout_data field contains a value that is passed to |
1430 |
|
pcre_exec() by the caller specifically so that it can be |
1431 |
|
passed back in callouts. It is passed in the pcre_callout |
1432 |
|
field of the pcre_extra data structure. If no such data was |
1433 |
|
passed, the value of callout_data in a pcre_callout block is |
1434 |
|
NULL. There is a description of the pcre_extra structure in |
1435 |
|
the pcreapi documentation. |
1436 |
|
|
1437 |
|
|
1438 |
|
|
1439 |
|
RETURN VALUES |
1440 |
|
|
1441 |
|
The callout function returns an integer. If the value is |
1442 |
|
zero, matching proceeds as normal. If the value is greater |
1443 |
|
than zero, matching fails at the current point, but back- |
1444 |
|
tracking to test other possibilities goes ahead, just as if |
1445 |
|
a lookahead assertion had failed. If the value is less than |
1446 |
|
zero, the match is abandoned, and pcre_exec() returns the |
1447 |
|
value. |
1448 |
|
|
1449 |
|
Negative values should normally be chosen from the set of |
1450 |
|
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH |
1451 |
|
forces a standard "no match" failure. The error number |
1452 |
|
PCRE_ERROR_CALLOUT is reserved for use by callout functions; |
1453 |
|
it will never be used by PCRE itself. |
1454 |
|
|
1455 |
|
Last updated: 21 January 2003 |
1456 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1457 |
|
----------------------------------------------------------------------------- |
1458 |
|
|
1459 |
|
NAME |
1460 |
|
PCRE - Perl-compatible regular expressions |
1461 |
|
|
1462 |
|
|
1463 |
DIFFERENCES FROM PERL |
DIFFERENCES FROM PERL |
|
The differences described here are with respect to Perl |
|
|
5.005. |
|
1464 |
|
|
1465 |
1. By default, a whitespace character is any character that |
This document describes the differences in the ways that |
1466 |
the C library function isspace() recognizes, though it is |
PCRE and Perl handle regular expressions. The differences |
1467 |
possible to compile PCRE with alternative character type |
described here are with respect to Perl 5.8. |
|
tables. Normally isspace() matches space, formfeed, newline, |
|
|
carriage return, horizontal tab, and vertical tab. Perl 5 no |
|
|
longer includes vertical tab in its set of whitespace char- |
|
|
acters. The \v escape that was in the Perl documentation for |
|
|
a long time was never in fact recognized. However, the char- |
|
|
acter itself was treated as whitespace at least up to 5.002. |
|
|
In 5.004 and 5.005 it does not match \s. |
|
1468 |
|
|
1469 |
2. PCRE does not allow repeat quantifiers on lookahead |
1. PCRE does not allow repeat quantifiers on lookahead |
1470 |
assertions. Perl permits them, but they do not mean what you |
assertions. Perl permits them, but they do not mean what you |
1471 |
might think. For example, (?!a){3} does not assert that the |
might think. For example, (?!a){3} does not assert that the |
1472 |
next three characters are not "a". It just asserts that the |
next three characters are not "a". It just asserts that the |
1473 |
next character is not "a" three times. |
next character is not "a" three times. |
1474 |
|
|
1475 |
3. Capturing subpatterns that occur inside negative looka- |
2. Capturing subpatterns that occur inside negative looka- |
1476 |
head assertions are counted, but their entries in the |
head assertions are counted, but their entries in the |
1477 |
offsets vector are never set. Perl sets its numerical vari- |
offsets vector are never set. Perl sets its numerical vari- |
1478 |
ables from any such patterns that are matched before the |
ables from any such patterns that are matched before the |
1480 |
only if the negative lookahead assertion contains just one |
only if the negative lookahead assertion contains just one |
1481 |
branch. |
branch. |
1482 |
|
|
1483 |
4. Though binary zero characters are supported in the sub- |
3. Though binary zero characters are supported in the sub- |
1484 |
ject string, they are not allowed in a pattern string |
ject string, they are not allowed in a pattern string |
1485 |
because it is passed as a normal C string, terminated by |
because it is passed as a normal C string, terminated by |
1486 |
zero. The escape sequence "\0" can be used in the pattern to |
zero. The escape sequence "\0" can be used in the pattern to |
1487 |
represent a binary zero. |
represent a binary zero. |
1488 |
|
|
1489 |
5. The following Perl escape sequences are not supported: |
4. The following Perl escape sequences are not supported: |
1490 |
\l, \u, \L, \U, \E, \Q. In fact these are implemented by |
\l, \u, \L, \U, \P, \p, and \X. In fact these are imple- |
1491 |
Perl's general string-handling and are not part of its pat- |
mented by Perl's general string-handling and are not part of |
1492 |
tern matching engine. |
its pattern matching engine. If any of these are encountered |
1493 |
|
by PCRE, an error is generated. |
1494 |
6. The Perl \G assertion is not supported as it is not |
|
1495 |
relevant to single pattern matches. |
5. PCRE does support the \Q...\E escape for quoting sub- |
1496 |
|
strings. Characters in between are treated as literals. This |
1497 |
7. Fairly obviously, PCRE does not support the (?{code}) |
is slightly different from Perl in that $ and @ are also |
1498 |
construction. |
handled as literals inside the quotes. In Perl, they cause |
1499 |
|
variable interpolation (but of course PCRE does not have |
1500 |
8. There are at the time of writing some oddities in Perl |
variables). Note the following examples: |
1501 |
5.005_02 concerned with the settings of captured strings |
|
1502 |
when part of a pattern is repeated. For example, matching |
Pattern PCRE matches Perl matches |
1503 |
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
|
1504 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 |
\Qabc$xyz\E abc$xyz abc followed by the |
1505 |
unset. However, if the pattern is changed to |
contents of $xyz |
1506 |
/^(aa(b(b))?)+$/ then $2 (and $3) get set. |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
1507 |
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
1508 |
In Perl 5.004 $2 is set in both cases, and that is also true |
|
1509 |
of PCRE. If in the future Perl changes to a consistent state |
In PCRE, the \Q...\E mechanism is not recognized inside a |
1510 |
that is different, PCRE may change to follow. |
character class. |
1511 |
|
|
1512 |
9. Another as yet unresolved discrepancy is that in Perl |
8. Fairly obviously, PCRE does not support the (?{code}) and |
1513 |
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string |
(?p{code}) constructions. However, there is some experimen- |
1514 |
"a", whereas in PCRE it does not. However, in both Perl and |
tal support for recursive patterns using the non-Perl items |
1515 |
PCRE /^(a)?a/ matched against "a" leaves $1 unset. |
(?R), (?number) and (?P>name). Also, the PCRE "callout" |
1516 |
|
feature allows an external function to be called during pat- |
1517 |
|
tern matching. |
1518 |
|
|
1519 |
|
9. There are some differences that are concerned with the |
1520 |
|
settings of captured strings when part of a pattern is |
1521 |
|
repeated. For example, matching "aba" against the pattern |
1522 |
|
/^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set |
1523 |
|
to "b". |
1524 |
|
|
1525 |
10. PCRE provides some extensions to the Perl regular |
10. PCRE provides some extensions to the Perl regular |
1526 |
expression facilities: |
expression facilities: |
1527 |
|
|
1528 |
(a) Although lookbehind assertions must match fixed length |
(a) Although lookbehind assertions must match fixed length |
1529 |
strings, each alternative branch of a lookbehind assertion |
strings, each alternative branch of a lookbehind assertion |
1530 |
can match a different length of string. Perl 5.005 requires |
can match a different length of string. Perl requires them |
1531 |
them all to have the same length. |
all to have the same length. |
1532 |
|
|
1533 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
1534 |
set, the $ meta- character matches only at the very end of |
set, the $ meta-character matches only at the very end of |
1535 |
the string. |
the string. |
1536 |
|
|
1537 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
1538 |
with no special meaning is faulted. |
with no special meaning is faulted. |
1539 |
|
|
1540 |
(d) If PCRE_UNGREEDY is set, the greediness of the |
(d) If PCRE_UNGREEDY is set, the greediness of the repeti- |
1541 |
repetition quantifiers is inverted, that is, by default they |
tion quantifiers is inverted, that is, by default they are |
1542 |
are not greedy, but if followed by a question mark they are. |
not greedy, but if followed by a question mark they are. |
1543 |
|
|
1544 |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
1545 |
only at the start of the subject. |
only at the first matching position in the subject string. |
1546 |
|
|
1547 |
|
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and |
1548 |
|
PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl |
1549 |
|
equivalents. |
1550 |
|
|
1551 |
|
(g) The (?R), (?number), and (?P>name) constructs allows for |
1552 |
|
recursive pattern matching (Perl can do this using the |
1553 |
|
(?p{code}) construct, which PCRE cannot support.) |
1554 |
|
|
1555 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
(h) PCRE supports named capturing substrings, using the |
1556 |
for pcre_exec() have no Perl equivalents. |
Python syntax. |
1557 |
|
|
1558 |
|
(i) PCRE supports the possessive quantifier "++" syntax, |
1559 |
|
taken from Sun's Java package. |
1560 |
|
|
1561 |
|
(j) The (R) condition, for testing recursion, is a PCRE |
1562 |
|
extension. |
1563 |
|
|
1564 |
|
(k) The callout facility is PCRE-specific. |
1565 |
|
|
1566 |
|
Last updated: 03 February 2003 |
1567 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1568 |
|
----------------------------------------------------------------------------- |
1569 |
|
|
1570 |
|
NAME |
1571 |
|
PCRE - Perl-compatible regular expressions |
1572 |
|
|
1573 |
|
|
1574 |
|
PCRE REGULAR EXPRESSION DETAILS |
1575 |
|
|
|
REGULAR EXPRESSION DETAILS |
|
1576 |
The syntax and semantics of the regular expressions sup- |
The syntax and semantics of the regular expressions sup- |
1577 |
ported by PCRE are described below. Regular expressions are |
ported by PCRE are described below. Regular expressions are |
1578 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
1579 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
1580 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
1581 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. |
O'Reilly, covers them in great detail. The description here |
1582 |
The description here is intended as reference documentation. |
is intended as reference documentation. |
1583 |
|
|
1584 |
|
The basic operation of PCRE is on strings of bytes. However, |
1585 |
|
there is also support for UTF-8 character strings. To use |
1586 |
|
this support you must build PCRE to include UTF-8 support, |
1587 |
|
and then call pcre_compile() with the PCRE_UTF8 option. How |
1588 |
|
this affects the pattern matching is mentioned in several |
1589 |
|
places below. There is also a summary of UTF-8 features in |
1590 |
|
the section on UTF-8 support in the main pcre page. |
1591 |
|
|
1592 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
1593 |
subject string from left to right. Most characters stand for |
subject string from left to right. Most characters stand for |
1609 |
Outside square brackets, the meta-characters are as follows: |
Outside square brackets, the meta-characters are as follows: |
1610 |
|
|
1611 |
\ general escape character with several uses |
\ general escape character with several uses |
1612 |
^ assert start of subject (or line, in multiline |
^ assert start of string (or line, in multiline mode) |
1613 |
mode) |
$ assert end of string (or line, in multiline mode) |
|
$ assert end of subject (or line, in multiline mode) |
|
1614 |
. match any character except newline (by default) |
. match any character except newline (by default) |
1615 |
[ start character class definition |
[ start character class definition |
1616 |
| start of alternative branch |
| start of alternative branch |
1621 |
also quantifier minimizer |
also quantifier minimizer |
1622 |
* 0 or more quantifier |
* 0 or more quantifier |
1623 |
+ 1 or more quantifier |
+ 1 or more quantifier |
1624 |
|
also "possessive quantifier" |
1625 |
{ start min/max quantifier |
{ start min/max quantifier |
1626 |
|
|
1627 |
Part of a pattern that is in square brackets is called a |
Part of a pattern that is in square brackets is called a |
1631 |
\ general escape character |
\ general escape character |
1632 |
^ negate the class, but only if the first character |
^ negate the class, but only if the first character |
1633 |
- indicates character range |
- indicates character range |
1634 |
|
[ POSIX character class (only if followed by POSIX |
1635 |
|
syntax) |
1636 |
] terminates the character class |
] terminates the character class |
1637 |
|
|
1638 |
The following sections describe the use of each of the |
The following sections describe the use of each of the |
1639 |
meta-characters. |
meta-characters. |
1640 |
|
|
1641 |
|
|
|
|
|
1642 |
BACKSLASH |
BACKSLASH |
1643 |
|
|
1644 |
The backslash character has several uses. Firstly, if it is |
The backslash character has several uses. Firstly, if it is |
1645 |
followed by a non-alphameric character, it takes away any |
followed by a non-alphameric character, it takes away any |
1646 |
special meaning that character may have. This use of |
special meaning that character may have. This use of |
1647 |
backslash as an escape character applies both inside and |
backslash as an escape character applies both inside and |
1648 |
outside character classes. |
outside character classes. |
1649 |
|
|
1650 |
For example, if you want to match a "*" character, you write |
For example, if you want to match a * character, you write |
1651 |
"\*" in the pattern. This applies whether or not the follow- |
\* in the pattern. This escaping action applies whether or |
1652 |
ing character would otherwise be interpreted as a meta- |
not the following character would otherwise be interpreted |
1653 |
character, so it is always safe to precede a non-alphameric |
as a meta-character, so it is always safe to precede a non- |
1654 |
with "\" to specify that it stands for itself. In particu- |
alphameric with backslash to specify that it stands for |
1655 |
lar, if you want to match a backslash, you write "\\". |
itself. In particular, if you want to match a backslash, you |
1656 |
|
write \\. |
1657 |
|
|
1658 |
If a pattern is compiled with the PCRE_EXTENDED option, whi- |
If a pattern is compiled with the PCRE_EXTENDED option, whi- |
1659 |
tespace in the pattern (other than in a character class) and |
tespace in the pattern (other than in a character class) and |
1660 |
characters between a "#" outside a character class and the |
characters between a # outside a character class and the |
1661 |
next newline character are ignored. An escaping backslash |
next newline character are ignored. An escaping backslash |
1662 |
can be used to include a whitespace or "#" character as part |
can be used to include a whitespace or # character as part |
1663 |
of the pattern. |
of the pattern. |
1664 |
|
|
1665 |
|
If you want to remove the special meaning from a sequence of |
1666 |
|
characters, you can do so by putting them between \Q and \E. |
1667 |
|
This is different from Perl in that $ and @ are handled as |
1668 |
|
literals in \Q...\E sequences in PCRE, whereas in Perl, $ |
1669 |
|
and @ cause variable interpolation. Note the following exam- |
1670 |
|
ples: |
1671 |
|
|
1672 |
|
Pattern PCRE matches Perl matches |
1673 |
|
|
1674 |
|
\Qabc$xyz\E abc$xyz abc followed by the |
1675 |
|
|
1676 |
|
contents of $xyz |
1677 |
|
\Qabc\$xyz\E abc\$xyz abc\$xyz |
1678 |
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
1679 |
|
|
1680 |
|
The \Q...\E sequence is recognized both inside and outside |
1681 |
|
character classes. |
1682 |
|
|
1683 |
A second use of backslash provides a way of encoding non- |
A second use of backslash provides a way of encoding non- |
1684 |
printing characters in patterns in a visible manner. There |
printing characters in patterns in a visible manner. There |
1685 |
is no restriction on the appearance of non-printing charac- |
is no restriction on the appearance of non-printing charac- |
1688 |
usually easier to use one of the following escape sequences |
usually easier to use one of the following escape sequences |
1689 |
than the binary character it represents: |
than the binary character it represents: |
1690 |
|
|
1691 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
1692 |
\cx "control-x", where x is any character |
\cx "control-x", where x is any character |
1693 |
\e escape (hex 1B) |
\e escape (hex 1B) |
1694 |
\f formfeed (hex 0C) |
\f formfeed (hex 0C) |
1695 |
\n newline (hex 0A) |
\n newline (hex 0A) |
1696 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
1697 |
|
\t tab (hex 09) |
1698 |
tab (hex 09) |
\ddd character with octal code ddd, or backreference |
1699 |
\xhh character with hex code hh |
\xhh character with hex code hh |
1700 |
\ddd character with octal code ddd, or backreference |
\x{hhh..} character with hex code hhh... (UTF-8 mode only) |
1701 |
|
|
1702 |
The precise effect of "\cx" is as follows: if "x" is a lower |
The precise effect of \cx is as follows: if x is a lower |
1703 |
case letter, it is converted to upper case. Then bit 6 of |
case letter, it is converted to upper case. Then bit 6 of |
1704 |
the character (hex 40) is inverted. Thus "\cz" becomes hex |
the character (hex 40) is inverted. Thus \cz becomes hex |
1705 |
1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B. |
1A, but \c{ becomes hex 3B, while \c; becomes hex 7B. |
1706 |
|
|
1707 |
After "\x", up to two hexadecimal digits are read (letters |
After \x, from zero to two hexadecimal digits are read |
1708 |
can be in upper or lower case). |
(letters can be in upper or lower case). In UTF-8 mode, any |
1709 |
|
number of hexadecimal digits may appear between \x{ and }, |
1710 |
|
but the value of the character code must be less than 2**31 |
1711 |
|
(that is, the maximum hexadecimal value is 7FFFFFFF). If |
1712 |
|
characters other than hexadecimal digits appear between \x{ |
1713 |
|
and }, or if there is no terminating }, this form of escape |
1714 |
|
is not recognized. Instead, the initial \x will be inter- |
1715 |
|
preted as a basic hexadecimal escape, with no following |
1716 |
|
digits, giving a byte whose value is zero. |
1717 |
|
|
1718 |
|
Characters whose value is less than 256 can be defined by |
1719 |
|
either of the two syntaxes for \x when PCRE is in UTF-8 |
1720 |
|
mode. There is no difference in the way they are handled. |
1721 |
|
For example, \xdc is exactly the same as \x{dc}. |
1722 |
|
|
1723 |
After "\0" up to two further octal digits are read. In both |
After \0 up to two further octal digits are read. In both |
1724 |
cases, if there are fewer than two digits, just those that |
cases, if there are fewer than two digits, just those that |
1725 |
are present are used. Thus the sequence "\0\x\07" specifies |
are present are used. Thus the sequence \0\x\07 specifies |
1726 |
two binary zeros followed by a BEL character. Make sure you |
two binary zeros followed by a BEL character (code value 7). |
1727 |
supply two digits after the initial zero if the character |
Make sure you supply two digits after the initial zero if |
1728 |
that follows is itself an octal digit. |
the character that follows is itself an octal digit. |
1729 |
|
|
1730 |
The handling of a backslash followed by a digit other than 0 |
The handling of a backslash followed by a digit other than 0 |
1731 |
is complicated. Outside a character class, PCRE reads it |
is complicated. Outside a character class, PCRE reads it |
1751 |
writing a tab |
writing a tab |
1752 |
\011 is always a tab |
\011 is always a tab |
1753 |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
1754 |
\113 is the character with octal code 113 (since there |
\113 might be a back reference, otherwise the |
1755 |
can be no more than 99 back references) |
character with octal code 113 |
1756 |
\377 is a byte consisting entirely of 1 bits |
\377 might be a back reference, otherwise |
1757 |
|
the byte consisting entirely of 1 bits |
1758 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
1759 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
1760 |
|
|
1761 |
Note that octal values of 100 or greater must not be intro- |
Note that octal values of 100 or greater must not be intro- |
1762 |
duced by a leading zero, because no more than three octal |
duced by a leading zero, because no more than three octal |
1763 |
digits are ever read. |
digits are ever read. |
1764 |
All the sequences that define a single byte value can be |
|
1765 |
used both inside and outside character classes. In addition, |
All the sequences that define a single byte value or a sin- |
1766 |
inside a character class, the sequence "\b" is interpreted |
gle UTF-8 character (in UTF-8 mode) can be used both inside |
1767 |
as the backspace character (hex 08). Outside a character |
and outside character classes. In addition, inside a charac- |
1768 |
class it has a different meaning (see below). |
ter class, the sequence \b is interpreted as the backspace |
1769 |
|
character (hex 08). Outside a character class it has a dif- |
1770 |
|
ferent meaning (see below). |
1771 |
|
|
1772 |
The third use of backslash is for specifying generic charac- |
The third use of backslash is for specifying generic charac- |
1773 |
ter types: |
ter types: |
1777 |
\s any whitespace character |
\s any whitespace character |
1778 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
1779 |
\w any "word" character |
\w any "word" character |
1780 |
\W any "non-word" character |
W any "non-word" character |
1781 |
|
|
1782 |
Each pair of escape sequences partitions the complete set of |
Each pair of escape sequences partitions the complete set of |
1783 |
characters into two disjoint sets. Any given character |
characters into two disjoint sets. Any given character |
1784 |
matches one, and only one, of each pair. |
matches one, and only one, of each pair. |
1785 |
|
|
1786 |
|
In UTF-8 mode, characters with values greater than 255 never |
1787 |
|
match \d, \s, or \w, and always match \D, \S, and \W. |
1788 |
|
|
1789 |
|
For compatibility with Perl, \s does not match the VT char- |
1790 |
|
acter (code 11). This makes it different from the the POSIX |
1791 |
|
"space" class. The \s characters are HT (9), LF (10), FF |
1792 |
|
(12), CR (13), and space (32). |
1793 |
|
|
1794 |
A "word" character is any letter or digit or the underscore |
A "word" character is any letter or digit or the underscore |
1795 |
character, that is, any character which can be part of a |
character, that is, any character which can be part of a |
1796 |
Perl "word". The definition of letters and digits is con- |
Perl "word". The definition of letters and digits is con- |
1797 |
trolled by PCRE's character tables, and may vary if locale- |
trolled by PCRE's character tables, and may vary if locale- |
1798 |
specific matching is taking place (see "Locale support" |
specific matching is taking place (see "Locale support" in |
1799 |
above). For example, in the "fr" (French) locale, some char- |
the pcreapi page). For example, in the "fr" (French) locale, |
1800 |
acter codes greater than 128 are used for accented letters, |
some character codes greater than 128 are used for accented |
1801 |
and these are matched by \w. |
letters, and these are matched by \w. |
1802 |
|
|
1803 |
These character type sequences can appear both inside and |
These character type sequences can appear both inside and |
1804 |
outside character classes. They each match one character of |
outside character classes. They each match one character of |
1813 |
for more complicated assertions is described below. The |
for more complicated assertions is described below. The |
1814 |
backslashed assertions are |
backslashed assertions are |
1815 |
|
|
1816 |
\b word boundary |
\b matches at a word boundary |
1817 |
\B not a word boundary |
\B matches when not at a word boundary |
1818 |
\A start of subject (independent of multiline mode) |
\A matches at start of subject |
1819 |
\Z end of subject or newline at end (independent of |
\Z matches at end of subject or before newline at end |
1820 |
multiline mode) |
\z matches at end of subject |
1821 |
\z end of subject (independent of multiline mode) |
\G matches at first matching position in subject |
1822 |
|
|
1823 |
These assertions may not appear in character classes (but |
These assertions may not appear in character classes (but |
1824 |
note that "\b" has a different meaning, namely the backspace |
note that \b has a different meaning, namely the backspace |
1825 |
character, inside a character class). |
character, inside a character class). |
1826 |
|
|
1827 |
A word boundary is a position in the subject string where |
A word boundary is a position in the subject string where |
1828 |
the current character and the previous character do not both |
the current character and the previous character do not both |
1829 |
match \w or \W (i.e. one matches \w and the other matches |
match \w or \W (i.e. one matches \w and the other matches |
1830 |
\W), or the start or end of the string if the first or last |
\W), or the start or end of the string if the first or last |
1831 |
character matches \w, respectively. |
character matches \w, respectively. |
|
|
|
1832 |
The \A, \Z, and \z assertions differ from the traditional |
The \A, \Z, and \z assertions differ from the traditional |
1833 |
circumflex and dollar (described below) in that they only |
circumflex and dollar (described below) in that they only |
1834 |
ever match at the very start and end of the subject string, |
ever match at the very start and end of the subject string, |
1835 |
whatever options are set. They are not affected by the |
whatever options are set. Thus, they are independent of mul- |
1836 |
PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu- |
tiline mode. |
1837 |
ment of pcre_exec() is non-zero, \A can never match. The |
|
1838 |
|
They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL |
1839 |
|
options. If the startoffset argument of pcre_exec() is non- |
1840 |
|
zero, indicating that matching is to start at a point other |
1841 |
|
than the beginning of the subject, \A can never match. The |
1842 |
difference between \Z and \z is that \Z matches before a |
difference between \Z and \z is that \Z matches before a |
1843 |
newline that is the last character of the string as well as |
newline that is the last character of the string as well as |
1844 |
at the end of the string, whereas \z matches only at the |
at the end of the string, whereas \z matches only at the |
1845 |
end. |
end. |
1846 |
|
|
1847 |
|
The \G assertion is true only when the current matching |
1848 |
|
position is at the start point of the match, as specified by |
1849 |
|
the startoffset argument of pcre_exec(). It differs from \A |
1850 |
|
when the value of startoffset is non-zero. By calling |
1851 |
|
pcre_exec() multiple times with appropriate arguments, you |
1852 |
|
can mimic Perl's /g option, and it is in this kind of imple- |
1853 |
|
mentation where \G can be useful. |
1854 |
|
|
1855 |
|
Note, however, that PCRE's interpretation of \G, as the |
1856 |
|
start of the current match, is subtly different from Perl's, |
1857 |
|
which defines it as the end of the previous match. In Perl, |
1858 |
|
these can be different when the previously matched string |
1859 |
|
was empty. Because PCRE does just one match at a time, it |
1860 |
|
cannot reproduce this behaviour. |
1861 |
|
|
1862 |
|
If all the alternatives of a pattern begin with \G, the |
1863 |
|
expression is anchored to the starting match position, and |
1864 |
|
the "anchored" flag is set in the compiled regular expres- |
1865 |
|
sion. |
1866 |
|
|
1867 |
|
|
1868 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
1869 |
|
|
1870 |
Outside a character class, in the default matching mode, the |
Outside a character class, in the default matching mode, the |
1871 |
circumflex character is an assertion which is true only if |
circumflex character is an assertion which is true only if |
1872 |
the current matching point is at the start of the subject |
the current matching point is at the start of the subject |
1873 |
string. If the startoffset argument of pcre_exec() is non- |
string. If the startoffset argument of pcre_exec() is non- |
1874 |
zero, circumflex can never match. Inside a character class, |
zero, circumflex can never match if the PCRE_MULTILINE |
1875 |
circumflex has an entirely different meaning (see below). |
option is unset. Inside a character class, circumflex has an |
1876 |
|
entirely different meaning (see below). |
1877 |
|
|
1878 |
Circumflex need not be the first character of the pattern if |
Circumflex need not be the first character of the pattern if |
1879 |
a number of alternatives are involved, but it should be the |
a number of alternatives are involved, but it should be the |
1895 |
|
|
1896 |
The meaning of dollar can be changed so that it matches only |
The meaning of dollar can be changed so that it matches only |
1897 |
at the very end of the string, by setting the |
at the very end of the string, by setting the |
1898 |
PCRE_DOLLAR_ENDONLY option at compile or matching time. This |
PCRE_DOLLAR_ENDONLY option at compile time. This does not |
1899 |
does not affect the \Z assertion. |
affect the \Z assertion. |
1900 |
|
|
1901 |
The meanings of the circumflex and dollar characters are |
The meanings of the circumflex and dollar characters are |
1902 |
changed if the PCRE_MULTILINE option is set. When this is |
changed if the PCRE_MULTILINE option is set. When this is |
1903 |
the case, they match immediately after and immediately |
the case, they match immediately after and immediately |
1904 |
before an internal "\n" character, respectively, in addition |
before an internal newline character, respectively, in addi- |
1905 |
to matching at the start and end of the subject string. For |
tion to matching at the start and end of the subject string. |
1906 |
example, the pattern /^abc$/ matches the subject string |
For example, the pattern /^abc$/ matches the subject string |
1907 |
"def\nabc" in multiline mode, but not otherwise. Conse- |
"def\nabc" in multiline mode, but not otherwise. Conse- |
1908 |
quently, patterns that are anchored in single line mode |
quently, patterns that are anchored in single line mode |
1909 |
because all branches start with "^" are not anchored in mul- |
because all branches start with ^ are not anchored in multi- |
1910 |
tiline mode, and a match for circumflex is possible when the |
line mode, and a match for circumflex is possible when the |
1911 |
startoffset argument of pcre_exec() is non-zero. The |
startoffset argument of pcre_exec() is non-zero. The |
1912 |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
1913 |
set. |
set. |
1914 |
|
|
1915 |
Note that the sequences \A, \Z, and \z can be used to match |
Note that the sequences \A, \Z, and \z can be used to match |
1916 |
the start and end of the subject in both modes, and if all |
the start and end of the subject in both modes, and if all |
1917 |
branches of a pattern start with \A is it always anchored, |
branches of a pattern start with \A it is always anchored, |
1918 |
whether PCRE_MULTILINE is set or not. |
whether PCRE_MULTILINE is set or not. |
1919 |
|
|
1920 |
|
|
|
|
|
1921 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
1922 |
|
|
1923 |
Outside a character class, a dot in the pattern matches any |
Outside a character class, a dot in the pattern matches any |
1924 |
one character in the subject, including a non-printing char- |
one character in the subject, including a non-printing char- |
1925 |
acter, but not (by default) newline. If the PCRE_DOTALL |
acter, but not (by default) newline. In UTF-8 mode, a dot |
1926 |
option is set, then dots match newlines as well. The han- |
matches any UTF-8 character, which might be more than one |
1927 |
dling of dot is entirely independent of the handling of cir- |
byte long, except (by default) for newline. If the |
1928 |
cumflex and dollar, the only relationship being that they |
PCRE_DOTALL option is set, dots match newlines as well. The |
1929 |
both involve newline characters. Dot has no special meaning |
handling of dot is entirely independent of the handling of |
1930 |
|
circumflex and dollar, the only relationship being that they |
1931 |
|
both involve newline characters. Dot has no special meaning |
1932 |
in a character class. |
in a character class. |
1933 |
|
|
1934 |
|
|
1935 |
|
|
1936 |
|
MATCHING A SINGLE BYTE |
1937 |
|
|
1938 |
|
Outside a character class, the escape sequence \C matches |
1939 |
|
any one byte, both in and out of UTF-8 mode. Unlike a dot, |
1940 |
|
it always matches a newline. The feature is provided in Perl |
1941 |
|
in order to match individual bytes in UTF-8 mode. Because |
1942 |
|
it breaks up UTF-8 characters into individual bytes, what |
1943 |
|
remains in the string may be a malformed UTF-8 string. For |
1944 |
|
this reason it is best avoided. |
1945 |
|
|
1946 |
|
PCRE does not allow \C to appear in lookbehind assertions |
1947 |
|
(see below), because in UTF-8 mode it makes it impossible to |
1948 |
|
calculate the length of the lookbehind. |
1949 |
|
|
1950 |
|
|
1951 |
SQUARE BRACKETS |
SQUARE BRACKETS |
1952 |
|
|
1953 |
An opening square bracket introduces a character class, ter- |
An opening square bracket introduces a character class, ter- |
1954 |
minated by a closing square bracket. A closing square |
minated by a closing square bracket. A closing square |
1955 |
bracket on its own is not special. If a closing square |
bracket on its own is not special. If a closing square |
1957 |
the first data character in the class (after an initial cir- |
the first data character in the class (after an initial cir- |
1958 |
cumflex, if present) or escaped with a backslash. |
cumflex, if present) or escaped with a backslash. |
1959 |
|
|
1960 |
A character class matches a single character in the subject; |
A character class matches a single character in the subject. |
1961 |
the character must be in the set of characters defined by |
In UTF-8 mode, the character may occupy more than one byte. |
1962 |
the class, unless the first character in the class is a cir- |
A matched character must be in the set of characters defined |
1963 |
cumflex, in which case the subject character must not be in |
by the class, unless the first character in the class defin- |
1964 |
the set defined by the class. If a circumflex is actually |
ition is a circumflex, in which case the subject character |
1965 |
required as a member of the class, ensure it is not the |
must not be in the set defined by the class. If a circumflex |
1966 |
first character, or escape it with a backslash. |
is actually required as a member of the class, ensure it is |
1967 |
|
not the first character, or escape it with a backslash. |
1968 |
|
|
1969 |
For example, the character class [aeiou] matches any lower |
For example, the character class [aeiou] matches any lower |
1970 |
case vowel, while [^aeiou] matches any character that is not |
case vowel, while [^aeiou] matches any character that is not |
1975 |
string, and fails if the current pointer is at the end of |
string, and fails if the current pointer is at the end of |
1976 |
the string. |
the string. |
1977 |
|
|
1978 |
|
In UTF-8 mode, characters with values greater than 255 can |
1979 |
|
be included in a class as a literal string of bytes, or by |
1980 |
|
using the \x{ escaping mechanism. |
1981 |
|
|
1982 |
When caseless matching is set, any letters in a class |
When caseless matching is set, any letters in a class |
1983 |
represent both their upper case and lower case versions, so |
represent both their upper case and lower case versions, so |
1984 |
for example, a caseless [aeiou] matches "A" as well as "a", |
for example, a caseless [aeiou] matches "A" as well as "a", |
1985 |
and a caseless [^aeiou] does not match "A", whereas a case- |
and a caseless [^aeiou] does not match "A", whereas a case- |
1986 |
ful version would. |
ful version would. PCRE does not support the concept of case |
1987 |
|
for characters with values greater than 255. |
1988 |
The newline character is never treated in any special way in |
The newline character is never treated in any special way in |
1989 |
character classes, whatever the setting of the PCRE_DOTALL |
character classes, whatever the setting of the PCRE_DOTALL |
1990 |
or PCRE_MULTILINE options is. A class such as [^a] will |
or PCRE_MULTILINE options is. A class such as [^a] will |
2008 |
separate characters. The octal or hexadecimal representation |
separate characters. The octal or hexadecimal representation |
2009 |
of "]" can also be used to end a range. |
of "]" can also be used to end a range. |
2010 |
|
|
2011 |
Ranges operate in ASCII collating sequence. They can also be |
Ranges operate in the collating sequence of character |
2012 |
used for characters specified numerically, for example |
values. They can also be used for characters specified |
2013 |
[\000-\037]. If a range that includes letters is used when |
numerically, for example [\000-\037]. In UTF-8 mode, ranges |
2014 |
caseless matching is set, it matches the letters in either |
can include characters whose values are greater than 255, |
2015 |
case. For example, [W-c] is equivalent to [][\^_`wxyzabc], |
for example [\x{100}-\x{2ff}]. |
2016 |
matched caselessly, and if character tables for the "fr" |
|
2017 |
locale are in use, [\xc8-\xcb] matches accented E characters |
If a range that includes letters is used when caseless |
2018 |
in both cases. |
matching is set, it matches the letters in either case. For |
2019 |
|
example, [W-c] is equivalent to [][\^_`wxyzabc], matched |
2020 |
|
caselessly, and if character tables for the "fr" locale are |
2021 |
|
in use, [\xc8-\xcb] matches accented E characters in both |
2022 |
|
cases. |
2023 |
|
|
2024 |
The character types \d, \D, \s, \S, \w, and \W may also |
The character types \d, \D, \s, \S, \w, and \W may also |
2025 |
appear in a character class, and add the characters that |
appear in a character class, and add the characters that |
2035 |
classes, but it does no harm if they are escaped. |
classes, but it does no harm if they are escaped. |
2036 |
|
|
2037 |
|
|
2038 |
|
POSIX CHARACTER CLASSES |
2039 |
|
|
2040 |
|
Perl supports the POSIX notation for character classes, |
2041 |
|
which uses names enclosed by [: and :] within the enclosing |
2042 |
|
square brackets. PCRE also supports this notation. For exam- |
2043 |
|
ple, |
2044 |
|
|
2045 |
|
[01[:alpha:]%] |
2046 |
|
|
2047 |
|
matches "0", "1", any alphabetic character, or "%". The sup- |
2048 |
|
ported class names are |
2049 |
|
|
2050 |
|
alnum letters and digits |
2051 |
|
alpha letters |
2052 |
|
ascii character codes 0 - 127 |
2053 |
|
blank space or tab only |
2054 |
|
cntrl control characters |
2055 |
|
digit decimal digits (same as \d) |
2056 |
|
graph printing characters, excluding space |
2057 |
|
lower lower case letters |
2058 |
|
print printing characters, including space |
2059 |
|
punct printing characters, excluding letters and digits |
2060 |
|
space white space (not quite the same as \s) |
2061 |
|
upper upper case letters |
2062 |
|
word "word" characters (same as \w) |
2063 |
|
xdigit hexadecimal digits |
2064 |
|
|
2065 |
|
The "space" characters are HT (9), LF (10), VT (11), FF |
2066 |
|
(12), CR (13), and space (32). Notice that this list |
2067 |
|
includes the VT character (code 11). This makes "space" dif- |
2068 |
|
ferent to \s, which does not include VT (for Perl compati- |
2069 |
|
bility). |
2070 |
|
|
2071 |
|
The name "word" is a Perl extension, and "blank" is a GNU |
2072 |
|
extension from Perl 5.8. Another Perl extension is negation, |
2073 |
|
which is indicated by a ^ character after the colon. For |
2074 |
|
example, |
2075 |
|
|
2076 |
|
[12[:^digit:]] |
2077 |
|
|
2078 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also |
2079 |
|
recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a |
2080 |
|
"collating element", but these are not supported, and an |
2081 |
|
error is given if they are encountered. |
2082 |
|
|
2083 |
|
In UTF-8 mode, characters with values greater than 255 do |
2084 |
|
not match any of the POSIX character classes. |
2085 |
|
|
2086 |
|
|
2087 |
VERTICAL BAR |
VERTICAL BAR |
2088 |
|
|
2089 |
Vertical bar characters are used to separate alternative |
Vertical bar characters are used to separate alternative |
2090 |
patterns. For example, the pattern |
patterns. For example, the pattern |
2091 |
|
|
2101 |
subpattern. |
subpattern. |
2102 |
|
|
2103 |
|
|
|
|
|
2104 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
2105 |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, |
|
2106 |
and PCRE_EXTENDED can be changed from within the pattern by |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, |
2107 |
a sequence of Perl option letters enclosed between "(?" and |
PCRE_DOTALL, and PCRE_EXTENDED options can be changed from |
2108 |
")". The option letters are |
within the pattern by a sequence of Perl option letters |
2109 |
|
enclosed between "(?" and ")". The option letters are |
2110 |
|
|
2111 |
i for PCRE_CASELESS |
i for PCRE_CASELESS |
2112 |
m for PCRE_MULTILINE |
m for PCRE_MULTILINE |
2121 |
If a letter appears both before and after the hyphen, the |
If a letter appears both before and after the hyphen, the |
2122 |
option is unset. |
option is unset. |
2123 |
|
|
2124 |
The scope of these option changes depends on where in the |
When an option change occurs at top level (that is, not |
2125 |
pattern the setting occurs. For settings that are outside |
inside subpattern parentheses), the change applies to the |
2126 |
any subpattern (defined below), the effect is the same as if |
remainder of the pattern that follows. If the change is |
2127 |
the options were set or unset at the start of matching. The |
placed right at the start of a pattern, PCRE extracts it |
2128 |
following patterns all behave in exactly the same way: |
into the global options (and it will therefore show up in |
2129 |
|
data extracted by the pcre_fullinfo() function). |
2130 |
(?i)abc |
|
2131 |
a(?i)bc |
An option change within a subpattern affects only that part |
2132 |
ab(?i)c |
of the current pattern that follows it, so |
|
abc(?i) |
|
|
|
|
|
which in turn is the same as compiling the pattern abc with |
|
|
PCRE_CASELESS set. In other words, such "top level" set- |
|
|
tings apply to the whole pattern (unless there are other |
|
|
changes inside subpatterns). If there is more than one set- |
|
|
ting of the same option at top level, the rightmost setting |
|
|
is used. |
|
|
|
|
|
If an option change occurs inside a subpattern, the effect |
|
|
is different. This is a change of behaviour in Perl 5.005. |
|
|
An option change inside a subpattern affects only that part |
|
|
of the subpattern that follows it, so |
|
2133 |
|
|
2134 |
(a(?i)b)c |
(a(?i)b)c |
2135 |
|
|
2156 |
even when it is at top level. It is best put at the start. |
even when it is at top level. It is best put at the start. |
2157 |
|
|
2158 |
|
|
|
|
|
2159 |
SUBPATTERNS |
SUBPATTERNS |
2160 |
|
|
2161 |
Subpatterns are delimited by parentheses (round brackets), |
Subpatterns are delimited by parentheses (round brackets), |
2162 |
which can be nested. Marking part of a pattern as a subpat- |
which can be nested. Marking part of a pattern as a subpat- |
2163 |
tern does two things: |
tern does two things: |
2185 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
2186 |
|
|
2187 |
the captured substrings are "red king", "red", and "king", |
the captured substrings are "red king", "red", and "king", |
2188 |
and are numbered 1, 2, and 3. |
and are numbered 1, 2, and 3, respectively. |
2189 |
|
|
2190 |
The fact that plain parentheses fulfil two functions is not |
The fact that plain parentheses fulfil two functions is not |
2191 |
always helpful. There are often times when a grouping sub- |
always helpful. There are often times when a grouping sub- |
2192 |
pattern is required without a capturing requirement. If an |
pattern is required without a capturing requirement. If an |
2193 |
opening parenthesis is followed by "?:", the subpattern does |
opening parenthesis is followed by a question mark and a |
2194 |
not do any capturing, and is not counted when computing the |
colon, the subpattern does not do any capturing, and is not |
2195 |
number of any subsequent capturing subpatterns. For example, |
counted when computing the number of any subsequent captur- |
2196 |
if the string "the white queen" is matched against the pat- |
ing subpatterns. For example, if the string "the white |
2197 |
tern |
queen" is matched against the pattern |
2198 |
|
|
2199 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
2200 |
|
|
2201 |
the captured substrings are "white queen" and "queen", and |
the captured substrings are "white queen" and "queen", and |
2202 |
are numbered 1 and 2. The maximum number of captured sub- |
are numbered 1 and 2. The maximum number of capturing sub- |
2203 |
strings is 99, and the maximum number of all subpatterns, |
patterns is 65535, and the maximum depth of nesting of all |
2204 |
both capturing and non-capturing, is 200. |
subpatterns, both capturing and non-capturing, is 200. |
2205 |
|
|
2206 |
As a convenient shorthand, if any option settings are |
As a convenient shorthand, if any option settings are |
2207 |
required at the start of a non-capturing subpattern, the |
required at the start of a non-capturing subpattern, the |
2218 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
2219 |
|
|
2220 |
|
|
2221 |
|
NAMED SUBPATTERNS |
2222 |
|
|
2223 |
|
Identifying capturing parentheses by number is simple, but |
2224 |
|
it can be very hard to keep track of the numbers in compli- |
2225 |
|
cated regular expressions. Furthermore, if an expression is |
2226 |
|
modified, the numbers may change. To help with the diffi- |
2227 |
|
culty, PCRE supports the naming of subpatterns, something |
2228 |
|
that Perl does not provide. The Python syntax (?P<name>...) |
2229 |
|
is used. Names consist of alphanumeric characters and under- |
2230 |
|
scores, and must be unique within a pattern. |
2231 |
|
|
2232 |
|
Named capturing parentheses are still allocated numbers as |
2233 |
|
well as names. The PCRE API provides function calls for |
2234 |
|
extracting the name-to-number translation table from a com- |
2235 |
|
piled pattern. For further details see the pcreapi documen- |
2236 |
|
tation. |
2237 |
|
|
2238 |
|
|
2239 |
REPETITION |
REPETITION |
2240 |
|
|
2241 |
Repetition is specified by quantifiers, which can follow any |
Repetition is specified by quantifiers, which can follow any |
2242 |
of the following items: |
of the following items: |
2243 |
|
|
2244 |
|
a literal data character |
|
a single character, possibly escaped |
|
2245 |
the . metacharacter |
the . metacharacter |
2246 |
|
the \C escape sequence |
2247 |
|
escapes such as \d that match single characters |
2248 |
a character class |
a character class |
2249 |
a back reference (see next section) |
a back reference (see next section) |
2250 |
a parenthesized subpattern (unless it is an assertion - |
a parenthesized subpattern (unless it is an assertion) |
|
see below) |
|
2251 |
|
|
2252 |
The general repetition quantifier specifies a minimum and |
The general repetition quantifier specifies a minimum and |
2253 |
maximum number of permitted matches, by giving the two |
maximum number of permitted matches, by giving the two |
2276 |
as a literal character. For example, {,6} is not a quantif- |
as a literal character. For example, {,6} is not a quantif- |
2277 |
ier, but a literal string of four characters. |
ier, but a literal string of four characters. |
2278 |
|
|
2279 |
|
In UTF-8 mode, quantifiers apply to UTF-8 characters rather |
2280 |
|
than to individual bytes. Thus, for example, \x{100}{2} |
2281 |
|
matches two UTF-8 characters, each of which is represented |
2282 |
|
by a two-byte sequence. |
2283 |
|
|
2284 |
The quantifier {0} is permitted, causing the expression to |
The quantifier {0} is permitted, causing the expression to |
2285 |
behave as if the previous item and the quantifier were not |
behave as if the previous item and the quantifier were not |
2286 |
present. |
present. |
2319 |
|
|
2320 |
/* first command */ not comment /* second comment */ |
/* first command */ not comment /* second comment */ |
2321 |
|
|
2322 |
fails, because it matches the entire string due to the |
fails, because it matches the entire string owing to the |
2323 |
greediness of the .* item. |
greediness of the .* item. |
2324 |
|
|
2325 |
However, if a quantifier is followed by a question mark, |
However, if a quantifier is followed by a question mark, it |
2326 |
then it ceases to be greedy, and instead matches the minimum |
ceases to be greedy, and instead matches the minimum number |
2327 |
number of times possible, so the pattern |
of times possible, so the pattern |
2328 |
|
|
2329 |
/\*.*?\*/ |
/\*.*?\*/ |
2330 |
|
|
2341 |
that is the only way the rest of the pattern matches. |
that is the only way the rest of the pattern matches. |
2342 |
|
|
2343 |
If the PCRE_UNGREEDY option is set (an option which is not |
If the PCRE_UNGREEDY option is set (an option which is not |
2344 |
available in Perl) then the quantifiers are not greedy by |
available in Perl), the quantifiers are not greedy by |
2345 |
default, but individual ones can be made greedy by following |
default, but individual ones can be made greedy by following |
2346 |
them with a question mark. In other words, it inverts the |
them with a question mark. In other words, it inverts the |
2347 |
default behaviour. |
default behaviour. |
2350 |
repeat count that is greater than 1 or with a limited max- |
repeat count that is greater than 1 or with a limited max- |
2351 |
imum, more store is required for the compiled pattern, in |
imum, more store is required for the compiled pattern, in |
2352 |
proportion to the size of the minimum or maximum. |
proportion to the size of the minimum or maximum. |
|
|
|
2353 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
2354 |
option (equivalent to Perl's /s) is set, thus allowing the . |
option (equivalent to Perl's /s) is set, thus allowing the . |
2355 |
to match newlines, then the pattern is implicitly anchored, |
to match newlines, the pattern is implicitly anchored, |
2356 |
because whatever follows will be tried against every charac- |
because whatever follows will be tried against every charac- |
2357 |
ter position in the subject string, so there is no point in |
ter position in the subject string, so there is no point in |
2358 |
retrying the overall match at any position after the first. |
retrying the overall match at any position after the first. |
2359 |
PCRE treats such a pattern as though it were preceded by \A. |
PCRE normally treats such a pattern as though it were pre- |
2360 |
In cases where it is known that the subject string contains |
ceded by \A. |
2361 |
no newlines, it is worth setting PCRE_DOTALL when the pat- |
|
2362 |
tern begins with .* in order to obtain this optimization, or |
In cases where it is known that the subject string contains |
2363 |
alternatively using ^ to indicate anchoring explicitly. |
no newlines, it is worth setting PCRE_DOTALL in order to |
2364 |
|
obtain this optimization, or alternatively using ^ to indi- |
2365 |
|
cate anchoring explicitly. |
2366 |
|
|
2367 |
|
However, there is one situation where the optimization can- |
2368 |
|
not be used. When .* is inside capturing parentheses that |
2369 |
|
are the subject of a backreference elsewhere in the pattern, |
2370 |
|
a match at the start may fail, and a later one succeed. Con- |
2371 |
|
sider, for example: |
2372 |
|
|
2373 |
|
(.*)abc\1 |
2374 |
|
|
2375 |
|
If the subject is "xyz123abc123" the match point is the |
2376 |
|
fourth character. For this reason, such a pattern is not |
2377 |
|
implicitly anchored. |
2378 |
|
|
2379 |
When a capturing subpattern is repeated, the value captured |
When a capturing subpattern is repeated, the value captured |
2380 |
is the substring that matched the final iteration. For exam- |
is the substring that matched the final iteration. For exam- |
2394 |
"b". |
"b". |
2395 |
|
|
2396 |
|
|
2397 |
|
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
2398 |
|
|
2399 |
|
With both maximizing and minimizing repetition, failure of |
2400 |
|
what follows normally causes the repeated item to be re- |
2401 |
|
evaluated to see if a different number of repeats allows the |
2402 |
|
rest of the pattern to match. Sometimes it is useful to |
2403 |
|
prevent this, either to change the nature of the match, or |
2404 |
|
to cause it fail earlier than it otherwise might, when the |
2405 |
|
author of the pattern knows there is no point in carrying |
2406 |
|
on. |
2407 |
|
|
2408 |
|
Consider, for example, the pattern \d+foo when applied to |
2409 |
|
the subject line |
2410 |
|
|
2411 |
|
123456bar |
2412 |
|
|
2413 |
|
After matching all 6 digits and then failing to match "foo", |
2414 |
|
the normal action of the matcher is to try again with only 5 |
2415 |
|
digits matching the \d+ item, and then with 4, and so on, |
2416 |
|
before ultimately failing. "Atomic grouping" (a term taken |
2417 |
|
from Jeffrey Friedl's book) provides the means for specify- |
2418 |
|
ing that once a subpattern has matched, it is not to be re- |
2419 |
|
evaluated in this way. |
2420 |
|
|
2421 |
|
If we use atomic grouping for the previous example, the |
2422 |
|
matcher would give up immediately on failing to match "foo" |
2423 |
|
the first time. The notation is a kind of special |
2424 |
|
parenthesis, starting with (?> as in this example: |
2425 |
|
|
2426 |
|
(?>\d+)bar |
2427 |
|
|
2428 |
|
This kind of parenthesis "locks up" the part of the pattern |
2429 |
|
it contains once it has matched, and a failure further into |
2430 |
|
the pattern is prevented from backtracking into it. Back- |
2431 |
|
tracking past it to previous items, however, works as nor- |
2432 |
|
mal. |
2433 |
|
|
2434 |
|
An alternative description is that a subpattern of this type |
2435 |
|
matches the string of characters that an identical stan- |
2436 |
|
dalone pattern would match, if anchored at the current point |
2437 |
|
in the subject string. |
2438 |
|
|
2439 |
|
Atomic grouping subpatterns are not capturing subpatterns. |
2440 |
|
Simple cases such as the above example can be thought of as |
2441 |
|
a maximizing repeat that must swallow everything it can. So, |
2442 |
|
while both \d+ and \d+? are prepared to adjust the number of |
2443 |
|
digits they match in order to make the rest of the pattern |
2444 |
|
match, (?>\d+) can only match an entire sequence of digits. |
2445 |
|
|
2446 |
|
Atomic groups in general can of course contain arbitrarily |
2447 |
|
complicated subpatterns, and can be nested. However, when |
2448 |
|
the subpattern for an atomic group is just a single repeated |
2449 |
|
item, as in the example above, a simpler notation, called a |
2450 |
|
"possessive quantifier" can be used. This consists of an |
2451 |
|
additional + character following a quantifier. Using this |
2452 |
|
notation, the previous example can be rewritten as |
2453 |
|
|
2454 |
|
\d++bar |
2455 |
|
|
2456 |
|
Possessive quantifiers are always greedy; the setting of the |
2457 |
|
PCRE_UNGREEDY option is ignored. They are a convenient nota- |
2458 |
|
tion for the simpler forms of atomic group. However, there |
2459 |
|
is no difference in the meaning or processing of a posses- |
2460 |
|
sive quantifier and the equivalent atomic group. |
2461 |
|
|
2462 |
|
The possessive quantifier syntax is an extension to the Perl |
2463 |
|
syntax. It originates in Sun's Java package. |
2464 |
|
|
2465 |
|
When a pattern contains an unlimited repeat inside a subpat- |
2466 |
|
tern that can itself be repeated an unlimited number of |
2467 |
|
times, the use of an atomic group is the only way to avoid |
2468 |
|
some failing matches taking a very long time indeed. The |
2469 |
|
pattern |
2470 |
|
|
2471 |
|
(\D+|<\d+>)*[!?] |
2472 |
|
|
2473 |
|
matches an unlimited number of substrings that either con- |
2474 |
|
sist of non-digits, or digits enclosed in <>, followed by |
2475 |
|
either ! or ?. When it matches, it runs quickly. However, if |
2476 |
|
it is applied to |
2477 |
|
|
2478 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
2479 |
|
|
2480 |
|
it takes a long time before reporting failure. This is |
2481 |
|
because the string can be divided between the two repeats in |
2482 |
|
a large number of ways, and all have to be tried. (The exam- |
2483 |
|
ple used [!?] rather than a single character at the end, |
2484 |
|
because both PCRE and Perl have an optimization that allows |
2485 |
|
for fast failure when a single character is used. They |
2486 |
|
remember the last single character that is required for a |
2487 |
|
match, and fail early if it is not present in the string.) |
2488 |
|
If the pattern is changed to |
2489 |
|
|
2490 |
|
((?>\D+)|<\d+>)*[!?] |
2491 |
|
|
2492 |
|
sequences of non-digits cannot be broken, and failure hap- |
2493 |
|
pens quickly. |
2494 |
|
|
2495 |
|
|
2496 |
BACK REFERENCES |
BACK REFERENCES |
2497 |
|
|
2498 |
Outside a character class, a backslash followed by a digit |
Outside a character class, a backslash followed by a digit |
2499 |
greater than 0 (and possibly further digits) is a back |
greater than 0 (and possibly further digits) is a back |
2500 |
reference to a capturing subpattern earlier (i.e. to its |
reference to a capturing subpattern earlier (that is, to its |
2501 |
left) in the pattern, provided there have been that many |
left) in the pattern, provided there have been that many |
2502 |
previous capturing left parentheses. |
previous capturing left parentheses. |
2503 |
|
|
2512 |
|
|
2513 |
A back reference matches whatever actually matched the cap- |
A back reference matches whatever actually matched the cap- |
2514 |
turing subpattern in the current subject string, rather than |
turing subpattern in the current subject string, rather than |
2515 |
anything matching the subpattern itself. So the pattern |
anything matching the subpattern itself (see "Subpatterns as |
2516 |
|
subroutines" below for a way of doing that). So the pattern |
2517 |
|
|
2518 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
2519 |
|
|
2520 |
matches "sense and sensibility" and "response and responsi- |
matches "sense and sensibility" and "response and responsi- |
2521 |
bility", but not "sense and responsibility". If caseful |
bility", but not "sense and responsibility". If caseful |
2522 |
matching is in force at the time of the back reference, then |
matching is in force at the time of the back reference, the |
2523 |
the case of letters is relevant. For example, |
case of letters is relevant. For example, |
2524 |
|
|
2525 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
2526 |
|
|
2528 |
though the original capturing subpattern is matched case- |
though the original capturing subpattern is matched case- |
2529 |
lessly. |
lessly. |
2530 |
|
|
2531 |
|
Back references to named subpatterns use the Python syntax |
2532 |
|
(?P=name). We could rewrite the above example as follows: |
2533 |
|
|
2534 |
|
(?<p1>(?i)rah)\s+(?P=p1) |
2535 |
|
|
2536 |
There may be more than one back reference to the same sub- |
There may be more than one back reference to the same sub- |
2537 |
pattern. If a subpattern has not actually been used in a |
pattern. If a subpattern has not actually been used in a |
2538 |
particular match, then any back references to it always |
particular match, any back references to it always fail. For |
2539 |
fail. For example, the pattern |
example, the pattern |
2540 |
|
|
2541 |
(a|(bc))\2 |
(a|(bc))\2 |
2542 |
|
|
2543 |
always fails if it starts to match "a" rather than "bc". |
always fails if it starts to match "a" rather than "bc". |
2544 |
Because there may be up to 99 back references, all digits |
Because there may be many capturing parentheses in a pat- |
2545 |
following the backslash are taken as part of a potential |
tern, all digits following the backslash are taken as part |
2546 |
back reference number. If the pattern continues with a digit |
of a potential back reference number. If the pattern contin- |
2547 |
character, then some delimiter must be used to terminate the |
ues with a digit character, some delimiter must be used to |
2548 |
back reference. If the PCRE_EXTENDED option is set, this can |
terminate the back reference. If the PCRE_EXTENDED option is |
2549 |
be whitespace. Otherwise an empty comment can be used. |
set, this can be whitespace. Otherwise an empty comment can |
2550 |
|
be used. |
2551 |
|
|
2552 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
2553 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
2557 |
|
|
2558 |
(a|b\1)+ |
(a|b\1)+ |
2559 |
|
|
2560 |
matches any number of "a"s and also "aba", "ababaa" etc. At |
matches any number of "a"s and also "aba", "ababbaa" etc. At |
2561 |
each iteration of the subpattern, the back reference matches |
each iteration of the subpattern, the back reference matches |
2562 |
the character string corresponding to the previous itera- |
the character string corresponding to the previous itera- |
2563 |
tion. In order for this to work, the pattern must be such |
tion. In order for this to work, the pattern must be such |
2566 |
example above, or by a quantifier with a minimum of zero. |
example above, or by a quantifier with a minimum of zero. |
2567 |
|
|
2568 |
|
|
|
|
|
2569 |
ASSERTIONS |
ASSERTIONS |
2570 |
|
|
2571 |
An assertion is a test on the characters following or |
An assertion is a test on the characters following or |
2572 |
preceding the current matching point that does not actually |
preceding the current matching point that does not actually |
2573 |
consume any characters. The simple assertions coded as \b, |
consume any characters. The simple assertions coded as \b, |
2574 |
\B, \A, \Z, \z, ^ and $ are described above. More compli- |
\B, \A, \G, \Z, \z, ^ and $ are described above. More com- |
2575 |
cated assertions are coded as subpatterns. There are two |
plicated assertions are coded as subpatterns. There are two |
2576 |
kinds: those that look ahead of the current position in the |
kinds: those that look ahead of the current position in the |
2577 |
subject string, and those that look behind it. |
subject string, and those that look behind it. |
2578 |
|
|
2579 |
An assertion subpattern is matched in the normal way, except |
An assertion subpattern is matched in the normal way, except |
2580 |
that it does not cause the current matching position to be |
that it does not cause the current matching position to be |
2581 |
changed. Lookahead assertions start with (?= for positive |
changed. Lookahead assertions start with (?= for positive |
2599 |
when the next three characters are "bar". A lookbehind |
when the next three characters are "bar". A lookbehind |
2600 |
assertion is needed to achieve this effect. |
assertion is needed to achieve this effect. |
2601 |
|
|
2602 |
|
If you want to force a matching failure at some point in a |
2603 |
|
pattern, the most convenient way to do it is with (?!) |
2604 |
|
because an empty string always matches, so an assertion that |
2605 |
|
requires there not to be an empty string must always fail. |
2606 |
|
|
2607 |
Lookbehind assertions start with (?<= for positive asser- |
Lookbehind assertions start with (?<= for positive asser- |
2608 |
tions and (?<! for negative assertions. For example, |
tions and (?<! for negative assertions. For example, |
2609 |
|
|
2624 |
causes an error at compile time. Branches that match dif- |
causes an error at compile time. Branches that match dif- |
2625 |
ferent length strings are permitted only at the top level of |
ferent length strings are permitted only at the top level of |
2626 |
a lookbehind assertion. This is an extension compared with |
a lookbehind assertion. This is an extension compared with |
2627 |
Perl 5.005, which requires all branches to match the same |
Perl (at least for 5.8), which requires all branches to |
2628 |
length of string. An assertion such as |
match the same length of string. An assertion such as |
2629 |
|
|
2630 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
2631 |
|
|
2639 |
alternative, to temporarily move the current position back |
alternative, to temporarily move the current position back |
2640 |
by the fixed width and then try to match. If there are |
by the fixed width and then try to match. If there are |
2641 |
insufficient characters before the current position, the |
insufficient characters before the current position, the |
2642 |
match is deemed to fail. Lookbehinds in conjunction with |
match is deemed to fail. |
2643 |
once-only subpatterns can be particularly useful for match- |
|
2644 |
ing at the ends of strings; an example is given at the end |
PCRE does not allow the \C escape (which matches a single |
2645 |
of the section on once-only subpatterns. |
byte in UTF-8 mode) to appear in lookbehind assertions, |
2646 |
|
because it makes it impossible to calculate the length of |
2647 |
|
the lookbehind. |
2648 |
|
|
2649 |
|
Atomic groups can be used in conjunction with lookbehind |
2650 |
|
assertions to specify efficient matching at the end of the |
2651 |
|
subject string. Consider a simple pattern such as |
2652 |
|
|
2653 |
|
abcd$ |
2654 |
|
|
2655 |
|
when applied to a long string that does not match. Because |
2656 |
|
matching proceeds from left to right, PCRE will look for |
2657 |
|
each "a" in the subject and then see if what follows matches |
2658 |
|
the rest of the pattern. If the pattern is specified as |
2659 |
|
|
2660 |
|
^.*abcd$ |
2661 |
|
|
2662 |
|
the initial .* matches the entire string at first, but when |
2663 |
|
this fails (because there is no following "a"), it back- |
2664 |
|
tracks to match all but the last character, then all but the |
2665 |
|
last two characters, and so on. Once again the search for |
2666 |
|
"a" covers the entire string, from right to left, so we are |
2667 |
|
no better off. However, if the pattern is written as |
2668 |
|
|
2669 |
|
^(?>.*)(?<=abcd) |
2670 |
|
|
2671 |
|
or, equivalently, |
2672 |
|
|
2673 |
|
^.*+(?<=abcd) |
2674 |
|
|
2675 |
|
there can be no backtracking for the .* item; it can match |
2676 |
|
only the entire string. The subsequent lookbehind assertion |
2677 |
|
does a single test on the last four characters. If it fails, |
2678 |
|
the match fails immediately. For long strings, this approach |
2679 |
|
makes a significant difference to the processing time. |
2680 |
|
|
2681 |
Several assertions (of any sort) may occur in succession. |
Several assertions (of any sort) may occur in succession. |
2682 |
For example, |
For example, |
2686 |
matches "foo" preceded by three digits that are not "999". |
matches "foo" preceded by three digits that are not "999". |
2687 |
Notice that each of the assertions is applied independently |
Notice that each of the assertions is applied independently |
2688 |
at the same point in the subject string. First there is a |
at the same point in the subject string. First there is a |
2689 |
check that the previous three characters are all digits, |
check that the previous three characters are all digits, and |
2690 |
then there is a check that the same three characters are not |
then there is a check that the same three characters are not |
2691 |
"999". This pattern does not match "foo" preceded by six |
"999". This pattern does not match "foo" preceded by six |
2692 |
characters, the first of which are digits and the last three |
characters, the first of which are digits and the last three |
2721 |
for positive assertions, because it does not make sense for |
for positive assertions, because it does not make sense for |
2722 |
negative assertions. |
negative assertions. |
2723 |
|
|
|
Assertions count towards the maximum of 200 parenthesized |
|
|
subpatterns. |
|
|
|
|
|
|
|
|
|
|
|
ONCE-ONLY SUBPATTERNS |
|
|
With both maximizing and minimizing repetition, failure of |
|
|
what follows normally causes the repeated item to be re- |
|
|
evaluated to see if a different number of repeats allows the |
|
|
rest of the pattern to match. Sometimes it is useful to |
|
|
prevent this, either to change the nature of the match, or |
|
|
to cause it fail earlier than it otherwise might, when the |
|
|
author of the pattern knows there is no point in carrying |
|
|
on. |
|
|
|
|
|
Consider, for example, the pattern \d+foo when applied to |
|
|
the subject line |
|
|
|
|
|
123456bar |
|
|
|
|
|
After matching all 6 digits and then failing to match "foo", |
|
|
the normal action of the matcher is to try again with only 5 |
|
|
digits matching the \d+ item, and then with 4, and so on, |
|
|
before ultimately failing. Once-only subpatterns provide the |
|
|
means for specifying that once a portion of the pattern has |
|
|
matched, it is not to be re-evaluated in this way, so the |
|
|
matcher would give up immediately on failing to match "foo" |
|
|
the first time. The notation is another kind of special |
|
|
parenthesis, starting with (?> as in this example: |
|
|
|
|
|
(?>\d+)bar |
|
|
|
|
|
This kind of parenthesis "locks up" the part of the pattern |
|
|
it contains once it has matched, and a failure further into |
|
|
the pattern is prevented from backtracking into it. Back- |
|
|
tracking past it to previous items, however, works as nor- |
|
|
mal. |
|
|
|
|
|
An alternative description is that a subpattern of this type |
|
|
matches the string of characters that an identical stan- |
|
|
dalone pattern would match, if anchored at the current point |
|
|
in the subject string. |
|
|
|
|
|
Once-only subpatterns are not capturing subpatterns. Simple |
|
|
cases such as the above example can be thought of as a max- |
|
|
imizing repeat that must swallow everything it can. So, |
|
|
while both \d+ and \d+? are prepared to adjust the number of |
|
|
digits they match in order to make the rest of the pattern |
|
|
match, (?>\d+) can only match an entire sequence of digits. |
|
|
|
|
|
This construction can of course contain arbitrarily compli- |
|
|
cated subpatterns, and it can be nested. |
|
|
|
|
|
Once-only subpatterns can be used in conjunction with look- |
|
|
behind assertions to specify efficient matching at the end |
|
|
of the subject string. Consider a simple pattern such as |
|
|
|
|
|
abcd$ |
|
|
|
|
|
when applied to a long string which does not match it. |
|
|
Because matching proceeds from left to right, PCRE will look |
|
|
for each "a" in the subject and then see if what follows |
|
|
matches the rest of the pattern. If the pattern is specified |
|
|
as |
|
|
|
|
|
^.*abcd$ |
|
|
|
|
|
then the initial .* matches the entire string at first, but |
|
|
when this fails, it backtracks to match all but the last |
|
|
character, then all but the last two characters, and so on. |
|
|
Once again the search for "a" covers the entire string, from |
|
|
right to left, so we are no better off. However, if the pat- |
|
|
tern is written as |
|
|
|
|
|
^(?>.*)(?<=abcd) |
|
|
|
|
|
then there can be no backtracking for the .* item; it can |
|
|
match only the entire string. The subsequent lookbehind |
|
|
assertion does a single test on the last four characters. If |
|
|
it fails, the match fails immediately. For long strings, |
|
|
this approach makes a significant difference to the process- |
|
|
ing time. |
|
|
|
|
|
|
|
2724 |
|
|
2725 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
2726 |
|
|
2727 |
It is possible to cause the matching process to obey a sub- |
It is possible to cause the matching process to obey a sub- |
2728 |
pattern conditionally or to choose between two alternative |
pattern conditionally or to choose between two alternative |
2729 |
subpatterns, depending on the result of an assertion, or |
subpatterns, depending on the result of an assertion, or |
2738 |
more than two alternatives in the subpattern, a compile-time |
more than two alternatives in the subpattern, a compile-time |
2739 |
error occurs. |
error occurs. |
2740 |
|
|
2741 |
There are two kinds of condition. If the text between the |
There are three kinds of condition. If the text between the |
2742 |
parentheses consists of a sequence of digits, then the |
parentheses consists of a sequence of digits, the condition |
2743 |
condition is satisfied if the capturing subpattern of that |
is satisfied if the capturing subpattern of that number has |
2744 |
number has previously matched. Consider the following pat- |
previously matched. The number must be greater than zero. |
2745 |
tern, which contains non-significant white space to make it |
Consider the following pattern, which contains non- |
2746 |
more readable (assume the PCRE_EXTENDED option) and to |
significant white space to make it more readable (assume the |
2747 |
divide it into three parts for ease of discussion: |
PCRE_EXTENDED option) and to divide it into three parts for |
2748 |
|
ease of discussion: |
2749 |
|
|
2750 |
( \( )? [^()]+ (?(1) \) ) |
( \( )? [^()]+ (?(1) \) ) |
2751 |
|
|
2762 |
matches a sequence of non-parentheses, optionally enclosed |
matches a sequence of non-parentheses, optionally enclosed |
2763 |
in parentheses. |
in parentheses. |
2764 |
|
|
2765 |
If the condition is not a sequence of digits, it must be an |
If the condition is the string (R), it is satisfied if a |
2766 |
assertion. This may be a positive or negative lookahead or |
recursive call to the pattern or subpattern has been made. |
2767 |
lookbehind assertion. Consider this pattern, again contain- |
At "top level", the condition is false. This is a PCRE |
2768 |
ing non-significant white space, and with the two alterna- |
extension. Recursive patterns are described in the next |
2769 |
tives on the second line: |
section. |
2770 |
|
|
2771 |
|
If the condition is not a sequence of digits or (R), it must |
2772 |
|
be an assertion. This may be a positive or negative looka- |
2773 |
|
head or lookbehind assertion. Consider this pattern, again |
2774 |
|
containing non-significant white space, and with the two |
2775 |
|
alternatives on the second line: |
2776 |
|
|
2777 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
2778 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
2787 |
letters and dd are digits. |
letters and dd are digits. |
2788 |
|
|
2789 |
|
|
|
|
|
2790 |
COMMENTS |
COMMENTS |
2791 |
|
|
2792 |
The sequence (?# marks the start of a comment which contin- |
The sequence (?# marks the start of a comment which contin- |
2793 |
ues up to the next closing parenthesis. Nested parentheses |
ues up to the next closing parenthesis. Nested parentheses |
2794 |
are not permitted. The characters that make up a comment |
are not permitted. The characters that make up a comment |
2799 |
ues up to the next newline character in the pattern. |
ues up to the next newline character in the pattern. |
2800 |
|
|
2801 |
|
|
2802 |
|
RECURSIVE PATTERNS |
2803 |
|
|
2804 |
|
Consider the problem of matching a string in parentheses, |
2805 |
|
allowing for unlimited nested parentheses. Without the use |
2806 |
|
of recursion, the best that can be done is to use a pattern |
2807 |
|
that matches up to some fixed depth of nesting. It is not |
2808 |
|
possible to handle an arbitrary nesting depth. Perl has pro- |
2809 |
|
vided an experimental facility that allows regular expres- |
2810 |
|
sions to recurse (amongst other things). It does this by |
2811 |
|
interpolating Perl code in the expression at run time, and |
2812 |
|
the code can refer to the expression itself. A Perl pattern |
2813 |
|
to solve the parentheses problem can be created like this: |
2814 |
|
|
2815 |
|
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
2816 |
|
|
2817 |
|
The (?p{...}) item interpolates Perl code at run time, and |
2818 |
|
in this case refers recursively to the pattern in which it |
2819 |
|
appears. Obviously, PCRE cannot support the interpolation of |
2820 |
|
Perl code. Instead, it supports some special syntax for |
2821 |
|
recursion of the entire pattern, and also for individual |
2822 |
|
subpattern recursion. |
2823 |
|
|
2824 |
|
The special item that consists of (? followed by a number |
2825 |
|
greater than zero and a closing parenthesis is a recursive |
2826 |
|
call of the subpattern of the given number, provided that it |
2827 |
|
occurs inside that subpattern. (If not, it is a "subroutine" |
2828 |
|
call, which is described in the next section.) The special |
2829 |
|
item (?R) is a recursive call of the entire regular expres- |
2830 |
|
sion. |
2831 |
|
|
2832 |
|
For example, this PCRE pattern solves the nested parentheses |
2833 |
|
problem (assume the PCRE_EXTENDED option is set so that |
2834 |
|
white space is ignored): |
2835 |
|
|
2836 |
|
\( ( (?>[^()]+) | (?R) )* \) |
2837 |
|
|
2838 |
|
First it matches an opening parenthesis. Then it matches any |
2839 |
|
number of substrings which can either be a sequence of non- |
2840 |
|
parentheses, or a recursive match of the pattern itself |
2841 |
|
(that is a correctly parenthesized substring). Finally |
2842 |
|
there is a closing parenthesis. |
2843 |
|
|
2844 |
|
If this were part of a larger pattern, you would not want to |
2845 |
|
recurse the entire pattern, so instead you could use this: |
2846 |
|
|
2847 |
|
( \( ( (?>[^()]+) | (?1) )* \) ) |
2848 |
|
|
2849 |
|
We have put the pattern into parentheses, and caused the |
2850 |
|
recursion to refer to them instead of the whole pattern. In |
2851 |
|
a larger pattern, keeping track of parenthesis numbers can |
2852 |
|
be tricky. It may be more convenient to use named |
2853 |
|
parentheses instead. For this, PCRE uses (?P>name), which is |
2854 |
|
an extension to the Python syntax that PCRE uses for named |
2855 |
|
parentheses (Perl does not provide named parentheses). We |
2856 |
|
could rewrite the above example as follows: |
2857 |
|
|
2858 |
|
(?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) |
2859 |
|
|
2860 |
|
This particular example pattern contains nested unlimited |
2861 |
|
repeats, and so the use of atomic grouping for matching |
2862 |
|
strings of non-parentheses is important when applying the |
2863 |
|
pattern to strings that do not match. For example, when this |
2864 |
|
pattern is applied to |
2865 |
|
|
2866 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
2867 |
|
|
2868 |
|
it yields "no match" quickly. However, if atomic grouping is |
2869 |
|
not used, the match runs for a very long time indeed because |
2870 |
|
there are so many different ways the + and * repeats can |
2871 |
|
carve up the subject, and all have to be tested before |
2872 |
|
failure can be reported. |
2873 |
|
At the end of a match, the values set for any capturing sub- |
2874 |
|
patterns are those from the outermost level of the recursion |
2875 |
|
at which the subpattern value is set. If you want to obtain |
2876 |
|
intermediate values, a callout function can be used (see |
2877 |
|
below and the pcrecallout documentation). If the pattern |
2878 |
|
above is matched against |
2879 |
|
|
2880 |
|
(ab(cd)ef) |
2881 |
|
|
2882 |
|
the value for the capturing parentheses is "ef", which is |
2883 |
|
the last value taken on at the top level. If additional |
2884 |
|
parentheses are added, giving |
2885 |
|
|
2886 |
|
\( ( ( (?>[^()]+) | (?R) )* ) \) |
2887 |
|
^ ^ |
2888 |
|
^ ^ |
2889 |
|
|
2890 |
|
the string they capture is "ab(cd)ef", the contents of the |
2891 |
|
top level parentheses. If there are more than 15 capturing |
2892 |
|
parentheses in a pattern, PCRE has to obtain extra memory to |
2893 |
|
store data during a recursion, which it does by using |
2894 |
|
pcre_malloc, freeing it via pcre_free afterwards. If no |
2895 |
|
memory can be obtained, the match fails with the |
2896 |
|
PCRE_ERROR_NOMEMORY error. |
2897 |
|
|
2898 |
|
Do not confuse the (?R) item with the condition (R), which |
2899 |
|
tests for recursion. Consider this pattern, which matches |
2900 |
|
text in angle brackets, allowing for arbitrary nesting. Only |
2901 |
|
digits are allowed in nested brackets (that is, when recurs- |
2902 |
|
ing), whereas any characters are permitted at the outer |
2903 |
|
level. |
2904 |
|
|
2905 |
|
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
2906 |
|
|
2907 |
|
In this pattern, (?(R) is the start of a conditional subpat- |
2908 |
|
tern, with two different alternatives for the recursive and |
2909 |
|
non-recursive cases. The (?R) item is the actual recursive |
2910 |
|
call. |
2911 |
|
|
2912 |
|
|
2913 |
|
SUBPATTERNS AS SUBROUTINES |
2914 |
|
|
2915 |
|
If the syntax for a recursive subpattern reference (either |
2916 |
|
by number or by name) is used outside the parentheses to |
2917 |
|
which it refers, it operates like a subroutine in a program- |
2918 |
|
ming language. An earlier example pointed out that the pat- |
2919 |
|
tern |
2920 |
|
|
2921 |
|
(sens|respons)e and \1ibility |
2922 |
|
|
2923 |
|
matches "sense and sensibility" and "response and responsi- |
2924 |
|
bility", but not "sense and responsibility". If instead the |
2925 |
|
pattern |
2926 |
|
|
2927 |
|
(sens|respons)e and (?1)ibility |
2928 |
|
|
2929 |
|
is used, it does match "sense and responsibility" as well as |
2930 |
|
the other two strings. Such references must, however, follow |
2931 |
|
the subpattern to which they refer. |
2932 |
|
|
2933 |
|
|
2934 |
|
CALLOUTS |
2935 |
|
|
2936 |
|
Perl has a feature whereby using the sequence (?{...}) |
2937 |
|
causes arbitrary Perl code to be obeyed in the middle of |
2938 |
|
matching a regular expression. This makes it possible, |
2939 |
|
amongst other things, to extract different substrings that |
2940 |
|
match the same pair of parentheses when there is a repeti- |
2941 |
|
tion. |
2942 |
|
|
2943 |
|
PCRE provides a similar feature, but of course it cannot |
2944 |
|
obey arbitrary Perl code. The feature is called "callout". |
2945 |
|
The caller of PCRE provides an external function by putting |
2946 |
|
its entry point in the global variable pcre_callout. By |
2947 |
|
default, this variable contains NULL, which disables all |
2948 |
|
calling out. |
2949 |
|
|
2950 |
|
Within a regular expression, (?C) indicates the points at |
2951 |
|
which the external function is to be called. If you want to |
2952 |
|
identify different callout points, you can put a number less |
2953 |
|
than 256 after the letter C. The default value is zero. For |
2954 |
|
example, this pattern has two callout points: |
2955 |
|
|
2956 |
|
(?C1)9abc(?C2)def |
2957 |
|
|
2958 |
|
During matching, when PCRE reaches a callout point (and |
2959 |
|
pcre_callout is set), the external function is called. It is |
2960 |
|
provided with the number of the callout, and, optionally, |
2961 |
|
one item of data originally supplied by the caller of |
2962 |
|
pcre_exec(). The callout function may cause matching to |
2963 |
|
backtrack, or to fail altogether. A complete description of |
2964 |
|
the interface to the callout function is given in the pcre- |
2965 |
|
callout documentation. |
2966 |
|
|
2967 |
|
Last updated: 03 February 2003 |
2968 |
|
Copyright (c) 1997-2003 University of Cambridge. |
2969 |
|
----------------------------------------------------------------------------- |
2970 |
|
|
2971 |
PERFORMANCE |
NAME |
2972 |
Certain items that may appear in patterns are more efficient |
PCRE - Perl-compatible regular expressions |
2973 |
than others. It is more efficient to use a character class |
|
2974 |
like [aeiou] than a set of alternatives such as (a|e|i|o|u). |
|
2975 |
In general, the simplest construction that provides the |
PCRE PERFORMANCE |
2976 |
required behaviour is usually the most efficient. Jeffrey |
|
2977 |
Friedl's book contains a lot of discussion about optimizing |
Certain items that may appear in regular expression patterns |
2978 |
regular expressions for efficient performance. |
are more efficient than others. It is more efficient to use |
2979 |
|
a character class like [aeiou] than a set of alternatives |
2980 |
When a pattern begins with .* and the PCRE_DOTALL option is |
such as (a|e|i|o|u). In general, the simplest construction |
2981 |
set, the pattern is implicitly anchored by PCRE, since it |
that provides the required behaviour is usually the most |
2982 |
can match only at the start of a subject string. However, if |
efficient. Jeffrey Friedl's book contains a lot of discus- |
2983 |
PCRE_DOTALL is not set, PCRE cannot make this optimization, |
sion about optimizing regular expressions for efficient per- |
2984 |
because the . metacharacter does not then match a newline, |
formance. |
2985 |
and if the subject string contains newlines, the pattern may |
|
2986 |
match from the character immediately following one of them |
When a pattern begins with .* not in parentheses, or in |
2987 |
instead of from the very start. For example, the pattern |
parentheses that are not the subject of a backreference, and |
2988 |
|
the PCRE_DOTALL option is set, the pattern is implicitly |
2989 |
|
anchored by PCRE, since it can match only at the start of a |
2990 |
|
subject string. However, if PCRE_DOTALL is not set, PCRE |
2991 |
|
cannot make this optimization, because the . metacharacter |
2992 |
|
does not then match a newline, and if the subject string |
2993 |
|
contains newlines, the pattern may match from the character |
2994 |
|
immediately following one of them instead of from the very |
2995 |
|
start. For example, the pattern |
2996 |
|
|
2997 |
(.*) second |
.*second |
2998 |
|
|
2999 |
matches the subject "first\nand second" (where \n stands for |
matches the subject "first\nand second" (where \n stands for |
3000 |
a newline character) with the first captured substring being |
a newline character), with the match starting at the seventh |
3001 |
"and". In order to do this, PCRE has to retry the match |
character. In order to do this, PCRE has to retry the match |
3002 |
starting after every newline in the subject. |
starting after every newline in the subject. |
3003 |
|
|
3004 |
If you are using such a pattern with subject strings that do |
If you are using such a pattern with subject strings that do |
3021 |
that the entire match is going to fail, PCRE has in princi- |
that the entire match is going to fail, PCRE has in princi- |
3022 |
ple to try every possible variation, and this can take an |
ple to try every possible variation, and this can take an |
3023 |
extremely long time. |
extremely long time. |
|
|
|
3024 |
An optimization catches some of the more simple cases such |
An optimization catches some of the more simple cases such |
3025 |
as |
as |
3026 |
|
|
3040 |
whereas the latter takes an appreciable time with strings |
whereas the latter takes an appreciable time with strings |
3041 |
longer than about 20 characters. |
longer than about 20 characters. |
3042 |
|
|
3043 |
|
Last updated: 03 February 2003 |
3044 |
|
Copyright (c) 1997-2003 University of Cambridge. |
3045 |
|
----------------------------------------------------------------------------- |
3046 |
|
|
3047 |
|
NAME |
3048 |
|
PCRE - Perl-compatible regular expressions. |
3049 |
|
|
3050 |
|
|
3051 |
|
SYNOPSIS OF POSIX API |
3052 |
|
#include <pcreposix.h> |
3053 |
|
|
3054 |
|
int regcomp(regex_t *preg, const char *pattern, |
3055 |
|
int cflags); |
3056 |
|
|
3057 |
|
int regexec(regex_t *preg, const char *string, |
3058 |
|
size_t nmatch, regmatch_t pmatch[], int eflags); |
3059 |
|
|
3060 |
|
size_t regerror(int errcode, const regex_t *preg, |
3061 |
|
char *errbuf, size_t errbuf_size); |
3062 |
|
|
3063 |
|
void regfree(regex_t *preg); |
3064 |
|
|
3065 |
|
|
3066 |
|
DESCRIPTION |
3067 |
|
|
3068 |
|
This set of functions provides a POSIX-style API to the PCRE |
3069 |
|
regular expression package. See the pcreapi documentation |
3070 |
|
for a description of the native API, which contains addi- |
3071 |
|
tional functionality. |
3072 |
|
|
3073 |
|
The functions described here are just wrapper functions that |
3074 |
|
ultimately call the PCRE native API. Their prototypes are |
3075 |
|
defined in the pcreposix.h header file, and on Unix systems |
3076 |
|
the library itself is called pcreposix.a, so can be accessed |
3077 |
|
by adding -lpcreposix to the command for linking an applica- |
3078 |
|
tion which uses them. Because the POSIX functions call the |
3079 |
|
native ones, it is also necessary to add -lpcre. |
3080 |
|
|
3081 |
|
I have implemented only those option bits that can be rea- |
3082 |
|
sonably mapped to PCRE native options. In addition, the |
3083 |
|
options REG_EXTENDED and REG_NOSUB are defined with the |
3084 |
|
value zero. They have no effect, but since programs that are |
3085 |
|
written to the POSIX interface often use them, this makes it |
3086 |
|
easier to slot in PCRE as a replacement library. Other POSIX |
3087 |
|
options are not even defined. |
3088 |
|
|
3089 |
|
When PCRE is called via these functions, it is only the API |
3090 |
|
that is POSIX-like in style. The syntax and semantics of the |
3091 |
|
regular expressions themselves are still those of Perl, sub- |
3092 |
|
ject to the setting of various PCRE options, as described |
3093 |
|
below. "POSIX-like in style" means that the API approximates |
3094 |
|
to the POSIX definition; it is not fully POSIX-compatible, |
3095 |
|
and in multi-byte encoding domains it is probably even less |
3096 |
|
compatible. |
3097 |
|
|
3098 |
|
The header for these functions is supplied as pcreposix.h to |
3099 |
|
avoid any potential clash with other POSIX libraries. It |
3100 |
|
can, of course, be renamed or aliased as regex.h, which is |
3101 |
|
the "correct" name. It provides two structure types, regex_t |
3102 |
|
for compiled internal forms, and regmatch_t for returning |
3103 |
|
captured substrings. It also defines some constants whose |
3104 |
|
names start with "REG_"; these are used for setting options |
3105 |
|
and identifying error codes. |
3106 |
|
|
3107 |
|
|
3108 |
|
COMPILING A PATTERN |
3109 |
|
|
3110 |
|
The function regcomp() is called to compile a pattern into |
3111 |
|
an internal form. The pattern is a C string terminated by a |
3112 |
|
binary zero, and is passed in the argument pattern. The preg |
3113 |
|
argument is a pointer to a regex_t structure which is used |
3114 |
|
as a base for storing information about the compiled expres- |
3115 |
|
sion. |
3116 |
|
|
3117 |
|
The argument cflags is either zero, or contains one or more |
3118 |
|
of the bits defined by the following macros: |
3119 |
|
|
3120 |
|
REG_ICASE |
3121 |
|
|
3122 |
|
The PCRE_CASELESS option is set when the expression is |
3123 |
|
passed for compilation to the native function. |
3124 |
|
|
3125 |
|
REG_NEWLINE |
3126 |
|
|
3127 |
|
The PCRE_MULTILINE option is set when the expression is |
3128 |
|
passed for compilation to the native function. Note that |
3129 |
|
this does not mimic the defined POSIX behaviour for |
3130 |
|
REG_NEWLINE (see the following section). |
3131 |
|
|
3132 |
|
In the absence of these flags, no options are passed to the |
3133 |
|
native function. This means the the regex is compiled with |
3134 |
|
PCRE default semantics. In particular, the way it handles |
3135 |
|
newline characters in the subject string is the Perl way, |
3136 |
|
not the POSIX way. Note that setting PCRE_MULTILINE has only |
3137 |
|
some of the effects specified for REG_NEWLINE. It does not |
3138 |
|
affect the way newlines are matched by . (they aren't) or by |
3139 |
|
a negative class such as [^a] (they are). |
3140 |
|
|
3141 |
|
The yield of regcomp() is zero on success, and non-zero oth- |
3142 |
|
erwise. The preg structure is filled in on success, and one |
3143 |
|
member of the structure is public: re_nsub contains the |
3144 |
|
number of capturing subpatterns in the regular expression. |
3145 |
|
Various error codes are defined in the header file. |
3146 |
|
|
3147 |
|
|
3148 |
|
MATCHING NEWLINE CHARACTERS |
3149 |
|
|
3150 |
|
This area is not simple, because POSIX and Perl take dif- |
3151 |
|
ferent views of things. It is not possible to get PCRE to |
3152 |
|
obey POSIX semantics, but then PCRE was never intended to be |
3153 |
|
a POSIX engine. The following table lists the different pos- |
3154 |
|
sibilities for matching newline characters in PCRE: |
3155 |
|
|
3156 |
|
Default Change with |
3157 |
|
|
3158 |
|
. matches newline no PCRE_DOTALL |
3159 |
|
newline matches [^a] yes not changeable |
3160 |
|
$ matches \n at end yes PCRE_DOLLARENDONLY |
3161 |
|
$ matches \n in middle no PCRE_MULTILINE |
3162 |
|
^ matches \n in middle no PCRE_MULTILINE |
3163 |
|
|
3164 |
|
This is the equivalent table for POSIX: |
3165 |
|
|
3166 |
|
Default Change with |
3167 |
|
|
3168 |
|
. matches newline yes REG_NEWLINE |
3169 |
|
newline matches [^a] yes REG_NEWLINE |
3170 |
|
$ matches \n at end no REG_NEWLINE |
3171 |
|
$ matches \n in middle no REG_NEWLINE |
3172 |
|
^ matches \n in middle no REG_NEWLINE |
3173 |
|
|
3174 |
|
PCRE's behaviour is the same as Perl's, except that there is |
3175 |
|
no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE |
3176 |
|
and Perl, there is no way to stop newline from matching |
3177 |
|
[^a]. |
3178 |
|
|
3179 |
|
The default POSIX newline handling can be obtained by set- |
3180 |
|
ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way |
3181 |
|
to make PCRE behave exactly as for the REG_NEWLINE action. |
3182 |
|
|
3183 |
|
|
3184 |
|
MATCHING A PATTERN |
3185 |
|
|
3186 |
|
The function regexec() is called to match a pre-compiled |
3187 |
|
pattern preg against a given string, which is terminated by |
3188 |
|
a zero byte, subject to the options in eflags. These can be: |
3189 |
|
|
3190 |
|
REG_NOTBOL |
3191 |
|
|
3192 |
|
The PCRE_NOTBOL option is set when calling the underlying |
3193 |
|
PCRE matching function. |
3194 |
|
|
3195 |
|
REG_NOTEOL |
3196 |
|
|
3197 |
|
The PCRE_NOTEOL option is set when calling the underlying |
3198 |
|
PCRE matching function. |
3199 |
|
|
3200 |
|
The portion of the string that was matched, and also any |
3201 |
|
captured substrings, are returned via the pmatch argument, |
3202 |
|
which points to an array of nmatch structures of type |
3203 |
|
regmatch_t, containing the members rm_so and rm_eo. These |
3204 |
|
contain the offset to the first character of each substring |
3205 |
|
and the offset to the first character after the end of each |
3206 |
|
substring, respectively. The 0th element of the vector |
3207 |
|
relates to the entire portion of string that was matched; |
3208 |
|
subsequent elements relate to the capturing subpatterns of |
3209 |
|
the regular expression. Unused entries in the array have |
3210 |
|
both structure members set to -1. |
3211 |
|
|
3212 |
|
A successful match yields a zero return; various error codes |
3213 |
|
are defined in the header file, of which REG_NOMATCH is the |
3214 |
|
"expected" failure code. |
3215 |
|
|
3216 |
|
|
3217 |
|
ERROR MESSAGES |
3218 |
|
|
3219 |
|
The regerror() function maps a non-zero errorcode from |
3220 |
|
either regcomp() or regexec() to a printable message. If |
3221 |
|
preg is not NULL, the error should have arisen from the use |
3222 |
|
of that structure. A message terminated by a binary zero is |
3223 |
|
placed in errbuf. The length of the message, including the |
3224 |
|
zero, is limited to errbuf_size. The yield of the function |
3225 |
|
is the size of buffer needed to hold the whole message. |
3226 |
|
|
3227 |
|
|
3228 |
|
STORAGE |
3229 |
|
|
3230 |
|
Compiling a regular expression causes memory to be allocated |
3231 |
|
and associated with the preg structure. The function reg- |
3232 |
|
free() frees all such memory, after which preg may no longer |
3233 |
|
be used as a compiled expression. |
3234 |
|
|
3235 |
|
|
3236 |
AUTHOR |
AUTHOR |
3237 |
|
|
3238 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
3239 |
University Computing Service, |
University Computing Service, |
|
New Museums Site, |
|
3240 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
|
Phone: +44 1223 334714 |
|
3241 |
|
|
3242 |
Last updated: 29 July 1999 |
Last updated: 03 February 2003 |
3243 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-2003 University of Cambridge. |
3244 |
|
----------------------------------------------------------------------------- |
3245 |
|
|
3246 |
|
NAME |
3247 |
|
PCRE - Perl-compatible regular expressions |
3248 |
|
|
3249 |
|
|
3250 |
|
PCRE SAMPLE PROGRAM |
3251 |
|
|
3252 |
|
A simple, complete demonstration program, to get you started |
3253 |
|
with using PCRE, is supplied in the file pcredemo.c in the |
3254 |
|
PCRE distribution. |
3255 |
|
|
3256 |
|
The program compiles the regular expression that is its |
3257 |
|
first argument, and matches it against the subject string in |
3258 |
|
its second argument. No PCRE options are set, and default |
3259 |
|
character tables are used. If matching succeeds, the program |
3260 |
|
outputs the portion of the subject that matched, together |
3261 |
|
with the contents of any captured substrings. |
3262 |
|
|
3263 |
|
If the -g option is given on the command line, the program |
3264 |
|
then goes on to check for further matches of the same regu- |
3265 |
|
lar expression in the same subject string. The logic is a |
3266 |
|
little bit tricky because of the possibility of matching an |
3267 |
|
empty string. Comments in the code explain what is going on. |
3268 |
|
|
3269 |
|
On a Unix system that has PCRE installed in /usr/local, you |
3270 |
|
can compile the demonstration program using a command like |
3271 |
|
this: |
3272 |
|
|
3273 |
|
gcc -o pcredemo pcredemo.c -I/usr/local/include \ |
3274 |
|
-L/usr/local/lib -lpcre |
3275 |
|
|
3276 |
|
Then you can run simple tests like this: |
3277 |
|
|
3278 |
|
./pcredemo 'cat|dog' 'the cat sat on the mat' |
3279 |
|
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
3280 |
|
|
3281 |
|
Note that there is a much more comprehensive test program, |
3282 |
|
called pcretest, which supports many more facilities for |
3283 |
|
testing regular expressions and the PCRE library. The |
3284 |
|
pcredemo program is provided as a simple coding example. |
3285 |
|
|
3286 |
|
On some operating systems (e.g. Solaris) you may get an |
3287 |
|
error like this when you try to run pcredemo: |
3288 |
|
|
3289 |
|
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such |
3290 |
|
file or directory |
3291 |
|
|
3292 |
|
This is caused by the way shared library support works on |
3293 |
|
those systems. You need to add |
3294 |
|
|
3295 |
|
-R/usr/local/lib |
3296 |
|
|
3297 |
|
to the compile command to get round this problem. |
3298 |
|
|
3299 |
|
Last updated: 28 January 2003 |
3300 |
|
Copyright (c) 1997-2003 University of Cambridge. |
3301 |
|
----------------------------------------------------------------------------- |
3302 |
|
|