1 |
|
This file contains a concatenation of the PCRE man pages, converted to plain |
2 |
|
text format for ease of searching with a text editor, or for use on systems |
3 |
|
that do not have a man page processor. The small individual files that give |
4 |
|
synopses of each function in the library have not been included. There are |
5 |
|
separate text files for the pcregrep and pcretest commands. |
6 |
|
----------------------------------------------------------------------------- |
7 |
|
|
8 |
|
NAME |
9 |
|
PCRE - Perl-compatible regular expressions |
10 |
|
|
11 |
|
|
12 |
|
DESCRIPTION |
13 |
|
|
14 |
|
The PCRE library is a set of functions that implement regu- |
15 |
|
lar expression pattern matching using the same syntax and |
16 |
|
semantics as Perl, with just a few differences. The current |
17 |
|
implementation of PCRE (release 4.x) corresponds approxi- |
18 |
|
mately with Perl 5.8, including support for UTF-8 encoded |
19 |
|
strings. However, this support has to be explicitly |
20 |
|
enabled; it is not the default. |
21 |
|
|
22 |
|
PCRE is written in C and released as a C library. However, a |
23 |
|
number of people have written wrappers and interfaces of |
24 |
|
various kinds. A C++ class is included in these contribu- |
25 |
|
tions, which can be found in the Contrib directory at the |
26 |
|
primary FTP site, which is: |
27 |
|
|
28 |
|
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
29 |
|
|
30 |
|
Details of exactly which Perl regular expression features |
31 |
|
are and are not supported by PCRE are given in separate |
32 |
|
documents. See the pcrepattern and pcrecompat pages. |
33 |
|
|
34 |
|
Some features of PCRE can be included, excluded, or changed |
35 |
|
when the library is built. The pcre_config() function makes |
36 |
|
it possible for a client to discover which features are |
37 |
|
available. Documentation about building PCRE for various |
38 |
|
operating systems can be found in the README file in the |
39 |
|
source distribution. |
40 |
|
|
41 |
|
|
42 |
|
USER DOCUMENTATION |
43 |
|
|
44 |
|
The user documentation for PCRE has been split up into a |
45 |
|
number of different sections. In the "man" format, each of |
46 |
|
these is a separate "man page". In the HTML format, each is |
47 |
|
a separate page, linked from the index page. In the plain |
48 |
|
text format, all the sections are concatenated, for ease of |
49 |
|
searching. The sections are as follows: |
50 |
|
|
51 |
|
pcre this document |
52 |
|
pcreapi details of PCRE's native API |
53 |
|
pcrebuild options for building PCRE |
54 |
|
pcrecallout details of the callout feature |
55 |
|
pcrecompat discussion of Perl compatibility |
56 |
|
pcregrep description of the pcregrep command |
57 |
|
pcrepattern syntax and semantics of supported |
58 |
|
regular expressions |
59 |
|
pcreperform discussion of performance issues |
60 |
|
pcreposix the POSIX-compatible API |
61 |
|
pcresample discussion of the sample program |
62 |
|
pcretest the pcretest testing command |
63 |
|
|
64 |
|
In addition, in the "man" and HTML formats, there is a short |
65 |
|
page for each library function, listing its arguments and |
66 |
|
results. |
67 |
|
|
68 |
|
|
69 |
|
LIMITATIONS |
70 |
|
|
71 |
|
There are some size limitations in PCRE but it is hoped that |
72 |
|
they will never in practice be relevant. |
73 |
|
|
74 |
|
The maximum length of a compiled pattern is 65539 (sic) |
75 |
|
bytes if PCRE is compiled with the default internal linkage |
76 |
|
size of 2. If you want to process regular expressions that |
77 |
|
are truly enormous, you can compile PCRE with an internal |
78 |
|
linkage size of 3 or 4 (see the README file in the source |
79 |
|
distribution and the pcrebuild documentation for details). |
80 |
|
If these cases the limit is substantially larger. However, |
81 |
|
the speed of execution will be slower. |
82 |
|
|
83 |
|
All values in repeating quantifiers must be less than 65536. |
84 |
|
The maximum number of capturing subpatterns is 65535. |
85 |
|
|
86 |
|
There is no limit to the number of non-capturing subpat- |
87 |
|
terns, but the maximum depth of nesting of all kinds of |
88 |
|
parenthesized subpattern, including capturing subpatterns, |
89 |
|
assertions, and other types of subpattern, is 200. |
90 |
|
|
91 |
|
The maximum length of a subject string is the largest posi- |
92 |
|
tive number that an integer variable can hold. However, PCRE |
93 |
|
uses recursion to handle subpatterns and indefinite repeti- |
94 |
|
tion. This means that the available stack space may limit |
95 |
|
the size of a subject string that can be processed by cer- |
96 |
|
tain patterns. |
97 |
|
|
98 |
|
|
99 |
|
UTF-8 SUPPORT |
100 |
|
|
101 |
|
Starting at release 3.3, PCRE has had some support for char- |
102 |
|
acter strings encoded in the UTF-8 format. For release 4.0 |
103 |
|
this has been greatly extended to cover most common require- |
104 |
|
ments. |
105 |
|
|
106 |
|
In order process UTF-8 strings, you must build PCRE to |
107 |
|
include UTF-8 support in the code, and, in addition, you |
108 |
|
must call pcre_compile() with the PCRE_UTF8 option flag. |
109 |
|
When you do this, both the pattern and any subject strings |
110 |
|
that are matched against it are treated as UTF-8 strings |
111 |
|
instead of just strings of bytes. |
112 |
|
|
113 |
|
If you compile PCRE with UTF-8 support, but do not use it at |
114 |
|
run time, the library will be a bit bigger, but the addi- |
115 |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
116 |
|
flag in several places, so should not be very large. |
117 |
|
|
118 |
|
The following comments apply when PCRE is running in UTF-8 |
119 |
|
mode: |
120 |
|
|
121 |
|
1. When you set the PCRE_UTF8 flag, the strings passed as |
122 |
|
patterns and subjects are checked for validity on entry to |
123 |
|
the relevant functions. If an invalid UTF-8 string is |
124 |
|
passed, an error return is given. In some situations, you |
125 |
|
may already know that your strings are valid, and therefore |
126 |
|
want to skip these checks in order to improve performance. |
127 |
|
If you set the PCRE_NO_UTF8_CHECK flag at compile time or at |
128 |
|
run time, PCRE assumes that the pattern or subject it is |
129 |
|
given (respectively) contains only valid UTF-8 codes. In |
130 |
|
this case, it does not diagnose an invalid UTF-8 string. If |
131 |
|
you pass an invalid UTF-8 string to PCRE when |
132 |
|
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your |
133 |
|
program may crash. |
134 |
|
|
135 |
|
2. In a pattern, the escape sequence \x{...}, where the con- |
136 |
|
tents of the braces is a string of hexadecimal digits, is |
137 |
|
interpreted as a UTF-8 character whose code number is the |
138 |
|
given hexadecimal number, for example: \x{1234}. If a non- |
139 |
|
hexadecimal digit appears between the braces, the item is |
140 |
|
not recognized. This escape sequence can be used either as |
141 |
|
a literal, or within a character class. |
142 |
|
|
143 |
|
3. The original hexadecimal escape sequence, \xhh, matches a |
144 |
|
two-byte UTF-8 character if the value is greater than 127. |
145 |
|
|
146 |
|
4. Repeat quantifiers apply to complete UTF-8 characters, |
147 |
|
not to individual bytes, for example: \x{100}{3}. |
148 |
|
|
149 |
|
5. The dot metacharacter matches one UTF-8 character instead |
150 |
|
of a single byte. |
151 |
|
|
152 |
|
6. The escape sequence \C can be used to match a single byte |
153 |
|
in UTF-8 mode, but its use can lead to some strange effects. |
154 |
|
|
155 |
|
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W |
156 |
|
correctly test characters of any code value, but the charac- |
157 |
|
ters that PCRE recognizes as digits, spaces, or word charac- |
158 |
|
ters remain the same set as before, all with values less |
159 |
|
than 256. |
160 |
|
|
161 |
|
8. Case-insensitive matching applies only to characters |
162 |
|
whose values are less than 256. PCRE does not support the |
163 |
|
notion of "case" for higher-valued characters. |
164 |
|
|
165 |
|
9. PCRE does not support the use of Unicode tables and pro- |
166 |
|
perties or the Perl escapes \p, \P, and \X. |
167 |
|
|
168 |
|
|
169 |
|
AUTHOR |
170 |
|
|
171 |
|
Philip Hazel <ph10@cam.ac.uk> |
172 |
|
University Computing Service, |
173 |
|
Cambridge CB2 3QG, England. |
174 |
|
Phone: +44 1223 334714 |
175 |
|
|
176 |
|
Last updated: 20 August 2003 |
177 |
|
Copyright (c) 1997-2003 University of Cambridge. |
178 |
|
----------------------------------------------------------------------------- |
179 |
|
|
180 |
|
NAME |
181 |
|
PCRE - Perl-compatible regular expressions |
182 |
|
|
183 |
|
|
184 |
|
PCRE BUILD-TIME OPTIONS |
185 |
|
|
186 |
|
This document describes the optional features of PCRE that |
187 |
|
can be selected when the library is compiled. They are all |
188 |
|
selected, or deselected, by providing options to the config- |
189 |
|
ure script which is run before the make command. The com- |
190 |
|
plete list of options for configure (which includes the |
191 |
|
standard ones such as the selection of the installation |
192 |
|
directory) can be obtained by running |
193 |
|
|
194 |
|
./configure --help |
195 |
|
|
196 |
|
The following sections describe certain options whose names |
197 |
|
begin with --enable or --disable. These settings specify |
198 |
|
changes to the defaults for the configure command. Because |
199 |
|
of the way that configure works, --enable and --disable |
200 |
|
always come in pairs, so the complementary option always |
201 |
|
exists as well, but as it specifies the default, it is not |
202 |
|
described. |
203 |
|
|
204 |
|
|
205 |
|
UTF-8 SUPPORT |
206 |
|
|
207 |
|
To build PCRE with support for UTF-8 character strings, add |
208 |
|
|
209 |
|
--enable-utf8 |
210 |
|
|
211 |
|
to the configure command. Of itself, this does not make PCRE |
212 |
|
treat strings as UTF-8. As well as compiling PCRE with this |
213 |
|
option, you also have have to set the PCRE_UTF8 option when |
214 |
|
you call the pcre_compile() function. |
215 |
|
|
216 |
|
|
217 |
|
CODE VALUE OF NEWLINE |
218 |
|
|
219 |
|
By default, PCRE treats character 10 (linefeed) as the new- |
220 |
|
line character. This is the normal newline character on |
221 |
|
Unix-like systems. You can compile PCRE to use character 13 |
222 |
|
(carriage return) instead by adding |
223 |
|
|
224 |
|
--enable-newline-is-cr |
225 |
|
|
226 |
|
to the configure command. For completeness there is also a |
227 |
|
--enable-newline-is-lf option, which explicitly specifies |
228 |
|
linefeed as the newline character. |
229 |
|
|
230 |
|
|
231 |
|
BUILDING SHARED AND STATIC LIBRARIES |
232 |
|
|
233 |
|
The PCRE building process uses libtool to build both shared |
234 |
|
and static Unix libraries by default. You can suppress one |
235 |
|
of these by adding one of |
236 |
|
|
237 |
|
--disable-shared |
238 |
|
--disable-static |
239 |
|
|
240 |
|
to the configure command, as required. |
241 |
|
|
242 |
|
|
243 |
|
POSIX MALLOC USAGE |
244 |
|
|
245 |
|
When PCRE is called through the POSIX interface (see the |
246 |
|
pcreposix documentation), additional working storage is |
247 |
|
required for holding the pointers to capturing substrings |
248 |
|
because PCRE requires three integers per substring, whereas |
249 |
|
the POSIX interface provides only two. If the number of |
250 |
|
expected substrings is small, the wrapper function uses |
251 |
|
space on the stack, because this is faster than using mal- |
252 |
|
loc() for each call. The default threshold above which the |
253 |
|
stack is no longer used is 10; it can be changed by adding a |
254 |
|
setting such as |
255 |
|
|
256 |
|
--with-posix-malloc-threshold=20 |
257 |
|
|
258 |
|
to the configure command. |
259 |
|
|
260 |
|
|
261 |
|
LIMITING PCRE RESOURCE USAGE |
262 |
|
|
263 |
|
Internally, PCRE has a function called match() which it |
264 |
|
calls repeatedly (possibly recursively) when performing a |
265 |
|
matching operation. By limiting the number of times this |
266 |
|
function may be called, a limit can be placed on the |
267 |
|
resources used by a single call to pcre_exec(). The limit |
268 |
|
can be changed at run time, as described in the pcreapi |
269 |
|
documentation. The default is 10 million, but this can be |
270 |
|
changed by adding a setting such as |
271 |
|
|
272 |
|
--with-match-limit=500000 |
273 |
|
|
274 |
|
to the configure command. |
275 |
|
|
276 |
|
|
277 |
|
HANDLING VERY LARGE PATTERNS |
278 |
|
|
279 |
|
Within a compiled pattern, offset values are used to point |
280 |
|
from one part to another (for example, from an opening |
281 |
|
parenthesis to an alternation metacharacter). By default |
282 |
|
two-byte values are used for these offsets, leading to a |
283 |
|
maximum size for a compiled pattern of around 64K. This is |
284 |
|
sufficient to handle all but the most gigantic patterns. |
285 |
|
Nevertheless, some people do want to process enormous pat- |
286 |
|
terns, so it is possible to compile PCRE to use three-byte |
287 |
|
or four-byte offsets by adding a setting such as |
288 |
|
|
289 |
|
--with-link-size=3 |
290 |
|
|
291 |
|
to the configure command. The value given must be 2, 3, or |
292 |
|
4. Using longer offsets slows down the operation of PCRE |
293 |
|
because it has to load additional bytes when handling them. |
294 |
|
|
295 |
|
If you build PCRE with an increased link size, test 2 (and |
296 |
|
test 5 if you are using UTF-8) will fail. Part of the output |
297 |
|
of these tests is a representation of the compiled pattern, |
298 |
|
and this changes with the link size. |
299 |
|
|
300 |
|
Last updated: 21 January 2003 |
301 |
|
Copyright (c) 1997-2003 University of Cambridge. |
302 |
|
----------------------------------------------------------------------------- |
303 |
|
|
304 |
NAME |
NAME |
305 |
pcre - Perl-compatible regular expressions. |
PCRE - Perl-compatible regular expressions |
306 |
|
|
307 |
|
|
308 |
|
SYNOPSIS OF PCRE API |
309 |
|
|
|
SYNOPSIS |
|
310 |
#include <pcre.h> |
#include <pcre.h> |
311 |
|
|
312 |
pcre *pcre_compile(const char *pattern, int options, |
pcre *pcre_compile(const char *pattern, int options, |
320 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
321 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
322 |
|
|
323 |
|
int pcre_copy_named_substring(const pcre *code, |
324 |
|
const char *subject, int *ovector, |
325 |
|
int stringcount, const char *stringname, |
326 |
|
char *buffer, int buffersize); |
327 |
|
|
328 |
int pcre_copy_substring(const char *subject, int *ovector, |
int pcre_copy_substring(const char *subject, int *ovector, |
329 |
int stringcount, int stringnumber, char *buffer, |
int stringcount, int stringnumber, char *buffer, |
330 |
int buffersize); |
int buffersize); |
331 |
|
|
332 |
|
int pcre_get_named_substring(const pcre *code, |
333 |
|
const char *subject, int *ovector, |
334 |
|
int stringcount, const char *stringname, |
335 |
|
const char **stringptr); |
336 |
|
|
337 |
|
int pcre_get_stringnumber(const pcre *code, |
338 |
|
const char *name); |
339 |
|
|
340 |
int pcre_get_substring(const char *subject, int *ovector, |
int pcre_get_substring(const char *subject, int *ovector, |
341 |
int stringcount, int stringnumber, |
int stringcount, int stringnumber, |
342 |
const char **stringptr); |
const char **stringptr); |
353 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
354 |
int what, void *where); |
int what, void *where); |
355 |
|
|
356 |
|
|
357 |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
358 |
|
|
359 |
|
int pcre_config(int what, void *where); |
360 |
|
|
361 |
char *pcre_version(void); |
char *pcre_version(void); |
362 |
|
|
363 |
void *(*pcre_malloc)(size_t); |
void *(*pcre_malloc)(size_t); |
364 |
|
|
365 |
void (*pcre_free)(void *); |
void (*pcre_free)(void *); |
366 |
|
|
367 |
|
int (*pcre_callout)(pcre_callout_block *); |
368 |
|
|
369 |
|
|
370 |
|
PCRE API |
|
DESCRIPTION |
|
|
The PCRE library is a set of functions that implement regu- |
|
|
lar expression pattern matching using the same syntax and |
|
|
semantics as Perl 5, with just a few differences (see |
|
|
|
|
|
below). The current implementation corresponds to Perl |
|
|
5.005, with some additional features from later versions. |
|
|
This includes some experimental, incomplete support for |
|
|
UTF-8 encoded strings. Details of exactly what is and what |
|
|
is not supported are given below. |
|
371 |
|
|
372 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
373 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
386 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
387 |
are used for compiling and matching regular expressions. A |
are used for compiling and matching regular expressions. A |
388 |
sample program that demonstrates the simplest way of using |
sample program that demonstrates the simplest way of using |
389 |
them is given in the file pcredemo.c. The last section of |
them is given in the file pcredemo.c. The pcresample docu- |
390 |
this man page describes how to run it. |
mentation describes how to run it. |
391 |
|
|
392 |
The functions pcre_copy_substring(), pcre_get_substring(), |
There are convenience functions for extracting captured sub- |
393 |
and pcre_get_substring_list() are convenience functions for |
strings from a matched subject string. They are: |
394 |
extracting captured substrings from a matched subject |
|
395 |
string; pcre_free_substring() and pcre_free_substring_list() |
pcre_copy_substring() |
396 |
are also provided, to free the memory used for extracted |
pcre_copy_named_substring() |
397 |
|
pcre_get_substring() |
398 |
|
pcre_get_named_substring() |
399 |
|
pcre_get_substring_list() |
400 |
|
|
401 |
|
pcre_free_substring() and pcre_free_substring_list() are |
402 |
|
also provided, to free the memory used for extracted |
403 |
strings. |
strings. |
404 |
|
|
405 |
The function pcre_maketables() is used (optionally) to build |
The function pcre_maketables() is used (optionally) to build |
420 |
replace them if it wishes to intercept the calls. This |
replace them if it wishes to intercept the calls. This |
421 |
should be done before calling any PCRE functions. |
should be done before calling any PCRE functions. |
422 |
|
|
423 |
|
The global variable pcre_callout initially contains NULL. It |
424 |
|
can be set by the caller to a "callout" function, which PCRE |
425 |
|
will then call at specified points during a matching opera- |
426 |
|
tion. Details are given in the pcrecallout documentation. |
427 |
|
|
428 |
|
|
429 |
MULTI-THREADING |
MULTITHREADING |
430 |
|
|
431 |
The PCRE functions can be used in multi-threading applica- |
The PCRE functions can be used in multi-threading applica- |
432 |
tions, with the proviso that the memory management functions |
tions, with the proviso that the memory management functions |
433 |
pointed to by pcre_malloc and pcre_free are shared by all |
pointed to by pcre_malloc and pcre_free, and the callout |
434 |
|
function pointed to by pcre_callout, are shared by all |
435 |
threads. |
threads. |
436 |
|
|
437 |
The compiled form of a regular expression is not altered |
The compiled form of a regular expression is not altered |
439 |
used by several threads at once. |
used by several threads at once. |
440 |
|
|
441 |
|
|
442 |
|
CHECKING BUILD-TIME OPTIONS |
443 |
|
|
444 |
|
int pcre_config(int what, void *where); |
445 |
|
|
446 |
|
The function pcre_config() makes it possible for a PCRE |
447 |
|
client to discover which optional features have been com- |
448 |
|
piled into the PCRE library. The pcrebuild documentation has |
449 |
|
more details about these optional features. |
450 |
|
|
451 |
|
The first argument for pcre_config() is an integer, specify- |
452 |
|
ing which information is required; the second argument is a |
453 |
|
pointer to a variable into which the information is placed. |
454 |
|
The following information is available: |
455 |
|
|
456 |
|
PCRE_CONFIG_UTF8 |
457 |
|
|
458 |
|
The output is an integer that is set to one if UTF-8 support |
459 |
|
is available; otherwise it is set to zero. |
460 |
|
|
461 |
|
PCRE_CONFIG_NEWLINE |
462 |
|
|
463 |
|
The output is an integer that is set to the value of the |
464 |
|
code that is used for the newline character. It is either |
465 |
|
linefeed (10) or carriage return (13), and should normally |
466 |
|
be the standard character for your operating system. |
467 |
|
|
468 |
|
PCRE_CONFIG_LINK_SIZE |
469 |
|
|
470 |
|
The output is an integer that contains the number of bytes |
471 |
|
used for internal linkage in compiled regular expressions. |
472 |
|
The value is 2, 3, or 4. Larger values allow larger regular |
473 |
|
expressions to be compiled, at the expense of slower match- |
474 |
|
ing. The default value of 2 is sufficient for all but the |
475 |
|
most massive patterns, since it allows the compiled pattern |
476 |
|
to be up to 64K in size. |
477 |
|
|
478 |
|
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
479 |
|
|
480 |
|
The output is an integer that contains the threshold above |
481 |
|
which the POSIX interface uses malloc() for output vectors. |
482 |
|
Further details are given in the pcreposix documentation. |
483 |
|
|
484 |
|
PCRE_CONFIG_MATCH_LIMIT |
485 |
|
|
486 |
|
The output is an integer that gives the default limit for |
487 |
|
the number of internal matching function calls in a |
488 |
|
pcre_exec() execution. Further details are given with |
489 |
|
pcre_exec() below. |
490 |
|
|
491 |
|
|
492 |
COMPILING A PATTERN |
COMPILING A PATTERN |
493 |
|
|
494 |
|
pcre *pcre_compile(const char *pattern, int options, |
495 |
|
const char **errptr, int *erroffset, |
496 |
|
const unsigned char *tableptr); |
497 |
|
|
498 |
The function pcre_compile() is called to compile a pattern |
The function pcre_compile() is called to compile a pattern |
499 |
into an internal form. The pattern is a C string terminated |
into an internal form. The pattern is a C string terminated |
500 |
by a binary zero, and is passed in the argument pattern. A |
by a binary zero, and is passed in the argument pattern. A |
510 |
pcre data block is not fully relocatable, because it con- |
pcre data block is not fully relocatable, because it con- |
511 |
tains a copy of the tableptr argument, which is an address |
tains a copy of the tableptr argument, which is an address |
512 |
(see below). |
(see below). |
|
|
|
|
The size of a compiled pattern is roughly proportional to |
|
|
the length of the pattern string, except that each character |
|
|
class (other than those containing just a single character, |
|
|
negated or not) requires 33 bytes, and repeat quantifiers |
|
|
with a minimum greater than one or a bounded maximum cause |
|
|
the relevant portions of the compiled pattern to be repli- |
|
|
cated. |
|
|
|
|
513 |
The options argument contains independent bits that affect |
The options argument contains independent bits that affect |
514 |
the compilation. It should be zero if no options are |
the compilation. It should be zero if no options are |
515 |
required. Some of the options, in particular, those that are |
required. Some of the options, in particular, those that are |
516 |
compatible with Perl, can also be set and unset from within |
compatible with Perl, can also be set and unset from within |
517 |
the pattern (see the detailed description of regular expres- |
the pattern (see the detailed description of regular expres- |
518 |
sions below). For these options, the contents of the options |
sions in the pcrepattern documentation). For these options, |
519 |
argument specifies their initial settings at the start of |
the contents of the options argument specifies their initial |
520 |
compilation and execution. The PCRE_ANCHORED option can be |
settings at the start of compilation and execution. The |
521 |
set at the time of matching as well as at compile time. |
PCRE_ANCHORED option can be set at the time of matching as |
522 |
|
well as at compile time. |
523 |
|
|
524 |
If errptr is NULL, pcre_compile() returns NULL immediately. |
If errptr is NULL, pcre_compile() returns NULL immediately. |
525 |
Otherwise, if compilation of a pattern fails, pcre_compile() |
Otherwise, if compilation of a pattern fails, pcre_compile() |
549 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
550 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
551 |
|
|
552 |
The following option bits are defined in the header file: |
The following option bits are defined: |
553 |
|
|
554 |
PCRE_ANCHORED |
PCRE_ANCHORED |
555 |
|
|
556 |
If this bit is set, the pattern is forced to be "anchored", |
If this bit is set, the pattern is forced to be "anchored", |
557 |
that is, it is constrained to match only at the start of the |
that is, it is constrained to match only at the first match- |
558 |
string which is being searched (the "subject string"). This |
ing point in the string which is being searched (the "sub- |
559 |
effect can also be achieved by appropriate constructs in the |
ject string"). This effect can also be achieved by appropri- |
560 |
pattern itself, which is the only way to do it in Perl. |
ate constructs in the pattern itself, which is the only way |
561 |
|
to do it in Perl. |
562 |
|
|
563 |
PCRE_CASELESS |
PCRE_CASELESS |
564 |
|
|
565 |
If this bit is set, letters in the pattern match both upper |
If this bit is set, letters in the pattern match both upper |
566 |
and lower case letters. It is equivalent to Perl's /i |
and lower case letters. It is equivalent to Perl's /i |
567 |
option. |
option, and it can be changed within a pattern by a (?i) |
568 |
|
option setting. |
569 |
|
|
570 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
571 |
|
|
575 |
character if it is a newline (but not before any other new- |
character if it is a newline (but not before any other new- |
576 |
lines). The PCRE_DOLLAR_ENDONLY option is ignored if |
lines). The PCRE_DOLLAR_ENDONLY option is ignored if |
577 |
PCRE_MULTILINE is set. There is no equivalent to this option |
PCRE_MULTILINE is set. There is no equivalent to this option |
578 |
in Perl. |
in Perl, and no way to set it within a pattern. |
579 |
|
|
580 |
PCRE_DOTALL |
PCRE_DOTALL |
581 |
|
|
582 |
If this bit is set, a dot metacharater in the pattern |
If this bit is set, a dot metacharater in the pattern |
583 |
matches all characters, including newlines. Without it, new- |
matches all characters, including newlines. Without it, new- |
584 |
lines are excluded. This option is equivalent to Perl's /s |
lines are excluded. This option is equivalent to Perl's /s |
585 |
option. A negative class such as [^a] always matches a new- |
option, and it can be changed within a pattern by a (?s) |
586 |
line character, independent of the setting of this option. |
option setting. A negative class such as [^a] always matches |
587 |
|
a newline character, independent of the setting of this |
588 |
|
option. |
589 |
|
|
590 |
PCRE_EXTENDED |
PCRE_EXTENDED |
591 |
|
|
592 |
If this bit is set, whitespace data characters in the pat- |
If this bit is set, whitespace data characters in the pat- |
593 |
tern are totally ignored except when escaped or inside a |
tern are totally ignored except when escaped or inside a |
594 |
character class, and characters between an unescaped # out- |
character class. Whitespace does not include the VT charac- |
595 |
side a character class and the next newline character, |
ter (code 11). In addition, characters between an unescaped |
596 |
|
# outside a character class and the next newline character, |
597 |
inclusive, are also ignored. This is equivalent to Perl's /x |
inclusive, are also ignored. This is equivalent to Perl's /x |
598 |
option, and makes it possible to include comments inside |
option, and it can be changed within a pattern by a (?x) |
599 |
complicated patterns. Note, however, that this applies only |
option setting. |
600 |
to data characters. Whitespace characters may never appear |
|
601 |
|
This option makes it possible to include comments inside |
602 |
|
complicated patterns. Note, however, that this applies only |
603 |
|
to data characters. Whitespace characters may never appear |
604 |
within special character sequences in a pattern, for example |
within special character sequences in a pattern, for example |
605 |
within the sequence (?( which introduces a conditional sub- |
within the sequence (?( which introduces a conditional sub- |
606 |
pattern. |
pattern. |
607 |
|
|
608 |
PCRE_EXTRA |
PCRE_EXTRA |
632 |
of line" constructs match immediately following or immedi- |
of line" constructs match immediately following or immedi- |
633 |
ately before any newline in the subject string, respec- |
ately before any newline in the subject string, respec- |
634 |
tively, as well as at the very start and end. This is |
tively, as well as at the very start and end. This is |
635 |
equivalent to Perl's /m option. If there are no "\n" charac- |
equivalent to Perl's /m option, and it can be changed within |
636 |
ters in a subject string, or no occurrences of ^ or $ in a |
a pattern by a (?m) option setting. If there are no "\n" |
637 |
pattern, setting PCRE_MULTILINE has no effect. |
characters in a subject string, or no occurrences of ^ or $ |
638 |
|
in a pattern, setting PCRE_MULTILINE has no effect. |
639 |
|
|
640 |
|
PCRE_NO_AUTO_CAPTURE |
641 |
|
|
642 |
|
If this option is set, it disables the use of numbered cap- |
643 |
|
turing parentheses in the pattern. Any opening parenthesis |
644 |
|
that is not followed by ? behaves as if it were followed by |
645 |
|
?: but named parentheses can still be used for capturing |
646 |
|
(and they acquire numbers in the usual way). There is no |
647 |
|
equivalent of this option in Perl. |
648 |
|
|
649 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
650 |
|
|
656 |
PCRE_UTF8 |
PCRE_UTF8 |
657 |
|
|
658 |
This option causes PCRE to regard both the pattern and the |
This option causes PCRE to regard both the pattern and the |
659 |
subject as strings of UTF-8 characters instead of just byte |
subject as strings of UTF-8 characters instead of single- |
660 |
strings. However, it is available only if PCRE has been |
byte character strings. However, it is available only if |
661 |
built to include UTF-8 support. If not, the use of this |
PCRE has been built to include UTF-8 support. If not, the |
662 |
option provokes an error. Support for UTF-8 is new, experi- |
use of this option provokes an error. Details of how this |
663 |
mental, and incomplete. Details of exactly what it entails |
option changes the behaviour of PCRE are given in the sec- |
664 |
are given below. |
tion on UTF-8 support in the main pcre page. |
665 |
|
|
666 |
|
PCRE_NO_UTF8_CHECK |
667 |
|
|
668 |
|
When PCRE_UTF8 is set, the validity of the pattern as a |
669 |
|
UTF-8 string is automatically checked. If an invalid UTF-8 |
670 |
|
sequence of bytes is found, pcre_compile() returns an error. |
671 |
|
If you already know that your pattern is valid, and you want |
672 |
|
to skip this check for performance reasons, you can set the |
673 |
|
PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
674 |
|
passing an invalid UTF-8 string as a pattern is undefined. |
675 |
|
It may cause your program to crash. Note that there is a |
676 |
|
similar option for suppressing the checking of subject |
677 |
|
strings passed to pcre_exec(). |
678 |
|
|
679 |
|
|
680 |
|
|
681 |
STUDYING A PATTERN |
STUDYING A PATTERN |
682 |
|
|
683 |
|
pcre_extra *pcre_study(const pcre *code, int options, |
684 |
|
const char **errptr); |
685 |
|
|
686 |
When a pattern is going to be used several times, it is |
When a pattern is going to be used several times, it is |
687 |
worth spending more time analyzing it in order to speed up |
worth spending more time analyzing it in order to speed up |
688 |
the time taken for matching. The function pcre_study() takes |
the time taken for matching. The function pcre_study() takes |
689 |
a pointer to a compiled pattern as its first argument, and |
a pointer to a compiled pattern as its first argument. If |
690 |
returns a pointer to a pcre_extra block (another typedef for |
studing the pattern produces additional information that |
691 |
a structure with hidden contents) containing additional |
will help speed up matching, pcre_study() returns a pointer |
692 |
information about the pattern; this can be passed to |
to a pcre_extra block, in which the study_data field points |
693 |
pcre_exec(). If no additional information is available, NULL |
to the results of the study. |
694 |
is returned. |
|
695 |
|
The returned value from a pcre_study() can be passed |
696 |
|
directly to pcre_exec(). However, the pcre_extra block also |
697 |
|
contains other fields that can be set by the caller before |
698 |
|
the block is passed; these are described below. If studying |
699 |
|
the pattern does not produce any additional information, |
700 |
|
pcre_study() returns NULL. In that circumstance, if the cal- |
701 |
|
ling program wants to pass some of the other fields to |
702 |
|
pcre_exec(), it must set up its own pcre_extra block. |
703 |
|
|
704 |
The second argument contains option bits. At present, no |
The second argument contains option bits. At present, no |
705 |
options are defined for pcre_study(), and this argument |
options are defined for pcre_study(), and this argument |
706 |
should always be zero. |
should always be zero. |
707 |
|
|
708 |
The third argument for pcre_study() is a pointer to an error |
The third argument for pcre_study() is a pointer for an |
709 |
message. If studying succeeds (even if no data is returned), |
error message. If studying succeeds (even if no data is |
710 |
the variable it points to is set to NULL. Otherwise it |
returned), the variable it points to is set to NULL. Other- |
711 |
points to a textual error message. |
wise it points to a textual error message. You should there- |
712 |
|
fore test the error pointer for NULL after calling |
713 |
|
pcre_study(), to be sure that it has run successfully. |
714 |
|
|
715 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
716 |
|
|
726 |
created. |
created. |
727 |
|
|
728 |
|
|
|
|
|
729 |
LOCALE SUPPORT |
LOCALE SUPPORT |
730 |
|
|
731 |
PCRE handles caseless matching, and determines whether char- |
PCRE handles caseless matching, and determines whether char- |
732 |
acters are letters, digits, or whatever, by reference to a |
acters are letters, digits, or whatever, by reference to a |
733 |
set of tables. The library contains a default set of tables |
set of tables. When running in UTF-8 mode, this applies only |
734 |
which is created in the default C locale when PCRE is com- |
to characters with codes less than 256. The library contains |
735 |
piled. This is used when the final argument of |
a default set of tables that is created in the default C |
736 |
pcre_compile() is NULL, and is sufficient for many applica- |
locale when PCRE is compiled. This is used when the final |
737 |
tions. |
argument of pcre_compile() is NULL, and is sufficient for |
738 |
|
many applications. |
739 |
|
|
740 |
An alternative set of tables can, however, be supplied. Such |
An alternative set of tables can, however, be supplied. Such |
741 |
tables are built by calling the pcre_maketables() function, |
tables are built by calling the pcre_maketables() function, |
753 |
The tables are built in memory that is obtained via |
The tables are built in memory that is obtained via |
754 |
pcre_malloc. The pointer that is passed to pcre_compile is |
pcre_malloc. The pointer that is passed to pcre_compile is |
755 |
saved with the compiled pattern, and the same tables are |
saved with the compiled pattern, and the same tables are |
756 |
used via this pointer by pcre_study() and pcre_exec(). Thus |
used via this pointer by pcre_study() and pcre_exec(). Thus, |
757 |
for any single pattern, compilation, studying and matching |
for any single pattern, compilation, studying and matching |
758 |
all happen in the same locale, but different patterns can be |
all happen in the same locale, but different patterns can be |
759 |
compiled in different locales. It is the caller's responsi- |
compiled in different locales. It is the caller's responsi- |
761 |
remains available for as long as it is needed. |
remains available for as long as it is needed. |
762 |
|
|
763 |
|
|
|
|
|
764 |
INFORMATION ABOUT A PATTERN |
INFORMATION ABOUT A PATTERN |
765 |
|
|
766 |
|
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
767 |
|
int what, void *where); |
768 |
|
|
769 |
The pcre_fullinfo() function returns information about a |
The pcre_fullinfo() function returns information about a |
770 |
compiled pattern. It replaces the obsolete pcre_info() func- |
compiled pattern. It replaces the obsolete pcre_info() func- |
771 |
tion, which is nevertheless retained for backwards compabil- |
tion, which is nevertheless retained for backwards compabil- |
772 |
ity (and is documented below). |
ity (and is documented below). |
|
|
|
773 |
The first argument for pcre_fullinfo() is a pointer to the |
The first argument for pcre_fullinfo() is a pointer to the |
774 |
compiled pattern. The second argument is the result of |
compiled pattern. The second argument is the result of |
775 |
pcre_study(), or NULL if the pattern was not studied. The |
pcre_study(), or NULL if the pattern was not studied. The |
776 |
third argument specifies which piece of information is |
third argument specifies which piece of information is |
777 |
required, while the fourth argument is a pointer to a vari- |
required, and the fourth argument is a pointer to a variable |
778 |
able to receive the data. The yield of the function is zero |
to receive the data. The yield of the function is zero for |
779 |
for success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
780 |
|
|
781 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
782 |
the argument where was NULL |
the argument where was NULL |
797 |
The possible values for the third argument are defined in |
The possible values for the third argument are defined in |
798 |
pcre.h, and are as follows: |
pcre.h, and are as follows: |
799 |
|
|
800 |
PCRE_INFO_OPTIONS |
PCRE_INFO_BACKREFMAX |
|
|
|
|
Return a copy of the options with which the pattern was com- |
|
|
piled. The fourth argument should point to an unsigned long |
|
|
int variable. These option bits are those specified in the |
|
|
call to pcre_compile(), modified by any top-level option |
|
|
settings within the pattern itself, and with the |
|
|
PCRE_ANCHORED bit forcibly set if the form of the pattern |
|
|
implies that it can match only at the start of a subject |
|
|
string. |
|
|
|
|
|
PCRE_INFO_SIZE |
|
801 |
|
|
802 |
Return the size of the compiled pattern, that is, the value |
Return the number of the highest back reference in the pat- |
803 |
that was passed as the argument to pcre_malloc() when PCRE |
tern. The fourth argument should point to an int variable. |
804 |
was getting memory in which to place the compiled data. The |
Zero is returned if there are no back references. |
|
fourth argument should point to a size_t variable. |
|
805 |
|
|
806 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
807 |
|
|
808 |
Return the number of capturing subpatterns in the pattern. |
Return the number of capturing subpatterns in the pattern. |
809 |
The fourth argument should point to an int variable. |
The fourth argument should point to an int variable. |
810 |
|
|
811 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_FIRSTBYTE |
|
|
|
|
Return the number of the highest back reference in the pat- |
|
|
tern. The fourth argument should point to an int variable. |
|
|
Zero is returned if there are no back references. |
|
812 |
|
|
813 |
PCRE_INFO_FIRSTCHAR |
Return information about the first byte of any matched |
814 |
|
string, for a non-anchored pattern. (This option used to be |
815 |
|
called PCRE_INFO_FIRSTCHAR; the old name is still recognized |
816 |
|
for backwards compatibility.) |
817 |
|
|
818 |
Return information about the first character of any matched |
If there is a fixed first byte, e.g. from a pattern such as |
|
string, for a non-anchored pattern. If there is a fixed |
|
|
first character, e.g. from a pattern such as |
|
819 |
(cat|cow|coyote), it is returned in the integer pointed to |
(cat|cow|coyote), it is returned in the integer pointed to |
820 |
by where. Otherwise, if either |
by where. Otherwise, if either |
821 |
|
|
827 |
anchored), |
anchored), |
828 |
|
|
829 |
-1 is returned, indicating that the pattern matches only at |
-1 is returned, indicating that the pattern matches only at |
830 |
the start of a subject string or after any "\n" within the |
the start of a subject string or after any newline within |
831 |
string. Otherwise -2 is returned. For anchored patterns, -2 |
the string. Otherwise -2 is returned. For anchored patterns, |
832 |
is returned. |
-2 is returned. |
833 |
|
|
834 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
835 |
|
|
836 |
If the pattern was studied, and this resulted in the con- |
If the pattern was studied, and this resulted in the con- |
837 |
struction of a 256-bit table indicating a fixed set of char- |
struction of a 256-bit table indicating a fixed set of bytes |
838 |
acters for the first character in any matching string, a |
for the first byte in any matching string, a pointer to the |
839 |
pointer to the table is returned. Otherwise NULL is |
table is returned. Otherwise NULL is returned. The fourth |
840 |
returned. The fourth argument should point to an unsigned |
argument should point to an unsigned char * variable. |
|
char * variable. |
|
841 |
|
|
842 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
843 |
|
|
844 |
For a non-anchored pattern, return the value of the right- |
Return the value of the rightmost literal byte that must |
845 |
most literal character which must exist in any matched |
exist in any matched string, other than at its start, if |
846 |
string, other than at its start. The fourth argument should |
such a byte has been recorded. The fourth argument should |
847 |
point to an int variable. If there is no such character, or |
point to an int variable. If there is no such byte, -1 is |
848 |
if the pattern is anchored, -1 is returned. For example, for |
returned. For anchored patterns, a last literal byte is |
849 |
the pattern /a\d+z\d+/ the returned value is 'z'. |
recorded only if it follows something of variable length. |
850 |
|
For example, for the pattern /^a\d+z\d+/ the returned value |
851 |
|
is "z", but for /^a\dz\d/ the returned value is -1. |
852 |
|
|
853 |
|
PCRE_INFO_NAMECOUNT |
854 |
|
PCRE_INFO_NAMEENTRYSIZE |
855 |
|
PCRE_INFO_NAMETABLE |
856 |
|
|
857 |
|
PCRE supports the use of named as well as numbered capturing |
858 |
|
parentheses. The names are just an additional way of identi- |
859 |
|
fying the parentheses, which still acquire a number. A |
860 |
|
caller that wants to extract data from a named subpattern |
861 |
|
must convert the name to a number in order to access the |
862 |
|
correct pointers in the output vector (described with |
863 |
|
pcre_exec() below). In order to do this, it must first use |
864 |
|
these three values to obtain the name-to-number mapping |
865 |
|
table for the pattern. |
866 |
|
|
867 |
|
The map consists of a number of fixed-size entries. |
868 |
|
PCRE_INFO_NAMECOUNT gives the number of entries, and |
869 |
|
PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both |
870 |
|
of these return an int value. The entry size depends on the |
871 |
|
length of the longest name. PCRE_INFO_NAMETABLE returns a |
872 |
|
pointer to the first entry of the table (a pointer to char). |
873 |
|
The first two bytes of each entry are the number of the cap- |
874 |
|
turing parenthesis, most significant byte first. The rest of |
875 |
|
the entry is the corresponding name, zero terminated. The |
876 |
|
names are in alphabetical order. For example, consider the |
877 |
|
following pattern (assume PCRE_EXTENDED is set, so white |
878 |
|
space - including newlines - is ignored): |
879 |
|
|
880 |
|
(?P<date> (?P<year>(\d\d)?\d\d) - |
881 |
|
(?P<month>\d\d) - (?P<day>\d\d) ) |
882 |
|
|
883 |
|
There are four named subpatterns, so the table has four |
884 |
|
entries, and each entry in the table is eight bytes long. |
885 |
|
The table is as follows, with non-printing bytes shows in |
886 |
|
hex, and undefined bytes shown as ??: |
887 |
|
|
888 |
|
00 01 d a t e 00 ?? |
889 |
|
00 05 d a y 00 ?? ?? |
890 |
|
00 04 m o n t h 00 |
891 |
|
00 02 y e a r 00 ?? |
892 |
|
|
893 |
|
When writing code to extract data from named subpatterns, |
894 |
|
remember that the length of each entry may be different for |
895 |
|
each compiled pattern. |
896 |
|
|
897 |
|
PCRE_INFO_OPTIONS |
898 |
|
|
899 |
|
Return a copy of the options with which the pattern was com- |
900 |
|
piled. The fourth argument should point to an unsigned long |
901 |
|
int variable. These option bits are those specified in the |
902 |
|
call to pcre_compile(), modified by any top-level option |
903 |
|
settings within the pattern itself. |
904 |
|
|
905 |
|
A pattern is automatically anchored by PCRE if all of its |
906 |
|
top-level alternatives begin with one of the following: |
907 |
|
|
908 |
|
^ unless PCRE_MULTILINE is set |
909 |
|
\A always |
910 |
|
\G always |
911 |
|
.* if PCRE_DOTALL is set and there are no back |
912 |
|
references to the subpattern in which .* appears |
913 |
|
|
914 |
|
For such patterns, the PCRE_ANCHORED bit is set in the |
915 |
|
options returned by pcre_fullinfo(). |
916 |
|
|
917 |
|
PCRE_INFO_SIZE |
918 |
|
|
919 |
|
Return the size of the compiled pattern, that is, the value |
920 |
|
that was passed as the argument to pcre_malloc() when PCRE |
921 |
|
was getting memory in which to place the compiled data. The |
922 |
|
fourth argument should point to a size_t variable. |
923 |
|
|
924 |
|
PCRE_INFO_STUDYSIZE |
925 |
|
|
926 |
|
Returns the size of the data block pointed to by the |
927 |
|
study_data field in a pcre_extra block. That is, it is the |
928 |
|
value that was passed to pcre_malloc() when PCRE was getting |
929 |
|
memory into which to place the data created by pcre_study(). |
930 |
|
The fourth argument should point to a size_t variable. |
931 |
|
|
932 |
|
|
933 |
|
OBSOLETE INFO FUNCTION |
934 |
|
|
935 |
|
int pcre_info(const pcre *code, int *optptr, *firstcharptr); |
936 |
|
|
937 |
The pcre_info() function is now obsolete because its inter- |
The pcre_info() function is now obsolete because its inter- |
938 |
face is too restrictive to return all the available data |
face is too restrictive to return all the available data |
951 |
If the pattern is not anchored and the firstcharptr argument |
If the pattern is not anchored and the firstcharptr argument |
952 |
is not NULL, it is used to pass back information about the |
is not NULL, it is used to pass back information about the |
953 |
first character of any matched string (see |
first character of any matched string (see |
954 |
PCRE_INFO_FIRSTCHAR above). |
PCRE_INFO_FIRSTBYTE above). |
|
|
|
955 |
|
|
956 |
|
|
957 |
MATCHING A PATTERN |
MATCHING A PATTERN |
|
The function pcre_exec() is called to match a subject string |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
SunOS 5.8 Last change: 9 |
|
|
|
|
958 |
|
|
959 |
|
int pcre_exec(const pcre *code, const pcre_extra *extra, |
960 |
|
const char *subject, int length, int startoffset, |
961 |
|
int options, int *ovector, int ovecsize); |
962 |
|
|
963 |
|
The function pcre_exec() is called to match a subject string |
964 |
against a pre-compiled pattern, which is passed in the code |
against a pre-compiled pattern, which is passed in the code |
965 |
argument. If the pattern has been studied, the result of the |
argument. If the pattern has been studied, the result of the |
966 |
study should be passed in the extra argument. Otherwise this |
study should be passed in the extra argument. |
|
must be NULL. |
|
967 |
|
|
968 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
969 |
|
|
979 |
ovector, /* vector for substring information */ |
ovector, /* vector for substring information */ |
980 |
30); /* number of elements in the vector */ |
30); /* number of elements in the vector */ |
981 |
|
|
982 |
|
If the extra argument is not NULL, it must point to a |
983 |
|
pcre_extra data block. The pcre_study() function returns |
984 |
|
such a block (when it doesn't return NULL), but you can also |
985 |
|
create one for yourself, and pass additional information in |
986 |
|
it. The fields in the block are as follows: |
987 |
|
|
988 |
|
unsigned long int flags; |
989 |
|
void *study_data; |
990 |
|
unsigned long int match_limit; |
991 |
|
void *callout_data; |
992 |
|
|
993 |
|
The flags field is a bitmap that specifies which of the |
994 |
|
other fields are set. The flag bits are: |
995 |
|
|
996 |
|
PCRE_EXTRA_STUDY_DATA |
997 |
|
PCRE_EXTRA_MATCH_LIMIT |
998 |
|
PCRE_EXTRA_CALLOUT_DATA |
999 |
|
|
1000 |
|
Other flag bits should be set to zero. The study_data field |
1001 |
|
is set in the pcre_extra block that is returned by |
1002 |
|
pcre_study(), together with the appropriate flag bit. You |
1003 |
|
should not set this yourself, but you can add to the block |
1004 |
|
by setting the other fields. |
1005 |
|
|
1006 |
|
The match_limit field provides a means of preventing PCRE |
1007 |
|
from using up a vast amount of resources when running pat- |
1008 |
|
terns that are not going to match, but which have a very |
1009 |
|
large number of possibilities in their search trees. The |
1010 |
|
classic example is the use of nested unlimited repeats. |
1011 |
|
Internally, PCRE uses a function called match() which it |
1012 |
|
calls repeatedly (sometimes recursively). The limit is |
1013 |
|
imposed on the number of times this function is called dur- |
1014 |
|
ing a match, which has the effect of limiting the amount of |
1015 |
|
recursion and backtracking that can take place. For patterns |
1016 |
|
that are not anchored, the count starts from zero for each |
1017 |
|
position in the subject string. |
1018 |
|
|
1019 |
|
The default limit for the library can be set when PCRE is |
1020 |
|
built; the default default is 10 million, which handles all |
1021 |
|
but the most extreme cases. You can reduce the default by |
1022 |
|
suppling pcre_exec() with a pcre_extra block in which |
1023 |
|
match_limit is set to a smaller value, and |
1024 |
|
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the |
1025 |
|
limit is exceeded, pcre_exec() returns |
1026 |
|
PCRE_ERROR_MATCHLIMIT. |
1027 |
|
|
1028 |
|
The pcre_callout field is used in conjunction with the "cal- |
1029 |
|
lout" feature, which is described in the pcrecallout docu- |
1030 |
|
mentation. |
1031 |
|
|
1032 |
The PCRE_ANCHORED option can be passed in the options argu- |
The PCRE_ANCHORED option can be passed in the options argu- |
1033 |
ment, whose unused bits must be zero. However, if a pattern |
ment, whose unused bits must be zero. This limits |
1034 |
was compiled with PCRE_ANCHORED, or turned out to be |
pcre_exec() to matching at the first matching position. How- |
1035 |
anchored by virtue of its contents, it cannot be made |
ever, if a pattern was compiled with PCRE_ANCHORED, or |
1036 |
unachored at matching time. |
turned out to be anchored by virtue of its contents, it can- |
1037 |
|
not be made unachored at matching time. |
1038 |
|
|
1039 |
|
When PCRE_UTF8 was set at compile time, the validity of the |
1040 |
|
subject as a UTF-8 string is automatically checked. If an |
1041 |
|
invalid UTF-8 sequence of bytes is found, pcre_exec() |
1042 |
|
returns the error PCRE_ERROR_BADUTF8. If you already know |
1043 |
|
that your subject is valid, and you want to skip this check |
1044 |
|
for performance reasons, you can set the PCRE_NO_UTF8_CHECK |
1045 |
|
option when calling pcre_exec(). When this option is set, |
1046 |
|
the effect of passing an invalid UTF-8 string as a subject |
1047 |
|
is undefined. It may cause your program to crash. |
1048 |
|
|
1049 |
There are also three further options that can be set only at |
There are also three further options that can be set only at |
1050 |
matching time: |
matching time: |
1088 |
advancing the starting offset (see below) and trying an |
advancing the starting offset (see below) and trying an |
1089 |
ordinary match again. |
ordinary match again. |
1090 |
|
|
1091 |
The subject string is passed as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in |
1092 |
length in length, and a starting offset in startoffset. |
subject, a length in length, and a starting offset in star- |
1093 |
Unlike the pattern string, the subject may contain binary |
toffset. Unlike the pattern string, the subject may contain |
1094 |
zero characters. When the starting offset is zero, the |
binary zero bytes. When the starting offset is zero, the |
1095 |
search for a match starts at the beginning of the subject, |
search for a match starts at the beginning of the subject, |
1096 |
and this is by far the most common case. |
and this is by far the most common case. |
1097 |
|
|
1098 |
|
If the pattern was compiled with the PCRE_UTF8 option, the |
1099 |
|
subject must be a sequence of bytes that is a valid UTF-8 |
1100 |
|
string. If an invalid UTF-8 string is passed, PCRE's |
1101 |
|
behaviour is not defined. |
1102 |
|
|
1103 |
A non-zero starting offset is useful when searching for |
A non-zero starting offset is useful when searching for |
1104 |
another match in the same subject by calling pcre_exec() |
another match in the same subject by calling pcre_exec() |
1105 |
again after a previous success. Setting startoffset differs |
again after a previous success. Setting startoffset differs |
1135 |
used for a fragment of a pattern that picks out a substring. |
used for a fragment of a pattern that picks out a substring. |
1136 |
PCRE supports several other kinds of parenthesized subpat- |
PCRE supports several other kinds of parenthesized subpat- |
1137 |
tern that do not cause substrings to be captured. |
tern that do not cause substrings to be captured. |
|
|
|
1138 |
Captured substrings are returned to the caller via a vector |
Captured substrings are returned to the caller via a vector |
1139 |
of integer offsets whose address is passed in ovector. The |
of integer offsets whose address is passed in ovector. The |
1140 |
number of elements in the vector is passed in ovecsize. The |
number of elements in the vector is passed in ovecsize. The |
1190 |
Note that pcre_info() can be used to find out how many cap- |
Note that pcre_info() can be used to find out how many cap- |
1191 |
turing subpatterns there are in a compiled pattern. The |
turing subpatterns there are in a compiled pattern. The |
1192 |
smallest size for ovector that will allow for n captured |
smallest size for ovector that will allow for n captured |
1193 |
substrings in addition to the offsets of the substring |
substrings, in addition to the offsets of the substring |
1194 |
matched by the whole pattern is (n+1)*3. |
matched by the whole pattern, is (n+1)*3. |
1195 |
|
|
1196 |
If pcre_exec() fails, it returns a negative number. The fol- |
If pcre_exec() fails, it returns a negative number. The fol- |
1197 |
lowing are defined in the header file: |
lowing are defined in the header file: |
1231 |
pcre_malloc() fails, this error is given. The memory is |
pcre_malloc() fails, this error is given. The memory is |
1232 |
freed at the end of matching. |
freed at the end of matching. |
1233 |
|
|
1234 |
|
PCRE_ERROR_NOSUBSTRING (-7) |
1235 |
|
|
1236 |
|
This error is used by the pcre_copy_substring(), |
1237 |
|
pcre_get_substring(), and pcre_get_substring_list() func- |
1238 |
|
tions (see below). It is never returned by pcre_exec(). |
1239 |
|
|
1240 |
|
PCRE_ERROR_MATCHLIMIT (-8) |
1241 |
|
|
1242 |
|
The recursion and backtracking limit, as specified by the |
1243 |
|
match_limit field in a pcre_extra structure (or defaulted) |
1244 |
|
was reached. See the description above. |
1245 |
|
|
1246 |
|
PCRE_ERROR_CALLOUT (-9) |
1247 |
|
|
1248 |
|
This error is never generated by pcre_exec() itself. It is |
1249 |
|
provided for use by callout functions that want to yield a |
1250 |
|
distinctive error code. See the pcrecallout documentation |
1251 |
|
for details. |
1252 |
|
|
1253 |
|
PCRE_ERROR_BADUTF8 (-10) |
1254 |
|
|
1255 |
|
A string that contains an invalid UTF-8 byte sequence was |
1256 |
|
passed as a subject. |
1257 |
|
|
1258 |
|
|
1259 |
|
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
1260 |
|
|
1261 |
|
int pcre_copy_substring(const char *subject, int *ovector, |
1262 |
|
int stringcount, int stringnumber, char *buffer, |
1263 |
|
int buffersize); |
1264 |
|
|
1265 |
|
int pcre_get_substring(const char *subject, int *ovector, |
1266 |
|
int stringcount, int stringnumber, |
1267 |
|
const char **stringptr); |
1268 |
|
|
1269 |
|
int pcre_get_substring_list(const char *subject, |
1270 |
|
int *ovector, int stringcount, const char ***listptr); |
1271 |
|
|
|
EXTRACTING CAPTURED SUBSTRINGS |
|
1272 |
Captured substrings can be accessed directly by using the |
Captured substrings can be accessed directly by using the |
1273 |
offsets returned by pcre_exec() in ovector. For convenience, |
offsets returned by pcre_exec() in ovector. For convenience, |
1274 |
the functions pcre_copy_substring(), pcre_get_substring(), |
the functions pcre_copy_substring(), pcre_get_substring(), |
1275 |
and pcre_get_substring_list() are provided for extracting |
and pcre_get_substring_list() are provided for extracting |
1276 |
captured substrings as new, separate, zero-terminated |
captured substrings as new, separate, zero-terminated |
1277 |
|
strings. These functions identify substrings by number. The |
1278 |
|
next section describes functions for extracting named sub- |
1279 |
strings. A substring that contains a binary zero is |
strings. A substring that contains a binary zero is |
1280 |
correctly extracted and has a further zero added on the end, |
correctly extracted and has a further zero added on the end, |
1281 |
but the result does not, of course, function as a C string. |
but the result is not, of course, a C string. |
1282 |
|
|
1283 |
The first three arguments are the same for all three func- |
The first three arguments are the same for all three of |
1284 |
tions: subject is the subject string which has just been |
these functions: subject is the subject string which has |
1285 |
successfully matched, ovector is a pointer to the vector of |
just been successfully matched, ovector is a pointer to the |
1286 |
integer offsets that was passed to pcre_exec(), and |
vector of integer offsets that was passed to pcre_exec(), |
1287 |
stringcount is the number of substrings that were captured |
and stringcount is the number of substrings that were cap- |
1288 |
by the match, including the substring that matched the |
tured by the match, including the substring that matched the |
1289 |
entire regular expression. This is the value returned by |
entire regular expression. This is the value returned by |
1290 |
pcre_exec if it is greater than zero. If pcre_exec() |
pcre_exec if it is greater than zero. If pcre_exec() |
1291 |
returned zero, indicating that it ran out of space in ovec- |
returned zero, indicating that it ran out of space in ovec- |
1292 |
tor, the value passed as stringcount should be the size of |
tor, the value passed as stringcount should be the size of |
1293 |
the vector divided by three. |
the vector divided by three. |
|
|
|
1294 |
The functions pcre_copy_substring() and pcre_get_substring() |
The functions pcre_copy_substring() and pcre_get_substring() |
1295 |
extract a single substring, whose number is given as string- |
extract a single substring, whose number is given as string- |
1296 |
number. A value of zero extracts the substring that matched |
number. A value of zero extracts the substring that matched |
1343 |
the functions are provided. |
the functions are provided. |
1344 |
|
|
1345 |
|
|
1346 |
|
EXTRACTING CAPTURED SUBSTRINGS BY NAME |
1347 |
|
|
1348 |
LIMITATIONS |
int pcre_copy_named_substring(const pcre *code, |
1349 |
There are some size limitations in PCRE but it is hoped that |
const char *subject, int *ovector, |
1350 |
they will never in practice be relevant. The maximum length |
int stringcount, const char *stringname, |
1351 |
of a compiled pattern is 65539 (sic) bytes. All values in |
char *buffer, int buffersize); |
1352 |
repeating quantifiers must be less than 65536. There max- |
|
1353 |
imum number of capturing subpatterns is 65535. There is no |
int pcre_get_stringnumber(const pcre *code, |
1354 |
limit to the number of non-capturing subpatterns, but the |
const char *name); |
1355 |
maximum depth of nesting of all kinds of parenthesized sub- |
|
1356 |
pattern, including capturing subpatterns, assertions, and |
int pcre_get_named_substring(const pcre *code, |
1357 |
other types of subpattern, is 200. |
const char *subject, int *ovector, |
1358 |
|
int stringcount, const char *stringname, |
1359 |
|
const char **stringptr); |
1360 |
|
|
1361 |
The maximum length of a subject string is the largest posi- |
To extract a substring by name, you first have to find asso- |
1362 |
tive number that an integer variable can hold. However, PCRE |
ciated number. This can be done by calling |
1363 |
uses recursion to handle subpatterns and indefinite repeti- |
pcre_get_stringnumber(). The first argument is the compiled |
1364 |
tion. This means that the available stack space may limit |
pattern, and the second is the name. For example, for this |
1365 |
the size of a subject string that can be processed by cer- |
pattern |
1366 |
tain patterns. |
|
1367 |
|
ab(?<xxx>\d+)... |
1368 |
|
|
1369 |
|
the number of the subpattern called "xxx" is 1. Given the |
1370 |
|
number, you can then extract the substring directly, or use |
1371 |
|
one of the functions described in the previous section. For |
1372 |
|
convenience, there are also two functions that do the whole |
1373 |
|
job. |
1374 |
|
|
1375 |
|
Most of the arguments of pcre_copy_named_substring() and |
1376 |
|
pcre_get_named_substring() are the same as those for the |
1377 |
|
functions that extract by number, and so are not re- |
1378 |
|
described here. There are just two differences. |
1379 |
|
|
1380 |
|
First, instead of a substring number, a substring name is |
1381 |
|
given. Second, there is an extra argument, given at the |
1382 |
|
start, which is a pointer to the compiled pattern. This is |
1383 |
|
needed in order to gain access to the name-to-number trans- |
1384 |
|
lation table. |
1385 |
|
|
1386 |
|
These functions call pcre_get_stringnumber(), and if it |
1387 |
|
succeeds, they then call pcre_copy_substring() or |
1388 |
|
pcre_get_substring(), as appropriate. |
1389 |
|
|
1390 |
|
Last updated: 20 August 2003 |
1391 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1392 |
|
----------------------------------------------------------------------------- |
1393 |
|
|
1394 |
|
NAME |
1395 |
|
PCRE - Perl-compatible regular expressions |
1396 |
|
|
1397 |
|
|
1398 |
|
PCRE CALLOUTS |
1399 |
|
|
1400 |
|
int (*pcre_callout)(pcre_callout_block *); |
1401 |
|
|
1402 |
|
PCRE provides a feature called "callout", which is a means |
1403 |
|
of temporarily passing control to the caller of PCRE in the |
1404 |
|
middle of pattern matching. The caller of PCRE provides an |
1405 |
|
external function by putting its entry point in the global |
1406 |
|
variable pcre_callout. By default, this variable contains |
1407 |
|
NULL, which disables all calling out. |
1408 |
|
|
1409 |
|
Within a regular expression, (?C) indicates the points at |
1410 |
|
which the external function is to be called. Different cal- |
1411 |
|
lout points can be identified by putting a number less than |
1412 |
|
256 after the letter C. The default value is zero. For |
1413 |
|
example, this pattern has two callout points: |
1414 |
|
|
1415 |
|
(?C1)9abc(?C2)def |
1416 |
|
|
1417 |
|
During matching, when PCRE reaches a callout point (and |
1418 |
|
pcre_callout is set), the external function is called. Its |
1419 |
|
only argument is a pointer to a pcre_callout block. This |
1420 |
|
contains the following variables: |
1421 |
|
|
1422 |
|
int version; |
1423 |
|
int callout_number; |
1424 |
|
int *offset_vector; |
1425 |
|
const char *subject; |
1426 |
|
int subject_length; |
1427 |
|
int start_match; |
1428 |
|
int current_position; |
1429 |
|
int capture_top; |
1430 |
|
int capture_last; |
1431 |
|
void *callout_data; |
1432 |
|
|
1433 |
|
The version field is an integer containing the version |
1434 |
|
number of the block format. The current version is zero. The |
1435 |
|
version number may change in future if additional fields are |
1436 |
|
added, but the intention is never to remove any of the |
1437 |
|
existing fields. |
1438 |
|
|
1439 |
|
The callout_number field contains the number of the callout, |
1440 |
|
as compiled into the pattern (that is, the number after ?C). |
1441 |
|
|
1442 |
|
The offset_vector field is a pointer to the vector of |
1443 |
|
offsets that was passed by the caller to pcre_exec(). The |
1444 |
|
contents can be inspected in order to extract substrings |
1445 |
|
that have been matched so far, in the same way as for |
1446 |
|
extracting substrings after a match has completed. |
1447 |
|
The subject and subject_length fields contain copies the |
1448 |
|
values that were passed to pcre_exec(). |
1449 |
|
|
1450 |
|
The start_match field contains the offset within the subject |
1451 |
|
at which the current match attempt started. If the pattern |
1452 |
|
is not anchored, the callout function may be called several |
1453 |
|
times for different starting points. |
1454 |
|
|
1455 |
|
The current_position field contains the offset within the |
1456 |
|
subject of the current match pointer. |
1457 |
|
|
1458 |
|
The capture_top field contains one more than the number of |
1459 |
|
the highest numbered captured substring so far. If no sub- |
1460 |
|
strings have been captured, the value of capture_top is one. |
1461 |
|
|
1462 |
|
The capture_last field contains the number of the most |
1463 |
|
recently captured substring. |
1464 |
|
|
1465 |
|
The callout_data field contains a value that is passed to |
1466 |
|
pcre_exec() by the caller specifically so that it can be |
1467 |
|
passed back in callouts. It is passed in the pcre_callout |
1468 |
|
field of the pcre_extra data structure. If no such data was |
1469 |
|
passed, the value of callout_data in a pcre_callout block is |
1470 |
|
NULL. There is a description of the pcre_extra structure in |
1471 |
|
the pcreapi documentation. |
1472 |
|
|
1473 |
|
|
1474 |
|
|
1475 |
|
RETURN VALUES |
1476 |
|
|
1477 |
|
The callout function returns an integer. If the value is |
1478 |
|
zero, matching proceeds as normal. If the value is greater |
1479 |
|
than zero, matching fails at the current point, but back- |
1480 |
|
tracking to test other possibilities goes ahead, just as if |
1481 |
|
a lookahead assertion had failed. If the value is less than |
1482 |
|
zero, the match is abandoned, and pcre_exec() returns the |
1483 |
|
value. |
1484 |
|
|
1485 |
|
Negative values should normally be chosen from the set of |
1486 |
|
PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH |
1487 |
|
forces a standard "no match" failure. The error number |
1488 |
|
PCRE_ERROR_CALLOUT is reserved for use by callout functions; |
1489 |
|
it will never be used by PCRE itself. |
1490 |
|
|
1491 |
|
Last updated: 21 January 2003 |
1492 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1493 |
|
----------------------------------------------------------------------------- |
1494 |
|
|
1495 |
|
NAME |
1496 |
|
PCRE - Perl-compatible regular expressions |
1497 |
|
|
1498 |
|
|
1499 |
DIFFERENCES FROM PERL |
DIFFERENCES FROM PERL |
|
The differences described here are with respect to Perl |
|
|
5.005. |
|
1500 |
|
|
1501 |
1. By default, a whitespace character is any character that |
This document describes the differences in the ways that |
1502 |
the C library function isspace() recognizes, though it is |
PCRE and Perl handle regular expressions. The differences |
1503 |
possible to compile PCRE with alternative character type |
described here are with respect to Perl 5.8. |
|
tables. Normally isspace() matches space, formfeed, newline, |
|
|
carriage return, horizontal tab, and vertical tab. Perl 5 no |
|
|
longer includes vertical tab in its set of whitespace char- |
|
|
acters. The \v escape that was in the Perl documentation for |
|
|
a long time was never in fact recognized. However, the char- |
|
|
acter itself was treated as whitespace at least up to 5.002. |
|
|
In 5.004 and 5.005 it does not match \s. |
|
1504 |
|
|
1505 |
2. PCRE does not allow repeat quantifiers on lookahead |
1. PCRE does not allow repeat quantifiers on lookahead |
1506 |
assertions. Perl permits them, but they do not mean what you |
assertions. Perl permits them, but they do not mean what you |
1507 |
might think. For example, (?!a){3} does not assert that the |
might think. For example, (?!a){3} does not assert that the |
1508 |
next three characters are not "a". It just asserts that the |
next three characters are not "a". It just asserts that the |
1509 |
next character is not "a" three times. |
next character is not "a" three times. |
1510 |
|
|
1511 |
3. Capturing subpatterns that occur inside negative looka- |
2. Capturing subpatterns that occur inside negative looka- |
1512 |
head assertions are counted, but their entries in the |
head assertions are counted, but their entries in the |
1513 |
offsets vector are never set. Perl sets its numerical vari- |
offsets vector are never set. Perl sets its numerical vari- |
1514 |
ables from any such patterns that are matched before the |
ables from any such patterns that are matched before the |
1516 |
only if the negative lookahead assertion contains just one |
only if the negative lookahead assertion contains just one |
1517 |
branch. |
branch. |
1518 |
|
|
1519 |
4. Though binary zero characters are supported in the sub- |
3. Though binary zero characters are supported in the sub- |
1520 |
ject string, they are not allowed in a pattern string |
ject string, they are not allowed in a pattern string |
1521 |
because it is passed as a normal C string, terminated by |
because it is passed as a normal C string, terminated by |
1522 |
zero. The escape sequence "\0" can be used in the pattern to |
zero. The escape sequence "\0" can be used in the pattern to |
1523 |
represent a binary zero. |
represent a binary zero. |
1524 |
|
|
1525 |
5. The following Perl escape sequences are not supported: |
4. The following Perl escape sequences are not supported: |
1526 |
\l, \u, \L, \U, \E, \Q. In fact these are implemented by |
\l, \u, \L, \U, \P, \p, and \X. In fact these are imple- |
1527 |
Perl's general string-handling and are not part of its pat- |
mented by Perl's general string-handling and are not part of |
1528 |
tern matching engine. |
its pattern matching engine. If any of these are encountered |
1529 |
|
by PCRE, an error is generated. |
1530 |
|
|
1531 |
|
5. PCRE does support the \Q...\E escape for quoting sub- |
1532 |
|
strings. Characters in between are treated as literals. This |
1533 |
|
is slightly different from Perl in that $ and @ are also |
1534 |
|
handled as literals inside the quotes. In Perl, they cause |
1535 |
|
variable interpolation (but of course PCRE does not have |
1536 |
|
variables). Note the following examples: |
1537 |
|
|
1538 |
|
Pattern PCRE matches Perl matches |
1539 |
|
|
1540 |
|
\Qabc$xyz\E abc$xyz abc followed by the |
1541 |
|
contents of $xyz |
1542 |
|
\Qabc\$xyz\E abc\$xyz abc\$xyz |
1543 |
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
1544 |
|
|
1545 |
6. The Perl \G assertion is not supported as it is not |
In PCRE, the \Q...\E mechanism is not recognized inside a |
1546 |
relevant to single pattern matches. |
character class. |
1547 |
|
|
1548 |
7. Fairly obviously, PCRE does not support the (?{code}) and |
8. Fairly obviously, PCRE does not support the (?{code}) and |
1549 |
(?p{code}) constructions. However, there is some experimen- |
(?p{code}) constructions. However, there is some experimen- |
1550 |
tal support for recursive patterns using the non-Perl item |
tal support for recursive patterns using the non-Perl items |
1551 |
(?R). |
(?R), (?number) and (?P>name). Also, the PCRE "callout" |
1552 |
|
feature allows an external function to be called during pat- |
1553 |
8. There are at the time of writing some oddities in Perl |
tern matching. |
1554 |
5.005_02 concerned with the settings of captured strings |
|
1555 |
when part of a pattern is repeated. For example, matching |
9. There are some differences that are concerned with the |
1556 |
"aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
settings of captured strings when part of a pattern is |
1557 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 |
repeated. For example, matching "aba" against the pattern |
1558 |
unset. However, if the pattern is changed to |
/^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set |
1559 |
/^(aa(b(b))?)+$/ then $2 (and $3) are set. |
to "b". |
|
|
|
|
In Perl 5.004 $2 is set in both cases, and that is also true |
|
|
of PCRE. If in the future Perl changes to a consistent state |
|
|
that is different, PCRE may change to follow. |
|
|
|
|
|
9. Another as yet unresolved discrepancy is that in Perl |
|
|
5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string |
|
|
"a", whereas in PCRE it does not. However, in both Perl and |
|
|
PCRE /^(a)?a/ matched against "a" leaves $1 unset. |
|
1560 |
|
|
1561 |
10. PCRE provides some extensions to the Perl regular |
10. PCRE provides some extensions to the Perl regular |
1562 |
expression facilities: |
expression facilities: |
1563 |
|
|
1564 |
(a) Although lookbehind assertions must match fixed length |
(a) Although lookbehind assertions must match fixed length |
1565 |
strings, each alternative branch of a lookbehind assertion |
strings, each alternative branch of a lookbehind assertion |
1566 |
can match a different length of string. Perl 5.005 requires |
can match a different length of string. Perl requires them |
1567 |
them all to have the same length. |
all to have the same length. |
1568 |
|
|
1569 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not |
1570 |
set, the $ meta- character matches only at the very end of |
set, the $ meta-character matches only at the very end of |
1571 |
the string. |
the string. |
1572 |
|
|
1573 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
(c) If PCRE_EXTRA is set, a backslash followed by a letter |
1578 |
not greedy, but if followed by a question mark they are. |
not greedy, but if followed by a question mark they are. |
1579 |
|
|
1580 |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
(e) PCRE_ANCHORED can be used to force a pattern to be tried |
1581 |
only at the start of the subject. |
only at the first matching position in the subject string. |
1582 |
|
|
1583 |
|
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and |
1584 |
|
PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl |
1585 |
|
equivalents. |
1586 |
|
|
1587 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options |
(g) The (?R), (?number), and (?P>name) constructs allows for |
1588 |
for pcre_exec() have no Perl equivalents. |
recursive pattern matching (Perl can do this using the |
1589 |
|
(?p{code}) construct, which PCRE cannot support.) |
1590 |
|
|
1591 |
(g) The (?R) construct allows for recursive pattern matching |
(h) PCRE supports named capturing substrings, using the |
1592 |
(Perl 5.6 can do this using the (?p{code}) construct, which |
Python syntax. |
|
PCRE cannot of course support.) |
|
1593 |
|
|
1594 |
|
(i) PCRE supports the possessive quantifier "++" syntax, |
1595 |
|
taken from Sun's Java package. |
1596 |
|
|
1597 |
|
(j) The (R) condition, for testing recursion, is a PCRE |
1598 |
|
extension. |
1599 |
|
|
1600 |
|
(k) The callout facility is PCRE-specific. |
1601 |
|
|
1602 |
|
Last updated: 03 February 2003 |
1603 |
|
Copyright (c) 1997-2003 University of Cambridge. |
1604 |
|
----------------------------------------------------------------------------- |
1605 |
|
|
1606 |
|
NAME |
1607 |
|
PCRE - Perl-compatible regular expressions |
1608 |
|
|
1609 |
|
|
1610 |
|
PCRE REGULAR EXPRESSION DETAILS |
1611 |
|
|
|
REGULAR EXPRESSION DETAILS |
|
1612 |
The syntax and semantics of the regular expressions sup- |
The syntax and semantics of the regular expressions sup- |
1613 |
ported by PCRE are described below. Regular expressions are |
ported by PCRE are described below. Regular expressions are |
1614 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
1615 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
1616 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
1617 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
O'Reilly, covers them in great detail. The description here |
1618 |
|
is intended as reference documentation. |
1619 |
|
|
|
The description here is intended as reference documentation. |
|
1620 |
The basic operation of PCRE is on strings of bytes. However, |
The basic operation of PCRE is on strings of bytes. However, |
1621 |
there is the beginnings of some support for UTF-8 character |
there is also support for UTF-8 character strings. To use |
1622 |
strings. To use this support you must configure PCRE to |
this support you must build PCRE to include UTF-8 support, |
1623 |
include it, and then call pcre_compile() with the PCRE_UTF8 |
and then call pcre_compile() with the PCRE_UTF8 option. How |
1624 |
option. How this affects the pattern matching is described |
this affects the pattern matching is mentioned in several |
1625 |
in the final section of this document. |
places below. There is also a summary of UTF-8 features in |
1626 |
|
the section on UTF-8 support in the main pcre page. |
1627 |
|
|
1628 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
1629 |
subject string from left to right. Most characters stand for |
subject string from left to right. Most characters stand for |
1645 |
Outside square brackets, the meta-characters are as follows: |
Outside square brackets, the meta-characters are as follows: |
1646 |
|
|
1647 |
\ general escape character with several uses |
\ general escape character with several uses |
1648 |
^ assert start of subject (or line, in multiline |
^ assert start of string (or line, in multiline mode) |
1649 |
mode) |
$ assert end of string (or line, in multiline mode) |
|
$ assert end of subject (or line, in multiline mode) |
|
1650 |
. match any character except newline (by default) |
. match any character except newline (by default) |
1651 |
[ start character class definition |
[ start character class definition |
1652 |
| start of alternative branch |
| start of alternative branch |
1657 |
also quantifier minimizer |
also quantifier minimizer |
1658 |
* 0 or more quantifier |
* 0 or more quantifier |
1659 |
+ 1 or more quantifier |
+ 1 or more quantifier |
1660 |
|
also "possessive quantifier" |
1661 |
{ start min/max quantifier |
{ start min/max quantifier |
1662 |
|
|
1663 |
Part of a pattern that is in square brackets is called a |
Part of a pattern that is in square brackets is called a |
1667 |
\ general escape character |
\ general escape character |
1668 |
^ negate the class, but only if the first character |
^ negate the class, but only if the first character |
1669 |
- indicates character range |
- indicates character range |
1670 |
|
[ POSIX character class (only if followed by POSIX |
1671 |
|
syntax) |
1672 |
] terminates the character class |
] terminates the character class |
1673 |
|
|
1674 |
The following sections describe the use of each of the |
The following sections describe the use of each of the |
1675 |
meta-characters. |
meta-characters. |
1676 |
|
|
1677 |
|
|
|
|
|
1678 |
BACKSLASH |
BACKSLASH |
1679 |
|
|
1680 |
The backslash character has several uses. Firstly, if it is |
The backslash character has several uses. Firstly, if it is |
1681 |
followed by a non-alphameric character, it takes away any |
followed by a non-alphameric character, it takes away any |
1682 |
special meaning that character may have. This use of |
special meaning that character may have. This use of |
|
|
|
1683 |
backslash as an escape character applies both inside and |
backslash as an escape character applies both inside and |
1684 |
outside character classes. |
outside character classes. |
1685 |
|
|
1686 |
For example, if you want to match a "*" character, you write |
For example, if you want to match a * character, you write |
1687 |
"\*" in the pattern. This applies whether or not the follow- |
\* in the pattern. This escaping action applies whether or |
1688 |
ing character would otherwise be interpreted as a meta- |
not the following character would otherwise be interpreted |
1689 |
character, so it is always safe to precede a non-alphameric |
as a meta-character, so it is always safe to precede a non- |
1690 |
with "\" to specify that it stands for itself. In particu- |
alphameric with backslash to specify that it stands for |
1691 |
lar, if you want to match a backslash, you write "\\". |
itself. In particular, if you want to match a backslash, you |
1692 |
|
write \\. |
1693 |
|
|
1694 |
If a pattern is compiled with the PCRE_EXTENDED option, whi- |
If a pattern is compiled with the PCRE_EXTENDED option, whi- |
1695 |
tespace in the pattern (other than in a character class) and |
tespace in the pattern (other than in a character class) and |
1696 |
characters between a "#" outside a character class and the |
characters between a # outside a character class and the |
1697 |
next newline character are ignored. An escaping backslash |
next newline character are ignored. An escaping backslash |
1698 |
can be used to include a whitespace or "#" character as part |
can be used to include a whitespace or # character as part |
1699 |
of the pattern. |
of the pattern. |
1700 |
|
|
1701 |
|
If you want to remove the special meaning from a sequence of |
1702 |
|
characters, you can do so by putting them between \Q and \E. |
1703 |
|
This is different from Perl in that $ and @ are handled as |
1704 |
|
literals in \Q...\E sequences in PCRE, whereas in Perl, $ |
1705 |
|
and @ cause variable interpolation. Note the following exam- |
1706 |
|
ples: |
1707 |
|
|
1708 |
|
Pattern PCRE matches Perl matches |
1709 |
|
|
1710 |
|
\Qabc$xyz\E abc$xyz abc followed by the |
1711 |
|
|
1712 |
|
contents of $xyz |
1713 |
|
\Qabc\$xyz\E abc\$xyz abc\$xyz |
1714 |
|
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
1715 |
|
|
1716 |
|
The \Q...\E sequence is recognized both inside and outside |
1717 |
|
character classes. |
1718 |
|
|
1719 |
A second use of backslash provides a way of encoding non- |
A second use of backslash provides a way of encoding non- |
1720 |
printing characters in patterns in a visible manner. There |
printing characters in patterns in a visible manner. There |
1721 |
is no restriction on the appearance of non-printing charac- |
is no restriction on the appearance of non-printing charac- |
1724 |
usually easier to use one of the following escape sequences |
usually easier to use one of the following escape sequences |
1725 |
than the binary character it represents: |
than the binary character it represents: |
1726 |
|
|
1727 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
1728 |
\cx "control-x", where x is any character |
\cx "control-x", where x is any character |
1729 |
\e escape (hex 1B) |
\e escape (hex 1B) |
1730 |
\f formfeed (hex 0C) |
\f formfeed (hex 0C) |
1731 |
\n newline (hex 0A) |
\n newline (hex 0A) |
1732 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
1733 |
\t tab (hex 09) |
\t tab (hex 09) |
1734 |
\xhh character with hex code hh |
\ddd character with octal code ddd, or backreference |
1735 |
\ddd character with octal code ddd, or backreference |
\xhh character with hex code hh |
1736 |
|
\x{hhh..} character with hex code hhh... (UTF-8 mode only) |
1737 |
|
|
1738 |
The precise effect of "\cx" is as follows: if "x" is a lower |
The precise effect of \cx is as follows: if x is a lower |
1739 |
case letter, it is converted to upper case. Then bit 6 of |
case letter, it is converted to upper case. Then bit 6 of |
1740 |
the character (hex 40) is inverted. Thus "\cz" becomes hex |
the character (hex 40) is inverted. Thus \cz becomes hex |
1741 |
1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B. |
1A, but \c{ becomes hex 3B, while \c; becomes hex 7B. |
1742 |
|
|
1743 |
After "\x", up to two hexadecimal digits are read (letters |
After \x, from zero to two hexadecimal digits are read |
1744 |
can be in upper or lower case). |
(letters can be in upper or lower case). In UTF-8 mode, any |
1745 |
|
number of hexadecimal digits may appear between \x{ and }, |
1746 |
|
but the value of the character code must be less than 2**31 |
1747 |
|
(that is, the maximum hexadecimal value is 7FFFFFFF). If |
1748 |
|
characters other than hexadecimal digits appear between \x{ |
1749 |
|
and }, or if there is no terminating }, this form of escape |
1750 |
|
is not recognized. Instead, the initial \x will be inter- |
1751 |
|
preted as a basic hexadecimal escape, with no following |
1752 |
|
digits, giving a byte whose value is zero. |
1753 |
|
|
1754 |
|
Characters whose value is less than 256 can be defined by |
1755 |
|
either of the two syntaxes for \x when PCRE is in UTF-8 |
1756 |
|
mode. There is no difference in the way they are handled. |
1757 |
|
For example, \xdc is exactly the same as \x{dc}. |
1758 |
|
|
1759 |
After "\0" up to two further octal digits are read. In both |
After \0 up to two further octal digits are read. In both |
1760 |
cases, if there are fewer than two digits, just those that |
cases, if there are fewer than two digits, just those that |
1761 |
are present are used. Thus the sequence "\0\x\07" specifies |
are present are used. Thus the sequence \0\x\07 specifies |
1762 |
two binary zeros followed by a BEL character. Make sure you |
two binary zeros followed by a BEL character (code value 7). |
1763 |
supply two digits after the initial zero if the character |
Make sure you supply two digits after the initial zero if |
1764 |
that follows is itself an octal digit. |
the character that follows is itself an octal digit. |
1765 |
|
|
1766 |
The handling of a backslash followed by a digit other than 0 |
The handling of a backslash followed by a digit other than 0 |
1767 |
is complicated. Outside a character class, PCRE reads it |
is complicated. Outside a character class, PCRE reads it |
1787 |
writing a tab |
writing a tab |
1788 |
\011 is always a tab |
\011 is always a tab |
1789 |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
1790 |
\113 is the character with octal code 113 (since there |
\113 might be a back reference, otherwise the |
1791 |
can be no more than 99 back references) |
character with octal code 113 |
1792 |
\377 is a byte consisting entirely of 1 bits |
\377 might be a back reference, otherwise |
1793 |
|
the byte consisting entirely of 1 bits |
1794 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
1795 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
1796 |
|
|
1798 |
duced by a leading zero, because no more than three octal |
duced by a leading zero, because no more than three octal |
1799 |
digits are ever read. |
digits are ever read. |
1800 |
|
|
1801 |
All the sequences that define a single byte value can be |
All the sequences that define a single byte value or a sin- |
1802 |
used both inside and outside character classes. In addition, |
gle UTF-8 character (in UTF-8 mode) can be used both inside |
1803 |
inside a character class, the sequence "\b" is interpreted |
and outside character classes. In addition, inside a charac- |
1804 |
as the backspace character (hex 08). Outside a character |
ter class, the sequence \b is interpreted as the backspace |
1805 |
class it has a different meaning (see below). |
character (hex 08). Outside a character class it has a dif- |
1806 |
|
ferent meaning (see below). |
1807 |
|
|
1808 |
The third use of backslash is for specifying generic charac- |
The third use of backslash is for specifying generic charac- |
1809 |
ter types: |
ter types: |
1813 |
\s any whitespace character |
\s any whitespace character |
1814 |
\S any character that is not a whitespace character |
\S any character that is not a whitespace character |
1815 |
\w any "word" character |
\w any "word" character |
1816 |
\W any "non-word" character |
W any "non-word" character |
1817 |
|
|
1818 |
Each pair of escape sequences partitions the complete set of |
Each pair of escape sequences partitions the complete set of |
1819 |
characters into two disjoint sets. Any given character |
characters into two disjoint sets. Any given character |
1820 |
matches one, and only one, of each pair. |
matches one, and only one, of each pair. |
1821 |
|
|
1822 |
|
In UTF-8 mode, characters with values greater than 255 never |
1823 |
|
match \d, \s, or \w, and always match \D, \S, and \W. |
1824 |
|
|
1825 |
|
For compatibility with Perl, \s does not match the VT char- |
1826 |
|
acter (code 11). This makes it different from the the POSIX |
1827 |
|
"space" class. The \s characters are HT (9), LF (10), FF |
1828 |
|
(12), CR (13), and space (32). |
1829 |
|
|
1830 |
A "word" character is any letter or digit or the underscore |
A "word" character is any letter or digit or the underscore |
1831 |
character, that is, any character which can be part of a |
character, that is, any character which can be part of a |
1832 |
Perl "word". The definition of letters and digits is con- |
Perl "word". The definition of letters and digits is con- |
1833 |
trolled by PCRE's character tables, and may vary if locale- |
trolled by PCRE's character tables, and may vary if locale- |
1834 |
specific matching is taking place (see "Locale support" |
specific matching is taking place (see "Locale support" in |
1835 |
above). For example, in the "fr" (French) locale, some char- |
the pcreapi page). For example, in the "fr" (French) locale, |
1836 |
acter codes greater than 128 are used for accented letters, |
some character codes greater than 128 are used for accented |
1837 |
and these are matched by \w. |
letters, and these are matched by \w. |
1838 |
|
|
1839 |
These character type sequences can appear both inside and |
These character type sequences can appear both inside and |
1840 |
outside character classes. They each match one character of |
outside character classes. They each match one character of |
1849 |
for more complicated assertions is described below. The |
for more complicated assertions is described below. The |
1850 |
backslashed assertions are |
backslashed assertions are |
1851 |
|
|
1852 |
\b word boundary |
\b matches at a word boundary |
1853 |
\B not a word boundary |
\B matches when not at a word boundary |
1854 |
\A start of subject (independent of multiline mode) |
\A matches at start of subject |
1855 |
\Z end of subject or newline at end (independent of |
\Z matches at end of subject or before newline at end |
1856 |
multiline mode) |
\z matches at end of subject |
1857 |
\z end of subject (independent of multiline mode) |
\G matches at first matching position in subject |
1858 |
|
|
1859 |
These assertions may not appear in character classes (but |
These assertions may not appear in character classes (but |
1860 |
note that "\b" has a different meaning, namely the backspace |
note that \b has a different meaning, namely the backspace |
1861 |
character, inside a character class). |
character, inside a character class). |
1862 |
|
|
1863 |
A word boundary is a position in the subject string where |
A word boundary is a position in the subject string where |
1865 |
match \w or \W (i.e. one matches \w and the other matches |
match \w or \W (i.e. one matches \w and the other matches |
1866 |
\W), or the start or end of the string if the first or last |
\W), or the start or end of the string if the first or last |
1867 |
character matches \w, respectively. |
character matches \w, respectively. |
|
|
|
1868 |
The \A, \Z, and \z assertions differ from the traditional |
The \A, \Z, and \z assertions differ from the traditional |
1869 |
circumflex and dollar (described below) in that they only |
circumflex and dollar (described below) in that they only |
1870 |
ever match at the very start and end of the subject string, |
ever match at the very start and end of the subject string, |
1871 |
whatever options are set. They are not affected by the |
whatever options are set. Thus, they are independent of mul- |
1872 |
PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu- |
tiline mode. |
1873 |
ment of pcre_exec() is non-zero, \A can never match. The |
|
1874 |
|
They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL |
1875 |
|
options. If the startoffset argument of pcre_exec() is non- |
1876 |
|
zero, indicating that matching is to start at a point other |
1877 |
|
than the beginning of the subject, \A can never match. The |
1878 |
difference between \Z and \z is that \Z matches before a |
difference between \Z and \z is that \Z matches before a |
1879 |
newline that is the last character of the string as well as |
newline that is the last character of the string as well as |
1880 |
at the end of the string, whereas \z matches only at the |
at the end of the string, whereas \z matches only at the |
1881 |
end. |
end. |
1882 |
|
|
1883 |
|
The \G assertion is true only when the current matching |
1884 |
|
position is at the start point of the match, as specified by |
1885 |
|
the startoffset argument of pcre_exec(). It differs from \A |
1886 |
|
when the value of startoffset is non-zero. By calling |
1887 |
|
pcre_exec() multiple times with appropriate arguments, you |
1888 |
|
can mimic Perl's /g option, and it is in this kind of imple- |
1889 |
|
mentation where \G can be useful. |
1890 |
|
|
1891 |
|
Note, however, that PCRE's interpretation of \G, as the |
1892 |
|
start of the current match, is subtly different from Perl's, |
1893 |
|
which defines it as the end of the previous match. In Perl, |
1894 |
|
these can be different when the previously matched string |
1895 |
|
was empty. Because PCRE does just one match at a time, it |
1896 |
|
cannot reproduce this behaviour. |
1897 |
|
|
1898 |
|
If all the alternatives of a pattern begin with \G, the |
1899 |
|
expression is anchored to the starting match position, and |
1900 |
|
the "anchored" flag is set in the compiled regular expres- |
1901 |
|
sion. |
1902 |
|
|
1903 |
|
|
1904 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
1905 |
|
|
1906 |
Outside a character class, in the default matching mode, the |
Outside a character class, in the default matching mode, the |
1907 |
circumflex character is an assertion which is true only if |
circumflex character is an assertion which is true only if |
1908 |
the current matching point is at the start of the subject |
the current matching point is at the start of the subject |
1909 |
string. If the startoffset argument of pcre_exec() is non- |
string. If the startoffset argument of pcre_exec() is non- |
1910 |
zero, circumflex can never match. Inside a character class, |
zero, circumflex can never match if the PCRE_MULTILINE |
1911 |
circumflex has an entirely different meaning (see below). |
option is unset. Inside a character class, circumflex has an |
1912 |
|
entirely different meaning (see below). |
1913 |
|
|
1914 |
Circumflex need not be the first character of the pattern if |
Circumflex need not be the first character of the pattern if |
1915 |
a number of alternatives are involved, but it should be the |
a number of alternatives are involved, but it should be the |
1931 |
|
|
1932 |
The meaning of dollar can be changed so that it matches only |
The meaning of dollar can be changed so that it matches only |
1933 |
at the very end of the string, by setting the |
at the very end of the string, by setting the |
1934 |
PCRE_DOLLAR_ENDONLY option at compile or matching time. This |
PCRE_DOLLAR_ENDONLY option at compile time. This does not |
1935 |
does not affect the \Z assertion. |
affect the \Z assertion. |
1936 |
|
|
1937 |
The meanings of the circumflex and dollar characters are |
The meanings of the circumflex and dollar characters are |
1938 |
changed if the PCRE_MULTILINE option is set. When this is |
changed if the PCRE_MULTILINE option is set. When this is |
1939 |
the case, they match immediately after and immediately |
the case, they match immediately after and immediately |
1940 |
before an internal "\n" character, respectively, in addition |
before an internal newline character, respectively, in addi- |
1941 |
to matching at the start and end of the subject string. For |
tion to matching at the start and end of the subject string. |
1942 |
example, the pattern /^abc$/ matches the subject string |
For example, the pattern /^abc$/ matches the subject string |
1943 |
"def\nabc" in multiline mode, but not otherwise. Conse- |
"def\nabc" in multiline mode, but not otherwise. Conse- |
1944 |
quently, patterns that are anchored in single line mode |
quently, patterns that are anchored in single line mode |
1945 |
because all branches start with "^" are not anchored in mul- |
because all branches start with ^ are not anchored in multi- |
1946 |
tiline mode, and a match for circumflex is possible when the |
line mode, and a match for circumflex is possible when the |
1947 |
startoffset argument of pcre_exec() is non-zero. The |
startoffset argument of pcre_exec() is non-zero. The |
1948 |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is |
1949 |
set. |
set. |
1954 |
whether PCRE_MULTILINE is set or not. |
whether PCRE_MULTILINE is set or not. |
1955 |
|
|
1956 |
|
|
|
|
|
1957 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
1958 |
|
|
1959 |
Outside a character class, a dot in the pattern matches any |
Outside a character class, a dot in the pattern matches any |
1960 |
one character in the subject, including a non-printing char- |
one character in the subject, including a non-printing char- |
1961 |
acter, but not (by default) newline. If the PCRE_DOTALL |
acter, but not (by default) newline. In UTF-8 mode, a dot |
1962 |
option is set, dots match newlines as well. The handling of |
matches any UTF-8 character, which might be more than one |
1963 |
dot is entirely independent of the handling of circumflex |
byte long, except (by default) for newline. If the |
1964 |
and dollar, the only relationship being that they both |
PCRE_DOTALL option is set, dots match newlines as well. The |
1965 |
involve newline characters. Dot has no special meaning in a |
handling of dot is entirely independent of the handling of |
1966 |
character class. |
circumflex and dollar, the only relationship being that they |
1967 |
|
both involve newline characters. Dot has no special meaning |
1968 |
|
in a character class. |
1969 |
|
|
1970 |
|
|
1971 |
|
|
1972 |
|
MATCHING A SINGLE BYTE |
1973 |
|
|
1974 |
|
Outside a character class, the escape sequence \C matches |
1975 |
|
any one byte, both in and out of UTF-8 mode. Unlike a dot, |
1976 |
|
it always matches a newline. The feature is provided in Perl |
1977 |
|
in order to match individual bytes in UTF-8 mode. Because |
1978 |
|
it breaks up UTF-8 characters into individual bytes, what |
1979 |
|
remains in the string may be a malformed UTF-8 string. For |
1980 |
|
this reason it is best avoided. |
1981 |
|
|
1982 |
|
PCRE does not allow \C to appear in lookbehind assertions |
1983 |
|
(see below), because in UTF-8 mode it makes it impossible to |
1984 |
|
calculate the length of the lookbehind. |
1985 |
|
|
1986 |
|
|
1987 |
SQUARE BRACKETS |
SQUARE BRACKETS |
1988 |
|
|
1989 |
An opening square bracket introduces a character class, ter- |
An opening square bracket introduces a character class, ter- |
1990 |
minated by a closing square bracket. A closing square |
minated by a closing square bracket. A closing square |
1991 |
bracket on its own is not special. If a closing square |
bracket on its own is not special. If a closing square |
1993 |
the first data character in the class (after an initial cir- |
the first data character in the class (after an initial cir- |
1994 |
cumflex, if present) or escaped with a backslash. |
cumflex, if present) or escaped with a backslash. |
1995 |
|
|
1996 |
A character class matches a single character in the subject; |
A character class matches a single character in the subject. |
1997 |
the character must be in the set of characters defined by |
In UTF-8 mode, the character may occupy more than one byte. |
1998 |
the class, unless the first character in the class is a cir- |
A matched character must be in the set of characters defined |
1999 |
cumflex, in which case the subject character must not be in |
by the class, unless the first character in the class defin- |
2000 |
the set defined by the class. If a circumflex is actually |
ition is a circumflex, in which case the subject character |
2001 |
required as a member of the class, ensure it is not the |
must not be in the set defined by the class. If a circumflex |
2002 |
first character, or escape it with a backslash. |
is actually required as a member of the class, ensure it is |
2003 |
|
not the first character, or escape it with a backslash. |
2004 |
|
|
2005 |
For example, the character class [aeiou] matches any lower |
For example, the character class [aeiou] matches any lower |
2006 |
case vowel, while [^aeiou] matches any character that is not |
case vowel, while [^aeiou] matches any character that is not |
2011 |
string, and fails if the current pointer is at the end of |
string, and fails if the current pointer is at the end of |
2012 |
the string. |
the string. |
2013 |
|
|
2014 |
|
In UTF-8 mode, characters with values greater than 255 can |
2015 |
|
be included in a class as a literal string of bytes, or by |
2016 |
|
using the \x{ escaping mechanism. |
2017 |
|
|
2018 |
When caseless matching is set, any letters in a class |
When caseless matching is set, any letters in a class |
2019 |
represent both their upper case and lower case versions, so |
represent both their upper case and lower case versions, so |
2020 |
for example, a caseless [aeiou] matches "A" as well as "a", |
for example, a caseless [aeiou] matches "A" as well as "a", |
2021 |
and a caseless [^aeiou] does not match "A", whereas a case- |
and a caseless [^aeiou] does not match "A", whereas a case- |
2022 |
ful version would. |
ful version would. PCRE does not support the concept of case |
2023 |
|
for characters with values greater than 255. |
2024 |
The newline character is never treated in any special way in |
The newline character is never treated in any special way in |
2025 |
character classes, whatever the setting of the PCRE_DOTALL |
character classes, whatever the setting of the PCRE_DOTALL |
2026 |
or PCRE_MULTILINE options is. A class such as [^a] will |
or PCRE_MULTILINE options is. A class such as [^a] will |
2044 |
separate characters. The octal or hexadecimal representation |
separate characters. The octal or hexadecimal representation |
2045 |
of "]" can also be used to end a range. |
of "]" can also be used to end a range. |
2046 |
|
|
2047 |
Ranges operate in ASCII collating sequence. They can also be |
Ranges operate in the collating sequence of character |
2048 |
used for characters specified numerically, for example |
values. They can also be used for characters specified |
2049 |
[\000-\037]. If a range that includes letters is used when |
numerically, for example [\000-\037]. In UTF-8 mode, ranges |
2050 |
caseless matching is set, it matches the letters in either |
can include characters whose values are greater than 255, |
2051 |
case. For example, [W-c] is equivalent to [][\^_`wxyzabc], |
for example [\x{100}-\x{2ff}]. |
2052 |
matched caselessly, and if character tables for the "fr" |
|
2053 |
locale are in use, [\xc8-\xcb] matches accented E characters |
If a range that includes letters is used when caseless |
2054 |
in both cases. |
matching is set, it matches the letters in either case. For |
2055 |
|
example, [W-c] is equivalent to [][\^_`wxyzabc], matched |
2056 |
|
caselessly, and if character tables for the "fr" locale are |
2057 |
|
in use, [\xc8-\xcb] matches accented E characters in both |
2058 |
|
cases. |
2059 |
|
|
2060 |
The character types \d, \D, \s, \S, \w, and \W may also |
The character types \d, \D, \s, \S, \w, and \W may also |
2061 |
appear in a character class, and add the characters that |
appear in a character class, and add the characters that |
2071 |
classes, but it does no harm if they are escaped. |
classes, but it does no harm if they are escaped. |
2072 |
|
|
2073 |
|
|
|
|
|
2074 |
POSIX CHARACTER CLASSES |
POSIX CHARACTER CLASSES |
2075 |
Perl 5.6 (not yet released at the time of writing) is going |
|
2076 |
to support the POSIX notation for character classes, which |
Perl supports the POSIX notation for character classes, |
2077 |
uses names enclosed by [: and :] within the enclosing |
which uses names enclosed by [: and :] within the enclosing |
2078 |
square brackets. PCRE supports this notation. For example, |
square brackets. PCRE also supports this notation. For exam- |
2079 |
|
ple, |
2080 |
|
|
2081 |
[01[:alpha:]%] |
[01[:alpha:]%] |
2082 |
|
|
2086 |
alnum letters and digits |
alnum letters and digits |
2087 |
alpha letters |
alpha letters |
2088 |
ascii character codes 0 - 127 |
ascii character codes 0 - 127 |
2089 |
|
blank space or tab only |
2090 |
cntrl control characters |
cntrl control characters |
2091 |
digit decimal digits (same as \d) |
digit decimal digits (same as \d) |
2092 |
graph printing characters, excluding space |
graph printing characters, excluding space |
2093 |
lower lower case letters |
lower lower case letters |
2094 |
print printing characters, including space |
print printing characters, including space |
2095 |
punct printing characters, excluding letters and digits |
punct printing characters, excluding letters and digits |
2096 |
space white space (same as \s) |
space white space (not quite the same as \s) |
2097 |
upper upper case letters |
upper upper case letters |
2098 |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
2099 |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
2100 |
|
|
2101 |
The names "ascii" and "word" are Perl extensions. Another |
The "space" characters are HT (9), LF (10), VT (11), FF |
2102 |
Perl extension is negation, which is indicated by a ^ char- |
(12), CR (13), and space (32). Notice that this list |
2103 |
acter after the colon. For example, |
includes the VT character (code 11). This makes "space" dif- |
2104 |
|
ferent to \s, which does not include VT (for Perl compati- |
2105 |
|
bility). |
2106 |
|
|
2107 |
|
The name "word" is a Perl extension, and "blank" is a GNU |
2108 |
|
extension from Perl 5.8. Another Perl extension is negation, |
2109 |
|
which is indicated by a ^ character after the colon. For |
2110 |
|
example, |
2111 |
|
|
2112 |
[12[:^digit:]] |
[12[:^digit:]] |
2113 |
|
|
2116 |
"collating element", but these are not supported, and an |
"collating element", but these are not supported, and an |
2117 |
error is given if they are encountered. |
error is given if they are encountered. |
2118 |
|
|
2119 |
|
In UTF-8 mode, characters with values greater than 255 do |
2120 |
|
not match any of the POSIX character classes. |
2121 |
|
|
2122 |
|
|
2123 |
VERTICAL BAR |
VERTICAL BAR |
2124 |
|
|
2125 |
Vertical bar characters are used to separate alternative |
Vertical bar characters are used to separate alternative |
2126 |
patterns. For example, the pattern |
patterns. For example, the pattern |
2127 |
|
|
2137 |
subpattern. |
subpattern. |
2138 |
|
|
2139 |
|
|
|
|
|
2140 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
2141 |
The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, |
|
2142 |
and PCRE_EXTENDED can be changed from within the pattern by |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, |
2143 |
a sequence of Perl option letters enclosed between "(?" and |
PCRE_DOTALL, and PCRE_EXTENDED options can be changed from |
2144 |
")". The option letters are |
within the pattern by a sequence of Perl option letters |
2145 |
|
enclosed between "(?" and ")". The option letters are |
2146 |
|
|
2147 |
i for PCRE_CASELESS |
i for PCRE_CASELESS |
2148 |
m for PCRE_MULTILINE |
m for PCRE_MULTILINE |
2157 |
If a letter appears both before and after the hyphen, the |
If a letter appears both before and after the hyphen, the |
2158 |
option is unset. |
option is unset. |
2159 |
|
|
2160 |
The scope of these option changes depends on where in the |
When an option change occurs at top level (that is, not |
2161 |
pattern the setting occurs. For settings that are outside |
inside subpattern parentheses), the change applies to the |
2162 |
any subpattern (defined below), the effect is the same as if |
remainder of the pattern that follows. If the change is |
2163 |
the options were set or unset at the start of matching. The |
placed right at the start of a pattern, PCRE extracts it |
2164 |
following patterns all behave in exactly the same way: |
into the global options (and it will therefore show up in |
2165 |
|
data extracted by the pcre_fullinfo() function). |
2166 |
(?i)abc |
|
2167 |
a(?i)bc |
An option change within a subpattern affects only that part |
2168 |
ab(?i)c |
of the current pattern that follows it, so |
|
abc(?i) |
|
|
|
|
|
which in turn is the same as compiling the pattern abc with |
|
|
PCRE_CASELESS set. In other words, such "top level" set- |
|
|
tings apply to the whole pattern (unless there are other |
|
|
changes inside subpatterns). If there is more than one set- |
|
|
ting of the same option at top level, the rightmost setting |
|
|
is used. |
|
|
|
|
|
If an option change occurs inside a subpattern, the effect |
|
|
is different. This is a change of behaviour in Perl 5.005. |
|
|
An option change inside a subpattern affects only that part |
|
|
of the subpattern that follows it, so |
|
2169 |
|
|
2170 |
(a(?i)b)c |
(a(?i)b)c |
2171 |
|
|
2192 |
even when it is at top level. It is best put at the start. |
even when it is at top level. It is best put at the start. |
2193 |
|
|
2194 |
|
|
|
|
|
2195 |
SUBPATTERNS |
SUBPATTERNS |
2196 |
|
|
2197 |
Subpatterns are delimited by parentheses (round brackets), |
Subpatterns are delimited by parentheses (round brackets), |
2198 |
which can be nested. Marking part of a pattern as a subpat- |
which can be nested. Marking part of a pattern as a subpat- |
2199 |
tern does two things: |
tern does two things: |
2226 |
The fact that plain parentheses fulfil two functions is not |
The fact that plain parentheses fulfil two functions is not |
2227 |
always helpful. There are often times when a grouping sub- |
always helpful. There are often times when a grouping sub- |
2228 |
pattern is required without a capturing requirement. If an |
pattern is required without a capturing requirement. If an |
2229 |
opening parenthesis is followed by "?:", the subpattern does |
opening parenthesis is followed by a question mark and a |
2230 |
not do any capturing, and is not counted when computing the |
colon, the subpattern does not do any capturing, and is not |
2231 |
number of any subsequent capturing subpatterns. For example, |
counted when computing the number of any subsequent captur- |
2232 |
if the string "the white queen" is matched against the pat- |
ing subpatterns. For example, if the string "the white |
2233 |
tern |
queen" is matched against the pattern |
2234 |
|
|
2235 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
2236 |
|
|
2237 |
the captured substrings are "white queen" and "queen", and |
the captured substrings are "white queen" and "queen", and |
2238 |
are numbered 1 and 2. The maximum number of captured sub- |
are numbered 1 and 2. The maximum number of capturing sub- |
2239 |
strings is 99, and the maximum number of all subpatterns, |
patterns is 65535, and the maximum depth of nesting of all |
2240 |
both capturing and non-capturing, is 200. |
subpatterns, both capturing and non-capturing, is 200. |
2241 |
|
|
2242 |
As a convenient shorthand, if any option settings are |
As a convenient shorthand, if any option settings are |
2243 |
required at the start of a non-capturing subpattern, the |
required at the start of a non-capturing subpattern, the |
2254 |
the above patterns match "SUNDAY" as well as "Saturday". |
the above patterns match "SUNDAY" as well as "Saturday". |
2255 |
|
|
2256 |
|
|
2257 |
|
NAMED SUBPATTERNS |
2258 |
|
|
2259 |
|
Identifying capturing parentheses by number is simple, but |
2260 |
|
it can be very hard to keep track of the numbers in compli- |
2261 |
|
cated regular expressions. Furthermore, if an expression is |
2262 |
|
modified, the numbers may change. To help with the diffi- |
2263 |
|
culty, PCRE supports the naming of subpatterns, something |
2264 |
|
that Perl does not provide. The Python syntax (?P<name>...) |
2265 |
|
is used. Names consist of alphanumeric characters and under- |
2266 |
|
scores, and must be unique within a pattern. |
2267 |
|
|
2268 |
|
Named capturing parentheses are still allocated numbers as |
2269 |
|
well as names. The PCRE API provides function calls for |
2270 |
|
extracting the name-to-number translation table from a com- |
2271 |
|
piled pattern. For further details see the pcreapi documen- |
2272 |
|
tation. |
2273 |
|
|
2274 |
|
|
2275 |
REPETITION |
REPETITION |
2276 |
|
|
2277 |
Repetition is specified by quantifiers, which can follow any |
Repetition is specified by quantifiers, which can follow any |
2278 |
of the following items: |
of the following items: |
2279 |
|
|
2280 |
a single character, possibly escaped |
a literal data character |
2281 |
the . metacharacter |
the . metacharacter |
2282 |
|
the \C escape sequence |
2283 |
|
escapes such as \d that match single characters |
2284 |
a character class |
a character class |
2285 |
a back reference (see next section) |
a back reference (see next section) |
2286 |
a parenthesized subpattern (unless it is an assertion - |
a parenthesized subpattern (unless it is an assertion) |
|
see below) |
|
2287 |
|
|
2288 |
The general repetition quantifier specifies a minimum and |
The general repetition quantifier specifies a minimum and |
2289 |
maximum number of permitted matches, by giving the two |
maximum number of permitted matches, by giving the two |
2311 |
one that does not match the syntax of a quantifier, is taken |
one that does not match the syntax of a quantifier, is taken |
2312 |
as a literal character. For example, {,6} is not a quantif- |
as a literal character. For example, {,6} is not a quantif- |
2313 |
ier, but a literal string of four characters. |
ier, but a literal string of four characters. |
2314 |
|
|
2315 |
|
In UTF-8 mode, quantifiers apply to UTF-8 characters rather |
2316 |
|
than to individual bytes. Thus, for example, \x{100}{2} |
2317 |
|
matches two UTF-8 characters, each of which is represented |
2318 |
|
by a two-byte sequence. |
2319 |
|
|
2320 |
The quantifier {0} is permitted, causing the expression to |
The quantifier {0} is permitted, causing the expression to |
2321 |
behave as if the previous item and the quantifier were not |
behave as if the previous item and the quantifier were not |
2322 |
present. |
present. |
2386 |
repeat count that is greater than 1 or with a limited max- |
repeat count that is greater than 1 or with a limited max- |
2387 |
imum, more store is required for the compiled pattern, in |
imum, more store is required for the compiled pattern, in |
2388 |
proportion to the size of the minimum or maximum. |
proportion to the size of the minimum or maximum. |
|
|
|
2389 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL |
2390 |
option (equivalent to Perl's /s) is set, thus allowing the . |
option (equivalent to Perl's /s) is set, thus allowing the . |
2391 |
to match newlines, the pattern is implicitly anchored, |
to match newlines, the pattern is implicitly anchored, |
2392 |
because whatever follows will be tried against every charac- |
because whatever follows will be tried against every charac- |
2393 |
ter position in the subject string, so there is no point in |
ter position in the subject string, so there is no point in |
2394 |
retrying the overall match at any position after the first. |
retrying the overall match at any position after the first. |
2395 |
PCRE treats such a pattern as though it were preceded by \A. |
PCRE normally treats such a pattern as though it were pre- |
2396 |
In cases where it is known that the subject string contains |
ceded by \A. |
2397 |
no newlines, it is worth setting PCRE_DOTALL when the pat- |
|
2398 |
tern begins with .* in order to obtain this optimization, or |
In cases where it is known that the subject string contains |
2399 |
alternatively using ^ to indicate anchoring explicitly. |
no newlines, it is worth setting PCRE_DOTALL in order to |
2400 |
|
obtain this optimization, or alternatively using ^ to indi- |
2401 |
|
cate anchoring explicitly. |
2402 |
|
|
2403 |
|
However, there is one situation where the optimization can- |
2404 |
|
not be used. When .* is inside capturing parentheses that |
2405 |
|
are the subject of a backreference elsewhere in the pattern, |
2406 |
|
a match at the start may fail, and a later one succeed. Con- |
2407 |
|
sider, for example: |
2408 |
|
|
2409 |
|
(.*)abc\1 |
2410 |
|
|
2411 |
|
If the subject is "xyz123abc123" the match point is the |
2412 |
|
fourth character. For this reason, such a pattern is not |
2413 |
|
implicitly anchored. |
2414 |
|
|
2415 |
When a capturing subpattern is repeated, the value captured |
When a capturing subpattern is repeated, the value captured |
2416 |
is the substring that matched the final iteration. For exam- |
is the substring that matched the final iteration. For exam- |
2430 |
"b". |
"b". |
2431 |
|
|
2432 |
|
|
2433 |
|
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
2434 |
|
|
2435 |
BACK REFERENCES |
With both maximizing and minimizing repetition, failure of |
2436 |
Outside a character class, a backslash followed by a digit |
what follows normally causes the repeated item to be re- |
2437 |
greater than 0 (and possibly further digits) is a back |
evaluated to see if a different number of repeats allows the |
2438 |
|
rest of the pattern to match. Sometimes it is useful to |
2439 |
|
prevent this, either to change the nature of the match, or |
2440 |
|
to cause it fail earlier than it otherwise might, when the |
2441 |
|
author of the pattern knows there is no point in carrying |
2442 |
|
on. |
2443 |
|
|
2444 |
|
Consider, for example, the pattern \d+foo when applied to |
2445 |
|
the subject line |
2446 |
|
|
2447 |
|
123456bar |
2448 |
|
|
2449 |
|
After matching all 6 digits and then failing to match "foo", |
2450 |
|
the normal action of the matcher is to try again with only 5 |
2451 |
|
digits matching the \d+ item, and then with 4, and so on, |
2452 |
|
before ultimately failing. "Atomic grouping" (a term taken |
2453 |
|
from Jeffrey Friedl's book) provides the means for specify- |
2454 |
|
ing that once a subpattern has matched, it is not to be re- |
2455 |
|
evaluated in this way. |
2456 |
|
|
2457 |
|
If we use atomic grouping for the previous example, the |
2458 |
|
matcher would give up immediately on failing to match "foo" |
2459 |
|
the first time. The notation is a kind of special |
2460 |
|
parenthesis, starting with (?> as in this example: |
2461 |
|
|
2462 |
|
(?>\d+)bar |
2463 |
|
|
2464 |
|
This kind of parenthesis "locks up" the part of the pattern |
2465 |
|
it contains once it has matched, and a failure further into |
2466 |
|
the pattern is prevented from backtracking into it. Back- |
2467 |
|
tracking past it to previous items, however, works as nor- |
2468 |
|
mal. |
2469 |
|
|
2470 |
|
An alternative description is that a subpattern of this type |
2471 |
|
matches the string of characters that an identical stan- |
2472 |
|
dalone pattern would match, if anchored at the current point |
2473 |
|
in the subject string. |
2474 |
|
|
2475 |
|
Atomic grouping subpatterns are not capturing subpatterns. |
2476 |
|
Simple cases such as the above example can be thought of as |
2477 |
|
a maximizing repeat that must swallow everything it can. So, |
2478 |
|
while both \d+ and \d+? are prepared to adjust the number of |
2479 |
|
digits they match in order to make the rest of the pattern |
2480 |
|
match, (?>\d+) can only match an entire sequence of digits. |
2481 |
|
|
2482 |
|
Atomic groups in general can of course contain arbitrarily |
2483 |
|
complicated subpatterns, and can be nested. However, when |
2484 |
|
the subpattern for an atomic group is just a single repeated |
2485 |
|
item, as in the example above, a simpler notation, called a |
2486 |
|
"possessive quantifier" can be used. This consists of an |
2487 |
|
additional + character following a quantifier. Using this |
2488 |
|
notation, the previous example can be rewritten as |
2489 |
|
|
2490 |
|
\d++bar |
2491 |
|
|
2492 |
|
Possessive quantifiers are always greedy; the setting of the |
2493 |
|
PCRE_UNGREEDY option is ignored. They are a convenient nota- |
2494 |
|
tion for the simpler forms of atomic group. However, there |
2495 |
|
is no difference in the meaning or processing of a posses- |
2496 |
|
sive quantifier and the equivalent atomic group. |
2497 |
|
|
2498 |
|
The possessive quantifier syntax is an extension to the Perl |
2499 |
|
syntax. It originates in Sun's Java package. |
2500 |
|
|
2501 |
|
When a pattern contains an unlimited repeat inside a subpat- |
2502 |
|
tern that can itself be repeated an unlimited number of |
2503 |
|
times, the use of an atomic group is the only way to avoid |
2504 |
|
some failing matches taking a very long time indeed. The |
2505 |
|
pattern |
2506 |
|
|
2507 |
|
(\D+|<\d+>)*[!?] |
2508 |
|
|
2509 |
|
matches an unlimited number of substrings that either con- |
2510 |
|
sist of non-digits, or digits enclosed in <>, followed by |
2511 |
|
either ! or ?. When it matches, it runs quickly. However, if |
2512 |
|
it is applied to |
2513 |
|
|
2514 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
2515 |
|
|
2516 |
|
it takes a long time before reporting failure. This is |
2517 |
|
because the string can be divided between the two repeats in |
2518 |
|
a large number of ways, and all have to be tried. (The exam- |
2519 |
|
ple used [!?] rather than a single character at the end, |
2520 |
|
because both PCRE and Perl have an optimization that allows |
2521 |
|
for fast failure when a single character is used. They |
2522 |
|
remember the last single character that is required for a |
2523 |
|
match, and fail early if it is not present in the string.) |
2524 |
|
If the pattern is changed to |
2525 |
|
|
2526 |
SunOS 5.8 Last change: 30 |
((?>\D+)|<\d+>)*[!?] |
2527 |
|
|
2528 |
|
sequences of non-digits cannot be broken, and failure hap- |
2529 |
|
pens quickly. |
2530 |
|
|
2531 |
|
|
2532 |
reference to a capturing subpattern earlier (i.e. to its |
BACK REFERENCES |
2533 |
|
|
2534 |
|
Outside a character class, a backslash followed by a digit |
2535 |
|
greater than 0 (and possibly further digits) is a back |
2536 |
|
reference to a capturing subpattern earlier (that is, to its |
2537 |
left) in the pattern, provided there have been that many |
left) in the pattern, provided there have been that many |
2538 |
previous capturing left parentheses. |
previous capturing left parentheses. |
2539 |
|
|
2548 |
|
|
2549 |
A back reference matches whatever actually matched the cap- |
A back reference matches whatever actually matched the cap- |
2550 |
turing subpattern in the current subject string, rather than |
turing subpattern in the current subject string, rather than |
2551 |
anything matching the subpattern itself. So the pattern |
anything matching the subpattern itself (see "Subpatterns as |
2552 |
|
subroutines" below for a way of doing that). So the pattern |
2553 |
|
|
2554 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
2555 |
|
|
2564 |
though the original capturing subpattern is matched case- |
though the original capturing subpattern is matched case- |
2565 |
lessly. |
lessly. |
2566 |
|
|
2567 |
|
Back references to named subpatterns use the Python syntax |
2568 |
|
(?P=name). We could rewrite the above example as follows: |
2569 |
|
|
2570 |
|
(?<p1>(?i)rah)\s+(?P=p1) |
2571 |
|
|
2572 |
There may be more than one back reference to the same sub- |
There may be more than one back reference to the same sub- |
2573 |
pattern. If a subpattern has not actually been used in a |
pattern. If a subpattern has not actually been used in a |
2574 |
particular match, any back references to it always fail. For |
particular match, any back references to it always fail. For |
2577 |
(a|(bc))\2 |
(a|(bc))\2 |
2578 |
|
|
2579 |
always fails if it starts to match "a" rather than "bc". |
always fails if it starts to match "a" rather than "bc". |
2580 |
Because there may be up to 99 back references, all digits |
Because there may be many capturing parentheses in a pat- |
2581 |
following the backslash are taken as part of a potential |
tern, all digits following the backslash are taken as part |
2582 |
back reference number. If the pattern continues with a digit |
of a potential back reference number. If the pattern contin- |
2583 |
character, some delimiter must be used to terminate the back |
ues with a digit character, some delimiter must be used to |
2584 |
reference. If the PCRE_EXTENDED option is set, this can be |
terminate the back reference. If the PCRE_EXTENDED option is |
2585 |
whitespace. Otherwise an empty comment can be used. |
set, this can be whitespace. Otherwise an empty comment can |
2586 |
|
be used. |
2587 |
|
|
2588 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
2589 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
2602 |
example above, or by a quantifier with a minimum of zero. |
example above, or by a quantifier with a minimum of zero. |
2603 |
|
|
2604 |
|
|
|
|
|
2605 |
ASSERTIONS |
ASSERTIONS |
2606 |
|
|
2607 |
An assertion is a test on the characters following or |
An assertion is a test on the characters following or |
2608 |
preceding the current matching point that does not actually |
preceding the current matching point that does not actually |
2609 |
consume any characters. The simple assertions coded as \b, |
consume any characters. The simple assertions coded as \b, |
2610 |
\B, \A, \Z, \z, ^ and $ are described above. More compli- |
\B, \A, \G, \Z, \z, ^ and $ are described above. More com- |
2611 |
cated assertions are coded as subpatterns. There are two |
plicated assertions are coded as subpatterns. There are two |
2612 |
kinds: those that look ahead of the current position in the |
kinds: those that look ahead of the current position in the |
2613 |
subject string, and those that look behind it. |
subject string, and those that look behind it. |
2614 |
|
|
2635 |
when the next three characters are "bar". A lookbehind |
when the next three characters are "bar". A lookbehind |
2636 |
assertion is needed to achieve this effect. |
assertion is needed to achieve this effect. |
2637 |
|
|
2638 |
|
If you want to force a matching failure at some point in a |
2639 |
|
pattern, the most convenient way to do it is with (?!) |
2640 |
|
because an empty string always matches, so an assertion that |
2641 |
|
requires there not to be an empty string must always fail. |
2642 |
|
|
2643 |
Lookbehind assertions start with (?<= for positive asser- |
Lookbehind assertions start with (?<= for positive asser- |
2644 |
tions and (?<! for negative assertions. For example, |
tions and (?<! for negative assertions. For example, |
2645 |
|
|
2660 |
causes an error at compile time. Branches that match dif- |
causes an error at compile time. Branches that match dif- |
2661 |
ferent length strings are permitted only at the top level of |
ferent length strings are permitted only at the top level of |
2662 |
a lookbehind assertion. This is an extension compared with |
a lookbehind assertion. This is an extension compared with |
2663 |
Perl 5.005, which requires all branches to match the same |
Perl (at least for 5.8), which requires all branches to |
2664 |
length of string. An assertion such as |
match the same length of string. An assertion such as |
2665 |
|
|
2666 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
2667 |
|
|
2675 |
alternative, to temporarily move the current position back |
alternative, to temporarily move the current position back |
2676 |
by the fixed width and then try to match. If there are |
by the fixed width and then try to match. If there are |
2677 |
insufficient characters before the current position, the |
insufficient characters before the current position, the |
2678 |
match is deemed to fail. Lookbehinds in conjunction with |
match is deemed to fail. |
2679 |
once-only subpatterns can be particularly useful for match- |
|
2680 |
ing at the ends of strings; an example is given at the end |
PCRE does not allow the \C escape (which matches a single |
2681 |
of the section on once-only subpatterns. |
byte in UTF-8 mode) to appear in lookbehind assertions, |
2682 |
|
because it makes it impossible to calculate the length of |
2683 |
|
the lookbehind. |
2684 |
|
|
2685 |
|
Atomic groups can be used in conjunction with lookbehind |
2686 |
|
assertions to specify efficient matching at the end of the |
2687 |
|
subject string. Consider a simple pattern such as |
2688 |
|
|
2689 |
|
abcd$ |
2690 |
|
|
2691 |
|
when applied to a long string that does not match. Because |
2692 |
|
matching proceeds from left to right, PCRE will look for |
2693 |
|
each "a" in the subject and then see if what follows matches |
2694 |
|
the rest of the pattern. If the pattern is specified as |
2695 |
|
|
2696 |
|
^.*abcd$ |
2697 |
|
|
2698 |
|
the initial .* matches the entire string at first, but when |
2699 |
|
this fails (because there is no following "a"), it back- |
2700 |
|
tracks to match all but the last character, then all but the |
2701 |
|
last two characters, and so on. Once again the search for |
2702 |
|
"a" covers the entire string, from right to left, so we are |
2703 |
|
no better off. However, if the pattern is written as |
2704 |
|
|
2705 |
|
^(?>.*)(?<=abcd) |
2706 |
|
|
2707 |
|
or, equivalently, |
2708 |
|
|
2709 |
|
^.*+(?<=abcd) |
2710 |
|
|
2711 |
|
there can be no backtracking for the .* item; it can match |
2712 |
|
only the entire string. The subsequent lookbehind assertion |
2713 |
|
does a single test on the last four characters. If it fails, |
2714 |
|
the match fails immediately. For long strings, this approach |
2715 |
|
makes a significant difference to the processing time. |
2716 |
|
|
2717 |
Several assertions (of any sort) may occur in succession. |
Several assertions (of any sort) may occur in succession. |
2718 |
For example, |
For example, |
2757 |
for positive assertions, because it does not make sense for |
for positive assertions, because it does not make sense for |
2758 |
negative assertions. |
negative assertions. |
2759 |
|
|
|
Assertions count towards the maximum of 200 parenthesized |
|
|
subpatterns. |
|
|
|
|
|
|
|
|
|
|
|
ONCE-ONLY SUBPATTERNS |
|
|
With both maximizing and minimizing repetition, failure of |
|
|
what follows normally causes the repeated item to be re- |
|
|
evaluated to see if a different number of repeats allows the |
|
|
rest of the pattern to match. Sometimes it is useful to |
|
|
prevent this, either to change the nature of the match, or |
|
|
to cause it fail earlier than it otherwise might, when the |
|
|
author of the pattern knows there is no point in carrying |
|
|
on. |
|
|
|
|
|
Consider, for example, the pattern \d+foo when applied to |
|
|
the subject line |
|
|
|
|
|
123456bar |
|
|
|
|
|
After matching all 6 digits and then failing to match "foo", |
|
|
the normal action of the matcher is to try again with only 5 |
|
|
digits matching the \d+ item, and then with 4, and so on, |
|
|
before ultimately failing. Once-only subpatterns provide the |
|
|
means for specifying that once a portion of the pattern has |
|
|
matched, it is not to be re-evaluated in this way, so the |
|
|
matcher would give up immediately on failing to match "foo" |
|
|
the first time. The notation is another kind of special |
|
|
parenthesis, starting with (?> as in this example: |
|
|
|
|
|
(?>\d+)bar |
|
|
|
|
|
This kind of parenthesis "locks up" the part of the pattern |
|
|
it contains once it has matched, and a failure further into |
|
|
the pattern is prevented from backtracking into it. Back- |
|
|
tracking past it to previous items, however, works as nor- |
|
|
mal. |
|
|
|
|
|
An alternative description is that a subpattern of this type |
|
|
matches the string of characters that an identical stan- |
|
|
dalone pattern would match, if anchored at the current point |
|
|
in the subject string. |
|
|
|
|
|
Once-only subpatterns are not capturing subpatterns. Simple |
|
|
cases such as the above example can be thought of as a max- |
|
|
imizing repeat that must swallow everything it can. So, |
|
|
while both \d+ and \d+? are prepared to adjust the number of |
|
|
digits they match in order to make the rest of the pattern |
|
|
match, (?>\d+) can only match an entire sequence of digits. |
|
|
|
|
|
This construction can of course contain arbitrarily compli- |
|
|
cated subpatterns, and it can be nested. |
|
|
|
|
|
Once-only subpatterns can be used in conjunction with look- |
|
|
behind assertions to specify efficient matching at the end |
|
|
of the subject string. Consider a simple pattern such as |
|
|
|
|
|
abcd$ |
|
|
|
|
|
when applied to a long string which does not match. Because |
|
|
matching proceeds from left to right, PCRE will look for |
|
|
each "a" in the subject and then see if what follows matches |
|
|
the rest of the pattern. If the pattern is specified as |
|
|
|
|
|
^.*abcd$ |
|
|
|
|
|
the initial .* matches the entire string at first, but when |
|
|
this fails (because there is no following "a"), it back- |
|
|
tracks to match all but the last character, then all but the |
|
|
last two characters, and so on. Once again the search for |
|
|
"a" covers the entire string, from right to left, so we are |
|
|
no better off. However, if the pattern is written as |
|
|
|
|
|
^(?>.*)(?<=abcd) |
|
|
|
|
|
there can be no backtracking for the .* item; it can match |
|
|
only the entire string. The subsequent lookbehind assertion |
|
|
does a single test on the last four characters. If it fails, |
|
|
the match fails immediately. For long strings, this approach |
|
|
makes a significant difference to the processing time. |
|
|
|
|
|
When a pattern contains an unlimited repeat inside a subpat- |
|
|
tern that can itself be repeated an unlimited number of |
|
|
times, the use of a once-only subpattern is the only way to |
|
|
avoid some failing matches taking a very long time indeed. |
|
|
The pattern |
|
|
|
|
|
(\D+|<\d+>)*[!?] |
|
|
|
|
|
matches an unlimited number of substrings that either con- |
|
|
sist of non-digits, or digits enclosed in <>, followed by |
|
|
either ! or ?. When it matches, it runs quickly. However, if |
|
|
it is applied to |
|
|
|
|
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
|
|
|
|
|
it takes a long time before reporting failure. This is |
|
|
because the string can be divided between the two repeats in |
|
|
a large number of ways, and all have to be tried. (The exam- |
|
|
ple used [!?] rather than a single character at the end, |
|
|
because both PCRE and Perl have an optimization that allows |
|
|
for fast failure when a single character is used. They |
|
|
remember the last single character that is required for a |
|
|
match, and fail early if it is not present in the string.) |
|
|
If the pattern is changed to |
|
|
|
|
|
((?>\D+)|<\d+>)*[!?] |
|
|
|
|
|
sequences of non-digits cannot be broken, and failure hap- |
|
|
pens quickly. |
|
|
|
|
|
|
|
2760 |
|
|
2761 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
2762 |
|
|
2763 |
It is possible to cause the matching process to obey a sub- |
It is possible to cause the matching process to obey a sub- |
2764 |
pattern conditionally or to choose between two alternative |
pattern conditionally or to choose between two alternative |
2765 |
subpatterns, depending on the result of an assertion, or |
subpatterns, depending on the result of an assertion, or |
2774 |
more than two alternatives in the subpattern, a compile-time |
more than two alternatives in the subpattern, a compile-time |
2775 |
error occurs. |
error occurs. |
2776 |
|
|
2777 |
There are two kinds of condition. If the text between the |
There are three kinds of condition. If the text between the |
2778 |
parentheses consists of a sequence of digits, the condition |
parentheses consists of a sequence of digits, the condition |
2779 |
is satisfied if the capturing subpattern of that number has |
is satisfied if the capturing subpattern of that number has |
2780 |
previously matched. The number must be greater than zero. |
previously matched. The number must be greater than zero. |
2798 |
matches a sequence of non-parentheses, optionally enclosed |
matches a sequence of non-parentheses, optionally enclosed |
2799 |
in parentheses. |
in parentheses. |
2800 |
|
|
2801 |
If the condition is not a sequence of digits, it must be an |
If the condition is the string (R), it is satisfied if a |
2802 |
assertion. This may be a positive or negative lookahead or |
recursive call to the pattern or subpattern has been made. |
2803 |
lookbehind assertion. Consider this pattern, again contain- |
At "top level", the condition is false. This is a PCRE |
2804 |
ing non-significant white space, and with the two alterna- |
extension. Recursive patterns are described in the next |
2805 |
tives on the second line: |
section. |
2806 |
|
|
2807 |
|
If the condition is not a sequence of digits or (R), it must |
2808 |
|
be an assertion. This may be a positive or negative looka- |
2809 |
|
head or lookbehind assertion. Consider this pattern, again |
2810 |
|
containing non-significant white space, and with the two |
2811 |
|
alternatives on the second line: |
2812 |
|
|
2813 |
(?(?=[^a-z]*[a-z]) |
(?(?=[^a-z]*[a-z]) |
2814 |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
\d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} ) |
2823 |
letters and dd are digits. |
letters and dd are digits. |
2824 |
|
|
2825 |
|
|
|
|
|
2826 |
COMMENTS |
COMMENTS |
2827 |
|
|
2828 |
The sequence (?# marks the start of a comment which contin- |
The sequence (?# marks the start of a comment which contin- |
2829 |
ues up to the next closing parenthesis. Nested parentheses |
ues up to the next closing parenthesis. Nested parentheses |
2830 |
are not permitted. The characters that make up a comment |
are not permitted. The characters that make up a comment |
2835 |
ues up to the next newline character in the pattern. |
ues up to the next newline character in the pattern. |
2836 |
|
|
2837 |
|
|
|
|
|
2838 |
RECURSIVE PATTERNS |
RECURSIVE PATTERNS |
2839 |
|
|
2840 |
Consider the problem of matching a string in parentheses, |
Consider the problem of matching a string in parentheses, |
2841 |
allowing for unlimited nested parentheses. Without the use |
allowing for unlimited nested parentheses. Without the use |
2842 |
of recursion, the best that can be done is to use a pattern |
of recursion, the best that can be done is to use a pattern |
2843 |
that matches up to some fixed depth of nesting. It is not |
that matches up to some fixed depth of nesting. It is not |
2844 |
possible to handle an arbitrary nesting depth. Perl 5.6 has |
possible to handle an arbitrary nesting depth. Perl has pro- |
2845 |
provided an experimental facility that allows regular |
vided an experimental facility that allows regular expres- |
2846 |
expressions to recurse (amongst other things). It does this |
sions to recurse (amongst other things). It does this by |
2847 |
by interpolating Perl code in the expression at run time, |
interpolating Perl code in the expression at run time, and |
2848 |
and the code can refer to the expression itself. A Perl pat- |
the code can refer to the expression itself. A Perl pattern |
2849 |
tern to solve the parentheses problem can be created like |
to solve the parentheses problem can be created like this: |
|
this: |
|
2850 |
|
|
2851 |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
2852 |
|
|
2853 |
The (?p{...}) item interpolates Perl code at run time, and |
The (?p{...}) item interpolates Perl code at run time, and |
2854 |
in this case refers recursively to the pattern in which it |
in this case refers recursively to the pattern in which it |
2855 |
appears. Obviously, PCRE cannot support the interpolation of |
appears. Obviously, PCRE cannot support the interpolation of |
2856 |
Perl code. Instead, the special item (?R) is provided for |
Perl code. Instead, it supports some special syntax for |
2857 |
the specific case of recursion. This PCRE pattern solves the |
recursion of the entire pattern, and also for individual |
2858 |
parentheses problem (assume the PCRE_EXTENDED option is set |
subpattern recursion. |
2859 |
so that white space is ignored): |
|
2860 |
|
The special item that consists of (? followed by a number |
2861 |
|
greater than zero and a closing parenthesis is a recursive |
2862 |
|
call of the subpattern of the given number, provided that it |
2863 |
|
occurs inside that subpattern. (If not, it is a "subroutine" |
2864 |
|
call, which is described in the next section.) The special |
2865 |
|
item (?R) is a recursive call of the entire regular expres- |
2866 |
|
sion. |
2867 |
|
|
2868 |
|
For example, this PCRE pattern solves the nested parentheses |
2869 |
|
problem (assume the PCRE_EXTENDED option is set so that |
2870 |
|
white space is ignored): |
2871 |
|
|
2872 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
2873 |
|
|
2874 |
First it matches an opening parenthesis. Then it matches any |
First it matches an opening parenthesis. Then it matches any |
2875 |
number of substrings which can either be a sequence of non- |
number of substrings which can either be a sequence of non- |
2876 |
parentheses, or a recursive match of the pattern itself |
parentheses, or a recursive match of the pattern itself |
2877 |
(i.e. a correctly parenthesized substring). Finally there is |
(that is a correctly parenthesized substring). Finally |
2878 |
a closing parenthesis. |
there is a closing parenthesis. |
2879 |
|
|
2880 |
|
If this were part of a larger pattern, you would not want to |
2881 |
|
recurse the entire pattern, so instead you could use this: |
2882 |
|
|
2883 |
|
( \( ( (?>[^()]+) | (?1) )* \) ) |
2884 |
|
|
2885 |
|
We have put the pattern into parentheses, and caused the |
2886 |
|
recursion to refer to them instead of the whole pattern. In |
2887 |
|
a larger pattern, keeping track of parenthesis numbers can |
2888 |
|
be tricky. It may be more convenient to use named |
2889 |
|
parentheses instead. For this, PCRE uses (?P>name), which is |
2890 |
|
an extension to the Python syntax that PCRE uses for named |
2891 |
|
parentheses (Perl does not provide named parentheses). We |
2892 |
|
could rewrite the above example as follows: |
2893 |
|
|
2894 |
|
(?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) |
2895 |
|
|
2896 |
This particular example pattern contains nested unlimited |
This particular example pattern contains nested unlimited |
2897 |
repeats, and so the use of a once-only subpattern for match- |
repeats, and so the use of atomic grouping for matching |
2898 |
ing strings of non-parentheses is important when applying |
strings of non-parentheses is important when applying the |
2899 |
the pattern to strings that do not match. For example, when |
pattern to strings that do not match. For example, when this |
2900 |
it is applied to |
pattern is applied to |
2901 |
|
|
2902 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
2903 |
|
|
2904 |
it yields "no match" quickly. However, if a once-only sub- |
it yields "no match" quickly. However, if atomic grouping is |
2905 |
pattern is not used, the match runs for a very long time |
not used, the match runs for a very long time indeed because |
2906 |
indeed because there are so many different ways the + and * |
there are so many different ways the + and * repeats can |
2907 |
repeats can carve up the subject, and all have to be tested |
carve up the subject, and all have to be tested before |
2908 |
before failure can be reported. |
failure can be reported. |
2909 |
|
At the end of a match, the values set for any capturing sub- |
2910 |
The values set for any capturing subpatterns are those from |
patterns are those from the outermost level of the recursion |
2911 |
the outermost level of the recursion at which the subpattern |
at which the subpattern value is set. If you want to obtain |
2912 |
value is set. If the pattern above is matched against |
intermediate values, a callout function can be used (see |
2913 |
|
below and the pcrecallout documentation). If the pattern |
2914 |
|
above is matched against |
2915 |
|
|
2916 |
(ab(cd)ef) |
(ab(cd)ef) |
2917 |
|
|
2921 |
|
|
2922 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
2923 |
^ ^ |
^ ^ |
2924 |
^ ^ the string they capture is |
^ ^ |
|
"ab(cd)ef", the contents of the top level parentheses. If |
|
|
there are more than 15 capturing parentheses in a pattern, |
|
|
PCRE has to obtain extra memory to store data during a |
|
|
recursion, which it does by using pcre_malloc, freeing it |
|
|
via pcre_free afterwards. If no memory can be obtained, it |
|
|
saves data for the first 15 capturing parentheses only, as |
|
|
there is no way to give an out-of-memory error from within a |
|
|
recursion. |
|
2925 |
|
|
2926 |
|
the string they capture is "ab(cd)ef", the contents of the |
2927 |
|
top level parentheses. If there are more than 15 capturing |
2928 |
|
parentheses in a pattern, PCRE has to obtain extra memory to |
2929 |
|
store data during a recursion, which it does by using |
2930 |
|
pcre_malloc, freeing it via pcre_free afterwards. If no |
2931 |
|
memory can be obtained, the match fails with the |
2932 |
|
PCRE_ERROR_NOMEMORY error. |
2933 |
|
|
2934 |
|
Do not confuse the (?R) item with the condition (R), which |
2935 |
|
tests for recursion. Consider this pattern, which matches |
2936 |
|
text in angle brackets, allowing for arbitrary nesting. Only |
2937 |
|
digits are allowed in nested brackets (that is, when recurs- |
2938 |
|
ing), whereas any characters are permitted at the outer |
2939 |
|
level. |
2940 |
|
|
2941 |
|
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
2942 |
|
|
2943 |
|
In this pattern, (?(R) is the start of a conditional subpat- |
2944 |
|
tern, with two different alternatives for the recursive and |
2945 |
|
non-recursive cases. The (?R) item is the actual recursive |
2946 |
|
call. |
2947 |
|
|
2948 |
|
|
2949 |
|
SUBPATTERNS AS SUBROUTINES |
2950 |
|
|
2951 |
|
If the syntax for a recursive subpattern reference (either |
2952 |
|
by number or by name) is used outside the parentheses to |
2953 |
|
which it refers, it operates like a subroutine in a program- |
2954 |
|
ming language. An earlier example pointed out that the pat- |
2955 |
|
tern |
2956 |
|
|
2957 |
|
(sens|respons)e and \1ibility |
2958 |
|
|
2959 |
|
matches "sense and sensibility" and "response and responsi- |
2960 |
|
bility", but not "sense and responsibility". If instead the |
2961 |
|
pattern |
2962 |
|
|
2963 |
|
(sens|respons)e and (?1)ibility |
2964 |
|
|
2965 |
|
is used, it does match "sense and responsibility" as well as |
2966 |
|
the other two strings. Such references must, however, follow |
2967 |
|
the subpattern to which they refer. |
2968 |
|
|
2969 |
|
|
2970 |
|
CALLOUTS |
2971 |
|
|
2972 |
|
Perl has a feature whereby using the sequence (?{...}) |
2973 |
|
causes arbitrary Perl code to be obeyed in the middle of |
2974 |
|
matching a regular expression. This makes it possible, |
2975 |
|
amongst other things, to extract different substrings that |
2976 |
|
match the same pair of parentheses when there is a repeti- |
2977 |
|
tion. |
2978 |
|
|
2979 |
|
PCRE provides a similar feature, but of course it cannot |
2980 |
|
obey arbitrary Perl code. The feature is called "callout". |
2981 |
|
The caller of PCRE provides an external function by putting |
2982 |
|
its entry point in the global variable pcre_callout. By |
2983 |
|
default, this variable contains NULL, which disables all |
2984 |
|
calling out. |
2985 |
|
|
2986 |
|
Within a regular expression, (?C) indicates the points at |
2987 |
|
which the external function is to be called. If you want to |
2988 |
|
identify different callout points, you can put a number less |
2989 |
|
than 256 after the letter C. The default value is zero. For |
2990 |
|
example, this pattern has two callout points: |
2991 |
|
|
2992 |
|
(?C1)9abc(?C2)def |
2993 |
|
|
2994 |
|
During matching, when PCRE reaches a callout point (and |
2995 |
|
pcre_callout is set), the external function is called. It is |
2996 |
|
provided with the number of the callout, and, optionally, |
2997 |
|
one item of data originally supplied by the caller of |
2998 |
|
pcre_exec(). The callout function may cause matching to |
2999 |
|
backtrack, or to fail altogether. A complete description of |
3000 |
|
the interface to the callout function is given in the pcre- |
3001 |
|
callout documentation. |
3002 |
|
|
3003 |
|
Last updated: 03 February 2003 |
3004 |
|
Copyright (c) 1997-2003 University of Cambridge. |
3005 |
|
----------------------------------------------------------------------------- |
3006 |
|
|
3007 |
|
NAME |
3008 |
|
PCRE - Perl-compatible regular expressions |
3009 |
|
|
3010 |
|
|
3011 |
PERFORMANCE |
PCRE PERFORMANCE |
|
Certain items that may appear in patterns are more efficient |
|
|
than others. It is more efficient to use a character class |
|
|
like [aeiou] than a set of alternatives such as (a|e|i|o|u). |
|
|
In general, the simplest construction that provides the |
|
|
required behaviour is usually the most efficient. Jeffrey |
|
|
Friedl's book contains a lot of discussion about optimizing |
|
|
regular expressions for efficient performance. |
|
|
|
|
|
When a pattern begins with .* and the PCRE_DOTALL option is |
|
|
set, the pattern is implicitly anchored by PCRE, since it |
|
|
can match only at the start of a subject string. However, if |
|
|
PCRE_DOTALL is not set, PCRE cannot make this optimization, |
|
|
because the . metacharacter does not then match a newline, |
|
|
and if the subject string contains newlines, the pattern may |
|
|
match from the character immediately following one of them |
|
|
instead of from the very start. For example, the pattern |
|
3012 |
|
|
3013 |
(.*) second |
Certain items that may appear in regular expression patterns |
3014 |
|
are more efficient than others. It is more efficient to use |
3015 |
|
a character class like [aeiou] than a set of alternatives |
3016 |
|
such as (a|e|i|o|u). In general, the simplest construction |
3017 |
|
that provides the required behaviour is usually the most |
3018 |
|
efficient. Jeffrey Friedl's book contains a lot of discus- |
3019 |
|
sion about optimizing regular expressions for efficient per- |
3020 |
|
formance. |
3021 |
|
|
3022 |
|
When a pattern begins with .* not in parentheses, or in |
3023 |
|
parentheses that are not the subject of a backreference, and |
3024 |
|
the PCRE_DOTALL option is set, the pattern is implicitly |
3025 |
|
anchored by PCRE, since it can match only at the start of a |
3026 |
|
subject string. However, if PCRE_DOTALL is not set, PCRE |
3027 |
|
cannot make this optimization, because the . metacharacter |
3028 |
|
does not then match a newline, and if the subject string |
3029 |
|
contains newlines, the pattern may match from the character |
3030 |
|
immediately following one of them instead of from the very |
3031 |
|
start. For example, the pattern |
3032 |
|
|
3033 |
|
.*second |
3034 |
|
|
3035 |
matches the subject "first\nand second" (where \n stands for |
matches the subject "first\nand second" (where \n stands for |
3036 |
a newline character) with the first captured substring being |
a newline character), with the match starting at the seventh |
3037 |
"and". In order to do this, PCRE has to retry the match |
character. In order to do this, PCRE has to retry the match |
3038 |
starting after every newline in the subject. |
starting after every newline in the subject. |
3039 |
|
|
3040 |
If you are using such a pattern with subject strings that do |
If you are using such a pattern with subject strings that do |
3057 |
that the entire match is going to fail, PCRE has in princi- |
that the entire match is going to fail, PCRE has in princi- |
3058 |
ple to try every possible variation, and this can take an |
ple to try every possible variation, and this can take an |
3059 |
extremely long time. |
extremely long time. |
|
|
|
3060 |
An optimization catches some of the more simple cases such |
An optimization catches some of the more simple cases such |
3061 |
as |
as |
3062 |
|
|
3076 |
whereas the latter takes an appreciable time with strings |
whereas the latter takes an appreciable time with strings |
3077 |
longer than about 20 characters. |
longer than about 20 characters. |
3078 |
|
|
3079 |
|
Last updated: 03 February 2003 |
3080 |
|
Copyright (c) 1997-2003 University of Cambridge. |
3081 |
|
----------------------------------------------------------------------------- |
3082 |
|
|
3083 |
|
NAME |
3084 |
|
PCRE - Perl-compatible regular expressions. |
3085 |
|
|
|
UTF-8 SUPPORT |
|
|
Starting at release 3.3, PCRE has some support for character |
|
|
strings encoded in the UTF-8 format. This is incomplete, and |
|
|
is regarded as experimental. In order to use it, you must |
|
|
configure PCRE to include UTF-8 support in the code, and, in |
|
|
addition, you must call pcre_compile() with the PCRE_UTF8 |
|
|
option flag. When you do this, both the pattern and any sub- |
|
|
ject strings that are matched against it are treated as |
|
|
UTF-8 strings instead of just strings of bytes, but only in |
|
|
the cases that are mentioned below. |
|
3086 |
|
|
3087 |
If you compile PCRE with UTF-8 support, but do not use it at |
SYNOPSIS OF POSIX API |
3088 |
run time, the library will be a bit bigger, but the addi- |
#include <pcreposix.h> |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
|
|
flag in several places, so should not be very large. |
|
3089 |
|
|
3090 |
PCRE assumes that the strings it is given contain valid |
int regcomp(regex_t *preg, const char *pattern, |
3091 |
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
int cflags); |
|
you pass invalid UTF-8 strings to PCRE, the results are |
|
|
undefined. |
|
3092 |
|
|
3093 |
Running with PCRE_UTF8 set causes these changes in the way |
int regexec(regex_t *preg, const char *string, |
3094 |
PCRE works: |
size_t nmatch, regmatch_t pmatch[], int eflags); |
3095 |
|
|
3096 |
1. In a pattern, the escape sequence \x{...}, where the |
size_t regerror(int errcode, const regex_t *preg, |
3097 |
contents of the braces is a string of hexadecimal digits, is |
char *errbuf, size_t errbuf_size); |
|
interpreted as a UTF-8 character whose code number is the |
|
|
given hexadecimal number, for example: \x{1234}. This |
|
|
inserts from one to six literal bytes into the pattern, |
|
|
using the UTF-8 encoding. If a non-hexadecimal digit appears |
|
|
between the braces, the item is not recognized. |
|
|
|
|
|
2. The original hexadecimal escape sequence, \xhh, generates |
|
|
a two-byte UTF-8 character if its value is greater than 127. |
|
|
|
|
|
3. Repeat quantifiers are NOT correctly handled if they fol- |
|
|
low a multibyte character. For example, \x{100}* and \xc3+ |
|
|
do not work. If you want to repeat such characters, you must |
|
|
enclose them in non-capturing parentheses, for example |
|
|
(?:\x{100}), at present. |
|
3098 |
|
|
3099 |
4. The dot metacharacter matches one UTF-8 character instead |
void regfree(regex_t *preg); |
|
of a single byte. |
|
3100 |
|
|
|
5. Unlike literal UTF-8 characters, the dot metacharacter |
|
|
followed by a repeat quantifier does operate correctly on |
|
|
UTF-8 characters instead of single bytes. |
|
3101 |
|
|
3102 |
4. Although the \x{...} escape is permitted in a character |
DESCRIPTION |
|
class, characters whose values are greater than 255 cannot |
|
|
be included in a class. |
|
3103 |
|
|
3104 |
5. A class is matched against a UTF-8 character instead of |
This set of functions provides a POSIX-style API to the PCRE |
3105 |
just a single byte, but it can match only characters whose |
regular expression package. See the pcreapi documentation |
3106 |
values are less than 256. Characters with greater values |
for a description of the native API, which contains addi- |
3107 |
always fail to match a class. |
tional functionality. |
3108 |
|
|
3109 |
|
The functions described here are just wrapper functions that |
3110 |
|
ultimately call the PCRE native API. Their prototypes are |
3111 |
|
defined in the pcreposix.h header file, and on Unix systems |
3112 |
|
the library itself is called pcreposix.a, so can be accessed |
3113 |
|
by adding -lpcreposix to the command for linking an applica- |
3114 |
|
tion which uses them. Because the POSIX functions call the |
3115 |
|
native ones, it is also necessary to add -lpcre. |
3116 |
|
|
3117 |
|
I have implemented only those option bits that can be rea- |
3118 |
|
sonably mapped to PCRE native options. In addition, the |
3119 |
|
options REG_EXTENDED and REG_NOSUB are defined with the |
3120 |
|
value zero. They have no effect, but since programs that are |
3121 |
|
written to the POSIX interface often use them, this makes it |
3122 |
|
easier to slot in PCRE as a replacement library. Other POSIX |
3123 |
|
options are not even defined. |
3124 |
|
|
3125 |
|
When PCRE is called via these functions, it is only the API |
3126 |
|
that is POSIX-like in style. The syntax and semantics of the |
3127 |
|
regular expressions themselves are still those of Perl, sub- |
3128 |
|
ject to the setting of various PCRE options, as described |
3129 |
|
below. "POSIX-like in style" means that the API approximates |
3130 |
|
to the POSIX definition; it is not fully POSIX-compatible, |
3131 |
|
and in multi-byte encoding domains it is probably even less |
3132 |
|
compatible. |
3133 |
|
|
3134 |
|
The header for these functions is supplied as pcreposix.h to |
3135 |
|
avoid any potential clash with other POSIX libraries. It |
3136 |
|
can, of course, be renamed or aliased as regex.h, which is |
3137 |
|
the "correct" name. It provides two structure types, regex_t |
3138 |
|
for compiled internal forms, and regmatch_t for returning |
3139 |
|
captured substrings. It also defines some constants whose |
3140 |
|
names start with "REG_"; these are used for setting options |
3141 |
|
and identifying error codes. |
3142 |
|
|
|
6. Repeated classes work correctly on multiple characters. |
|
3143 |
|
|
3144 |
7. Classes containing just a single character whose value is |
COMPILING A PATTERN |
|
greater than 127 (but less than 256), for example, [\x80] or |
|
|
[^\x{93}], do not work because these are optimized into sin- |
|
|
gle byte matches. In the first case, of course, the class |
|
|
brackets are just redundant. |
|
3145 |
|
|
3146 |
8. Lookbehind assertions move backwards in the subject by a |
The function regcomp() is called to compile a pattern into |
3147 |
fixed number of characters instead of a fixed number of |
an internal form. The pattern is a C string terminated by a |
3148 |
bytes. Simple cases have been tested to work correctly, but |
binary zero, and is passed in the argument pattern. The preg |
3149 |
there may be hidden gotchas herein. |
argument is a pointer to a regex_t structure which is used |
3150 |
|
as a base for storing information about the compiled expres- |
3151 |
|
sion. |
3152 |
|
|
3153 |
|
The argument cflags is either zero, or contains one or more |
3154 |
|
of the bits defined by the following macros: |
3155 |
|
|
3156 |
|
REG_ICASE |
3157 |
|
|
3158 |
|
The PCRE_CASELESS option is set when the expression is |
3159 |
|
passed for compilation to the native function. |
3160 |
|
|
3161 |
|
REG_NEWLINE |
3162 |
|
|
3163 |
|
The PCRE_MULTILINE option is set when the expression is |
3164 |
|
passed for compilation to the native function. Note that |
3165 |
|
this does not mimic the defined POSIX behaviour for |
3166 |
|
REG_NEWLINE (see the following section). |
3167 |
|
|
3168 |
|
In the absence of these flags, no options are passed to the |
3169 |
|
native function. This means the the regex is compiled with |
3170 |
|
PCRE default semantics. In particular, the way it handles |
3171 |
|
newline characters in the subject string is the Perl way, |
3172 |
|
not the POSIX way. Note that setting PCRE_MULTILINE has only |
3173 |
|
some of the effects specified for REG_NEWLINE. It does not |
3174 |
|
affect the way newlines are matched by . (they aren't) or by |
3175 |
|
a negative class such as [^a] (they are). |
3176 |
|
|
3177 |
|
The yield of regcomp() is zero on success, and non-zero oth- |
3178 |
|
erwise. The preg structure is filled in on success, and one |
3179 |
|
member of the structure is public: re_nsub contains the |
3180 |
|
number of capturing subpatterns in the regular expression. |
3181 |
|
Various error codes are defined in the header file. |
3182 |
|
|
3183 |
|
|
3184 |
|
MATCHING NEWLINE CHARACTERS |
3185 |
|
|
3186 |
|
This area is not simple, because POSIX and Perl take dif- |
3187 |
|
ferent views of things. It is not possible to get PCRE to |
3188 |
|
obey POSIX semantics, but then PCRE was never intended to be |
3189 |
|
a POSIX engine. The following table lists the different pos- |
3190 |
|
sibilities for matching newline characters in PCRE: |
3191 |
|
|
3192 |
|
Default Change with |
3193 |
|
|
3194 |
|
. matches newline no PCRE_DOTALL |
3195 |
|
newline matches [^a] yes not changeable |
3196 |
|
$ matches \n at end yes PCRE_DOLLARENDONLY |
3197 |
|
$ matches \n in middle no PCRE_MULTILINE |
3198 |
|
^ matches \n in middle no PCRE_MULTILINE |
3199 |
|
|
3200 |
|
This is the equivalent table for POSIX: |
3201 |
|
|
3202 |
|
Default Change with |
3203 |
|
|
3204 |
|
. matches newline yes REG_NEWLINE |
3205 |
|
newline matches [^a] yes REG_NEWLINE |
3206 |
|
$ matches \n at end no REG_NEWLINE |
3207 |
|
$ matches \n in middle no REG_NEWLINE |
3208 |
|
^ matches \n in middle no REG_NEWLINE |
3209 |
|
|
3210 |
|
PCRE's behaviour is the same as Perl's, except that there is |
3211 |
|
no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE |
3212 |
|
and Perl, there is no way to stop newline from matching |
3213 |
|
[^a]. |
3214 |
|
|
3215 |
|
The default POSIX newline handling can be obtained by set- |
3216 |
|
ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way |
3217 |
|
to make PCRE behave exactly as for the REG_NEWLINE action. |
3218 |
|
|
|
9. The character types such as \d and \w do not work |
|
|
correctly with UTF-8 characters. They continue to test a |
|
|
single byte. |
|
3219 |
|
|
3220 |
10. Anything not explicitly mentioned here continues to work |
MATCHING A PATTERN |
|
in bytes rather than in characters. |
|
3221 |
|
|
3222 |
The following UTF-8 features of Perl 5.6 are not imple- |
The function regexec() is called to match a pre-compiled |
3223 |
mented: |
pattern preg against a given string, which is terminated by |
3224 |
|
a zero byte, subject to the options in eflags. These can be: |
3225 |
|
|
3226 |
|
REG_NOTBOL |
3227 |
|
|
3228 |
|
The PCRE_NOTBOL option is set when calling the underlying |
3229 |
|
PCRE matching function. |
3230 |
|
|
3231 |
|
REG_NOTEOL |
3232 |
|
|
3233 |
|
The PCRE_NOTEOL option is set when calling the underlying |
3234 |
|
PCRE matching function. |
3235 |
|
|
3236 |
|
The portion of the string that was matched, and also any |
3237 |
|
captured substrings, are returned via the pmatch argument, |
3238 |
|
which points to an array of nmatch structures of type |
3239 |
|
regmatch_t, containing the members rm_so and rm_eo. These |
3240 |
|
contain the offset to the first character of each substring |
3241 |
|
and the offset to the first character after the end of each |
3242 |
|
substring, respectively. The 0th element of the vector |
3243 |
|
relates to the entire portion of string that was matched; |
3244 |
|
subsequent elements relate to the capturing subpatterns of |
3245 |
|
the regular expression. Unused entries in the array have |
3246 |
|
both structure members set to -1. |
3247 |
|
|
3248 |
|
A successful match yields a zero return; various error codes |
3249 |
|
are defined in the header file, of which REG_NOMATCH is the |
3250 |
|
"expected" failure code. |
3251 |
|
|
3252 |
|
|
3253 |
|
ERROR MESSAGES |
3254 |
|
|
3255 |
|
The regerror() function maps a non-zero errorcode from |
3256 |
|
either regcomp() or regexec() to a printable message. If |
3257 |
|
preg is not NULL, the error should have arisen from the use |
3258 |
|
of that structure. A message terminated by a binary zero is |
3259 |
|
placed in errbuf. The length of the message, including the |
3260 |
|
zero, is limited to errbuf_size. The yield of the function |
3261 |
|
is the size of buffer needed to hold the whole message. |
3262 |
|
|
3263 |
|
|
3264 |
|
STORAGE |
3265 |
|
|
3266 |
|
Compiling a regular expression causes memory to be allocated |
3267 |
|
and associated with the preg structure. The function reg- |
3268 |
|
free() frees all such memory, after which preg may no longer |
3269 |
|
be used as a compiled expression. |
3270 |
|
|
|
1. The escape sequence \C to match a single byte. |
|
3271 |
|
|
3272 |
2. The use of Unicode tables and properties and escapes \p, |
AUTHOR |
|
\P, and \X. |
|
3273 |
|
|
3274 |
|
Philip Hazel <ph10@cam.ac.uk> |
3275 |
|
University Computing Service, |
3276 |
|
Cambridge CB2 3QG, England. |
3277 |
|
|
3278 |
|
Last updated: 03 February 2003 |
3279 |
|
Copyright (c) 1997-2003 University of Cambridge. |
3280 |
|
----------------------------------------------------------------------------- |
3281 |
|
|
3282 |
SAMPLE PROGRAM |
NAME |
3283 |
The code below is a simple, complete demonstration program, |
PCRE - Perl-compatible regular expressions |
3284 |
to get you started with using PCRE. This code is also sup- |
|
3285 |
plied in the file pcredemo.c in the PCRE distribution. |
|
3286 |
|
PCRE SAMPLE PROGRAM |
3287 |
|
|
3288 |
|
A simple, complete demonstration program, to get you started |
3289 |
|
with using PCRE, is supplied in the file pcredemo.c in the |
3290 |
|
PCRE distribution. |
3291 |
|
|
3292 |
The program compiles the regular expression that is its |
The program compiles the regular expression that is its |
3293 |
first argument, and matches it against the subject string in |
first argument, and matches it against the subject string in |
3294 |
its second argument. No options are set, and default charac- |
its second argument. No PCRE options are set, and default |
3295 |
ter tables are used. If matching succeeds, the program out- |
character tables are used. If matching succeeds, the program |
3296 |
puts the portion of the subject that matched, together with |
outputs the portion of the subject that matched, together |
3297 |
the contents of any captured substrings. |
with the contents of any captured substrings. |
3298 |
|
|
3299 |
|
If the -g option is given on the command line, the program |
3300 |
|
then goes on to check for further matches of the same regu- |
3301 |
|
lar expression in the same subject string. The logic is a |
3302 |
|
little bit tricky because of the possibility of matching an |
3303 |
|
empty string. Comments in the code explain what is going on. |
3304 |
|
|
3305 |
On a Unix system that has PCRE installed in /usr/local, you |
On a Unix system that has PCRE installed in /usr/local, you |
3306 |
can compile the demonstration program using a command like |
can compile the demonstration program using a command like |
3307 |
this: |
this: |
3308 |
|
|
3309 |
gcc -o pcredemo pcredemo.c -I/usr/local/include |
gcc -o pcredemo pcredemo.c -I/usr/local/include \ |
3310 |
-L/usr/local/lib -lpcre |
-L/usr/local/lib -lpcre |
3311 |
|
|
3312 |
Then you can run simple tests like this: |
Then you can run simple tests like this: |
3313 |
|
|
3314 |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
3315 |
|
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
3316 |
|
|
3317 |
Note that there is a much more comprehensive test program, |
Note that there is a much more comprehensive test program, |
3318 |
called pcretest, which supports many more facilities for |
called pcretest, which supports many more facilities for |
3319 |
testing regular expressions. The pcredemo program is pro- |
testing regular expressions and the PCRE library. The |
3320 |
vided as a simple coding example. |
pcredemo program is provided as a simple coding example. |
3321 |
|
|
3322 |
On some operating systems (e.g. Solaris) you may get an |
On some operating systems (e.g. Solaris) you may get an |
3323 |
error like this when you try to run pcredemo: |
error like this when you try to run pcredemo: |
3330 |
|
|
3331 |
-R/usr/local/lib |
-R/usr/local/lib |
3332 |
|
|
3333 |
to the compile command to get round this problem. Here's the |
to the compile command to get round this problem. |
|
code: |
|
|
|
|
|
#include <stdio.h> |
|
|
#include <string.h> |
|
|
#include <pcre.h> |
|
|
|
|
|
#define OVECCOUNT 30 /* should be a multiple of 3 */ |
|
|
|
|
|
int main(int argc, char **argv) |
|
|
{ |
|
|
pcre *re; |
|
|
const char *error; |
|
|
int erroffset; |
|
|
int ovector[OVECCOUNT]; |
|
|
int rc, i; |
|
|
|
|
|
if (argc != 3) |
|
|
{ |
|
|
printf("Two arguments required: a regex and a " |
|
|
"subject string\n"); |
|
|
return 1; |
|
|
} |
|
|
|
|
|
/* Compile the regular expression in the first argument */ |
|
|
|
|
|
re = pcre_compile( |
|
|
argv[1], /* the pattern */ |
|
|
0, /* default options */ |
|
|
&error, /* for error message */ |
|
|
&erroffset, /* for error offset */ |
|
|
NULL); /* use default character tables */ |
|
|
|
|
|
/* Compilation failed: print the error message and exit */ |
|
|
|
|
|
if (re == NULL) |
|
|
{ |
|
|
printf("PCRE compilation failed at offset %d: %s\n", |
|
|
erroffset, error); |
|
|
return 1; |
|
|
} |
|
|
|
|
|
/* Compilation succeeded: match the subject in the second |
|
|
argument */ |
|
|
|
|
|
rc = pcre_exec( |
|
|
re, /* the compiled pattern */ |
|
|
NULL, /* we didn't study the pattern */ |
|
|
argv[2], /* the subject string */ |
|
|
(int)strlen(argv[2]), /* the length of the subject */ |
|
|
0, /* start at offset 0 in the subject */ |
|
|
0, /* default options */ |
|
|
ovector, /* vector for substring information */ |
|
|
OVECCOUNT); /* number of elements in the vector */ |
|
|
|
|
|
/* Matching failed: handle error cases */ |
|
|
|
|
|
if (rc < 0) |
|
|
{ |
|
|
switch(rc) |
|
|
{ |
|
|
case PCRE_ERROR_NOMATCH: printf("No match\n"); break; |
|
|
/* |
|
|
Handle other special cases if you like |
|
|
*/ |
|
|
default: printf("Matching error %d\n", rc); break; |
|
|
} |
|
|
return 1; |
|
|
} |
|
|
|
|
|
/* Match succeded */ |
|
|
|
|
|
printf("Match succeeded\n"); |
|
|
|
|
|
/* The output vector wasn't big enough */ |
|
|
|
|
|
if (rc == 0) |
|
|
{ |
|
|
rc = OVECCOUNT/3; |
|
|
printf("ovector only has room for %d captured " |
|
|
substrings\n", rc - 1); |
|
|
} |
|
|
|
|
|
/* Show substrings stored in the output vector */ |
|
|
|
|
|
for (i = 0; i < rc; i++) |
|
|
{ |
|
|
char *substring_start = argv[2] + ovector[2*i]; |
|
|
int substring_length = ovector[2*i+1] - ovector[2*i]; |
|
|
printf("%2d: %.*s\n", i, substring_length, |
|
|
substring_start); |
|
|
} |
|
3334 |
|
|
3335 |
return 0; |
Last updated: 28 January 2003 |
3336 |
} |
Copyright (c) 1997-2003 University of Cambridge. |
3337 |
|
----------------------------------------------------------------------------- |
|
|
|
|
|
|
|
AUTHOR |
|
|
Philip Hazel <ph10@cam.ac.uk> |
|
|
University Computing Service, |
|
|
New Museums Site, |
|
|
Cambridge CB2 3QG, England. |
|
|
Phone: +44 1223 334714 |
|
3338 |
|
|
|
Last updated: 15 August 2001 |
|
|
Copyright (c) 1997-2001 University of Cambridge. |
|