/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 77 - (show annotations)
Sat Feb 24 21:40:45 2007 UTC (8 years, 5 months ago) by nigel
File MIME type: text/plain
File size: 218708 byte(s)
Error occurred while calculating annotation data.
Load pcre-6.0 into code/trunk.
1 -----------------------------------------------------------------------------
2 This file contains a concatenation of the PCRE man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
5 synopses of each function in the library have not been included. There are
6 separate text files for the pcregrep and pcretest commands.
7 -----------------------------------------------------------------------------
8
9
10
11 NAME
12 PCRE - Perl-compatible regular expressions
13
14
15 INTRODUCTION
16
17 The PCRE library is a set of functions that implement regular expres-
18 sion pattern matching using the same syntax and semantics as Perl, with
19 just a few differences. The current implementation of PCRE (release
20 6.x) corresponds approximately with Perl 5.8, including support for
21 UTF-8 encoded strings and Unicode general category properties. However,
22 this support has to be explicitly enabled; it is not the default.
23
24 In addition to the Perl-compatible matching function, PCRE also con-
25 tains an alternative matching function that matches the same compiled
26 patterns in a different way. In certain circumstances, the alternative
27 function has some advantages. For a discussion of the two matching
28 algorithms, see the pcrematching page.
29
30 PCRE is written in C and released as a C library. A number of people
31 have written wrappers and interfaces of various kinds. In particular,
32 Google Inc. have provided a comprehensive C++ wrapper. This is now
33 included as part of the PCRE distribution. The pcrecpp page has details
34 of this interface. Other people's contributions can be found in the
35 Contrib directory at the primary FTP site, which is:
36
37 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
38
39 Details of exactly which Perl regular expression features are and are
40 not supported by PCRE are given in separate documents. See the pcrepat-
41 tern and pcrecompat pages.
42
43 Some features of PCRE can be included, excluded, or changed when the
44 library is built. The pcre_config() function makes it possible for a
45 client to discover which features are available. The features them-
46 selves are described in the pcrebuild page. Documentation about build-
47 ing PCRE for various operating systems can be found in the README file
48 in the source distribution.
49
50 The library contains a number of undocumented internal functions and
51 data tables that are used by more than one of the exported external
52 functions, but which are not intended for use by external callers.
53 Their names all begin with "_pcre_", which hopefully will not provoke
54 any name clashes.
55
56
57 USER DOCUMENTATION
58
59 The user documentation for PCRE comprises a number of different sec-
60 tions. In the "man" format, each of these is a separate "man page". In
61 the HTML format, each is a separate page, linked from the index page.
62 In the plain text format, all the sections are concatenated, for ease
63 of searching. The sections are as follows:
64
65 pcre this document
66 pcreapi details of PCRE's native C API
67 pcrebuild options for building PCRE
68 pcrecallout details of the callout feature
69 pcrecompat discussion of Perl compatibility
70 pcrecpp details of the C++ wrapper
71 pcregrep description of the pcregrep command
72 pcrematching discussion of the two matching algorithms
73 pcrepartial details of the partial matching facility
74 pcrepattern syntax and semantics of supported
75 regular expressions
76 pcreperform discussion of performance issues
77 pcreposix the POSIX-compatible C API
78 pcreprecompile details of saving and re-using precompiled patterns
79 pcresample discussion of the sample program
80 pcretest description of the pcretest testing command
81
82 In addition, in the "man" and HTML formats, there is a short page for
83 each C library function, listing its arguments and results.
84
85
86 LIMITATIONS
87
88 There are some size limitations in PCRE but it is hoped that they will
89 never in practice be relevant.
90
91 The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
92 is compiled with the default internal linkage size of 2. If you want to
93 process regular expressions that are truly enormous, you can compile
94 PCRE with an internal linkage size of 3 or 4 (see the README file in
95 the source distribution and the pcrebuild documentation for details).
96 In these cases the limit is substantially larger. However, the speed
97 of execution will be slower.
98
99 All values in repeating quantifiers must be less than 65536. The maxi-
100 mum number of capturing subpatterns is 65535.
101
102 There is no limit to the number of non-capturing subpatterns, but the
103 maximum depth of nesting of all kinds of parenthesized subpattern,
104 including capturing subpatterns, assertions, and other types of subpat-
105 tern, is 200.
106
107 The maximum length of a subject string is the largest positive number
108 that an integer variable can hold. However, when using the traditional
109 matching function, PCRE uses recursion to handle subpatterns and indef-
110 inite repetition. This means that the available stack space may limit
111 the size of a subject string that can be processed by certain patterns.
112
113
114 UTF-8 AND UNICODE PROPERTY SUPPORT
115
116 From release 3.3, PCRE has had some support for character strings
117 encoded in the UTF-8 format. For release 4.0 this was greatly extended
118 to cover most common requirements, and in release 5.0 additional sup-
119 port for Unicode general category properties was added.
120
121 In order process UTF-8 strings, you must build PCRE to include UTF-8
122 support in the code, and, in addition, you must call pcre_compile()
123 with the PCRE_UTF8 option flag. When you do this, both the pattern and
124 any subject strings that are matched against it are treated as UTF-8
125 strings instead of just strings of bytes.
126
127 If you compile PCRE with UTF-8 support, but do not use it at run time,
128 the library will be a bit bigger, but the additional run time overhead
129 is limited to testing the PCRE_UTF8 flag in several places, so should
130 not be very large.
131
132 If PCRE is built with Unicode character property support (which implies
133 UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
134 ported. The available properties that can be tested are limited to the
135 general category properties such as Lu for an upper case letter or Nd
136 for a decimal number. A full list is given in the pcrepattern documen-
137 tation. The PCRE library is increased in size by about 90K when Unicode
138 property support is included.
139
140 The following comments apply when PCRE is running in UTF-8 mode:
141
142 1. When you set the PCRE_UTF8 flag, the strings passed as patterns and
143 subjects are checked for validity on entry to the relevant functions.
144 If an invalid UTF-8 string is passed, an error return is given. In some
145 situations, you may already know that your strings are valid, and
146 therefore want to skip these checks in order to improve performance. If
147 you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time,
148 PCRE assumes that the pattern or subject it is given (respectively)
149 contains only valid UTF-8 codes. In this case, it does not diagnose an
150 invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
151 PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may
152 crash.
153
154 2. In a pattern, the escape sequence \x{...}, where the contents of the
155 braces is a string of hexadecimal digits, is interpreted as a UTF-8
156 character whose code number is the given hexadecimal number, for exam-
157 ple: \x{1234}. If a non-hexadecimal digit appears between the braces,
158 the item is not recognized. This escape sequence can be used either as
159 a literal, or within a character class.
160
161 3. The original hexadecimal escape sequence, \xhh, matches a two-byte
162 UTF-8 character if the value is greater than 127.
163
164 4. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
165 vidual bytes, for example: \x{100}{3}.
166
167 5. The dot metacharacter matches one UTF-8 character instead of a sin-
168 gle byte.
169
170 6. The escape sequence \C can be used to match a single byte in UTF-8
171 mode, but its use can lead to some strange effects. This facility is
172 not available in the alternative matching function, pcre_dfa_exec().
173
174 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
175 test characters of any code value, but the characters that PCRE recog-
176 nizes as digits, spaces, or word characters remain the same set as
177 before, all with values less than 256. This remains true even when PCRE
178 includes Unicode property support, because to do otherwise would slow
179 down PCRE in many common cases. If you really want to test for a wider
180 sense of, say, "digit", you must use Unicode property tests such as
181 \p{Nd}.
182
183 8. Similarly, characters that match the POSIX named character classes
184 are all low-valued characters.
185
186 9. Case-insensitive matching applies only to characters whose values
187 are less than 128, unless PCRE is built with Unicode property support.
188 Even when Unicode property support is available, PCRE still uses its
189 own character tables when checking the case of low-valued characters,
190 so as not to degrade performance. The Unicode property information is
191 used only for characters with higher values.
192
193
194 AUTHOR
195
196 Philip Hazel
197 University Computing Service,
198 Cambridge CB2 3QG, England.
199
200 Putting an actual email address here seems to have been a spam magnet,
201 so I've taken it away. If you want to email me, use my initial and sur-
202 name, separated by a dot, at the domain ucs.cam.ac.uk.
203
204 Last updated: 07 March 2005
205 Copyright (c) 1997-2005 University of Cambridge.
206 -----------------------------------------------------------------------------
207
208
209
210 NAME
211 PCRE - Perl-compatible regular expressions
212
213
214 PCRE BUILD-TIME OPTIONS
215
216 This document describes the optional features of PCRE that can be
217 selected when the library is compiled. They are all selected, or dese-
218 lected, by providing options to the configure script that is run before
219 the make command. The complete list of options for configure (which
220 includes the standard ones such as the selection of the installation
221 directory) can be obtained by running
222
223 ./configure --help
224
225 The following sections describe certain options whose names begin with
226 --enable or --disable. These settings specify changes to the defaults
227 for the configure command. Because of the way that configure works,
228 --enable and --disable always come in pairs, so the complementary
229 option always exists as well, but as it specifies the default, it is
230 not described.
231
232
233 UTF-8 SUPPORT
234
235 To build PCRE with support for UTF-8 character strings, add
236
237 --enable-utf8
238
239 to the configure command. Of itself, this does not make PCRE treat
240 strings as UTF-8. As well as compiling PCRE with this option, you also
241 have have to set the PCRE_UTF8 option when you call the pcre_compile()
242 function.
243
244
245 UNICODE CHARACTER PROPERTY SUPPORT
246
247 UTF-8 support allows PCRE to process character values greater than 255
248 in the strings that it handles. On its own, however, it does not pro-
249 vide any facilities for accessing the properties of such characters. If
250 you want to be able to use the pattern escapes \P, \p, and \X, which
251 refer to Unicode character properties, you must add
252
253 --enable-unicode-properties
254
255 to the configure command. This implies UTF-8 support, even if you have
256 not explicitly requested it.
257
258 Including Unicode property support adds around 90K of tables to the
259 PCRE library, approximately doubling its size. Only the general cate-
260 gory properties such as Lu and Nd are supported. Details are given in
261 the pcrepattern documentation.
262
263
264 CODE VALUE OF NEWLINE
265
266 By default, PCRE treats character 10 (linefeed) as the newline charac-
267 ter. This is the normal newline character on Unix-like systems. You can
268 compile PCRE to use character 13 (carriage return) instead by adding
269
270 --enable-newline-is-cr
271
272 to the configure command. For completeness there is also a --enable-
273 newline-is-lf option, which explicitly specifies linefeed as the new-
274 line character.
275
276
277 BUILDING SHARED AND STATIC LIBRARIES
278
279 The PCRE building process uses libtool to build both shared and static
280 Unix libraries by default. You can suppress one of these by adding one
281 of
282
283 --disable-shared
284 --disable-static
285
286 to the configure command, as required.
287
288
289 POSIX MALLOC USAGE
290
291 When PCRE is called through the POSIX interface (see the pcreposix doc-
292 umentation), additional working storage is required for holding the
293 pointers to capturing substrings, because PCRE requires three integers
294 per substring, whereas the POSIX interface provides only two. If the
295 number of expected substrings is small, the wrapper function uses space
296 on the stack, because this is faster than using malloc() for each call.
297 The default threshold above which the stack is no longer used is 10; it
298 can be changed by adding a setting such as
299
300 --with-posix-malloc-threshold=20
301
302 to the configure command.
303
304
305 LIMITING PCRE RESOURCE USAGE
306
307 Internally, PCRE has a function called match(), which it calls repeat-
308 edly (possibly recursively) when matching a pattern with the
309 pcre_exec() function. By controlling the maximum number of times this
310 function may be called during a single matching operation, a limit can
311 be placed on the resources used by a single call to pcre_exec(). The
312 limit can be changed at run time, as described in the pcreapi documen-
313 tation. The default is 10 million, but this can be changed by adding a
314 setting such as
315
316 --with-match-limit=500000
317
318 to the configure command. This setting has no effect on the
319 pcre_dfa_exec() matching function.
320
321
322 HANDLING VERY LARGE PATTERNS
323
324 Within a compiled pattern, offset values are used to point from one
325 part to another (for example, from an opening parenthesis to an alter-
326 nation metacharacter). By default, two-byte values are used for these
327 offsets, leading to a maximum size for a compiled pattern of around
328 64K. This is sufficient to handle all but the most gigantic patterns.
329 Nevertheless, some people do want to process enormous patterns, so it
330 is possible to compile PCRE to use three-byte or four-byte offsets by
331 adding a setting such as
332
333 --with-link-size=3
334
335 to the configure command. The value given must be 2, 3, or 4. Using
336 longer offsets slows down the operation of PCRE because it has to load
337 additional bytes when handling them.
338
339 If you build PCRE with an increased link size, test 2 (and test 5 if
340 you are using UTF-8) will fail. Part of the output of these tests is a
341 representation of the compiled pattern, and this changes with the link
342 size.
343
344
345 AVOIDING EXCESSIVE STACK USAGE
346
347 When matching with the pcre_exec() function, PCRE implements backtrack-
348 ing by making recursive calls to an internal function called match().
349 In environments where the size of the stack is limited, this can se-
350 verely limit PCRE's operation. (The Unix environment does not usually
351 suffer from this problem.) An alternative approach that uses memory
352 from the heap to remember data, instead of using recursive function
353 calls, has been implemented to work round this problem. If you want to
354 build a version of PCRE that works this way, add
355
356 --disable-stack-for-recursion
357
358 to the configure command. With this configuration, PCRE will use the
359 pcre_stack_malloc and pcre_stack_free variables to call memory manage-
360 ment functions. Separate functions are provided because the usage is
361 very predictable: the block sizes requested are always the same, and
362 the blocks are always freed in reverse order. A calling program might
363 be able to implement optimized functions that perform better than the
364 standard malloc() and free() functions. PCRE runs noticeably more
365 slowly when built in this way. This option affects only the pcre_exec()
366 function; it is not relevant for the the pcre_dfa_exec() function.
367
368
369 USING EBCDIC CODE
370
371 PCRE assumes by default that it will run in an environment where the
372 character code is ASCII (or Unicode, which is a superset of ASCII).
373 PCRE can, however, be compiled to run in an EBCDIC environment by
374 adding
375
376 --enable-ebcdic
377
378 to the configure command.
379
380 Last updated: 28 February 2005
381 Copyright (c) 1997-2005 University of Cambridge.
382 -----------------------------------------------------------------------------
383
384
385
386 NAME
387 PCRE - Perl-compatible regular expressions
388
389
390 PCRE MATCHING ALGORITHMS
391
392 This document describes the two different algorithms that are available
393 in PCRE for matching a compiled regular expression against a given sub-
394 ject string. The "standard" algorithm is the one provided by the
395 pcre_exec() function. This works in the same was as Perl's matching
396 function, and provides a Perl-compatible matching operation.
397
398 An alternative algorithm is provided by the pcre_dfa_exec() function;
399 this operates in a different way, and is not Perl-compatible. It has
400 advantages and disadvantages compared with the standard algorithm, and
401 these are described below.
402
403 When there is only one possible way in which a given subject string can
404 match a pattern, the two algorithms give the same answer. A difference
405 arises, however, when there are multiple possibilities. For example, if
406 the pattern
407
408 ^<.*>
409
410 is matched against the string
411
412 <something> <something else> <something further>
413
414 there are three possible answers. The standard algorithm finds only one
415 of them, whereas the DFA algorithm finds all three.
416
417
418 REGULAR EXPRESSIONS AS TREES
419
420 The set of strings that are matched by a regular expression can be rep-
421 resented as a tree structure. An unlimited repetition in the pattern
422 makes the tree of infinite size, but it is still a tree. Matching the
423 pattern to a given subject string (from a given starting point) can be
424 thought of as a search of the tree. There are two standard ways to
425 search a tree: depth-first and breadth-first, and these correspond to
426 the two matching algorithms provided by PCRE.
427
428
429 THE STANDARD MATCHING ALGORITHM
430
431 In the terminology of Jeffrey Friedl's book Mastering Regular Expres-
432 sions, the standard algorithm is an "NFA algorithm". It conducts a
433 depth-first search of the pattern tree. That is, it proceeds along a
434 single path through the tree, checking that the subject matches what is
435 required. When there is a mismatch, the algorithm tries any alterna-
436 tives at the current point, and if they all fail, it backs up to the
437 previous branch point in the tree, and tries the next alternative
438 branch at that level. This often involves backing up (moving to the
439 left) in the subject string as well. The order in which repetition
440 branches are tried is controlled by the greedy or ungreedy nature of
441 the quantifier.
442
443 If a leaf node is reached, a matching string has been found, and at
444 that point the algorithm stops. Thus, if there is more than one possi-
445 ble match, this algorithm returns the first one that it finds. Whether
446 this is the shortest, the longest, or some intermediate length depends
447 on the way the greedy and ungreedy repetition quantifiers are specified
448 in the pattern.
449
450 Because it ends up with a single path through the tree, it is rela-
451 tively straightforward for this algorithm to keep track of the sub-
452 strings that are matched by portions of the pattern in parentheses.
453 This provides support for capturing parentheses and back references.
454
455
456 THE DFA MATCHING ALGORITHM
457
458 DFA stands for "deterministic finite automaton", but you do not need to
459 understand the origins of that name. This algorithm conducts a breadth-
460 first search of the tree. Starting from the first matching point in the
461 subject, it scans the subject string from left to right, once, charac-
462 ter by character, and as it does this, it remembers all the paths
463 through the tree that represent valid matches.
464
465 The scan continues until either the end of the subject is reached, or
466 there are no more unterminated paths. At this point, terminated paths
467 represent the different matching possibilities (if there are none, the
468 match has failed). Thus, if there is more than one possible match,
469 this algorithm finds all of them, and in particular, it finds the long-
470 est. In PCRE, there is an option to stop the algorithm after the first
471 match (which is necessarily the shortest) has been found.
472
473 Note that all the matches that are found start at the same point in the
474 subject. If the pattern
475
476 cat(er(pillar)?)
477
478 is matched against the string "the caterpillar catchment", the result
479 will be the three strings "cat", "cater", and "caterpillar" that start
480 at the fourth character of the subject. The algorithm does not automat-
481 ically move on to find matches that start at later positions.
482
483 There are a number of features of PCRE regular expressions that are not
484 supported by the DFA matching algorithm. They are as follows:
485
486 1. Because the algorithm finds all possible matches, the greedy or
487 ungreedy nature of repetition quantifiers is not relevant. Greedy and
488 ungreedy quantifiers are treated in exactly the same way.
489
490 2. When dealing with multiple paths through the tree simultaneously, it
491 is not straightforward to keep track of captured substrings for the
492 different matching possibilities, and PCRE's implementation of this
493 algorithm does not attempt to do this. This means that no captured sub-
494 strings are available.
495
496 3. Because no substrings are captured, back references within the pat-
497 tern are not supported, and cause errors if encountered.
498
499 4. For the same reason, conditional expressions that use a backrefer-
500 ence as the condition are not supported.
501
502 5. Callouts are supported, but the value of the capture_top field is
503 always 1, and the value of the capture_last field is always -1.
504
505 6. The \C escape sequence, which (in the standard algorithm) matches a
506 single byte, even in UTF-8 mode, is not supported because the DFA algo-
507 rithm moves through the subject string one character at a time, for all
508 active paths through the tree.
509
510
511 ADVANTAGES OF THE DFA ALGORITHM
512
513 Using the DFA matching algorithm provides the following advantages:
514
515 1. All possible matches (at a single point in the subject) are automat-
516 ically found, and in particular, the longest match is found. To find
517 more than one match using the standard algorithm, you have to do kludgy
518 things with callouts.
519
520 2. There is much better support for partial matching. The restrictions
521 on the content of the pattern that apply when using the standard algo-
522 rithm for partial matching do not apply to the DFA algorithm. For non-
523 anchored patterns, the starting position of a partial match is avail-
524 able.
525
526 3. Because the DFA algorithm scans the subject string just once, and
527 never needs to backtrack, it is possible to pass very long subject
528 strings to the matching function in several pieces, checking for par-
529 tial matching each time.
530
531
532 DISADVANTAGES OF THE DFA ALGORITHM
533
534 The DFA algorithm suffers from a number of disadvantages:
535
536 1. It is substantially slower than the standard algorithm. This is
537 partly because it has to search for all possible matches, but is also
538 because it is less susceptible to optimization.
539
540 2. Capturing parentheses and back references are not supported.
541
542 3. The "atomic group" feature of PCRE regular expressions is supported,
543 but does not provide the advantage that it does for the standard algo-
544 rithm.
545
546 Last updated: 28 February 2005
547 Copyright (c) 1997-2005 University of Cambridge.
548 -----------------------------------------------------------------------------
549
550
551
552 NAME
553 PCRE - Perl-compatible regular expressions
554
555
556 PCRE NATIVE API
557
558 #include <pcre.h>
559
560 pcre *pcre_compile(const char *pattern, int options,
561 const char **errptr, int *erroffset,
562 const unsigned char *tableptr);
563
564 pcre *pcre_compile2(const char *pattern, int options,
565 int *errorcodeptr,
566 const char **errptr, int *erroffset,
567 const unsigned char *tableptr);
568
569 pcre_extra *pcre_study(const pcre *code, int options,
570 const char **errptr);
571
572 int pcre_exec(const pcre *code, const pcre_extra *extra,
573 const char *subject, int length, int startoffset,
574 int options, int *ovector, int ovecsize);
575
576 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
577 const char *subject, int length, int startoffset,
578 int options, int *ovector, int ovecsize,
579 int *workspace, int wscount);
580
581 int pcre_copy_named_substring(const pcre *code,
582 const char *subject, int *ovector,
583 int stringcount, const char *stringname,
584 char *buffer, int buffersize);
585
586 int pcre_copy_substring(const char *subject, int *ovector,
587 int stringcount, int stringnumber, char *buffer,
588 int buffersize);
589
590 int pcre_get_named_substring(const pcre *code,
591 const char *subject, int *ovector,
592 int stringcount, const char *stringname,
593 const char **stringptr);
594
595 int pcre_get_stringnumber(const pcre *code,
596 const char *name);
597
598 int pcre_get_substring(const char *subject, int *ovector,
599 int stringcount, int stringnumber,
600 const char **stringptr);
601
602 int pcre_get_substring_list(const char *subject,
603 int *ovector, int stringcount, const char ***listptr);
604
605 void pcre_free_substring(const char *stringptr);
606
607 void pcre_free_substring_list(const char **stringptr);
608
609 const unsigned char *pcre_maketables(void);
610
611 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
612 int what, void *where);
613
614 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
615
616 int pcre_refcount(pcre *code, int adjust);
617
618 int pcre_config(int what, void *where);
619
620 char *pcre_version(void);
621
622 void *(*pcre_malloc)(size_t);
623
624 void (*pcre_free)(void *);
625
626 void *(*pcre_stack_malloc)(size_t);
627
628 void (*pcre_stack_free)(void *);
629
630 int (*pcre_callout)(pcre_callout_block *);
631
632
633 PCRE API OVERVIEW
634
635 PCRE has its own native API, which is described in this document. There
636 is also a set of wrapper functions that correspond to the POSIX regular
637 expression API. These are described in the pcreposix documentation.
638 Both of these APIs define a set of C function calls. A C++ wrapper is
639 distributed with PCRE. It is documented in the pcrecpp page.
640
641 The native API C function prototypes are defined in the header file
642 pcre.h, and on Unix systems the library itself is called libpcre. It
643 can normally be accessed by adding -lpcre to the command for linking an
644 application that uses PCRE. The header file defines the macros
645 PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num-
646 bers for the library. Applications can use these to include support
647 for different releases of PCRE.
648
649 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
650 pcre_exec() are used for compiling and matching regular expressions in
651 a Perl-compatible manner. A sample program that demonstrates the sim-
652 plest way of using them is provided in the file called pcredemo.c in
653 the source distribution. The pcresample documentation describes how to
654 run it.
655
656 A second matching function, pcre_dfa_exec(), which is not Perl-compati-
657 ble, is also provided. This uses a different algorithm for the match-
658 ing. This allows it to find all possible matches (at a given point in
659 the subject), not just one. However, this algorithm does not return
660 captured substrings. A description of the two matching algorithms and
661 their advantages and disadvantages is given in the pcrematching docu-
662 mentation.
663
664 In addition to the main compiling and matching functions, there are
665 convenience functions for extracting captured substrings from a subject
666 string that is matched by pcre_exec(). They are:
667
668 pcre_copy_substring()
669 pcre_copy_named_substring()
670 pcre_get_substring()
671 pcre_get_named_substring()
672 pcre_get_substring_list()
673 pcre_get_stringnumber()
674
675 pcre_free_substring() and pcre_free_substring_list() are also provided,
676 to free the memory used for extracted strings.
677
678 The function pcre_maketables() is used to build a set of character
679 tables in the current locale for passing to pcre_compile(),
680 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
681 provided for specialist use. Most commonly, no special tables are
682 passed, in which case internal tables that are generated when PCRE is
683 built are used.
684
685 The function pcre_fullinfo() is used to find out information about a
686 compiled pattern; pcre_info() is an obsolete version that returns only
687 some of the available information, but is retained for backwards com-
688 patibility. The function pcre_version() returns a pointer to a string
689 containing the version of PCRE and its date of release.
690
691 The function pcre_refcount() maintains a reference count in a data
692 block containing a compiled pattern. This is provided for the benefit
693 of object-oriented applications.
694
695 The global variables pcre_malloc and pcre_free initially contain the
696 entry points of the standard malloc() and free() functions, respec-
697 tively. PCRE calls the memory management functions via these variables,
698 so a calling program can replace them if it wishes to intercept the
699 calls. This should be done before calling any PCRE functions.
700
701 The global variables pcre_stack_malloc and pcre_stack_free are also
702 indirections to memory management functions. These special functions
703 are used only when PCRE is compiled to use the heap for remembering
704 data, instead of recursive function calls, when running the pcre_exec()
705 function. This is a non-standard way of building PCRE, for use in envi-
706 ronments that have limited stacks. Because of the greater use of memory
707 management, it runs more slowly. Separate functions are provided so
708 that special-purpose external code can be used for this case. When
709 used, these functions are always called in a stack-like manner (last
710 obtained, first freed), and always for memory blocks of the same size.
711
712 The global variable pcre_callout initially contains NULL. It can be set
713 by the caller to a "callout" function, which PCRE will then call at
714 specified points during a matching operation. Details are given in the
715 pcrecallout documentation.
716
717
718 MULTITHREADING
719
720 The PCRE functions can be used in multi-threading applications, with
721 the proviso that the memory management functions pointed to by
722 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
723 callout function pointed to by pcre_callout, are shared by all threads.
724
725 The compiled form of a regular expression is not altered during match-
726 ing, so the same compiled pattern can safely be used by several threads
727 at once.
728
729
730 SAVING PRECOMPILED PATTERNS FOR LATER USE
731
732 The compiled form of a regular expression can be saved and re-used at a
733 later time, possibly by a different program, and even on a host other
734 than the one on which it was compiled. Details are given in the
735 pcreprecompile documentation.
736
737
738 CHECKING BUILD-TIME OPTIONS
739
740 int pcre_config(int what, void *where);
741
742 The function pcre_config() makes it possible for a PCRE client to dis-
743 cover which optional features have been compiled into the PCRE library.
744 The pcrebuild documentation has more details about these optional fea-
745 tures.
746
747 The first argument for pcre_config() is an integer, specifying which
748 information is required; the second argument is a pointer to a variable
749 into which the information is placed. The following information is
750 available:
751
752 PCRE_CONFIG_UTF8
753
754 The output is an integer that is set to one if UTF-8 support is avail-
755 able; otherwise it is set to zero.
756
757 PCRE_CONFIG_UNICODE_PROPERTIES
758
759 The output is an integer that is set to one if support for Unicode
760 character properties is available; otherwise it is set to zero.
761
762 PCRE_CONFIG_NEWLINE
763
764 The output is an integer that is set to the value of the code that is
765 used for the newline character. It is either linefeed (10) or carriage
766 return (13), and should normally be the standard character for your
767 operating system.
768
769 PCRE_CONFIG_LINK_SIZE
770
771 The output is an integer that contains the number of bytes used for
772 internal linkage in compiled regular expressions. The value is 2, 3, or
773 4. Larger values allow larger regular expressions to be compiled, at
774 the expense of slower matching. The default value of 2 is sufficient
775 for all but the most massive patterns, since it allows the compiled
776 pattern to be up to 64K in size.
777
778 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
779
780 The output is an integer that contains the threshold above which the
781 POSIX interface uses malloc() for output vectors. Further details are
782 given in the pcreposix documentation.
783
784 PCRE_CONFIG_MATCH_LIMIT
785
786 The output is an integer that gives the default limit for the number of
787 internal matching function calls in a pcre_exec() execution. Further
788 details are given with pcre_exec() below.
789
790 PCRE_CONFIG_STACKRECURSE
791
792 The output is an integer that is set to one if internal recursion when
793 running pcre_exec() is implemented by recursive function calls that use
794 the stack to remember their state. This is the usual way that PCRE is
795 compiled. The output is zero if PCRE was compiled to use blocks of data
796 on the heap instead of recursive function calls. In this case,
797 pcre_stack_malloc and pcre_stack_free are called to manage memory
798 blocks on the heap, thus avoiding the use of the stack.
799
800
801 COMPILING A PATTERN
802
803 pcre *pcre_compile(const char *pattern, int options,
804 const char **errptr, int *erroffset,
805 const unsigned char *tableptr);
806
807 pcre *pcre_compile2(const char *pattern, int options,
808 int *errorcodeptr,
809 const char **errptr, int *erroffset,
810 const unsigned char *tableptr);
811
812 Either of the functions pcre_compile() or pcre_compile2() can be called
813 to compile a pattern into an internal form. The only difference between
814 the two interfaces is that pcre_compile2() has an additional argument,
815 errorcodeptr, via which a numerical error code can be returned.
816
817 The pattern is a C string terminated by a binary zero, and is passed in
818 the pattern argument. A pointer to a single block of memory that is
819 obtained via pcre_malloc is returned. This contains the compiled code
820 and related data. The pcre type is defined for the returned block; this
821 is a typedef for a structure whose contents are not externally defined.
822 It is up to the caller to free the memory when it is no longer
823 required.
824
825 Although the compiled code of a PCRE regex is relocatable, that is, it
826 does not depend on memory location, the complete pcre data block is not
827 fully relocatable, because it may contain a copy of the tableptr argu-
828 ment, which is an address (see below).
829
830 The options argument contains independent bits that affect the compila-
831 tion. It should be zero if no options are required. The available
832 options are described below. Some of them, in particular, those that
833 are compatible with Perl, can also be set and unset from within the
834 pattern (see the detailed description in the pcrepattern documenta-
835 tion). For these options, the contents of the options argument speci-
836 fies their initial settings at the start of compilation and execution.
837 The PCRE_ANCHORED option can be set at the time of matching as well as
838 at compile time.
839
840 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
841 if compilation of a pattern fails, pcre_compile() returns NULL, and
842 sets the variable pointed to by errptr to point to a textual error mes-
843 sage. The offset from the start of the pattern to the character where
844 the error was discovered is placed in the variable pointed to by
845 erroffset, which must not be NULL. If it is, an immediate error is
846 given.
847
848 If pcre_compile2() is used instead of pcre_compile(), and the error-
849 codeptr argument is not NULL, a non-zero error code number is returned
850 via this argument in the event of an error. This is in addition to the
851 textual error message. Error codes and messages are listed below.
852
853 If the final argument, tableptr, is NULL, PCRE uses a default set of
854 character tables that are built when PCRE is compiled, using the
855 default C locale. Otherwise, tableptr must be an address that is the
856 result of a call to pcre_maketables(). This value is stored with the
857 compiled pattern, and used again by pcre_exec(), unless another table
858 pointer is passed to it. For more discussion, see the section on locale
859 support below.
860
861 This code fragment shows a typical straightforward call to pcre_com-
862 pile():
863
864 pcre *re;
865 const char *error;
866 int erroffset;
867 re = pcre_compile(
868 "^A.*Z", /* the pattern */
869 0, /* default options */
870 &error, /* for error message */
871 &erroffset, /* for error offset */
872 NULL); /* use default character tables */
873
874 The following names for option bits are defined in the pcre.h header
875 file:
876
877 PCRE_ANCHORED
878
879 If this bit is set, the pattern is forced to be "anchored", that is, it
880 is constrained to match only at the first matching point in the string
881 that is being searched (the "subject string"). This effect can also be
882 achieved by appropriate constructs in the pattern itself, which is the
883 only way to do it in Perl.
884
885 PCRE_AUTO_CALLOUT
886
887 If this bit is set, pcre_compile() automatically inserts callout items,
888 all with number 255, before each pattern item. For discussion of the
889 callout facility, see the pcrecallout documentation.
890
891 PCRE_CASELESS
892
893 If this bit is set, letters in the pattern match both upper and lower
894 case letters. It is equivalent to Perl's /i option, and it can be
895 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
896 always understands the concept of case for characters whose values are
897 less than 128, so caseless matching is always possible. For characters
898 with higher values, the concept of case is supported if PCRE is com-
899 piled with Unicode property support, but not otherwise. If you want to
900 use caseless matching for characters 128 and above, you must ensure
901 that PCRE is compiled with Unicode property support as well as with
902 UTF-8 support.
903
904 PCRE_DOLLAR_ENDONLY
905
906 If this bit is set, a dollar metacharacter in the pattern matches only
907 at the end of the subject string. Without this option, a dollar also
908 matches immediately before the final character if it is a newline (but
909 not before any other newlines). The PCRE_DOLLAR_ENDONLY option is
910 ignored if PCRE_MULTILINE is set. There is no equivalent to this option
911 in Perl, and no way to set it within a pattern.
912
913 PCRE_DOTALL
914
915 If this bit is set, a dot metacharater in the pattern matches all char-
916 acters, including newlines. Without it, newlines are excluded. This
917 option is equivalent to Perl's /s option, and it can be changed within
918 a pattern by a (?s) option setting. A negative class such as [^a]
919 always matches a newline character, independent of the setting of this
920 option.
921
922 PCRE_EXTENDED
923
924 If this bit is set, whitespace data characters in the pattern are
925 totally ignored except when escaped or inside a character class. White-
926 space does not include the VT character (code 11). In addition, charac-
927 ters between an unescaped # outside a character class and the next new-
928 line character, inclusive, are also ignored. This is equivalent to
929 Perl's /x option, and it can be changed within a pattern by a (?x)
930 option setting.
931
932 This option makes it possible to include comments inside complicated
933 patterns. Note, however, that this applies only to data characters.
934 Whitespace characters may never appear within special character
935 sequences in a pattern, for example within the sequence (?( which
936 introduces a conditional subpattern.
937
938 PCRE_EXTRA
939
940 This option was invented in order to turn on additional functionality
941 of PCRE that is incompatible with Perl, but it is currently of very
942 little use. When set, any backslash in a pattern that is followed by a
943 letter that has no special meaning causes an error, thus reserving
944 these combinations for future expansion. By default, as in Perl, a
945 backslash followed by a letter with no special meaning is treated as a
946 literal. There are at present no other features controlled by this
947 option. It can also be set by a (?X) option setting within a pattern.
948
949 PCRE_FIRSTLINE
950
951 If this option is set, an unanchored pattern is required to match
952 before or at the first newline character in the subject string, though
953 the matched text may continue over the newline.
954
955 PCRE_MULTILINE
956
957 By default, PCRE treats the subject string as consisting of a single
958 line of characters (even if it actually contains newlines). The "start
959 of line" metacharacter (^) matches only at the start of the string,
960 while the "end of line" metacharacter ($) matches only at the end of
961 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
962 is set). This is the same as Perl.
963
964 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
965 constructs match immediately following or immediately before any new-
966 line in the subject string, respectively, as well as at the very start
967 and end. This is equivalent to Perl's /m option, and it can be changed
968 within a pattern by a (?m) option setting. If there are no "\n" charac-
969 ters in a subject string, or no occurrences of ^ or $ in a pattern,
970 setting PCRE_MULTILINE has no effect.
971
972 PCRE_NO_AUTO_CAPTURE
973
974 If this option is set, it disables the use of numbered capturing paren-
975 theses in the pattern. Any opening parenthesis that is not followed by
976 ? behaves as if it were followed by ?: but named parentheses can still
977 be used for capturing (and they acquire numbers in the usual way).
978 There is no equivalent of this option in Perl.
979
980 PCRE_UNGREEDY
981
982 This option inverts the "greediness" of the quantifiers so that they
983 are not greedy by default, but become greedy if followed by "?". It is
984 not compatible with Perl. It can also be set by a (?U) option setting
985 within the pattern.
986
987 PCRE_UTF8
988
989 This option causes PCRE to regard both the pattern and the subject as
990 strings of UTF-8 characters instead of single-byte character strings.
991 However, it is available only when PCRE is built to include UTF-8 sup-
992 port. If not, the use of this option provokes an error. Details of how
993 this option changes the behaviour of PCRE are given in the section on
994 UTF-8 support in the main pcre page.
995
996 PCRE_NO_UTF8_CHECK
997
998 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
999 automatically checked. If an invalid UTF-8 sequence of bytes is found,
1000 pcre_compile() returns an error. If you already know that your pattern
1001 is valid, and you want to skip this check for performance reasons, you
1002 can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
1003 passing an invalid UTF-8 string as a pattern is undefined. It may cause
1004 your program to crash. Note that this option can also be passed to
1005 pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check-
1006 ing of subject strings.
1007
1008
1009 COMPILATION ERROR CODES
1010
1011 The following table lists the error codes than may be returned by
1012 pcre_compile2(), along with the error messages that may be returned by
1013 both compiling functions.
1014
1015 0 no error
1016 1 \ at end of pattern
1017 2 \c at end of pattern
1018 3 unrecognized character follows \
1019 4 numbers out of order in {} quantifier
1020 5 number too big in {} quantifier
1021 6 missing terminating ] for character class
1022 7 invalid escape sequence in character class
1023 8 range out of order in character class
1024 9 nothing to repeat
1025 10 operand of unlimited repeat could match the empty string
1026 11 internal error: unexpected repeat
1027 12 unrecognized character after (?
1028 13 POSIX named classes are supported only within a class
1029 14 missing )
1030 15 reference to non-existent subpattern
1031 16 erroffset passed as NULL
1032 17 unknown option bit(s) set
1033 18 missing ) after comment
1034 19 parentheses nested too deeply
1035 20 regular expression too large
1036 21 failed to get memory
1037 22 unmatched parentheses
1038 23 internal error: code overflow
1039 24 unrecognized character after (?<
1040 25 lookbehind assertion is not fixed length
1041 26 malformed number after (?(
1042 27 conditional group contains more than two branches
1043 28 assertion expected after (?(
1044 29 (?R or (?digits must be followed by )
1045 30 unknown POSIX class name
1046 31 POSIX collating elements are not supported
1047 32 this version of PCRE is not compiled with PCRE_UTF8 support
1048 33 spare error
1049 34 character value in \x{...} sequence is too large
1050 35 invalid condition (?(0)
1051 36 \C not allowed in lookbehind assertion
1052 37 PCRE does not support \L, \l, \N, \U, or \u
1053 38 number after (?C is > 255
1054 39 closing ) for (?C expected
1055 40 recursive call could loop indefinitely
1056 41 unrecognized character after (?P
1057 42 syntax error after (?P
1058 43 two named groups have the same name
1059 44 invalid UTF-8 string
1060 45 support for \P, \p, and \X has not been compiled
1061 46 malformed \P or \p sequence
1062 47 unknown property name after \P or \p
1063
1064
1065 STUDYING A PATTERN
1066
1067 pcre_extra *pcre_study(const pcre *code, int options
1068 const char **errptr);
1069
1070 If a compiled pattern is going to be used several times, it is worth
1071 spending more time analyzing it in order to speed up the time taken for
1072 matching. The function pcre_study() takes a pointer to a compiled pat-
1073 tern as its first argument. If studying the pattern produces additional
1074 information that will help speed up matching, pcre_study() returns a
1075 pointer to a pcre_extra block, in which the study_data field points to
1076 the results of the study.
1077
1078 The returned value from pcre_study() can be passed directly to
1079 pcre_exec(). However, a pcre_extra block also contains other fields
1080 that can be set by the caller before the block is passed; these are
1081 described below in the section on matching a pattern.
1082
1083 If studying the pattern does not produce any additional information
1084 pcre_study() returns NULL. In that circumstance, if the calling program
1085 wants to pass any of the other fields to pcre_exec(), it must set up
1086 its own pcre_extra block.
1087
1088 The second argument of pcre_study() contains option bits. At present,
1089 no options are defined, and this argument should always be zero.
1090
1091 The third argument for pcre_study() is a pointer for an error message.
1092 If studying succeeds (even if no data is returned), the variable it
1093 points to is set to NULL. Otherwise it points to a textual error mes-
1094 sage. You should therefore test the error pointer for NULL after call-
1095 ing pcre_study(), to be sure that it has run successfully.
1096
1097 This is a typical call to pcre_study():
1098
1099 pcre_extra *pe;
1100 pe = pcre_study(
1101 re, /* result of pcre_compile() */
1102 0, /* no options exist */
1103 &error); /* set to NULL or points to a message */
1104
1105 At present, studying a pattern is useful only for non-anchored patterns
1106 that do not have a single fixed starting character. A bitmap of possi-
1107 ble starting bytes is created.
1108
1109
1110 LOCALE SUPPORT
1111
1112 PCRE handles caseless matching, and determines whether characters are
1113 letters digits, or whatever, by reference to a set of tables, indexed
1114 by character value. When running in UTF-8 mode, this applies only to
1115 characters with codes less than 128. Higher-valued codes never match
1116 escapes such as \w or \d, but can be tested with \p if PCRE is built
1117 with Unicode character property support.
1118
1119 An internal set of tables is created in the default C locale when PCRE
1120 is built. This is used when the final argument of pcre_compile() is
1121 NULL, and is sufficient for many applications. An alternative set of
1122 tables can, however, be supplied. These may be created in a different
1123 locale from the default. As more and more applications change to using
1124 Unicode, the need for this locale support is expected to die away.
1125
1126 External tables are built by calling the pcre_maketables() function,
1127 which has no arguments, in the relevant locale. The result can then be
1128 passed to pcre_compile() or pcre_exec() as often as necessary. For
1129 example, to build and use tables that are appropriate for the French
1130 locale (where accented characters with values greater than 128 are
1131 treated as letters), the following code could be used:
1132
1133 setlocale(LC_CTYPE, "fr_FR");
1134 tables = pcre_maketables();
1135 re = pcre_compile(..., tables);
1136
1137 When pcre_maketables() runs, the tables are built in memory that is
1138 obtained via pcre_malloc. It is the caller's responsibility to ensure
1139 that the memory containing the tables remains available for as long as
1140 it is needed.
1141
1142 The pointer that is passed to pcre_compile() is saved with the compiled
1143 pattern, and the same tables are used via this pointer by pcre_study()
1144 and normally also by pcre_exec(). Thus, by default, for any single pat-
1145 tern, compilation, studying and matching all happen in the same locale,
1146 but different patterns can be compiled in different locales.
1147
1148 It is possible to pass a table pointer or NULL (indicating the use of
1149 the internal tables) to pcre_exec(). Although not intended for this
1150 purpose, this facility could be used to match a pattern in a different
1151 locale from the one in which it was compiled. Passing table pointers at
1152 run time is discussed below in the section on matching a pattern.
1153
1154
1155 INFORMATION ABOUT A PATTERN
1156
1157 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1158 int what, void *where);
1159
1160 The pcre_fullinfo() function returns information about a compiled pat-
1161 tern. It replaces the obsolete pcre_info() function, which is neverthe-
1162 less retained for backwards compability (and is documented below).
1163
1164 The first argument for pcre_fullinfo() is a pointer to the compiled
1165 pattern. The second argument is the result of pcre_study(), or NULL if
1166 the pattern was not studied. The third argument specifies which piece
1167 of information is required, and the fourth argument is a pointer to a
1168 variable to receive the data. The yield of the function is zero for
1169 success, or one of the following negative numbers:
1170
1171 PCRE_ERROR_NULL the argument code was NULL
1172 the argument where was NULL
1173 PCRE_ERROR_BADMAGIC the "magic number" was not found
1174 PCRE_ERROR_BADOPTION the value of what was invalid
1175
1176 The "magic number" is placed at the start of each compiled pattern as
1177 an simple check against passing an arbitrary memory pointer. Here is a
1178 typical call of pcre_fullinfo(), to obtain the length of the compiled
1179 pattern:
1180
1181 int rc;
1182 unsigned long int length;
1183 rc = pcre_fullinfo(
1184 re, /* result of pcre_compile() */
1185 pe, /* result of pcre_study(), or NULL */
1186 PCRE_INFO_SIZE, /* what is required */
1187 &length); /* where to put the data */
1188
1189 The possible values for the third argument are defined in pcre.h, and
1190 are as follows:
1191
1192 PCRE_INFO_BACKREFMAX
1193
1194 Return the number of the highest back reference in the pattern. The
1195 fourth argument should point to an int variable. Zero is returned if
1196 there are no back references.
1197
1198 PCRE_INFO_CAPTURECOUNT
1199
1200 Return the number of capturing subpatterns in the pattern. The fourth
1201 argument should point to an int variable.
1202
1203 PCRE_INFO_DEFAULT_TABLES
1204
1205 Return a pointer to the internal default character tables within PCRE.
1206 The fourth argument should point to an unsigned char * variable. This
1207 information call is provided for internal use by the pcre_study() func-
1208 tion. External callers can cause PCRE to use its internal tables by
1209 passing a NULL table pointer.
1210
1211 PCRE_INFO_FIRSTBYTE
1212
1213 Return information about the first byte of any matched string, for a
1214 non-anchored pattern. (This option used to be called
1215 PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards
1216 compatibility.)
1217
1218 If there is a fixed first byte, for example, from a pattern such as
1219 (cat|cow|coyote), it is returned in the integer pointed to by where.
1220 Otherwise, if either
1221
1222 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
1223 branch starts with "^", or
1224
1225 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1226 set (if it were set, the pattern would be anchored),
1227
1228 -1 is returned, indicating that the pattern matches only at the start
1229 of a subject string or after any newline within the string. Otherwise
1230 -2 is returned. For anchored patterns, -2 is returned.
1231
1232 PCRE_INFO_FIRSTTABLE
1233
1234 If the pattern was studied, and this resulted in the construction of a
1235 256-bit table indicating a fixed set of bytes for the first byte in any
1236 matching string, a pointer to the table is returned. Otherwise NULL is
1237 returned. The fourth argument should point to an unsigned char * vari-
1238 able.
1239
1240 PCRE_INFO_LASTLITERAL
1241
1242 Return the value of the rightmost literal byte that must exist in any
1243 matched string, other than at its start, if such a byte has been
1244 recorded. The fourth argument should point to an int variable. If there
1245 is no such byte, -1 is returned. For anchored patterns, a last literal
1246 byte is recorded only if it follows something of variable length. For
1247 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1248 /^a\dz\d/ the returned value is -1.
1249
1250 PCRE_INFO_NAMECOUNT
1251 PCRE_INFO_NAMEENTRYSIZE
1252 PCRE_INFO_NAMETABLE
1253
1254 PCRE supports the use of named as well as numbered capturing parenthe-
1255 ses. The names are just an additional way of identifying the parenthe-
1256 ses, which still acquire numbers. A convenience function called
1257 pcre_get_named_substring() is provided for extracting an individual
1258 captured substring by name. It is also possible to extract the data
1259 directly, by first converting the name to a number in order to access
1260 the correct pointers in the output vector (described with pcre_exec()
1261 below). To do the conversion, you need to use the name-to-number map,
1262 which is described by these three values.
1263
1264 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1265 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1266 of each entry; both of these return an int value. The entry size
1267 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
1268 a pointer to the first entry of the table (a pointer to char). The
1269 first two bytes of each entry are the number of the capturing parenthe-
1270 sis, most significant byte first. The rest of the entry is the corre-
1271 sponding name, zero terminated. The names are in alphabetical order.
1272 For example, consider the following pattern (assume PCRE_EXTENDED is
1273 set, so white space - including newlines - is ignored):
1274
1275 (?P<date> (?P<year>(\d\d)?\d\d) -
1276 (?P<month>\d\d) - (?P<day>\d\d) )
1277
1278 There are four named subpatterns, so the table has four entries, and
1279 each entry in the table is eight bytes long. The table is as follows,
1280 with non-printing bytes shows in hexadecimal, and undefined bytes shown
1281 as ??:
1282
1283 00 01 d a t e 00 ??
1284 00 05 d a y 00 ?? ??
1285 00 04 m o n t h 00
1286 00 02 y e a r 00 ??
1287
1288 When writing code to extract data from named subpatterns using the
1289 name-to-number map, remember that the length of each entry is likely to
1290 be different for each compiled pattern.
1291
1292 PCRE_INFO_OPTIONS
1293
1294 Return a copy of the options with which the pattern was compiled. The
1295 fourth argument should point to an unsigned long int variable. These
1296 option bits are those specified in the call to pcre_compile(), modified
1297 by any top-level option settings within the pattern itself.
1298
1299 A pattern is automatically anchored by PCRE if all of its top-level
1300 alternatives begin with one of the following:
1301
1302 ^ unless PCRE_MULTILINE is set
1303 \A always
1304 \G always
1305 .* if PCRE_DOTALL is set and there are no back
1306 references to the subpattern in which .* appears
1307
1308 For such patterns, the PCRE_ANCHORED bit is set in the options returned
1309 by pcre_fullinfo().
1310
1311 PCRE_INFO_SIZE
1312
1313 Return the size of the compiled pattern, that is, the value that was
1314 passed as the argument to pcre_malloc() when PCRE was getting memory in
1315 which to place the compiled data. The fourth argument should point to a
1316 size_t variable.
1317
1318 PCRE_INFO_STUDYSIZE
1319
1320 Return the size of the data block pointed to by the study_data field in
1321 a pcre_extra block. That is, it is the value that was passed to
1322 pcre_malloc() when PCRE was getting memory into which to place the data
1323 created by pcre_study(). The fourth argument should point to a size_t
1324 variable.
1325
1326
1327 OBSOLETE INFO FUNCTION
1328
1329 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1330
1331 The pcre_info() function is now obsolete because its interface is too
1332 restrictive to return all the available data about a compiled pattern.
1333 New programs should use pcre_fullinfo() instead. The yield of
1334 pcre_info() is the number of capturing subpatterns, or one of the fol-
1335 lowing negative numbers:
1336
1337 PCRE_ERROR_NULL the argument code was NULL
1338 PCRE_ERROR_BADMAGIC the "magic number" was not found
1339
1340 If the optptr argument is not NULL, a copy of the options with which
1341 the pattern was compiled is placed in the integer it points to (see
1342 PCRE_INFO_OPTIONS above).
1343
1344 If the pattern is not anchored and the firstcharptr argument is not
1345 NULL, it is used to pass back information about the first character of
1346 any matched string (see PCRE_INFO_FIRSTBYTE above).
1347
1348
1349 REFERENCE COUNTS
1350
1351 int pcre_refcount(pcre *code, int adjust);
1352
1353 The pcre_refcount() function is used to maintain a reference count in
1354 the data block that contains a compiled pattern. It is provided for the
1355 benefit of applications that operate in an object-oriented manner,
1356 where different parts of the application may be using the same compiled
1357 pattern, but you want to free the block when they are all done.
1358
1359 When a pattern is compiled, the reference count field is initialized to
1360 zero. It is changed only by calling this function, whose action is to
1361 add the adjust value (which may be positive or negative) to it. The
1362 yield of the function is the new value. However, the value of the count
1363 is constrained to lie between 0 and 65535, inclusive. If the new value
1364 is outside these limits, it is forced to the appropriate limit value.
1365
1366 Except when it is zero, the reference count is not correctly preserved
1367 if a pattern is compiled on one host and then transferred to a host
1368 whose byte-order is different. (This seems a highly unlikely scenario.)
1369
1370
1371 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1372
1373 int pcre_exec(const pcre *code, const pcre_extra *extra,
1374 const char *subject, int length, int startoffset,
1375 int options, int *ovector, int ovecsize);
1376
1377 The function pcre_exec() is called to match a subject string against a
1378 compiled pattern, which is passed in the code argument. If the pattern
1379 has been studied, the result of the study should be passed in the extra
1380 argument. This function is the main matching facility of the library,
1381 and it operates in a Perl-like manner. For specialist use there is also
1382 an alternative matching function, which is described below in the sec-
1383 tion about the pcre_dfa_exec() function.
1384
1385 In most applications, the pattern will have been compiled (and option-
1386 ally studied) in the same process that calls pcre_exec(). However, it
1387 is possible to save compiled patterns and study data, and then use them
1388 later in different processes, possibly even on different hosts. For a
1389 discussion about this, see the pcreprecompile documentation.
1390
1391 Here is an example of a simple call to pcre_exec():
1392
1393 int rc;
1394 int ovector[30];
1395 rc = pcre_exec(
1396 re, /* result of pcre_compile() */
1397 NULL, /* we didn't study the pattern */
1398 "some string", /* the subject string */
1399 11, /* the length of the subject string */
1400 0, /* start at offset 0 in the subject */
1401 0, /* default options */
1402 ovector, /* vector of integers for substring information */
1403 30); /* number of elements (NOT size in bytes) */
1404
1405 Extra data for pcre_exec()
1406
1407 If the extra argument is not NULL, it must point to a pcre_extra data
1408 block. The pcre_study() function returns such a block (when it doesn't
1409 return NULL), but you can also create one for yourself, and pass addi-
1410 tional information in it. The fields in a pcre_extra block are as fol-
1411 lows:
1412
1413 unsigned long int flags;
1414 void *study_data;
1415 unsigned long int match_limit;
1416 void *callout_data;
1417 const unsigned char *tables;
1418
1419 The flags field is a bitmap that specifies which of the other fields
1420 are set. The flag bits are:
1421
1422 PCRE_EXTRA_STUDY_DATA
1423 PCRE_EXTRA_MATCH_LIMIT
1424 PCRE_EXTRA_CALLOUT_DATA
1425 PCRE_EXTRA_TABLES
1426
1427 Other flag bits should be set to zero. The study_data field is set in
1428 the pcre_extra block that is returned by pcre_study(), together with
1429 the appropriate flag bit. You should not set this yourself, but you may
1430 add to the block by setting the other fields and their corresponding
1431 flag bits.
1432
1433 The match_limit field provides a means of preventing PCRE from using up
1434 a vast amount of resources when running patterns that are not going to
1435 match, but which have a very large number of possibilities in their
1436 search trees. The classic example is the use of nested unlimited
1437 repeats.
1438
1439 Internally, PCRE uses a function called match() which it calls repeat-
1440 edly (sometimes recursively). The limit is imposed on the number of
1441 times this function is called during a match, which has the effect of
1442 limiting the amount of recursion and backtracking that can take place.
1443 For patterns that are not anchored, the count starts from zero for each
1444 position in the subject string.
1445
1446 The default limit for the library can be set when PCRE is built; the
1447 default default is 10 million, which handles all but the most extreme
1448 cases. You can reduce the default by suppling pcre_exec() with a
1449 pcre_extra block in which match_limit is set to a smaller value, and
1450 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1451 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1452
1453 The pcre_callout field is used in conjunction with the "callout" fea-
1454 ture, which is described in the pcrecallout documentation.
1455
1456 The tables field is used to pass a character tables pointer to
1457 pcre_exec(); this overrides the value that is stored with the compiled
1458 pattern. A non-NULL value is stored with the compiled pattern only if
1459 custom tables were supplied to pcre_compile() via its tableptr argu-
1460 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
1461 PCRE's internal tables to be used. This facility is helpful when re-
1462 using patterns that have been saved after compiling with an external
1463 set of tables, because the external tables might be at a different
1464 address when pcre_exec() is called. See the pcreprecompile documenta-
1465 tion for a discussion of saving compiled patterns for later use.
1466
1467 Option bits for pcre_exec()
1468
1469 The unused bits of the options argument for pcre_exec() must be zero.
1470 The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL,
1471 PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and PCRE_PARTIAL.
1472
1473 PCRE_ANCHORED
1474
1475 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
1476 matching position. If a pattern was compiled with PCRE_ANCHORED, or
1477 turned out to be anchored by virtue of its contents, it cannot be made
1478 unachored at matching time.
1479
1480 PCRE_NOTBOL
1481
1482 This option specifies that first character of the subject string is not
1483 the beginning of a line, so the circumflex metacharacter should not
1484 match before it. Setting this without PCRE_MULTILINE (at compile time)
1485 causes circumflex never to match. This option affects only the behav-
1486 iour of the circumflex metacharacter. It does not affect \A.
1487
1488 PCRE_NOTEOL
1489
1490 This option specifies that the end of the subject string is not the end
1491 of a line, so the dollar metacharacter should not match it nor (except
1492 in multiline mode) a newline immediately before it. Setting this with-
1493 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1494 option affects only the behaviour of the dollar metacharacter. It does
1495 not affect \Z or \z.
1496
1497 PCRE_NOTEMPTY
1498
1499 An empty string is not considered to be a valid match if this option is
1500 set. If there are alternatives in the pattern, they are tried. If all
1501 the alternatives match the empty string, the entire match fails. For
1502 example, if the pattern
1503
1504 a?b?
1505
1506 is applied to a string not beginning with "a" or "b", it matches the
1507 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1508 match is not valid, so PCRE searches further into the string for occur-
1509 rences of "a" or "b".
1510
1511 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1512 cial case of a pattern match of the empty string within its split()
1513 function, and when using the /g modifier. It is possible to emulate
1514 Perl's behaviour after matching a null string by first trying the match
1515 again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1516 if that fails by advancing the starting offset (see below) and trying
1517 an ordinary match again. There is some code that demonstrates how to do
1518 this in the pcredemo.c sample program.
1519
1520 PCRE_NO_UTF8_CHECK
1521
1522 When PCRE_UTF8 is set at compile time, the validity of the subject as a
1523 UTF-8 string is automatically checked when pcre_exec() is subsequently
1524 called. The value of startoffset is also checked to ensure that it
1525 points to the start of a UTF-8 character. If an invalid UTF-8 sequence
1526 of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1527 startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is
1528 returned.
1529
1530 If you already know that your subject is valid, and you want to skip
1531 these checks for performance reasons, you can set the
1532 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1533 do this for the second and subsequent calls to pcre_exec() if you are
1534 making repeated calls to find all the matches in a single subject
1535 string. However, you should be sure that the value of startoffset
1536 points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1537 set, the effect of passing an invalid UTF-8 string as a subject, or a
1538 value of startoffset that does not point to the start of a UTF-8 char-
1539 acter, is undefined. Your program may crash.
1540
1541 PCRE_PARTIAL
1542
1543 This option turns on the partial matching feature. If the subject
1544 string fails to match the pattern, but at some point during the match-
1545 ing process the end of the subject was reached (that is, the subject
1546 partially matches the pattern and the failure to match occurred only
1547 because there were not enough subject characters), pcre_exec() returns
1548 PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1549 used, there are restrictions on what may appear in the pattern. These
1550 are discussed in the pcrepartial documentation.
1551
1552 The string to be matched by pcre_exec()
1553
1554 The subject string is passed to pcre_exec() as a pointer in subject, a
1555 length in length, and a starting byte offset in startoffset. In UTF-8
1556 mode, the byte offset must point to the start of a UTF-8 character.
1557 Unlike the pattern string, the subject may contain binary zero bytes.
1558 When the starting offset is zero, the search for a match starts at the
1559 beginning of the subject, and this is by far the most common case.
1560
1561 A non-zero starting offset is useful when searching for another match
1562 in the same subject by calling pcre_exec() again after a previous suc-
1563 cess. Setting startoffset differs from just passing over a shortened
1564 string and setting PCRE_NOTBOL in the case of a pattern that begins
1565 with any kind of lookbehind. For example, consider the pattern
1566
1567 \Biss\B
1568
1569 which finds occurrences of "iss" in the middle of words. (\B matches
1570 only if the current position in the subject is not a word boundary.)
1571 When applied to the string "Mississipi" the first call to pcre_exec()
1572 finds the first occurrence. If pcre_exec() is called again with just
1573 the remainder of the subject, namely "issipi", it does not match,
1574 because \B is always false at the start of the subject, which is deemed
1575 to be a word boundary. However, if pcre_exec() is passed the entire
1576 string again, but with startoffset set to 4, it finds the second occur-
1577 rence of "iss" because it is able to look behind the starting point to
1578 discover that it is preceded by a letter.
1579
1580 If a non-zero starting offset is passed when the pattern is anchored,
1581 one attempt to match at the given offset is made. This can only succeed
1582 if the pattern does not require the match to be at the start of the
1583 subject.
1584
1585 How pcre_exec() returns captured substrings
1586
1587 In general, a pattern matches a certain portion of the subject, and in
1588 addition, further substrings from the subject may be picked out by
1589 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1590 this is called "capturing" in what follows, and the phrase "capturing
1591 subpattern" is used for a fragment of a pattern that picks out a sub-
1592 string. PCRE supports several other kinds of parenthesized subpattern
1593 that do not cause substrings to be captured.
1594
1595 Captured substrings are returned to the caller via a vector of integer
1596 offsets whose address is passed in ovector. The number of elements in
1597 the vector is passed in ovecsize, which must be a non-negative number.
1598 Note: this argument is NOT the size of ovector in bytes.
1599
1600 The first two-thirds of the vector is used to pass back captured sub-
1601 strings, each substring using a pair of integers. The remaining third
1602 of the vector is used as workspace by pcre_exec() while matching cap-
1603 turing subpatterns, and is not available for passing back information.
1604 The length passed in ovecsize should always be a multiple of three. If
1605 it is not, it is rounded down.
1606
1607 When a match is successful, information about captured substrings is
1608 returned in pairs of integers, starting at the beginning of ovector,
1609 and continuing up to two-thirds of its length at the most. The first
1610 element of a pair is set to the offset of the first character in a sub-
1611 string, and the second is set to the offset of the first character
1612 after the end of a substring. The first pair, ovector[0] and ovec-
1613 tor[1], identify the portion of the subject string matched by the
1614 entire pattern. The next pair is used for the first capturing subpat-
1615 tern, and so on. The value returned by pcre_exec() is the number of
1616 pairs that have been set. If there are no capturing subpatterns, the
1617 return value from a successful match is 1, indicating that just the
1618 first pair of offsets has been set.
1619
1620 Some convenience functions are provided for extracting the captured
1621 substrings as separate strings. These are described in the following
1622 section.
1623
1624 It is possible for an capturing subpattern number n+1 to match some
1625 part of the subject when subpattern n has not been used at all. For
1626 example, if the string "abc" is matched against the pattern (a|(z))(bc)
1627 subpatterns 1 and 3 are matched, but 2 is not. When this happens, both
1628 offset values corresponding to the unused subpattern are set to -1.
1629
1630 If a capturing subpattern is matched repeatedly, it is the last portion
1631 of the string that it matched that is returned.
1632
1633 If the vector is too small to hold all the captured substring offsets,
1634 it is used as far as possible (up to two-thirds of its length), and the
1635 function returns a value of zero. In particular, if the substring off-
1636 sets are not of interest, pcre_exec() may be called with ovector passed
1637 as NULL and ovecsize as zero. However, if the pattern contains back
1638 references and the ovector is not big enough to remember the related
1639 substrings, PCRE has to get additional memory for use during matching.
1640 Thus it is usually advisable to supply an ovector.
1641
1642 Note that pcre_info() can be used to find out how many capturing sub-
1643 patterns there are in a compiled pattern. The smallest size for ovector
1644 that will allow for n captured substrings, in addition to the offsets
1645 of the substring matched by the whole pattern, is (n+1)*3.
1646
1647 Return values from pcre_exec()
1648
1649 If pcre_exec() fails, it returns a negative number. The following are
1650 defined in the header file:
1651
1652 PCRE_ERROR_NOMATCH (-1)
1653
1654 The subject string did not match the pattern.
1655
1656 PCRE_ERROR_NULL (-2)
1657
1658 Either code or subject was passed as NULL, or ovector was NULL and
1659 ovecsize was not zero.
1660
1661 PCRE_ERROR_BADOPTION (-3)
1662
1663 An unrecognized bit was set in the options argument.
1664
1665 PCRE_ERROR_BADMAGIC (-4)
1666
1667 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1668 to catch the case when it is passed a junk pointer and to detect when a
1669 pattern that was compiled in an environment of one endianness is run in
1670 an environment with the other endianness. This is the error that PCRE
1671 gives when the magic number is not present.
1672
1673 PCRE_ERROR_UNKNOWN_NODE (-5)
1674
1675 While running the pattern match, an unknown item was encountered in the
1676 compiled pattern. This error could be caused by a bug in PCRE or by
1677 overwriting of the compiled pattern.
1678
1679 PCRE_ERROR_NOMEMORY (-6)
1680
1681 If a pattern contains back references, but the ovector that is passed
1682 to pcre_exec() is not big enough to remember the referenced substrings,
1683 PCRE gets a block of memory at the start of matching to use for this
1684 purpose. If the call via pcre_malloc() fails, this error is given. The
1685 memory is automatically freed at the end of matching.
1686
1687 PCRE_ERROR_NOSUBSTRING (-7)
1688
1689 This error is used by the pcre_copy_substring(), pcre_get_substring(),
1690 and pcre_get_substring_list() functions (see below). It is never
1691 returned by pcre_exec().
1692
1693 PCRE_ERROR_MATCHLIMIT (-8)
1694
1695 The recursion and backtracking limit, as specified by the match_limit
1696 field in a pcre_extra structure (or defaulted) was reached. See the
1697 description above.
1698
1699 PCRE_ERROR_CALLOUT (-9)
1700
1701 This error is never generated by pcre_exec() itself. It is provided for
1702 use by callout functions that want to yield a distinctive error code.
1703 See the pcrecallout documentation for details.
1704
1705 PCRE_ERROR_BADUTF8 (-10)
1706
1707 A string that contains an invalid UTF-8 byte sequence was passed as a
1708 subject.
1709
1710 PCRE_ERROR_BADUTF8_OFFSET (-11)
1711
1712 The UTF-8 byte sequence that was passed as a subject was valid, but the
1713 value of startoffset did not point to the beginning of a UTF-8 charac-
1714 ter.
1715
1716 PCRE_ERROR_PARTIAL (-12)
1717
1718 The subject string did not match, but it did match partially. See the
1719 pcrepartial documentation for details of partial matching.
1720
1721 PCRE_ERROR_BADPARTIAL (-13)
1722
1723 The PCRE_PARTIAL option was used with a compiled pattern containing
1724 items that are not supported for partial matching. See the pcrepartial
1725 documentation for details of partial matching.
1726
1727 PCRE_ERROR_INTERNAL (-14)
1728
1729 An unexpected internal error has occurred. This error could be caused
1730 by a bug in PCRE or by overwriting of the compiled pattern.
1731
1732 PCRE_ERROR_BADCOUNT (-15)
1733
1734 This error is given if the value of the ovecsize argument is negative.
1735
1736
1737 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
1738
1739 int pcre_copy_substring(const char *subject, int *ovector,
1740 int stringcount, int stringnumber, char *buffer,
1741 int buffersize);
1742
1743 int pcre_get_substring(const char *subject, int *ovector,
1744 int stringcount, int stringnumber,
1745 const char **stringptr);
1746
1747 int pcre_get_substring_list(const char *subject,
1748 int *ovector, int stringcount, const char ***listptr);
1749
1750 Captured substrings can be accessed directly by using the offsets
1751 returned by pcre_exec() in ovector. For convenience, the functions
1752 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
1753 string_list() are provided for extracting captured substrings as new,
1754 separate, zero-terminated strings. These functions identify substrings
1755 by number. The next section describes functions for extracting named
1756 substrings. A substring that contains a binary zero is correctly
1757 extracted and has a further zero added on the end, but the result is
1758 not, of course, a C string.
1759
1760 The first three arguments are the same for all three of these func-
1761 tions: subject is the subject string that has just been successfully
1762 matched, ovector is a pointer to the vector of integer offsets that was
1763 passed to pcre_exec(), and stringcount is the number of substrings that
1764 were captured by the match, including the substring that matched the
1765 entire regular expression. This is the value returned by pcre_exec() if
1766 it is greater than zero. If pcre_exec() returned zero, indicating that
1767 it ran out of space in ovector, the value passed as stringcount should
1768 be the number of elements in the vector divided by three.
1769
1770 The functions pcre_copy_substring() and pcre_get_substring() extract a
1771 single substring, whose number is given as stringnumber. A value of
1772 zero extracts the substring that matched the entire pattern, whereas
1773 higher values extract the captured substrings. For pcre_copy_sub-
1774 string(), the string is placed in buffer, whose length is given by
1775 buffersize, while for pcre_get_substring() a new block of memory is
1776 obtained via pcre_malloc, and its address is returned via stringptr.
1777 The yield of the function is the length of the string, not including
1778 the terminating zero, or one of
1779
1780 PCRE_ERROR_NOMEMORY (-6)
1781
1782 The buffer was too small for pcre_copy_substring(), or the attempt to
1783 get memory failed for pcre_get_substring().
1784
1785 PCRE_ERROR_NOSUBSTRING (-7)
1786
1787 There is no substring whose number is stringnumber.
1788
1789 The pcre_get_substring_list() function extracts all available sub-
1790 strings and builds a list of pointers to them. All this is done in a
1791 single block of memory that is obtained via pcre_malloc. The address of
1792 the memory block is returned via listptr, which is also the start of
1793 the list of string pointers. The end of the list is marked by a NULL
1794 pointer. The yield of the function is zero if all went well, or
1795
1796 PCRE_ERROR_NOMEMORY (-6)
1797
1798 if the attempt to get the memory block failed.
1799
1800 When any of these functions encounter a substring that is unset, which
1801 can happen when capturing subpattern number n+1 matches some part of
1802 the subject, but subpattern n has not been used at all, they return an
1803 empty string. This can be distinguished from a genuine zero-length sub-
1804 string by inspecting the appropriate offset in ovector, which is nega-
1805 tive for unset substrings.
1806
1807 The two convenience functions pcre_free_substring() and pcre_free_sub-
1808 string_list() can be used to free the memory returned by a previous
1809 call of pcre_get_substring() or pcre_get_substring_list(), respec-
1810 tively. They do nothing more than call the function pointed to by
1811 pcre_free, which of course could be called directly from a C program.
1812 However, PCRE is used in some situations where it is linked via a spe-
1813 cial interface to another programming language which cannot use
1814 pcre_free directly; it is for these cases that the functions are pro-
1815 vided.
1816
1817
1818 EXTRACTING CAPTURED SUBSTRINGS BY NAME
1819
1820 int pcre_get_stringnumber(const pcre *code,
1821 const char *name);
1822
1823 int pcre_copy_named_substring(const pcre *code,
1824 const char *subject, int *ovector,
1825 int stringcount, const char *stringname,
1826 char *buffer, int buffersize);
1827
1828 int pcre_get_named_substring(const pcre *code,
1829 const char *subject, int *ovector,
1830 int stringcount, const char *stringname,
1831 const char **stringptr);
1832
1833 To extract a substring by name, you first have to find associated num-
1834 ber. For example, for this pattern
1835
1836 (a+)b(?<xxx>\d+)...
1837
1838 the number of the subpattern called "xxx" is 2. You can find the number
1839 from the name by calling pcre_get_stringnumber(). The first argument is
1840 the compiled pattern, and the second is the name. The yield of the
1841 function is the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if
1842 there is no subpattern of that name.
1843
1844 Given the number, you can extract the substring directly, or use one of
1845 the functions described in the previous section. For convenience, there
1846 are also two functions that do the whole job.
1847
1848 Most of the arguments of pcre_copy_named_substring() and
1849 pcre_get_named_substring() are the same as those for the similarly
1850 named functions that extract by number. As these are described in the
1851 previous section, they are not re-described here. There are just two
1852 differences:
1853
1854 First, instead of a substring number, a substring name is given. Sec-
1855 ond, there is an extra argument, given at the start, which is a pointer
1856 to the compiled pattern. This is needed in order to gain access to the
1857 name-to-number translation table.
1858
1859 These functions call pcre_get_stringnumber(), and if it succeeds, they
1860 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
1861 ate.
1862
1863
1864 FINDING ALL POSSIBLE MATCHES
1865
1866 The traditional matching function uses a similar algorithm to Perl,
1867 which stops when it finds the first match, starting at a given point in
1868 the subject. If you want to find all possible matches, or the longest
1869 possible match, consider using the alternative matching function (see
1870 below) instead. If you cannot use the alternative function, but still
1871 need to find all possible matches, you can kludge it up by making use
1872 of the callout facility, which is described in the pcrecallout documen-
1873 tation.
1874
1875 What you have to do is to insert a callout right at the end of the pat-
1876 tern. When your callout function is called, extract and save the cur-
1877 rent matched substring. Then return 1, which forces pcre_exec() to
1878 backtrack and try other alternatives. Ultimately, when it runs out of
1879 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
1880
1881
1882 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
1883
1884 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
1885 const char *subject, int length, int startoffset,
1886 int options, int *ovector, int ovecsize,
1887 int *workspace, int wscount);
1888
1889 The function pcre_dfa_exec() is called to match a subject string
1890 against a compiled pattern, using a "DFA" matching algorithm. This has
1891 different characteristics to the normal algorithm, and is not compati-
1892 ble with Perl. Some of the features of PCRE patterns are not supported.
1893 Nevertheless, there are times when this kind of matching can be useful.
1894 For a discussion of the two matching algorithms, see the pcrematching
1895 documentation.
1896
1897 The arguments for the pcre_dfa_exec() function are the same as for
1898 pcre_exec(), plus two extras. The ovector argument is used in a differ-
1899 ent way, and this is described below. The other common arguments are
1900 used in the same way as for pcre_exec(), so their description is not
1901 repeated here.
1902
1903 The two additional arguments provide workspace for the function. The
1904 workspace vector should contain at least 20 elements. It is used for
1905 keeping track of multiple paths through the pattern tree. More
1906 workspace will be needed for patterns and subjects where there are a
1907 lot of possible matches.
1908
1909 Here is an example of a simple call to pcre_exec():
1910
1911 int rc;
1912 int ovector[10];
1913 int wspace[20];
1914 rc = pcre_exec(
1915 re, /* result of pcre_compile() */
1916 NULL, /* we didn't study the pattern */
1917 "some string", /* the subject string */
1918 11, /* the length of the subject string */
1919 0, /* start at offset 0 in the subject */
1920 0, /* default options */
1921 ovector, /* vector of integers for substring information */
1922 10, /* number of elements (NOT size in bytes) */
1923 wspace, /* working space vector */
1924 20); /* number of elements (NOT size in bytes) */
1925
1926 Option bits for pcre_dfa_exec()
1927
1928 The unused bits of the options argument for pcre_dfa_exec() must be
1929 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NOTBOL,
1930 PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL,
1931 PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last three of
1932 these are the same as for pcre_exec(), so their description is not
1933 repeated here.
1934
1935 PCRE_PARTIAL
1936
1937 This has the same general effect as it does for pcre_exec(), but the
1938 details are slightly different. When PCRE_PARTIAL is set for
1939 pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into
1940 PCRE_ERROR_PARTIAL if the end of the subject is reached, there have
1941 been no complete matches, but there is still at least one matching pos-
1942 sibility. The portion of the string that provided the partial match is
1943 set as the first matching string.
1944
1945 PCRE_DFA_SHORTEST
1946
1947 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
1948 stop as soon as it has found one match. Because of the way the DFA
1949 algorithm works, this is necessarily the shortest possible match at the
1950 first possible matching point in the subject string.
1951
1952 PCRE_DFA_RESTART
1953
1954 When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and
1955 returns a partial match, it is possible to call it again, with addi-
1956 tional subject characters, and have it continue with the same match.
1957 The PCRE_DFA_RESTART option requests this action; when it is set, the
1958 workspace and wscount options must reference the same vector as before
1959 because data about the match so far is left in them after a partial
1960 match. There is more discussion of this facility in the pcrepartial
1961 documentation.
1962
1963 Successful returns from pcre_dfa_exec()
1964
1965 When pcre_dfa_exec() succeeds, it may have matched more than one sub-
1966 string in the subject. Note, however, that all the matches from one run
1967 of the function start at the same point in the subject. The shorter
1968 matches are all initial substrings of the longer matches. For example,
1969 if the pattern
1970
1971 <.*>
1972
1973 is matched against the string
1974
1975 This is <something> <something else> <something further> no more
1976
1977 the three matched strings are
1978
1979 <something>
1980 <something> <something else>
1981 <something> <something else> <something further>
1982
1983 On success, the yield of the function is a number greater than zero,
1984 which is the number of matched substrings. The substrings themselves
1985 are returned in ovector. Each string uses two elements; the first is
1986 the offset to the start, and the second is the offset to the end. All
1987 the strings have the same start offset. (Space could have been saved by
1988 giving this only once, but it was decided to retain some compatibility
1989 with the way pcre_exec() returns data, even though the meaning of the
1990 strings is different.)
1991
1992 The strings are returned in reverse order of length; that is, the long-
1993 est matching string is given first. If there were too many matches to
1994 fit into ovector, the yield of the function is zero, and the vector is
1995 filled with the longest matches.
1996
1997 Error returns from pcre_dfa_exec()
1998
1999 The pcre_dfa_exec() function returns a negative number when it fails.
2000 Many of the errors are the same as for pcre_exec(), and these are
2001 described above. There are in addition the following errors that are
2002 specific to pcre_dfa_exec():
2003
2004 PCRE_ERROR_DFA_UITEM (-16)
2005
2006 This return is given if pcre_dfa_exec() encounters an item in the pat-
2007 tern that it does not support, for instance, the use of \C or a back
2008 reference.
2009
2010 PCRE_ERROR_DFA_UCOND (-17)
2011
2012 This return is given if pcre_dfa_exec() encounters a condition item in
2013 a pattern that uses a back reference for the condition. This is not
2014 supported.
2015
2016 PCRE_ERROR_DFA_UMLIMIT (-18)
2017
2018 This return is given if pcre_dfa_exec() is called with an extra block
2019 that contains a setting of the match_limit field. This is not supported
2020 (it is meaningless).
2021
2022 PCRE_ERROR_DFA_WSSIZE (-19)
2023
2024 This return is given if pcre_dfa_exec() runs out of space in the
2025 workspace vector.
2026
2027 PCRE_ERROR_DFA_RECURSE (-20)
2028
2029 When a recursive subpattern is processed, the matching function calls
2030 itself recursively, using private vectors for ovector and workspace.
2031 This error is given if the output vector is not large enough. This
2032 should be extremely rare, as a vector of size 1000 is used.
2033
2034 Last updated: 16 May 2005
2035 Copyright (c) 1997-2005 University of Cambridge.
2036 -----------------------------------------------------------------------------
2037
2038
2039
2040 NAME
2041 PCRE - Perl-compatible regular expressions
2042
2043
2044 PCRE CALLOUTS
2045
2046 int (*pcre_callout)(pcre_callout_block *);
2047
2048 PCRE provides a feature called "callout", which is a means of temporar-
2049 ily passing control to the caller of PCRE in the middle of pattern
2050 matching. The caller of PCRE provides an external function by putting
2051 its entry point in the global variable pcre_callout. By default, this
2052 variable contains NULL, which disables all calling out.
2053
2054 Within a regular expression, (?C) indicates the points at which the
2055 external function is to be called. Different callout points can be
2056 identified by putting a number less than 256 after the letter C. The
2057 default value is zero. For example, this pattern has two callout
2058 points:
2059
2060 (?C1)eabc(?C2)def
2061
2062 If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() is
2063 called, PCRE automatically inserts callouts, all with number 255,
2064 before each item in the pattern. For example, if PCRE_AUTO_CALLOUT is
2065 used with the pattern
2066
2067 A(\d{2}|--)
2068
2069 it is processed as if it were
2070
2071 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2072
2073 Notice that there is a callout before and after each parenthesis and
2074 alternation bar. Automatic callouts can be used for tracking the
2075 progress of pattern matching. The pcretest command has an option that
2076 sets automatic callouts; when it is used, the output indicates how the
2077 pattern is matched. This is useful information when you are trying to
2078 optimize the performance of a particular pattern.
2079
2080
2081 MISSING CALLOUTS
2082
2083 You should be aware that, because of optimizations in the way PCRE
2084 matches patterns, callouts sometimes do not happen. For example, if the
2085 pattern is
2086
2087 ab(?C4)cd
2088
2089 PCRE knows that any matching string must contain the letter "d". If the
2090 subject string is "abyz", the lack of "d" means that matching doesn't
2091 ever start, and the callout is never reached. However, with "abyd",
2092 though the result is still no match, the callout is obeyed.
2093
2094
2095 THE CALLOUT INTERFACE
2096
2097 During matching, when PCRE reaches a callout point, the external func-
2098 tion defined by pcre_callout is called (if it is set). This applies to
2099 both the pcre_exec() and the pcre_dfa_exec() matching functions. The
2100 only argument to the callout function is a pointer to a pcre_callout
2101 block. This structure contains the following fields:
2102
2103 int version;
2104 int callout_number;
2105 int *offset_vector;
2106 const char *subject;
2107 int subject_length;
2108 int start_match;
2109 int current_position;
2110 int capture_top;
2111 int capture_last;
2112 void *callout_data;
2113 int pattern_position;
2114 int next_item_length;
2115
2116 The version field is an integer containing the version number of the
2117 block format. The initial version was 0; the current version is 1. The
2118 version number will change again in future if additional fields are
2119 added, but the intention is never to remove any of the existing fields.
2120
2121 The callout_number field contains the number of the callout, as com-
2122 piled into the pattern (that is, the number after ?C for manual call-
2123 outs, and 255 for automatically generated callouts).
2124
2125 The offset_vector field is a pointer to the vector of offsets that was
2126 passed by the caller to pcre_exec() or pcre_dfa_exec(). When
2127 pcre_exec() is used, the contents can be inspected in order to extract
2128 substrings that have been matched so far, in the same way as for
2129 extracting substrings after a match has completed. For pcre_dfa_exec()
2130 this field is not useful.
2131
2132 The subject and subject_length fields contain copies of the values that
2133 were passed to pcre_exec().
2134
2135 The start_match field contains the offset within the subject at which
2136 the current match attempt started. If the pattern is not anchored, the
2137 callout function may be called several times from the same point in the
2138 pattern for different starting points in the subject.
2139
2140 The current_position field contains the offset within the subject of
2141 the current match pointer.
2142
2143 When the pcre_exec() function is used, the capture_top field contains
2144 one more than the number of the highest numbered captured substring so
2145 far. If no substrings have been captured, the value of capture_top is
2146 one. This is always the case when pcre_dfa_exec() is used, because it
2147 does not support captured substrings.
2148
2149 The capture_last field contains the number of the most recently cap-
2150 tured substring. If no substrings have been captured, its value is -1.
2151 This is always the case when pcre_dfa_exec() is used.
2152
2153 The callout_data field contains a value that is passed to pcre_exec()
2154 or pcre_dfa_exec() specifically so that it can be passed back in call-
2155 outs. It is passed in the pcre_callout field of the pcre_extra data
2156 structure. If no such data was passed, the value of callout_data in a
2157 pcre_callout block is NULL. There is a description of the pcre_extra
2158 structure in the pcreapi documentation.
2159
2160 The pattern_position field is present from version 1 of the pcre_call-
2161 out structure. It contains the offset to the next item to be matched in
2162 the pattern string.
2163
2164 The next_item_length field is present from version 1 of the pcre_call-
2165 out structure. It contains the length of the next item to be matched in
2166 the pattern string. When the callout immediately precedes an alterna-
2167 tion bar, a closing parenthesis, or the end of the pattern, the length
2168 is zero. When the callout precedes an opening parenthesis, the length
2169 is that of the entire subpattern.
2170
2171 The pattern_position and next_item_length fields are intended to help
2172 in distinguishing between different automatic callouts, which all have
2173 the same callout number. However, they are set for all callouts.
2174
2175
2176 RETURN VALUES
2177
2178 The external callout function returns an integer to PCRE. If the value
2179 is zero, matching proceeds as normal. If the value is greater than
2180 zero, matching fails at the current point, but the testing of other
2181 matching possibilities goes ahead, just as if a lookahead assertion had
2182 failed. If the value is less than zero, the match is abandoned, and
2183 pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2184
2185 Negative values should normally be chosen from the set of
2186 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2187 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
2188 reserved for use by callout functions; it will never be used by PCRE
2189 itself.
2190
2191 Last updated: 28 February 2005
2192 Copyright (c) 1997-2005 University of Cambridge.
2193 -----------------------------------------------------------------------------
2194
2195
2196
2197 NAME
2198 PCRE - Perl-compatible regular expressions
2199
2200
2201 DIFFERENCES BETWEEN PCRE AND PERL
2202
2203 This document describes the differences in the ways that PCRE and Perl
2204 handle regular expressions. The differences described here are with
2205 respect to Perl 5.8.
2206
2207 1. PCRE does not have full UTF-8 support. Details of what it does have
2208 are given in the section on UTF-8 support in the main pcre page.
2209
2210 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2211 permits them, but they do not mean what you might think. For example,
2212 (?!a){3} does not assert that the next three characters are not "a". It
2213 just asserts that the next character is not "a" three times.
2214
2215 3. Capturing subpatterns that occur inside negative lookahead asser-
2216 tions are counted, but their entries in the offsets vector are never
2217 set. Perl sets its numerical variables from any such patterns that are
2218 matched before the assertion fails to match something (thereby succeed-
2219 ing), but only if the negative lookahead assertion contains just one
2220 branch.
2221
2222 4. Though binary zero characters are supported in the subject string,
2223 they are not allowed in a pattern string because it is passed as a nor-
2224 mal C string, terminated by zero. The escape sequence \0 can be used in
2225 the pattern to represent a binary zero.
2226
2227 5. The following Perl escape sequences are not supported: \l, \u, \L,
2228 \U, and \N. In fact these are implemented by Perl's general string-han-
2229 dling and are not part of its pattern matching engine. If any of these
2230 are encountered by PCRE, an error is generated.
2231
2232 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
2233 is built with Unicode character property support. The properties that
2234 can be tested with \p and \P are limited to the general category prop-
2235 erties such as Lu and Nd.
2236
2237 7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2238 ters in between are treated as literals. This is slightly different
2239 from Perl in that $ and @ are also handled as literals inside the
2240 quotes. In Perl, they cause variable interpolation (but of course PCRE
2241 does not have variables). Note the following examples:
2242
2243 Pattern PCRE matches Perl matches
2244
2245 \Qabc$xyz\E abc$xyz abc followed by the
2246 contents of $xyz
2247 \Qabc\$xyz\E abc\$xyz abc\$xyz
2248 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
2249
2250 The \Q...\E sequence is recognized both inside and outside character
2251 classes.
2252
2253 8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
2254 constructions. However, there is support for recursive patterns using
2255 the non-Perl items (?R), (?number), and (?P>name). Also, the PCRE
2256 "callout" feature allows an external function to be called during pat-
2257 tern matching. See the pcrecallout documentation for details.
2258
2259 9. There are some differences that are concerned with the settings of
2260 captured strings when part of a pattern is repeated. For example,
2261 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
2262 unset, but in PCRE it is set to "b".
2263
2264 10. PCRE provides some extensions to the Perl regular expression facil-
2265 ities:
2266
2267 (a) Although lookbehind assertions must match fixed length strings,
2268 each alternative branch of a lookbehind assertion can match a different
2269 length of string. Perl requires them all to have the same length.
2270
2271 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
2272 meta-character matches only at the very end of the string.
2273
2274 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2275 cial meaning is faulted.
2276
2277 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
2278 fiers is inverted, that is, by default they are not greedy, but if fol-
2279 lowed by a question mark they are.
2280
2281 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2282 tried only at the first matching position in the subject string.
2283
2284 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2285 TURE options for pcre_exec() have no Perl equivalents.
2286
2287 (g) The (?R), (?number), and (?P>name) constructs allows for recursive
2288 pattern matching (Perl can do this using the (?p{code}) construct,
2289 which PCRE cannot support.)
2290
2291 (h) PCRE supports named capturing substrings, using the Python syntax.
2292
2293 (i) PCRE supports the possessive quantifier "++" syntax, taken from
2294 Sun's Java package.
2295
2296 (j) The (R) condition, for testing recursion, is a PCRE extension.
2297
2298 (k) The callout facility is PCRE-specific.
2299
2300 (l) The partial matching facility is PCRE-specific.
2301
2302 (m) Patterns compiled by PCRE can be saved and re-used at a later time,
2303 even on different hosts that have the other endianness.
2304
2305 (n) The alternative matching function (pcre_dfa_exec()) matches in a
2306 different way and is not Perl-compatible.
2307
2308 Last updated: 28 February 2005
2309 Copyright (c) 1997-2005 University of Cambridge.
2310 -----------------------------------------------------------------------------
2311
2312
2313
2314 NAME
2315 PCRE - Perl-compatible regular expressions
2316
2317
2318 PCRE REGULAR EXPRESSION DETAILS
2319
2320 The syntax and semantics of the regular expressions supported by PCRE
2321 are described below. Regular expressions are also described in the Perl
2322 documentation and in a number of books, some of which have copious
2323 examples. Jeffrey Friedl's "Mastering Regular Expressions", published
2324 by O'Reilly, covers regular expressions in great detail. This descrip-
2325 tion of PCRE's regular expressions is intended as reference material.
2326
2327 The original operation of PCRE was on strings of one-byte characters.
2328 However, there is now also support for UTF-8 character strings. To use
2329 this, you must build PCRE to include UTF-8 support, and then call
2330 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
2331 matching is mentioned in several places below. There is also a summary
2332 of UTF-8 features in the section on UTF-8 support in the main pcre
2333 page.
2334
2335 The remainder of this document discusses the patterns that are sup-
2336 ported by PCRE when its main matching function, pcre_exec(), is used.
2337 From release 6.0, PCRE offers a second matching function,
2338 pcre_dfa_exec(), which matches using a different algorithm that is not
2339 Perl-compatible. The advantages and disadvantages of the alternative
2340 function, and how it differs from the normal function, are discussed in
2341 the pcrematching page.
2342
2343 A regular expression is a pattern that is matched against a subject
2344 string from left to right. Most characters stand for themselves in a
2345 pattern, and match the corresponding characters in the subject. As a
2346 trivial example, the pattern
2347
2348 The quick brown fox
2349
2350 matches a portion of a subject string that is identical to itself. When
2351 caseless matching is specified (the PCRE_CASELESS option), letters are
2352 matched independently of case. In UTF-8 mode, PCRE always understands
2353 the concept of case for characters whose values are less than 128, so
2354 caseless matching is always possible. For characters with higher val-
2355 ues, the concept of case is supported if PCRE is compiled with Unicode
2356 property support, but not otherwise. If you want to use caseless
2357 matching for characters 128 and above, you must ensure that PCRE is
2358 compiled with Unicode property support as well as with UTF-8 support.
2359
2360 The power of regular expressions comes from the ability to include
2361 alternatives and repetitions in the pattern. These are encoded in the
2362 pattern by the use of metacharacters, which do not stand for themselves
2363 but instead are interpreted in some special way.
2364
2365 There are two different sets of metacharacters: those that are recog-
2366 nized anywhere in the pattern except within square brackets, and those
2367 that are recognized in square brackets. Outside square brackets, the
2368 metacharacters are as follows:
2369
2370 \ general escape character with several uses
2371 ^ assert start of string (or line, in multiline mode)
2372 $ assert end of string (or line, in multiline mode)
2373 . match any character except newline (by default)
2374 [ start character class definition
2375 | start of alternative branch
2376 ( start subpattern
2377 ) end subpattern
2378 ? extends the meaning of (
2379 also 0 or 1 quantifier
2380 also quantifier minimizer
2381 * 0 or more quantifier
2382 + 1 or more quantifier
2383 also "possessive quantifier"
2384 { start min/max quantifier
2385
2386 Part of a pattern that is in square brackets is called a "character
2387 class". In a character class the only metacharacters are:
2388
2389 \ general escape character
2390 ^ negate the class, but only if the first character
2391 - indicates character range
2392 [ POSIX character class (only if followed by POSIX
2393 syntax)
2394 ] terminates the character class
2395
2396 The following sections describe the use of each of the metacharacters.
2397
2398
2399 BACKSLASH
2400
2401 The backslash character has several uses. Firstly, if it is followed by
2402 a non-alphanumeric character, it takes away any special meaning that
2403 character may have. This use of backslash as an escape character
2404 applies both inside and outside character classes.
2405
2406 For example, if you want to match a * character, you write \* in the
2407 pattern. This escaping action applies whether or not the following
2408 character would otherwise be interpreted as a metacharacter, so it is
2409 always safe to precede a non-alphanumeric with backslash to specify
2410 that it stands for itself. In particular, if you want to match a back-
2411 slash, you write \\.
2412
2413 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
2414 the pattern (other than in a character class) and characters between a
2415 # outside a character class and the next newline character are ignored.
2416 An escaping backslash can be used to include a whitespace or # charac-
2417 ter as part of the pattern.
2418
2419 If you want to remove the special meaning from a sequence of charac-
2420 ters, you can do so by putting them between \Q and \E. This is differ-
2421 ent from Perl in that $ and @ are handled as literals in \Q...\E
2422 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
2423 tion. Note the following examples:
2424
2425 Pattern PCRE matches Perl matches
2426
2427 \Qabc$xyz\E abc$xyz abc followed by the
2428 contents of $xyz
2429 \Qabc\$xyz\E abc\$xyz abc\$xyz
2430 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
2431
2432 The \Q...\E sequence is recognized both inside and outside character
2433 classes.
2434
2435 Non-printing characters
2436
2437 A second use of backslash provides a way of encoding non-printing char-
2438 acters in patterns in a visible manner. There is no restriction on the
2439 appearance of non-printing characters, apart from the binary zero that
2440 terminates a pattern, but when a pattern is being prepared by text
2441 editing, it is usually easier to use one of the following escape
2442 sequences than the binary character it represents:
2443
2444 \a alarm, that is, the BEL character (hex 07)
2445 \cx "control-x", where x is any character
2446 \e escape (hex 1B)
2447 \f formfeed (hex 0C)
2448 \n newline (hex 0A)
2449 \r carriage return (hex 0D)
2450 \t tab (hex 09)
2451 \ddd character with octal code ddd, or backreference
2452 \xhh character with hex code hh
2453 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
2454
2455 The precise effect of \cx is as follows: if x is a lower case letter,
2456 it is converted to upper case. Then bit 6 of the character (hex 40) is
2457 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
2458 becomes hex 7B.
2459
2460 After \x, from zero to two hexadecimal digits are read (letters can be
2461 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
2462 its may appear between \x{ and }, but the value of the character code
2463 must be less than 2**31 (that is, the maximum hexadecimal value is
2464 7FFFFFFF). If characters other than hexadecimal digits appear between
2465 \x{ and }, or if there is no terminating }, this form of escape is not
2466 recognized. Instead, the initial \x will be interpreted as a basic
2467 hexadecimal escape, with no following digits, giving a character whose
2468 value is zero.
2469
2470 Characters whose value is less than 256 can be defined by either of the
2471 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
2472 in the way they are handled. For example, \xdc is exactly the same as
2473 \x{dc}.
2474
2475 After \0 up to two further octal digits are read. In both cases, if
2476 there are fewer than two digits, just those that are present are used.
2477 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
2478 character (code value 7). Make sure you supply two digits after the
2479 initial zero if the pattern character that follows is itself an octal
2480 digit.
2481
2482 The handling of a backslash followed by a digit other than 0 is compli-
2483 cated. Outside a character class, PCRE reads it and any following dig-
2484 its as a decimal number. If the number is less than 10, or if there
2485 have been at least that many previous capturing left parentheses in the
2486 expression, the entire sequence is taken as a back reference. A
2487 description of how this works is given later, following the discussion
2488 of parenthesized subpatterns.
2489
2490 Inside a character class, or if the decimal number is greater than 9
2491 and there have not been that many capturing subpatterns, PCRE re-reads
2492 up to three octal digits following the backslash, and generates a sin-
2493 gle byte from the least significant 8 bits of the value. Any subsequent
2494 digits stand for themselves. For example:
2495
2496 \040 is another way of writing a space
2497 \40 is the same, provided there are fewer than 40
2498 previous capturing subpatterns
2499 \7 is always a back reference
2500 \11 might be a back reference, or another way of
2501 writing a tab
2502 \011 is always a tab
2503 \0113 is a tab followed by the character "3"
2504 \113 might be a back reference, otherwise the
2505 character with octal code 113
2506 \377 might be a back reference, otherwise
2507 the byte consisting entirely of 1 bits
2508 \81 is either a back reference, or a binary zero
2509 followed by the two characters "8" and "1"
2510
2511 Note that octal values of 100 or greater must not be introduced by a
2512 leading zero, because no more than three octal digits are ever read.
2513
2514 All the sequences that define a single byte value or a single UTF-8
2515 character (in UTF-8 mode) can be used both inside and outside character
2516 classes. In addition, inside a character class, the sequence \b is
2517 interpreted as the backspace character (hex 08), and the sequence \X is
2518 interpreted as the character "X". Outside a character class, these
2519 sequences have different meanings (see below).
2520
2521 Generic character types
2522
2523 The third use of backslash is for specifying generic character types.
2524 The following are always recognized:
2525
2526 \d any decimal digit
2527 \D any character that is not a decimal digit
2528 \s any whitespace character
2529 \S any character that is not a whitespace character
2530 \w any "word" character
2531 \W any "non-word" character
2532
2533 Each pair of escape sequences partitions the complete set of characters
2534 into two disjoint sets. Any given character matches one, and only one,
2535 of each pair.
2536
2537 These character type sequences can appear both inside and outside char-
2538 acter classes. They each match one character of the appropriate type.
2539 If the current matching point is at the end of the subject string, all
2540 of them fail, since there is no character to match.
2541
2542 For compatibility with Perl, \s does not match the VT character (code
2543 11). This makes it different from the the POSIX "space" class. The \s
2544 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
2545
2546 A "word" character is an underscore or any character less than 256 that
2547 is a letter or digit. The definition of letters and digits is con-
2548 trolled by PCRE's low-valued character tables, and may vary if locale-
2549 specific matching is taking place (see "Locale support" in the pcreapi
2550 page). For example, in the "fr_FR" (French) locale, some character
2551 codes greater than 128 are used for accented letters, and these are
2552 matched by \w.
2553
2554 In UTF-8 mode, characters with values greater than 128 never match \d,
2555 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2556 code character property support is available.
2557
2558 Unicode character properties
2559
2560 When PCRE is built with Unicode character property support, three addi-
2561 tional escape sequences to match generic character types are available
2562 when UTF-8 mode is selected. They are:
2563
2564 \p{xx} a character with the xx property
2565 \P{xx} a character without the xx property
2566 \X an extended Unicode sequence
2567
2568 The property names represented by xx above are limited to the Unicode
2569 general category properties. Each character has exactly one such prop-
2570 erty, specified by a two-letter abbreviation. For compatibility with
2571 Perl, negation can be specified by including a circumflex between the
2572 opening brace and the property name. For example, \p{^Lu} is the same
2573 as \P{Lu}.
2574
2575 If only one letter is specified with \p or \P, it includes all the
2576 properties that start with that letter. In this case, in the absence of
2577 negation, the curly brackets in the escape sequence are optional; these
2578 two examples have the same effect:
2579
2580 \p{L}
2581 \pL
2582
2583 The following property codes are supported:
2584
2585 C Other
2586 Cc Control
2587 Cf Format
2588 Cn Unassigned
2589 Co Private use
2590 Cs Surrogate
2591
2592 L Letter
2593 Ll Lower case letter
2594 Lm Modifier letter
2595 Lo Other letter
2596 Lt Title case letter
2597 Lu Upper case letter
2598
2599 M Mark
2600 Mc Spacing mark
2601 Me Enclosing mark
2602 Mn Non-spacing mark
2603
2604 N Number
2605 Nd Decimal number
2606 Nl Letter number
2607 No Other number
2608
2609 P Punctuation
2610 Pc Connector punctuation
2611 Pd Dash punctuation
2612 Pe Close punctuation
2613 Pf Final punctuation
2614 Pi Initial punctuation
2615 Po Other punctuation
2616 Ps Open punctuation
2617
2618 S Symbol
2619 Sc Currency symbol
2620 Sk Modifier symbol
2621 Sm Mathematical symbol
2622 So Other symbol
2623
2624 Z Separator
2625 Zl Line separator
2626 Zp Paragraph separator
2627 Zs Space separator
2628
2629 Extended properties such as "Greek" or "InMusicalSymbols" are not sup-
2630 ported by PCRE.
2631
2632 Specifying caseless matching does not affect these escape sequences.
2633 For example, \p{Lu} always matches only upper case letters.
2634
2635 The \X escape matches any number of Unicode characters that form an
2636 extended Unicode sequence. \X is equivalent to
2637
2638 (?>\PM\pM*)
2639
2640 That is, it matches a character without the "mark" property, followed
2641 by zero or more characters with the "mark" property, and treats the
2642 sequence as an atomic group (see below). Characters with the "mark"
2643 property are typically accents that affect the preceding character.
2644
2645 Matching characters by Unicode property is not fast, because PCRE has
2646 to search a structure that contains data for over fifteen thousand
2647 characters. That is why the traditional escape sequences such as \d and
2648 \w do not use Unicode properties in PCRE.
2649
2650 Simple assertions
2651
2652 The fourth use of backslash is for certain simple assertions. An asser-
2653 tion specifies a condition that has to be met at a particular point in
2654 a match, without consuming any characters from the subject string. The
2655 use of subpatterns for more complicated assertions is described below.
2656 The backslashed assertions are:
2657
2658 \b matches at a word boundary
2659 \B matches when not at a word boundary
2660 \A matches at start of subject
2661 \Z matches at end of subject or before newline at end
2662 \z matches at end of subject
2663 \G matches at first matching position in subject
2664
2665 These assertions may not appear in character classes (but note that \b
2666 has a different meaning, namely the backspace character, inside a char-
2667 acter class).
2668
2669 A word boundary is a position in the subject string where the current
2670 character and the previous character do not both match \w or \W (i.e.
2671 one matches \w and the other matches \W), or the start or end of the
2672 string if the first or last character matches \w, respectively.
2673
2674 The \A, \Z, and \z assertions differ from the traditional circumflex
2675 and dollar (described in the next section) in that they only ever match
2676 at the very start and end of the subject string, whatever options are
2677 set. Thus, they are independent of multiline mode. These three asser-
2678 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
2679 affect only the behaviour of the circumflex and dollar metacharacters.
2680 However, if the startoffset argument of pcre_exec() is non-zero, indi-
2681 cating that matching is to start at a point other than the beginning of
2682 the subject, \A can never match. The difference between \Z and \z is
2683 that \Z matches before a newline that is the last character of the
2684 string as well as at the end of the string, whereas \z matches only at
2685 the end.
2686
2687 The \G assertion is true only when the current matching position is at
2688 the start point of the match, as specified by the startoffset argument
2689 of pcre_exec(). It differs from \A when the value of startoffset is
2690 non-zero. By calling pcre_exec() multiple times with appropriate argu-
2691 ments, you can mimic Perl's /g option, and it is in this kind of imple-
2692 mentation where \G can be useful.
2693
2694 Note, however, that PCRE's interpretation of \G, as the start of the
2695 current match, is subtly different from Perl's, which defines it as the
2696 end of the previous match. In Perl, these can be different when the
2697 previously matched string was empty. Because PCRE does just one match
2698 at a time, it cannot reproduce this behaviour.
2699
2700 If all the alternatives of a pattern begin with \G, the expression is
2701 anchored to the starting match position, and the "anchored" flag is set
2702 in the compiled regular expression.
2703
2704
2705 CIRCUMFLEX AND DOLLAR
2706
2707 Outside a character class, in the default matching mode, the circumflex
2708 character is an assertion that is true only if the current matching
2709 point is at the start of the subject string. If the startoffset argu-
2710 ment of pcre_exec() is non-zero, circumflex can never match if the
2711 PCRE_MULTILINE option is unset. Inside a character class, circumflex
2712 has an entirely different meaning (see below).
2713
2714 Circumflex need not be the first character of the pattern if a number
2715 of alternatives are involved, but it should be the first thing in each
2716 alternative in which it appears if the pattern is ever to match that
2717 branch. If all possible alternatives start with a circumflex, that is,
2718 if the pattern is constrained to match only at the start of the sub-
2719 ject, it is said to be an "anchored" pattern. (There are also other
2720 constructs that can cause a pattern to be anchored.)
2721
2722 A dollar character is an assertion that is true only if the current
2723 matching point is at the end of the subject string, or immediately
2724 before a newline character that is the last character in the string (by
2725 default). Dollar need not be the last character of the pattern if a
2726 number of alternatives are involved, but it should be the last item in
2727 any branch in which it appears. Dollar has no special meaning in a
2728 character class.
2729
2730 The meaning of dollar can be changed so that it matches only at the
2731 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
2732 compile time. This does not affect the \Z assertion.
2733
2734 The meanings of the circumflex and dollar characters are changed if the
2735 PCRE_MULTILINE option is set. When this is the case, they match immedi-
2736 ately after and immediately before an internal newline character,
2737 respectively, in addition to matching at the start and end of the sub-
2738 ject string. For example, the pattern /^abc$/ matches the subject
2739 string "def\nabc" (where \n represents a newline character) in multi-
2740 line mode, but not otherwise. Consequently, patterns that are anchored
2741 in single line mode because all branches start with ^ are not anchored
2742 in multiline mode, and a match for circumflex is possible when the
2743 startoffset argument of pcre_exec() is non-zero. The PCRE_DOL-
2744 LAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
2745
2746 Note that the sequences \A, \Z, and \z can be used to match the start
2747 and end of the subject in both modes, and if all branches of a pattern
2748 start with \A it is always anchored, whether PCRE_MULTILINE is set or
2749 not.
2750
2751
2752 FULL STOP (PERIOD, DOT)
2753
2754 Outside a character class, a dot in the pattern matches any one charac-
2755 ter in the subject, including a non-printing character, but not (by
2756 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
2757 which might be more than one byte long, except (by default) newline. If
2758 the PCRE_DOTALL option is set, dots match newlines as well. The han-
2759 dling of dot is entirely independent of the handling of circumflex and
2760 dollar, the only relationship being that they both involve newline
2761 characters. Dot has no special meaning in a character class.
2762
2763
2764 MATCHING A SINGLE BYTE
2765
2766 Outside a character class, the escape sequence \C matches any one byte,
2767 both in and out of UTF-8 mode. Unlike a dot, it can match a newline.
2768 The feature is provided in Perl in order to match individual bytes in
2769 UTF-8 mode. Because it breaks up UTF-8 characters into individual
2770 bytes, what remains in the string may be a malformed UTF-8 string. For
2771 this reason, the \C escape sequence is best avoided.
2772
2773 PCRE does not allow \C to appear in lookbehind assertions (described
2774 below), because in UTF-8 mode this would make it impossible to calcu-
2775 late the length of the lookbehind.
2776
2777
2778 SQUARE BRACKETS AND CHARACTER CLASSES
2779
2780 An opening square bracket introduces a character class, terminated by a
2781 closing square bracket. A closing square bracket on its own is not spe-
2782 cial. If a closing square bracket is required as a member of the class,
2783 it should be the first data character in the class (after an initial
2784 circumflex, if present) or escaped with a backslash.
2785
2786 A character class matches a single character in the subject. In UTF-8
2787 mode, the character may occupy more than one byte. A matched character
2788 must be in the set of characters defined by the class, unless the first
2789 character in the class definition is a circumflex, in which case the
2790 subject character must not be in the set defined by the class. If a
2791 circumflex is actually required as a member of the class, ensure it is
2792 not the first character, or escape it with a backslash.
2793
2794 For example, the character class [aeiou] matches any lower case vowel,
2795 while [^aeiou] matches any character that is not a lower case vowel.
2796 Note that a circumflex is just a convenient notation for specifying the
2797 characters that are in the class by enumerating those that are not. A
2798 class that starts with a circumflex is not an assertion: it still con-
2799 sumes a character from the subject string, and therefore it fails if
2800 the current pointer is at the end of the string.
2801
2802 In UTF-8 mode, characters with values greater than 255 can be included
2803 in a class as a literal string of bytes, or by using the \x{ escaping
2804 mechanism.
2805
2806 When caseless matching is set, any letters in a class represent both
2807 their upper case and lower case versions, so for example, a caseless
2808 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
2809 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
2810 understands the concept of case for characters whose values are less
2811 than 128, so caseless matching is always possible. For characters with
2812 higher values, the concept of case is supported if PCRE is compiled
2813 with Unicode property support, but not otherwise. If you want to use
2814 caseless matching for characters 128 and above, you must ensure that
2815 PCRE is compiled with Unicode property support as well as with UTF-8
2816 support.
2817
2818 The newline character is never treated in any special way in character
2819 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
2820 options is. A class such as [^a] will always match a newline.
2821
2822 The minus (hyphen) character can be used to specify a range of charac-
2823 ters in a character class. For example, [d-m] matches any letter
2824 between d and m, inclusive. If a minus character is required in a
2825 class, it must be escaped with a backslash or appear in a position
2826 where it cannot be interpreted as indicating a range, typically as the
2827 first or last character in the class.
2828
2829 It is not possible to have the literal character "]" as the end charac-
2830 ter of a range. A pattern such as [W-]46] is interpreted as a class of
2831 two characters ("W" and "-") followed by a literal string "46]", so it
2832 would match "W46]" or "-46]". However, if the "]" is escaped with a
2833 backslash it is interpreted as the end of range, so [W-\]46] is inter-
2834 preted as a class containing a range followed by two other characters.
2835 The octal or hexadecimal representation of "]" can also be used to end
2836 a range.
2837
2838 Ranges operate in the collating sequence of character values. They can
2839 also be used for characters specified numerically, for example
2840 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
2841 are greater than 255, for example [\x{100}-\x{2ff}].
2842
2843 If a range that includes letters is used when caseless matching is set,
2844 it matches the letters in either case. For example, [W-c] is equivalent
2845 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
2846 character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches
2847 accented E characters in both cases. In UTF-8 mode, PCRE supports the
2848 concept of case for characters with values greater than 128 only when
2849 it is compiled with Unicode property support.
2850
2851 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
2852 in a character class, and add the characters that they match to the
2853 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
2854 flex can conveniently be used with the upper case character types to
2855 specify a more restricted set of characters than the matching lower
2856 case type. For example, the class [^\W_] matches any letter or digit,
2857 but not underscore.
2858
2859 The only metacharacters that are recognized in character classes are
2860 backslash, hyphen (only where it can be interpreted as specifying a
2861 range), circumflex (only at the start), opening square bracket (only
2862 when it can be interpreted as introducing a POSIX class name - see the
2863 next section), and the terminating closing square bracket. However,
2864 escaping other non-alphanumeric characters does no harm.
2865
2866
2867 POSIX CHARACTER CLASSES
2868
2869 Perl supports the POSIX notation for character classes. This uses names
2870 enclosed by [: and :] within the enclosing square brackets. PCRE also
2871 supports this notation. For example,
2872
2873 [01[:alpha:]%]
2874
2875 matches "0", "1", any alphabetic character, or "%". The supported class
2876 names are
2877
2878 alnum letters and digits
2879 alpha letters
2880 ascii character codes 0 - 127
2881 blank space or tab only
2882 cntrl control characters
2883 digit decimal digits (same as \d)
2884 graph printing characters, excluding space
2885 lower lower case letters
2886 print printing characters, including space
2887 punct printing characters, excluding letters and digits
2888 space white space (not quite the same as \s)
2889 upper upper case letters
2890 word "word" characters (same as \w)
2891 xdigit hexadecimal digits
2892
2893 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
2894 and space (32). Notice that this list includes the VT character (code
2895 11). This makes "space" different to \s, which does not include VT (for
2896 Perl compatibility).
2897
2898 The name "word" is a Perl extension, and "blank" is a GNU extension
2899 from Perl 5.8. Another Perl extension is negation, which is indicated
2900 by a ^ character after the colon. For example,
2901
2902 [12[:^digit:]]
2903
2904 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
2905 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2906 these are not supported, and an error is given if they are encountered.
2907
2908 In UTF-8 mode, characters with values greater than 128 do not match any
2909 of the POSIX character classes.
2910
2911
2912 VERTICAL BAR
2913
2914 Vertical bar characters are used to separate alternative patterns. For
2915 example, the pattern
2916
2917 gilbert|sullivan
2918
2919 matches either "gilbert" or "sullivan". Any number of alternatives may
2920 appear, and an empty alternative is permitted (matching the empty
2921 string). The matching process tries each alternative in turn, from
2922 left to right, and the first one that succeeds is used. If the alterna-
2923 tives are within a subpattern (defined below), "succeeds" means match-
2924 ing the rest of the main pattern as well as the alternative in the sub-
2925 pattern.
2926
2927
2928 INTERNAL OPTION SETTING
2929
2930 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
2931 PCRE_EXTENDED options can be changed from within the pattern by a
2932 sequence of Perl option letters enclosed between "(?" and ")". The
2933 option letters are
2934
2935 i for PCRE_CASELESS
2936 m for PCRE_MULTILINE
2937 s for PCRE_DOTALL
2938 x for PCRE_EXTENDED
2939
2940 For example, (?im) sets caseless, multiline matching. It is also possi-
2941 ble to unset these options by preceding the letter with a hyphen, and a
2942 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
2943 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
2944 is also permitted. If a letter appears both before and after the
2945 hyphen, the option is unset.
2946
2947 When an option change occurs at top level (that is, not inside subpat-
2948 tern parentheses), the change applies to the remainder of the pattern
2949 that follows. If the change is placed right at the start of a pattern,
2950 PCRE extracts it into the global options (and it will therefore show up
2951 in data extracted by the pcre_fullinfo() function).
2952
2953 An option change within a subpattern affects only that part of the cur-
2954 rent pattern that follows it, so
2955
2956 (a(?i)b)c
2957
2958 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
2959 used). By this means, options can be made to have different settings
2960 in different parts of the pattern. Any changes made in one alternative
2961 do carry on into subsequent branches within the same subpattern. For
2962 example,
2963
2964 (a(?i)b|c)
2965
2966 matches "ab", "aB", "c", and "C", even though when matching "C" the
2967 first branch is abandoned before the option setting. This is because
2968 the effects of option settings happen at compile time. There would be
2969 some very weird behaviour otherwise.
2970
2971 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
2972 in the same way as the Perl-compatible options by using the characters
2973 U and X respectively. The (?X) flag setting is special in that it must
2974 always occur earlier in the pattern than any of the additional features
2975 it turns on, even when it is at top level. It is best to put it at the
2976 start.
2977
2978
2979 SUBPATTERNS
2980
2981 Subpatterns are delimited by parentheses (round brackets), which can be
2982 nested. Turning part of a pattern into a subpattern does two things:
2983
2984 1. It localizes a set of alternatives. For example, the pattern
2985
2986 cat(aract|erpillar|)
2987
2988 matches one of the words "cat", "cataract", or "caterpillar". Without
2989 the parentheses, it would match "cataract", "erpillar" or the empty
2990 string.
2991
2992 2. It sets up the subpattern as a capturing subpattern. This means
2993 that, when the whole pattern matches, that portion of the subject
2994 string that matched the subpattern is passed back to the caller via the
2995 ovector argument of pcre_exec(). Opening parentheses are counted from
2996 left to right (starting from 1) to obtain numbers for the capturing
2997 subpatterns.
2998
2999 For example, if the string "the red king" is matched against the pat-
3000 tern
3001
3002 the ((red|white) (king|queen))
3003
3004 the captured substrings are "red king", "red", and "king", and are num-
3005 bered 1, 2, and 3, respectively.
3006
3007 The fact that plain parentheses fulfil two functions is not always
3008 helpful. There are often times when a grouping subpattern is required
3009 without a capturing requirement. If an opening parenthesis is followed
3010 by a question mark and a colon, the subpattern does not do any captur-
3011 ing, and is not counted when computing the number of any subsequent
3012 capturing subpatterns. For example, if the string "the white queen" is
3013 matched against the pattern
3014
3015 the ((?:red|white) (king|queen))
3016
3017 the captured substrings are "white queen" and "queen", and are numbered
3018 1 and 2. The maximum number of capturing subpatterns is 65535, and the
3019 maximum depth of nesting of all subpatterns, both capturing and non-
3020 capturing, is 200.
3021
3022 As a convenient shorthand, if any option settings are required at the
3023 start of a non-capturing subpattern, the option letters may appear
3024 between the "?" and the ":". Thus the two patterns
3025
3026 (?i:saturday|sunday)
3027 (?:(?i)saturday|sunday)
3028
3029 match exactly the same set of strings. Because alternative branches are
3030 tried from left to right, and options are not reset until the end of
3031 the subpattern is reached, an option setting in one branch does affect
3032 subsequent branches, so the above patterns match "SUNDAY" as well as
3033 "Saturday".
3034
3035
3036 NAMED SUBPATTERNS
3037
3038 Identifying capturing parentheses by number is simple, but it can be
3039 very hard to keep track of the numbers in complicated regular expres-
3040 sions. Furthermore, if an expression is modified, the numbers may
3041 change. To help with this difficulty, PCRE supports the naming of sub-
3042 patterns, something that Perl does not provide. The Python syntax
3043 (?P<name>...) is used. Names consist of alphanumeric characters and
3044 underscores, and must be unique within a pattern.
3045
3046 Named capturing parentheses are still allocated numbers as well as
3047 names. The PCRE API provides function calls for extracting the name-to-
3048 number translation table from a compiled pattern. There is also a con-
3049 venience function for extracting a captured substring by name. For fur-
3050 ther details see the pcreapi documentation.
3051
3052
3053 REPETITION
3054
3055 Repetition is specified by quantifiers, which can follow any of the
3056 following items:
3057
3058 a literal data character
3059 the . metacharacter
3060 the \C escape sequence
3061 the \X escape sequence (in UTF-8 mode with Unicode properties)
3062 an escape such as \d that matches a single character
3063 a character class
3064 a back reference (see next section)
3065 a parenthesized subpattern (unless it is an assertion)
3066
3067 The general repetition quantifier specifies a minimum and maximum num-
3068 ber of permitted matches, by giving the two numbers in curly brackets
3069 (braces), separated by a comma. The numbers must be less than 65536,
3070 and the first must be less than or equal to the second. For example:
3071
3072 z{2,4}
3073
3074 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
3075 special character. If the second number is omitted, but the comma is
3076 present, there is no upper limit; if the second number and the comma
3077 are both omitted, the quantifier specifies an exact number of required
3078 matches. Thus
3079
3080 [aeiou]{3,}
3081
3082 matches at least 3 successive vowels, but may match many more, while
3083
3084 \d{8}
3085
3086 matches exactly 8 digits. An opening curly bracket that appears in a
3087 position where a quantifier is not allowed, or one that does not match
3088 the syntax of a quantifier, is taken as a literal character. For exam-
3089 ple, {,6} is not a quantifier, but a literal string of four characters.
3090
3091 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
3092 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
3093 acters, each of which is represented by a two-byte sequence. Similarly,
3094 when Unicode property support is available, \X{3} matches three Unicode
3095 extended sequences, each of which may be several bytes long (and they
3096 may be of different lengths).
3097
3098 The quantifier {0} is permitted, causing the expression to behave as if
3099 the previous item and the quantifier were not present.
3100
3101 For convenience (and historical compatibility) the three most common
3102 quantifiers have single-character abbreviations:
3103
3104 * is equivalent to {0,}
3105 + is equivalent to {1,}
3106 ? is equivalent to {0,1}
3107
3108 It is possible to construct infinite loops by following a subpattern
3109 that can match no characters with a quantifier that has no upper limit,
3110 for example:
3111
3112 (a?)*
3113
3114 Earlier versions of Perl and PCRE used to give an error at compile time
3115 for such patterns. However, because there are cases where this can be
3116 useful, such patterns are now accepted, but if any repetition of the
3117 subpattern does in fact match no characters, the loop is forcibly bro-
3118 ken.
3119
3120 By default, the quantifiers are "greedy", that is, they match as much
3121 as possible (up to the maximum number of permitted times), without
3122 causing the rest of the pattern to fail. The classic example of where
3123 this gives problems is in trying to match comments in C programs. These
3124 appear between /* and */ and within the comment, individual * and /
3125 characters may appear. An attempt to match C comments by applying the
3126 pattern
3127
3128 /\*.*\*/
3129
3130 to the string
3131
3132 /* first comment */ not comment /* second comment */
3133
3134 fails, because it matches the entire string owing to the greediness of
3135 the .* item.
3136
3137 However, if a quantifier is followed by a question mark, it ceases to
3138 be greedy, and instead matches the minimum number of times possible, so
3139 the pattern
3140
3141 /\*.*?\*/
3142
3143 does the right thing with the C comments. The meaning of the various
3144 quantifiers is not otherwise changed, just the preferred number of
3145 matches. Do not confuse this use of question mark with its use as a
3146 quantifier in its own right. Because it has two uses, it can sometimes
3147 appear doubled, as in
3148
3149 \d??\d
3150
3151 which matches one digit by preference, but can match two if that is the
3152 only way the rest of the pattern matches.
3153
3154 If the PCRE_UNGREEDY option is set (an option which is not available in
3155 Perl), the quantifiers are not greedy by default, but individual ones
3156 can be made greedy by following them with a question mark. In other
3157 words, it inverts the default behaviour.
3158
3159 When a parenthesized subpattern is quantified with a minimum repeat
3160 count that is greater than 1 or with a limited maximum, more memory is
3161 required for the compiled pattern, in proportion to the size of the
3162 minimum or maximum.
3163
3164 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
3165 alent to Perl's /s) is set, thus allowing the . to match newlines, the
3166 pattern is implicitly anchored, because whatever follows will be tried
3167 against every character position in the subject string, so there is no
3168 point in retrying the overall match at any position after the first.
3169 PCRE normally treats such a pattern as though it were preceded by \A.
3170
3171 In cases where it is known that the subject string contains no new-
3172 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
3173 mization, or alternatively using ^ to indicate anchoring explicitly.
3174
3175 However, there is one situation where the optimization cannot be used.
3176 When .* is inside capturing parentheses that are the subject of a
3177 backreference elsewhere in the pattern, a match at the start may fail,
3178 and a later one succeed. Consider, for example:
3179
3180 (.*)abc\1
3181
3182 If the subject is "xyz123abc123" the match point is the fourth charac-
3183 ter. For this reason, such a pattern is not implicitly anchored.
3184
3185 When a capturing subpattern is repeated, the value captured is the sub-
3186 string that matched the final iteration. For example, after
3187
3188 (tweedle[dume]{3}\s*)+
3189
3190 has matched "tweedledum tweedledee" the value of the captured substring
3191 is "tweedledee". However, if there are nested capturing subpatterns,
3192 the corresponding captured values may have been set in previous itera-
3193 tions. For example, after
3194
3195 /(a|(b))+/
3196
3197 matches "aba" the value of the second captured substring is "b".
3198
3199
3200 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
3201
3202 With both maximizing and minimizing repetition, failure of what follows
3203 normally causes the repeated item to be re-evaluated to see if a dif-
3204 ferent number of repeats allows the rest of the pattern to match. Some-
3205 times it is useful to prevent this, either to change the nature of the
3206 match, or to cause it fail earlier than it otherwise might, when the
3207 author of the pattern knows there is no point in carrying on.
3208
3209 Consider, for example, the pattern \d+foo when applied to the subject
3210 line
3211
3212 123456bar
3213
3214 After matching all 6 digits and then failing to match "foo", the normal
3215 action of the matcher is to try again with only 5 digits matching the
3216 \d+ item, and then with 4, and so on, before ultimately failing.
3217 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3218 the means for specifying that once a subpattern has matched, it is not
3219 to be re-evaluated in this way.
3220
3221 If we use atomic grouping for the previous example, the matcher would
3222 give up immediately on failing to match "foo" the first time. The nota-
3223 tion is a kind of special parenthesis, starting with (?> as in this
3224 example:
3225
3226 (?>\d+)foo
3227
3228 This kind of parenthesis "locks up" the part of the pattern it con-
3229 tains once it has matched, and a failure further into the pattern is
3230 prevented from backtracking into it. Backtracking past it to previous
3231 items, however, works as normal.
3232
3233 An alternative description is that a subpattern of this type matches
3234 the string of characters that an identical standalone pattern would
3235 match, if anchored at the current point in the subject string.
3236
3237 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3238 such as the above example can be thought of as a maximizing repeat that
3239 must swallow everything it can. So, while both \d+ and \d+? are pre-
3240 pared to adjust the number of digits they match in order to make the
3241 rest of the pattern match, (?>\d+) can only match an entire sequence of
3242 digits.
3243
3244 Atomic groups in general can of course contain arbitrarily complicated
3245 subpatterns, and can be nested. However, when the subpattern for an
3246 atomic group is just a single repeated item, as in the example above, a
3247 simpler notation, called a "possessive quantifier" can be used. This
3248 consists of an additional + character following a quantifier. Using
3249 this notation, the previous example can be rewritten as
3250
3251 \d++foo
3252
3253 Possessive quantifiers are always greedy; the setting of the
3254 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
3255 simpler forms of atomic group. However, there is no difference in the
3256 meaning or processing of a possessive quantifier and the equivalent
3257 atomic group.
3258
3259 The possessive quantifier syntax is an extension to the Perl syntax. It
3260 originates in Sun's Java package.
3261
3262 When a pattern contains an unlimited repeat inside a subpattern that
3263 can itself be repeated an unlimited number of times, the use of an
3264 atomic group is the only way to avoid some failing matches taking a
3265 very long time indeed. The pattern
3266
3267 (\D+|<\d+>)*[!?]
3268
3269 matches an unlimited number of substrings that either consist of non-
3270 digits, or digits enclosed in <>, followed by either ! or ?. When it
3271 matches, it runs quickly. However, if it is applied to
3272
3273 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3274
3275 it takes a long time before reporting failure. This is because the
3276 string can be divided between the internal \D+ repeat and the external
3277 * repeat in a large number of ways, and all have to be tried. (The
3278 example uses [!?] rather than a single character at the end, because
3279 both PCRE and Perl have an optimization that allows for fast failure
3280 when a single character is used. They remember the last single charac-
3281 ter that is required for a match, and fail early if it is not present
3282 in the string.) If the pattern is changed so that it uses an atomic
3283 group, like this:
3284
3285 ((?>\D+)|<\d+>)*[!?]
3286
3287 sequences of non-digits cannot be broken, and failure happens quickly.
3288
3289
3290 BACK REFERENCES
3291
3292 Outside a character class, a backslash followed by a digit greater than
3293 0 (and possibly further digits) is a back reference to a capturing sub-
3294 pattern earlier (that is, to its left) in the pattern, provided there
3295 have been that many previous capturing left parentheses.
3296
3297 However, if the decimal number following the backslash is less than 10,
3298 it is always taken as a back reference, and causes an error only if
3299 there are not that many capturing left parentheses in the entire pat-
3300 tern. In other words, the parentheses that are referenced need not be
3301 to the left of the reference for numbers less than 10. See the subsec-
3302 tion entitled "Non-printing characters" above for further details of
3303 the handling of digits following a backslash.
3304
3305 A back reference matches whatever actually matched the capturing sub-
3306 pattern in the current subject string, rather than anything matching
3307 the subpattern itself (see "Subpatterns as subroutines" below for a way
3308 of doing that). So the pattern
3309
3310 (sens|respons)e and \1ibility
3311
3312 matches "sense and sensibility" and "response and responsibility", but
3313 not "sense and responsibility". If caseful matching is in force at the
3314 time of the back reference, the case of letters is relevant. For exam-
3315 ple,
3316
3317 ((?i)rah)\s+\1
3318
3319 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
3320 original capturing subpattern is matched caselessly.
3321
3322 Back references to named subpatterns use the Python syntax (?P=name).
3323 We could rewrite the above example as follows:
3324
3325 (?<p1>(?i)rah)\s+(?P=p1)
3326
3327 There may be more than one back reference to the same subpattern. If a
3328 subpattern has not actually been used in a particular match, any back
3329 references to it always fail. For example, the pattern
3330
3331 (a|(bc))\2
3332
3333 always fails if it starts to match "a" rather than "bc". Because there
3334 may be many capturing parentheses in a pattern, all digits following
3335 the backslash are taken as part of a potential back reference number.
3336 If the pattern continues with a digit character, some delimiter must be
3337 used to terminate the back reference. If the PCRE_EXTENDED option is
3338 set, this can be whitespace. Otherwise an empty comment (see "Com-
3339 ments" below) can be used.
3340
3341 A back reference that occurs inside the parentheses to which it refers
3342 fails when the subpattern is first used, so, for example, (a\1) never
3343 matches. However, such references can be useful inside repeated sub-
3344 patterns. For example, the pattern
3345
3346 (a|b\1)+
3347
3348 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
3349 ation of the subpattern, the back reference matches the character
3350 string corresponding to the previous iteration. In order for this to
3351 work, the pattern must be such that the first iteration does not need
3352 to match the back reference. This can be done using alternation, as in
3353 the example above, or by a quantifier with a minimum of zero.
3354
3355
3356 ASSERTIONS
3357
3358 An assertion is a test on the characters following or preceding the
3359 current matching point that does not actually consume any characters.
3360 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
3361 described above.
3362
3363 More complicated assertions are coded as subpatterns. There are two
3364 kinds: those that look ahead of the current position in the subject
3365 string, and those that look behind it. An assertion subpattern is
3366 matched in the normal way, except that it does not cause the current
3367 matching position to be changed.
3368
3369 Assertion subpatterns are not capturing subpatterns, and may not be
3370 repeated, because it makes no sense to assert the same thing several
3371 times. If any kind of assertion contains capturing subpatterns within
3372 it, these are counted for the purposes of numbering the capturing sub-
3373 patterns in the whole pattern. However, substring capturing is carried
3374 out only for positive assertions, because it does not make sense for
3375 negative assertions.
3376
3377 Lookahead assertions
3378
3379 Lookahead assertions start with (?= for positive assertions and (?! for
3380 negative assertions. For example,
3381
3382 \w+(?=;)
3383
3384 matches a word followed by a semicolon, but does not include the semi-
3385 colon in the match, and
3386
3387 foo(?!bar)
3388
3389 matches any occurrence of "foo" that is not followed by "bar". Note
3390 that the apparently similar pattern
3391
3392 (?!foo)bar
3393
3394 does not find an occurrence of "bar" that is preceded by something
3395 other than "foo"; it finds any occurrence of "bar" whatsoever, because
3396 the assertion (?!foo) is always true when the next three characters are
3397 "bar". A lookbehind assertion is needed to achieve the other effect.
3398
3399 If you want to force a matching failure at some point in a pattern, the
3400 most convenient way to do it is with (?!) because an empty string
3401 always matches, so an assertion that requires there not to be an empty
3402 string must always fail.
3403
3404 Lookbehind assertions
3405
3406 Lookbehind assertions start with (?<= for positive assertions and (?<!
3407 for negative assertions. For example,
3408
3409 (?<!foo)bar
3410
3411 does find an occurrence of "bar" that is not preceded by "foo". The
3412 contents of a lookbehind assertion are restricted such that all the
3413 strings it matches must have a fixed length. However, if there are sev-
3414 eral alternatives, they do not all have to have the same fixed length.
3415 Thus
3416
3417 (?<=bullock|donkey)
3418
3419 is permitted, but
3420
3421 (?<!dogs?|cats?)
3422
3423 causes an error at compile time. Branches that match different length
3424 strings are permitted only at the top level of a lookbehind assertion.
3425 This is an extension compared with Perl (at least for 5.8), which
3426 requires all branches to match the same length of string. An assertion
3427 such as
3428
3429 (?<=ab(c|de))
3430
3431 is not permitted, because its single top-level branch can match two
3432 different lengths, but it is acceptable if rewritten to use two top-
3433 level branches:
3434
3435 (?<=abc|abde)
3436
3437 The implementation of lookbehind assertions is, for each alternative,
3438 to temporarily move the current position back by the fixed width and
3439 then try to match. If there are insufficient characters before the cur-
3440 rent position, the match is deemed to fail.
3441
3442 PCRE does not allow the \C escape (which matches a single byte in UTF-8
3443 mode) to appear in lookbehind assertions, because it makes it impossi-
3444 ble to calculate the length of the lookbehind. The \X escape, which can
3445 match different numbers of bytes, is also not permitted.
3446
3447 Atomic groups can be used in conjunction with lookbehind assertions to
3448 specify efficient matching at the end of the subject string. Consider a
3449 simple pattern such as
3450
3451 abcd$
3452
3453 when applied to a long string that does not match. Because matching
3454 proceeds from left to right, PCRE will look for each "a" in the subject
3455 and then see if what follows matches the rest of the pattern. If the
3456 pattern is specified as
3457
3458 ^.*abcd$
3459
3460 the initial .* matches the entire string at first, but when this fails
3461 (because there is no following "a"), it backtracks to match all but the
3462 last character, then all but the last two characters, and so on. Once
3463 again the search for "a" covers the entire string, from right to left,
3464 so we are no better off. However, if the pattern is written as
3465
3466 ^(?>.*)(?<=abcd)
3467
3468 or, equivalently, using the possessive quantifier syntax,
3469
3470 ^.*+(?<=abcd)
3471
3472 there can be no backtracking for the .* item; it can match only the
3473 entire string. The subsequent lookbehind assertion does a single test
3474 on the last four characters. If it fails, the match fails immediately.
3475 For long strings, this approach makes a significant difference to the
3476 processing time.
3477
3478 Using multiple assertions
3479
3480 Several assertions (of any sort) may occur in succession. For example,
3481
3482 (?<=\d{3})(?<!999)foo
3483
3484 matches "foo" preceded by three digits that are not "999". Notice that
3485 each of the assertions is applied independently at the same point in
3486 the subject string. First there is a check that the previous three
3487 characters are all digits, and then there is a check that the same
3488 three characters are not "999". This pattern does not match "foo" pre-
3489 ceded by six characters, the first of which are digits and the last
3490 three of which are not "999". For example, it doesn't match "123abc-
3491 foo". A pattern to do that is
3492
3493 (?<=\d{3}...)(?<!999)foo
3494
3495 This time the first assertion looks at the preceding six characters,
3496 checking that the first three are digits, and then the second assertion
3497 checks that the preceding three characters are not "999".
3498
3499 Assertions can be nested in any combination. For example,
3500
3501 (?<=(?<!foo)bar)baz
3502
3503 matches an occurrence of "baz" that is preceded by "bar" which in turn
3504 is not preceded by "foo", while
3505
3506 (?<=\d{3}(?!999)...)foo
3507
3508 is another pattern that matches "foo" preceded by three digits and any
3509 three characters that are not "999".
3510
3511
3512 CONDITIONAL SUBPATTERNS
3513
3514 It is possible to cause the matching process to obey a subpattern con-
3515 ditionally or to choose between two alternative subpatterns, depending
3516 on the result of an assertion, or whether a previous capturing subpat-
3517 tern matched or not. The two possible forms of conditional subpattern
3518 are
3519
3520 (?(condition)yes-pattern)
3521 (?(condition)yes-pattern|no-pattern)
3522
3523 If the condition is satisfied, the yes-pattern is used; otherwise the
3524 no-pattern (if present) is used. If there are more than two alterna-
3525 tives in the subpattern, a compile-time error occurs.
3526
3527 There are three kinds of condition. If the text between the parentheses
3528 consists of a sequence of digits, the condition is satisfied if the
3529 capturing subpattern of that number has previously matched. The number
3530 must be greater than zero. Consider the following pattern, which con-
3531 tains non-significant white space to make it more readable (assume the
3532 PCRE_EXTENDED option) and to divide it into three parts for ease of
3533 discussion:
3534
3535 ( \( )? [^()]+ (?(1) \) )
3536
3537 The first part matches an optional opening parenthesis, and if that
3538 character is present, sets it as the first captured substring. The sec-
3539 ond part matches one or more characters that are not parentheses. The
3540 third part is a conditional subpattern that tests whether the first set
3541 of parentheses matched or not. If they did, that is, if subject started
3542 with an opening parenthesis, the condition is true, and so the yes-pat-
3543 tern is executed and a closing parenthesis is required. Otherwise,
3544 since no-pattern is not present, the subpattern matches nothing. In
3545 other words, this pattern matches a sequence of non-parentheses,
3546 optionally enclosed in parentheses.
3547
3548 If the condition is the string (R), it is satisfied if a recursive call
3549 to the pattern or subpattern has been made. At "top level", the condi-
3550 tion is false. This is a PCRE extension. Recursive patterns are
3551 described in the next section.
3552
3553 If the condition is not a sequence of digits or (R), it must be an
3554 assertion. This may be a positive or negative lookahead or lookbehind
3555 assertion. Consider this pattern, again containing non-significant
3556 white space, and with the two alternatives on the second line:
3557
3558 (?(?=[^a-z]*[a-z])
3559 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
3560
3561 The condition is a positive lookahead assertion that matches an
3562 optional sequence of non-letters followed by a letter. In other words,
3563 it tests for the presence of at least one letter in the subject. If a
3564 letter is found, the subject is matched against the first alternative;
3565 otherwise it is matched against the second. This pattern matches
3566 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
3567 letters and dd are digits.
3568
3569
3570 COMMENTS
3571
3572 The sequence (?# marks the start of a comment that continues up to the
3573 next closing parenthesis. Nested parentheses are not permitted. The
3574 characters that make up a comment play no part in the pattern matching
3575 at all.
3576
3577 If the PCRE_EXTENDED option is set, an unescaped # character outside a
3578 character class introduces a comment that continues up to the next new-
3579 line character in the pattern.
3580
3581
3582 RECURSIVE PATTERNS
3583
3584 Consider the problem of matching a string in parentheses, allowing for
3585 unlimited nested parentheses. Without the use of recursion, the best
3586 that can be done is to use a pattern that matches up to some fixed
3587 depth of nesting. It is not possible to handle an arbitrary nesting
3588 depth. Perl provides a facility that allows regular expressions to
3589 recurse (amongst other things). It does this by interpolating Perl code
3590 in the expression at run time, and the code can refer to the expression
3591 itself. A Perl pattern to solve the parentheses problem can be created
3592 like this:
3593
3594 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
3595
3596 The (?p{...}) item interpolates Perl code at run time, and in this case
3597 refers recursively to the pattern in which it appears. Obviously, PCRE
3598 cannot support the interpolation of Perl code. Instead, it supports
3599 some special syntax for recursion of the entire pattern, and also for
3600 individual subpattern recursion.
3601
3602 The special item that consists of (? followed by a number greater than
3603 zero and a closing parenthesis is a recursive call of the subpattern of
3604 the given number, provided that it occurs inside that subpattern. (If
3605 not, it is a "subroutine" call, which is described in the next sec-
3606 tion.) The special item (?R) is a recursive call of the entire regular
3607 expression.
3608
3609 For example, this PCRE pattern solves the nested parentheses problem
3610 (assume the PCRE_EXTENDED option is set so that white space is
3611 ignored):
3612
3613 \( ( (?>[^()]+) | (?R) )* \)
3614
3615 First it matches an opening parenthesis. Then it matches any number of
3616 substrings which can either be a sequence of non-parentheses, or a
3617 recursive match of the pattern itself (that is a correctly parenthe-
3618 sized substring). Finally there is a closing parenthesis.
3619
3620 If this were part of a larger pattern, you would not want to recurse
3621 the entire pattern, so instead you could use this:
3622
3623 ( \( ( (?>[^()]+) | (?1) )* \) )
3624
3625 We have put the pattern into parentheses, and caused the recursion to
3626 refer to them instead of the whole pattern. In a larger pattern, keep-
3627 ing track of parenthesis numbers can be tricky. It may be more conve-
3628 nient to use named parentheses instead. For this, PCRE uses (?P>name),
3629 which is an extension to the Python syntax that PCRE uses for named
3630 parentheses (Perl does not provide named parentheses). We could rewrite
3631 the above example as follows:
3632
3633 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
3634
3635 This particular example pattern contains nested unlimited repeats, and
3636 so the use of atomic grouping for matching strings of non-parentheses
3637 is important when applying the pattern to strings that do not match.
3638 For example, when this pattern is applied to
3639
3640 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
3641
3642 it yields "no match" quickly. However, if atomic grouping is not used,
3643 the match runs for a very long time indeed because there are so many
3644 different ways the + and * repeats can carve up the subject, and all
3645 have to be tested before failure can be reported.
3646
3647 At the end of a match, the values set for any capturing subpatterns are
3648 those from the outermost level of the recursion at which the subpattern
3649 value is set. If you want to obtain intermediate values, a callout
3650 function can be used (see the next section and the pcrecallout documen-
3651 tation). If the pattern above is matched against
3652
3653 (ab(cd)ef)
3654
3655 the value for the capturing parentheses is "ef", which is the last
3656 value taken on at the top level. If additional parentheses are added,
3657 giving
3658
3659 \( ( ( (?>[^()]+) | (?R) )* ) \)
3660 ^ ^
3661 ^ ^
3662
3663 the string they capture is "ab(cd)ef", the contents of the top level
3664 parentheses. If there are more than 15 capturing parentheses in a pat-
3665 tern, PCRE has to obtain extra memory to store data during a recursion,
3666 which it does by using pcre_malloc, freeing it via pcre_free after-
3667 wards. If no memory can be obtained, the match fails with the
3668 PCRE_ERROR_NOMEMORY error.
3669
3670 Do not confuse the (?R) item with the condition (R), which tests for
3671 recursion. Consider this pattern, which matches text in angle brack-
3672 ets, allowing for arbitrary nesting. Only digits are allowed in nested
3673 brackets (that is, when recursing), whereas any characters are permit-
3674 ted at the outer level.
3675
3676 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
3677
3678 In this pattern, (?(R) is the start of a conditional subpattern, with
3679 two different alternatives for the recursive and non-recursive cases.
3680 The (?R) item is the actual recursive call.
3681
3682
3683 SUBPATTERNS AS SUBROUTINES
3684
3685 If the syntax for a recursive subpattern reference (either by number or
3686 by name) is used outside the parentheses to which it refers, it oper-
3687 ates like a subroutine in a programming language. An earlier example
3688 pointed out that the pattern
3689
3690 (sens|respons)e and \1ibility
3691
3692 matches "sense and sensibility" and "response and responsibility", but
3693 not "sense and responsibility". If instead the pattern
3694
3695 (sens|respons)e and (?1)ibility
3696
3697 is used, it does match "sense and responsibility" as well as the other
3698 two strings. Such references must, however, follow the subpattern to
3699 which they refer.
3700
3701
3702 CALLOUTS
3703
3704 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
3705 Perl code to be obeyed in the middle of matching a regular expression.
3706 This makes it possible, amongst other things, to extract different sub-
3707 strings that match the same pair of parentheses when there is a repeti-
3708 tion.
3709
3710 PCRE provides a similar feature, but of course it cannot obey arbitrary
3711 Perl code. The feature is called "callout". The caller of PCRE provides
3712 an external function by putting its entry point in the global variable
3713 pcre_callout. By default, this variable contains NULL, which disables
3714 all calling out.
3715
3716 Within a regular expression, (?C) indicates the points at which the
3717 external function is to be called. If you want to identify different
3718 callout points, you can put a number less than 256 after the letter C.
3719 The default value is zero. For example, this pattern has two callout
3720 points:
3721
3722 (?C1)abc(?C2)def
3723
3724 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
3725 automatically installed before each item in the pattern. They are all
3726 numbered 255.
3727
3728 During matching, when PCRE reaches a callout point (and pcre_callout is
3729 set), the external function is called. It is provided with the number
3730 of the callout, the position in the pattern, and, optionally, one item
3731 of data originally supplied by the caller of pcre_exec(). The callout
3732 function may cause matching to proceed, to backtrack, or to fail alto-
3733 gether. A complete description of the interface to the callout function
3734 is given in the pcrecallout documentation.
3735
3736 Last updated: 28 February 2005
3737 Copyright (c) 1997-2005 University of Cambridge.
3738 -----------------------------------------------------------------------------
3739
3740
3741
3742 NAME
3743 PCRE - Perl-compatible regular expressions
3744
3745
3746 PARTIAL MATCHING IN PCRE
3747
3748 In normal use of PCRE, if the subject string that is passed to
3749 pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too
3750 short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.
3751 There are circumstances where it might be helpful to distinguish this
3752 case from other cases in which there is no match.
3753
3754 Consider, for example, an application where a human is required to type
3755 in data for a field with specific formatting requirements. An example
3756 might be a date in the form ddmmmyy, defined by this pattern:
3757
3758 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
3759
3760 If the application sees the user's keystrokes one by one, and can check
3761 that what has been typed so far is potentially valid, it is able to
3762 raise an error as soon as a mistake is made, possibly beeping and not
3763 reflecting the character that has been typed. This immediate feedback
3764 is likely to be a better user interface than a check that is delayed
3765 until the entire string has been entered.
3766
3767 PCRE supports the concept of partial matching by means of the PCRE_PAR-
3768 TIAL option, which can be set when calling pcre_exec() or
3769 pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
3770 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
3771 during the matching process the last part of the subject string matched
3772 part of the pattern. Unfortunately, for non-anchored matching, it is
3773 not possible to obtain the position of the start of the partial match.
3774 No captured data is set when PCRE_ERROR_PARTIAL is returned.
3775
3776 When PCRE_PARTIAL is set for pcre_dfa_exec(), the return code
3777 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of
3778 the subject is reached, there have been no complete matches, but there
3779 is still at least one matching possibility. The portion of the string
3780 that provided the partial match is set as the first matching string.
3781
3782 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
3783 the last literal byte in a pattern, and abandons matching immediately
3784 if such a byte is not present in the subject string. This optimization
3785 cannot be used for a subject string that might match only partially.
3786
3787
3788 RESTRICTED PATTERNS FOR PCRE_PARTIAL
3789
3790 Because of the way certain internal optimizations are implemented in
3791 the pcre_exec() function, the PCRE_PARTIAL option cannot be used with
3792 all patterns. These restrictions do not apply when pcre_dfa_exec() is
3793 used. For pcre_exec(), repeated single characters such as
3794
3795 a{2,4}
3796
3797 and repeated single metasequences such as
3798
3799 \d+
3800
3801 are not permitted if the maximum number of occurrences is greater than
3802 one. Optional items such as \d? (where the maximum is one) are permit-
3803 ted. Quantifiers with any values are permitted after parentheses, so
3804 the invalid examples above can be coded thus:
3805
3806 (a){2,4}
3807 (\d)+
3808
3809 These constructions run more slowly, but for the kinds of application
3810 that are envisaged for this facility, this is not felt to be a major
3811 restriction.
3812
3813 If PCRE_PARTIAL is set for a pattern that does not conform to the
3814 restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
3815 (-13).
3816
3817
3818 EXAMPLE OF PARTIAL MATCHING USING PCRETEST
3819
3820 If the escape sequence \P is present in a pcretest data line, the
3821 PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
3822 uses the date example quoted above:
3823
3824 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
3825 data> 25jun04P
3826 0: 25jun04
3827 1: jun
3828 data> 25dec3P
3829 Partial match
3830 data> 3juP
3831 Partial match
3832 data> 3jujP
3833 No match
3834 data> jP
3835 No match
3836
3837 The first data string is matched completely, so pcretest shows the
3838 matched substrings. The remaining four strings do not match the com-
3839 plete pattern, but the first two are partial matches. The same test,
3840 using DFA matching (by means of the \D escape sequence), produces the
3841 following output:
3842
3843 re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
3844 data> 25jun04\P\D
3845 0: 25jun04
3846 data> 23dec3\P\D
3847 Partial match: 23dec3
3848 data> 3ju\P\D
3849 Partial match: 3ju
3850 data> 3juj\P\D
3851 No match
3852 data> j\P\D
3853 No match
3854
3855 Notice that in this case the portion of the string that was matched is
3856 made available.
3857
3858
3859 MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
3860
3861 When a partial match has been found using pcre_dfa_exec(), it is possi-
3862 ble to continue the match by providing additional subject data and
3863 calling pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the
3864 same working space (where details of the previous partial match are
3865 stored). Here is an example using pcretest, where the \R escape
3866 sequence sets the PCRE_DFA_RESTART option and the \D escape sequence
3867 requests the use of pcre_dfa_exec():
3868
3869 re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/
3870 data> 23ja\P\D
3871 Partial match: 23ja
3872 data> n05\R\D
3873 0: n05
3874
3875 The first call has "23ja" as the subject, and requests partial match-
3876 ing; the second call has "n05" as the subject for the continued
3877 (restarted) match. Notice that when the match is complete, only the
3878 last part is shown; PCRE does not retain the previously partially-
3879 matched string. It is up to the calling program to do that if it needs
3880 to.
3881
3882 This facility can be used to pass very long subject strings to
3883 pcre_dfa_exec(). However, some care is needed for certain types of pat-
3884 tern.
3885
3886 1. If the pattern contains tests for the beginning or end of a line,
3887 you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
3888 ate, when the subject string for any call does not contain the begin-
3889 ning or end of a line.
3890
3891 2. If the pattern contains backward assertions (including \b or \B),
3892 you need to arrange for some overlap in the subject strings to allow
3893 for this. For example, you could pass the subject in chunks that were
3894 500 bytes long, but in a buffer of 700 bytes, with the starting offset
3895 set to 200 and the previous 200 bytes at the start of the buffer.
3896
3897 3. Matching a subject string that is split into multiple segments does
3898 not always produce exactly the same result as matching over one single
3899 long string. The difference arises when there are multiple matching
3900 possibilities, because a partial match result is given only when there
3901 are no completed matches in a call to fBpcre_dfa_exec(). This means
3902 that as soon as the shortest match has been found, continuation to a
3903 new subject segment is no longer possible. Consider this pcretest
3904 example:
3905
3906 re> /dog(sbody)?/
3907 data> do\P\D
3908 Partial match: do
3909 data> gsb\R\P\D
3910 0: g
3911 data> dogsbody\D
3912 0: dogsbody
3913 1: dog
3914
3915 The pattern matches the words "dog" or "dogsbody". When the subject is
3916 presented in several parts ("do" and "gsb" being the first two) the
3917 match stops when "dog" has been found, and it is not possible to con-
3918 tinue. On the other hand, if "dogsbody" is presented as a single
3919 string, both matches are found.
3920
3921 Because of this phenomenon, it does not usually make sense to end a
3922 pattern that is going to be matched in this way with a variable repeat.
3923
3924 Last updated: 28 February 2005
3925 Copyright (c) 1997-2005 University of Cambridge.
3926 -----------------------------------------------------------------------------
3927
3928
3929
3930 NAME
3931 PCRE - Perl-compatible regular expressions
3932
3933
3934 SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
3935
3936 If you are running an application that uses a large number of regular
3937 expression patterns, it may be useful to store them in a precompiled
3938 form instead of having to compile them every time the application is
3939 run. If you are not using any private character tables (see the
3940 pcre_maketables() documentation), this is relatively straightforward.
3941 If you are using private tables, it is a little bit more complicated.
3942
3943 If you save compiled patterns to a file, you can copy them to a differ-
3944 ent host and run them there. This works even if the new host has the
3945 opposite endianness to the one on which the patterns were compiled.
3946 There may be a small performance penalty, but it should be insignifi-
3947 cant.
3948
3949
3950 SAVING A COMPILED PATTERN
3951 The value returned by pcre_compile() points to a single block of memory
3952 that holds the compiled pattern and associated data. You can find the
3953 length of this block in bytes by calling pcre_fullinfo() with an argu-
3954 ment of PCRE_INFO_SIZE. You can then save the data in any appropriate
3955 manner. Here is sample code that compiles a pattern and writes it to a
3956 file. It assumes that the variable fd refers to a file that is open for
3957 output:
3958
3959 int erroroffset, rc, size;
3960 char *error;
3961 pcre *re;
3962
3963 re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
3964 if (re == NULL) { ... handle errors ... }
3965 rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
3966 if (rc < 0) { ... handle errors ... }
3967 rc = fwrite(re, 1, size, fd);
3968 if (rc != size) { ... handle errors ... }
3969
3970 In this example, the bytes that comprise the compiled pattern are
3971 copied exactly. Note that this is binary data that may contain any of
3972 the 256 possible byte values. On systems that make a distinction
3973 between binary and non-binary data, be sure that the file is opened for
3974 binary output.
3975
3976 If you want to write more than one pattern to a file, you will have to
3977 devise a way of separating them. For binary data, preceding each pat-
3978 tern with its length is probably the most straightforward approach.
3979 Another possibility is to write out the data in hexadecimal instead of
3980 binary, one pattern to a line.
3981
3982 Saving compiled patterns in a file is only one possible way of storing
3983 them for later use. They could equally well be saved in a database, or
3984 in the memory of some daemon process that passes them via sockets to
3985 the processes that want them.
3986
3987 If the pattern has been studied, it is also possible to save the study
3988 data in a similar way to the compiled pattern itself. When studying
3989 generates additional information, pcre_study() returns a pointer to a
3990 pcre_extra data block. Its format is defined in the section on matching
3991 a pattern in the pcreapi documentation. The study_data field points to
3992 the binary study data, and this is what you must save (not the
3993 pcre_extra block itself). The length of the study data can be obtained
3994 by calling pcre_fullinfo() with an argument of PCRE_INFO_STUDYSIZE.
3995 Remember to check that pcre_study() did return a non-NULL value before
3996 trying to save the study data.
3997
3998
3999 RE-USING A PRECOMPILED PATTERN
4000
4001 Re-using a precompiled pattern is straightforward. Having reloaded it
4002 into main memory, you pass its pointer to pcre_exec() or
4003 pcre_dfa_exec() in the usual way. This should work even on another
4004 host, and even if that host has the opposite endianness to the one
4005 where the pattern was compiled.
4006
4007 However, if you passed a pointer to custom character tables when the
4008 pattern was compiled (the tableptr argument of pcre_compile()), you
4009 must now pass a similar pointer to pcre_exec() or pcre_dfa_exec(),
4010 because the value saved with the compiled pattern will obviously be
4011 nonsense. A field in a pcre_extra() block is used to pass this data, as
4012 described in the section on matching a pattern in the pcreapi documen-
4013 tation.
4014
4015 If you did not provide custom character tables when the pattern was
4016 compiled, the pointer in the compiled pattern is NULL, which causes
4017 pcre_exec() to use PCRE's internal tables. Thus, you do not need to
4018 take any special action at run time in this case.
4019
4020 If you saved study data with the compiled pattern, you need to create
4021 your own pcre_extra data block and set the study_data field to point to
4022 the reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA
4023 bit in the flags field to indicate that study data is present. Then
4024 pass the pcre_extra block to pcre_exec() or pcre_dfa_exec() in the
4025 usual way.
4026
4027
4028 COMPATIBILITY WITH DIFFERENT PCRE RELEASES
4029
4030 The layout of the control block that is at the start of the data that
4031 makes up a compiled pattern was changed for release 5.0. If you have
4032 any saved patterns that were compiled with previous releases (not a
4033 facility that was previously advertised), you will have to recompile
4034 them for release 5.0. However, from now on, it should be possible to
4035 make changes in a compabible manner.
4036
4037 Last updated: 28 February 2005
4038 Copyright (c) 1997-2005 University of Cambridge.
4039 -----------------------------------------------------------------------------
4040
4041
4042
4043 NAME
4044 PCRE - Perl-compatible regular expressions
4045
4046
4047 PCRE PERFORMANCE
4048
4049 Certain items that may appear in regular expression patterns are more
4050 efficient than others. It is more efficient to use a character class
4051 like [aeiou] than a set of alternatives such as (a|e|i|o|u). In gen-
4052 eral, the simplest construction that provides the required behaviour is
4053 usually the most efficient. Jeffrey Friedl's book contains a lot of
4054 useful general discussion about optimizing regular expressions for
4055 efficient performance. This document contains a few observations about
4056 PCRE.
4057
4058 Using Unicode character properties (the \p, \P, and \X escapes) is
4059 slow, because PCRE has to scan a structure that contains data for over
4060 fifteen thousand characters whenever it needs a character's property.
4061 If you can find an alternative pattern that does not use character
4062 properties, it will probably be faster.
4063
4064 When a pattern begins with .* not in parentheses, or in parentheses
4065 that are not the subject of a backreference, and the PCRE_DOTALL option
4066 is set, the pattern is implicitly anchored by PCRE, since it can match
4067 only at the start of a subject string. However, if PCRE_DOTALL is not
4068 set, PCRE cannot make this optimization, because the . metacharacter
4069 does not then match a newline, and if the subject string contains new-
4070 lines, the pattern may match from the character immediately following
4071 one of them instead of from the very start. For example, the pattern
4072
4073 .*second
4074
4075 matches the subject "first\nand second" (where \n stands for a newline
4076 character), with the match starting at the seventh character. In order
4077 to do this, PCRE has to retry the match starting after every newline in
4078 the subject.
4079
4080 If you are using such a pattern with subject strings that do not con-
4081 tain newlines, the best performance is obtained by setting PCRE_DOTALL,
4082 or starting the pattern with ^.* or ^.*? to indicate explicit anchor-
4083 ing. That saves PCRE from having to scan along the subject looking for
4084 a newline to restart at.
4085
4086 Beware of patterns that contain nested indefinite repeats. These can
4087 take a long time to run when applied to a string that does not match.
4088 Consider the pattern fragment
4089
4090 (a+)*
4091
4092 This can match "aaaa" in 33 different ways, and this number increases
4093 very rapidly as the string gets longer. (The * repeat can match 0, 1,
4094 2, 3, or 4 times, and for each of those cases other than 0, the +
4095 repeats can match different numbers of times.) When the remainder of
4096 the pattern is such that the entire match is going to fail, PCRE has in
4097 principle to try every possible variation, and this can take an
4098 extremely long time.
4099
4100 An optimization catches some of the more simple cases such as
4101
4102 (a+)*b
4103
4104 where a literal character follows. Before embarking on the standard
4105 matching procedure, PCRE checks that there is a "b" later in the sub-
4106 ject string, and if there is not, it fails the match immediately. How-
4107 ever, when there is no following literal this optimization cannot be
4108 used. You can see the difference by comparing the behaviour of
4109
4110 (a+)*\d
4111
4112 with the pattern above. The former gives a failure almost instantly
4113 when applied to a whole line of "a" characters, whereas the latter
4114 takes an appreciable time with strings longer than about 20 characters.
4115
4116 In many cases, the solution to this kind of performance issue is to use
4117 an atomic group or a possessive quantifier.
4118
4119 Last updated: 28 February 2005
4120 Copyright (c) 1997-2005 University of Cambridge.
4121 -----------------------------------------------------------------------------
4122
4123
4124
4125 NAME
4126 PCRE - Perl-compatible regular expressions.
4127
4128
4129 SYNOPSIS OF POSIX API
4130
4131 #include <pcreposix.h>
4132
4133 int regcomp(regex_t *preg, const char *pattern,
4134 int cflags);
4135
4136 int regexec(regex_t *preg, const char *string,
4137 size_t nmatch, regmatch_t pmatch[], int eflags);
4138
4139 size_t regerror(int errcode, const regex_t *preg,
4140 char *errbuf, size_t errbuf_size);
4141
4142 void regfree(regex_t *preg);
4143
4144
4145 DESCRIPTION
4146
4147 This set of functions provides a POSIX-style API to the PCRE regular
4148 expression package. See the pcreapi documentation for a description of
4149 PCRE's native API, which contains much additional functionality.
4150
4151 The functions described here are just wrapper functions that ultimately
4152 call the PCRE native API. Their prototypes are defined in the
4153 pcreposix.h header file, and on Unix systems the library itself is
4154 called pcreposix.a, so can be accessed by adding -lpcreposix to the
4155 command for linking an application that uses them. Because the POSIX
4156 functions call the native ones, it is also necessary to add -lpcre.
4157
4158 I have implemented only those option bits that can be reasonably mapped
4159 to PCRE native options. In addition, the options REG_EXTENDED and
4160 REG_NOSUB are defined with the value zero. They have no effect, but
4161 since programs that are written to the POSIX interface often use them,
4162 this makes it easier to slot in PCRE as a replacement library. Other
4163 POSIX options are not even defined.
4164
4165 When PCRE is called via these functions, it is only the API that is
4166 POSIX-like in style. The syntax and semantics of the regular expres-
4167 sions themselves are still those of Perl, subject to the setting of
4168 various PCRE options, as described below. "POSIX-like in style" means
4169 that the API approximates to the POSIX definition; it is not fully
4170 POSIX-compatible, and in multi-byte encoding domains it is probably
4171 even less compatible.
4172
4173 The header for these functions is supplied as pcreposix.h to avoid any
4174 potential clash with other POSIX libraries. It can, of course, be
4175 renamed or aliased as regex.h, which is the "correct" name. It provides
4176 two structure types, regex_t for compiled internal forms, and reg-
4177 match_t for returning captured substrings. It also defines some con-
4178 stants whose names start with "REG_"; these are used for setting
4179 options and identifying error codes.
4180
4181
4182 COMPILING A PATTERN
4183
4184 The function regcomp() is called to compile a pattern into an internal
4185 form. The pattern is a C string terminated by a binary zero, and is
4186 passed in the argument pattern. The preg argument is a pointer to a
4187 regex_t structure that is used as a base for storing information about
4188 the compiled expression.
4189
4190 The argument cflags is either zero, or contains one or more of the bits
4191 defined by the following macros:
4192
4193 REG_DOTALL
4194
4195 The PCRE_DOTALL option is set when the expression is passed for compi-
4196 lation to the native function. Note that REG_DOTALL is not part of the
4197 POSIX standard.
4198
4199 REG_ICASE
4200
4201 The PCRE_CASELESS option is set when the expression is passed for com-
4202 pilation to the native function.
4203
4204 REG_NEWLINE
4205
4206 The PCRE_MULTILINE option is set when the expression is passed for com-
4207 pilation to the native function. Note that this does not mimic the
4208 defined POSIX behaviour for REG_NEWLINE (see the following section).
4209
4210 In the absence of these flags, no options are passed to the native
4211 function. This means the the regex is compiled with PCRE default
4212 semantics. In particular, the way it handles newline characters in the
4213 subject string is the Perl way, not the POSIX way. Note that setting
4214 PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
4215 It does not affect the way newlines are matched by . (they aren't) or
4216 by a negative class such as [^a] (they are).
4217
4218 The yield of regcomp() is zero on success, and non-zero otherwise. The
4219 preg structure is filled in on success, and one member of the structure
4220 is public: re_nsub contains the number of capturing subpatterns in the
4221 regular expression. Various error codes are defined in the header file.
4222
4223
4224 MATCHING NEWLINE CHARACTERS
4225
4226 This area is not simple, because POSIX and Perl take different views of
4227 things. It is not possible to get PCRE to obey POSIX semantics, but
4228 then PCRE was never intended to be a POSIX engine. The following table
4229 lists the different possibilities for matching newline characters in
4230 PCRE:
4231
4232 Default Change with
4233
4234 . matches newline no PCRE_DOTALL
4235 newline matches [^a] yes not changeable
4236 $ matches \n at end yes PCRE_DOLLARENDONLY
4237 $ matches \n in middle no PCRE_MULTILINE
4238 ^ matches \n in middle no PCRE_MULTILINE
4239
4240 This is the equivalent table for POSIX:
4241
4242 Default Change with
4243
4244 . matches newline yes REG_NEWLINE
4245 newline matches [^a] yes REG_NEWLINE
4246 $ matches \n at end no REG_NEWLINE
4247 $ matches \n in middle no REG_NEWLINE
4248 ^ matches \n in middle no REG_NEWLINE
4249
4250 PCRE's behaviour is the same as Perl's, except that there is no equiva-
4251 lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
4252 no way to stop newline from matching [^a].
4253
4254 The default POSIX newline handling can be obtained by setting
4255 PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
4256 behave exactly as for the REG_NEWLINE action.
4257
4258
4259 MATCHING A PATTERN
4260
4261 The function regexec() is called to match a compiled pattern preg
4262 against a given string, which is terminated by a zero byte, subject to
4263 the options in eflags. These can be:
4264
4265 REG_NOTBOL
4266
4267 The PCRE_NOTBOL option is set when calling the underlying PCRE matching
4268 function.
4269
4270 REG_NOTEOL
4271
4272 The PCRE_NOTEOL option is set when calling the underlying PCRE matching
4273 function.
4274
4275 The portion of the string that was matched, and also any captured sub-
4276 strings, are returned via the pmatch argument, which points to an array
4277 of nmatch structures of type regmatch_t, containing the members rm_so
4278 and rm_eo. These contain the offset to the first character of each sub-
4279 string and the offset to the first character after the end of each sub-
4280 string, respectively. The 0th element of the vector relates to the
4281 entire portion of string that was matched; subsequent elements relate
4282 to the capturing subpatterns of the regular expression. Unused entries
4283 in the array have both structure members set to -1.
4284
4285 A successful match yields a zero return; various error codes are
4286 defined in the header file, of which REG_NOMATCH is the "expected"
4287 failure code.
4288
4289
4290 ERROR MESSAGES
4291
4292 The regerror() function maps a non-zero errorcode from either regcomp()
4293 or regexec() to a printable message. If preg is not NULL, the error
4294 should have arisen from the use of that structure. A message terminated
4295 by a binary zero is placed in errbuf. The length of the message,
4296 including the zero, is limited to errbuf_size. The yield of the func-
4297 tion is the size of buffer needed to hold the whole message.
4298
4299
4300 MEMORY USAGE
4301
4302 Compiling a regular expression causes memory to be allocated and asso-
4303 ciated with the preg structure. The function regfree() frees all such
4304 memory, after which preg may no longer be used as a compiled expres-
4305 sion.
4306
4307
4308 AUTHOR
4309
4310 Philip Hazel
4311 University Computing Service,
4312 Cambridge CB2 3QG, England.
4313
4314 Last updated: 28 February 2005
4315 Copyright (c) 1997-2005 University of Cambridge.
4316 -----------------------------------------------------------------------------
4317
4318
4319
4320 NAME
4321 PCRE - Perl-compatible regular expressions.
4322
4323
4324 SYNOPSIS OF C++ WRAPPER
4325
4326 #include <pcrecpp.h>
4327
4328
4329 DESCRIPTION
4330
4331 The C++ wrapper for PCRE was provided by Google Inc. This brief man
4332 page was constructed from the notes in the pcrecpp.h file, which should
4333 be consulted for further details.
4334
4335
4336 MATCHING INTERFACE
4337
4338 The "FullMatch" operation checks that supplied text matches a supplied
4339 pattern exactly. If pointer arguments are supplied, it copies matched
4340 sub-strings that match sub-patterns into them.
4341
4342 Example: successful match
4343 pcrecpp::RE re("h.*o");
4344 re.FullMatch("hello");
4345
4346 Example: unsuccessful match (requires full match):
4347 pcrecpp::RE re("e");
4348 !re.FullMatch("hello");
4349
4350 Example: creating a temporary RE object:
4351 pcrecpp::RE("h.*o").FullMatch("hello");
4352
4353 You can pass in a "const char*" or a "string" for "text". The examples
4354 below tend to use a const char*. You can, as in the different examples
4355 above, store the RE object explicitly in a variable or use a temporary
4356 RE object. The examples below use one mode or the other arbitrarily.
4357 Either could correctly be used for any of these examples.
4358
4359 You must supply extra pointer arguments to extract matched subpieces.
4360
4361 Example: extracts "ruby" into "s" and 1234 into "i"
4362 int i;
4363 string s;
4364 pcrecpp::RE re("(\\w+):(\\d+)");
4365 re.FullMatch("ruby:1234", &s, &i);
4366
4367 Example: does not try to extract any extra sub-patterns
4368 re.FullMatch("ruby:1234", &s);
4369
4370 Example: does not try to extract into NULL
4371 re.FullMatch("ruby:1234", NULL, &i);
4372
4373 Example: integer overflow causes failure
4374 !re.FullMatch("ruby:1234567891234", NULL, &i);
4375
4376 Example: fails because there aren't enough sub-patterns:
4377 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
4378
4379 Example: fails because string cannot be stored in integer
4380 !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
4381
4382 The provided pointer arguments can be pointers to any scalar numeric
4383 type, or one of:
4384
4385 string (matched piece is copied to string)
4386 StringPiece (StringPiece is mutated to point to matched piece)
4387 T (where "bool T::ParseFrom(const char*, int)" exists)
4388 NULL (the corresponding matched sub-pattern is not copied)
4389
4390 The function returns true iff all of the following conditions are sat-
4391 isfied:
4392
4393 a. "text" matches "pattern" exactly;
4394
4395 b. The number of matched sub-patterns is >= number of supplied
4396 pointers;
4397
4398 c. The "i"th argument has a suitable type for holding the
4399 string captured as the "i"th sub-pattern. If you pass in
4400 NULL for the "i"th argument, or pass fewer arguments than
4401 number of sub-patterns, "i"th captured sub-pattern is
4402 ignored.
4403
4404 The matching interface supports at most 16 arguments per call. If you
4405 need more, consider using the more general interface
4406 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
4407
4408
4409 PARTIAL MATCHES
4410
4411 You can use the "PartialMatch" operation when you want the pattern to
4412 match any substring of the text.
4413
4414 Example: simple search for a string:
4415 pcrecpp::RE("ell").PartialMatch("hello");
4416
4417 Example: find first number in a string:
4418 int number;
4419 pcrecpp::RE re("(\\d+)");
4420 re.PartialMatch("x*100 + 20", &number);
4421 assert(number == 100);
4422
4423
4424 UTF-8 AND THE MATCHING INTERFACE
4425
4426 By default, pattern and text are plain text, one byte per character.
4427 The UTF8 flag, passed to the constructor, causes both pattern and
4428 string to be treated as UTF-8 text, still a byte stream but potentially
4429 multiple bytes per character. In practice, the text is likelier to be
4430 UTF-8 than the pattern, but the match returned may depend on the UTF8
4431 flag, so always use it when matching UTF8 text. For example, "." will
4432 match one byte normally but with UTF8 set may match up to three bytes
4433 of a multi-byte character.
4434
4435 Example:
4436 pcrecpp::RE_Options options;
4437 options.set_utf8();
4438 pcrecpp::RE re(utf8_pattern, options);
4439 re.FullMatch(utf8_string);
4440
4441 Example: using the convenience function UTF8():
4442 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
4443 re.FullMatch(utf8_string);
4444
4445 NOTE: The UTF8 flag is ignored if pcre was not configured with the
4446 --enable-utf8 flag.
4447
4448
4449 SCANNING TEXT INCREMENTALLY
4450
4451 The "Consume" operation may be useful if you want to repeatedly match
4452 regular expressions at the front of a string and skip over them as they
4453 match. This requires use of the "StringPiece" type, which represents a
4454 sub-range of a real string. Like RE, StringPiece is defined in the
4455 pcrecpp namespace.
4456
4457 Example: read lines of the form "var = value" from a string.
4458 string contents = ...; // Fill string somehow
4459 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
4460
4461 string var;
4462 int value;
4463 pcrecpp::RE re("(\\w+) = (\\d+)\n");
4464 while (re.Consume(&input, &var, &value)) {
4465 ...;
4466 }
4467
4468 Each successful call to "Consume" will set "var/value", and also
4469 advance "input" so it points past the matched text.
4470
4471 The "FindAndConsume" operation is similar to "Consume" but does not
4472 anchor your match at the beginning of the string. For example, you
4473 could extract all words from a string by repeatedly calling
4474
4475 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
4476
4477
4478 PARSING HEX/OCTAL/C-RADIX NUMBERS
4479
4480 By default, if you pass a pointer to a numeric value, the corresponding
4481 text is interpreted as a base-10 number. You can instead wrap the
4482 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
4483 to interpret the text in another base. The CRadix operator interprets
4484 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
4485 base-10.
4486
4487 Example:
4488 int a, b, c, d;
4489 pcrecpp::RE re("(.*) (.*) (.*) (.*)");
4490 re.FullMatch("100 40 0100 0x40",
4491 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
4492 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
4493
4494 will leave 64 in a, b, c, and d.
4495
4496
4497 REPLACING PARTS OF STRINGS
4498
4499 You can replace the first match of "pattern" in "str" with "rewrite".
4500 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
4501 insert text matching corresponding parenthesized group from the pat-
4502 tern. \0 in "rewrite" refers to the entire matching text. For example:
4503
4504 string s = "yabba dabba doo";
4505 pcrecpp::RE("b+").Replace("d", &s);
4506
4507 will leave "s" containing "yada dabba doo". The result is true if the
4508 pattern matches and a replacement occurs, false otherwise.
4509
4510 GlobalReplace is like Replace except that it replaces all occurrences
4511 of the pattern in the string with the rewrite. Replacements are not
4512 subject to re-matching. For example:
4513
4514 string s = "yabba dabba doo";
4515 pcrecpp::RE("b+").GlobalReplace("d", &s);
4516
4517 will leave "s" containing "yada dada doo". It returns the number of
4518 replacements made.
4519
4520 Extract is like Replace, except that if the pattern matches, "rewrite"
4521 is copied into "out" (an additional argument) with substitutions. The
4522 non-matching portions of "text" are ignored. Returns true iff a match
4523 occurred and the extraction happened successfully; if no match occurs,
4524 the string is left unaffected.
4525
4526
4527 AUTHOR
4528
4529 The C++ wrapper was contributed by Google Inc.
4530 Copyright (c) 2005 Google Inc.
4531 -----------------------------------------------------------------------------
4532
4533
4534
4535 NAME
4536 PCRE - Perl-compatible regular expressions
4537
4538
4539 PCRE SAMPLE PROGRAM
4540
4541 A simple, complete demonstration program, to get you started with using
4542 PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
4543
4544 The program compiles the regular expression that is its first argument,
4545 and matches it against the subject string in its second argument. No
4546 PCRE options are set, and default character tables are used. If match-
4547 ing succeeds, the program outputs the portion of the subject that
4548 matched, together with the contents of any captured substrings.
4549
4550 If the -g option is given on the command line, the program then goes on
4551 to check for further matches of the same regular expression in the same
4552 subject string. The logic is a little bit tricky because of the possi-
4553 bility of matching an empty string. Comments in the code explain what
4554 is going on.
4555
4556 If PCRE is installed in the standard include and library directories
4557 for your system, you should be able to compile the demonstration pro-
4558 gram using this command:
4559
4560 gcc -o pcredemo pcredemo.c -lpcre
4561
4562 If PCRE is installed elsewhere, you may need to add additional options
4563 to the command line. For example, on a Unix-like system that has PCRE
4564 installed in /usr/local, you can compile the demonstration program
4565 using a command like this:
4566
4567 gcc -o pcredemo -I/usr/local/include pcredemo.c \
4568 -L/usr/local/lib -lpcre
4569
4570 Once you have compiled the demonstration program, you can run simple
4571 tests like this:
4572
4573 ./pcredemo 'cat|dog' 'the cat sat on the mat'
4574 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
4575
4576 Note that there is a much more comprehensive test program, called
4577 pcretest, which supports many more facilities for testing regular
4578 expressions and the PCRE library. The pcredemo program is provided as a
4579 simple coding example.
4580
4581 On some operating systems (e.g. Solaris), when PCRE is not installed in
4582 the standard library directory, you may get an error like this when you
4583 try to run pcredemo:
4584
4585 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
4586 directory
4587
4588 This is caused by the way shared library support works on those sys-
4589 tems. You need to add
4590
4591 -R/usr/local/lib
4592
4593 (for example) to the compile command to get round this problem.
4594
4595 Last updated: 09 September 2004
4596 Copyright (c) 1997-2004 University of Cambridge.
4597 -----------------------------------------------------------------------------
4598

  ViewVC Help
Powered by ViewVC 1.1.5