/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 208 - (show annotations)
Mon Aug 6 15:23:29 2007 UTC (12 years, 1 month ago) by ph10
File MIME type: text/plain
File size: 285474 byte(s)
Added a pcresyntax man page; tidied some others.
1 -----------------------------------------------------------------------------
2 This file contains a concatenation of the PCRE man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
5 synopses of each function in the library have not been included. There are
6 separate text files for the pcregrep and pcretest commands.
7 -----------------------------------------------------------------------------
8
9
10 PCRE(3) PCRE(3)
11
12
13 NAME
14 PCRE - Perl-compatible regular expressions
15
16
17 INTRODUCTION
18
19 The PCRE library is a set of functions that implement regular expres-
20 sion pattern matching using the same syntax and semantics as Perl, with
21 just a few differences. (Certain features that appeared in Python and
22 PCRE before they appeared in Perl are also available using the Python
23 syntax.)
24
25 The current implementation of PCRE (release 7.x) corresponds approxi-
26 mately with Perl 5.10, including support for UTF-8 encoded strings and
27 Unicode general category properties. However, UTF-8 and Unicode support
28 has to be explicitly enabled; it is not the default. The Unicode tables
29 correspond to Unicode release 5.0.0.
30
31 In addition to the Perl-compatible matching function, PCRE contains an
32 alternative matching function that matches the same compiled patterns
33 in a different way. In certain circumstances, the alternative function
34 has some advantages. For a discussion of the two matching algorithms,
35 see the pcrematching page.
36
37 PCRE is written in C and released as a C library. A number of people
38 have written wrappers and interfaces of various kinds. In particular,
39 Google Inc. have provided a comprehensive C++ wrapper. This is now
40 included as part of the PCRE distribution. The pcrecpp page has details
41 of this interface. Other people's contributions can be found in the
42 Contrib directory at the primary FTP site, which is:
43
44 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
45
46 Details of exactly which Perl regular expression features are and are
47 not supported by PCRE are given in separate documents. See the pcrepat-
48 tern and pcrecompat pages. There is a syntax summary in the pcresyntax
49 page.
50
51 Some features of PCRE can be included, excluded, or changed when the
52 library is built. The pcre_config() function makes it possible for a
53 client to discover which features are available. The features them-
54 selves are described in the pcrebuild page. Documentation about build-
55 ing PCRE for various operating systems can be found in the README file
56 in the source distribution.
57
58 The library contains a number of undocumented internal functions and
59 data tables that are used by more than one of the exported external
60 functions, but which are not intended for use by external callers.
61 Their names all begin with "_pcre_", which hopefully will not provoke
62 any name clashes. In some environments, it is possible to control which
63 external symbols are exported when a shared library is built, and in
64 these cases the undocumented symbols are not exported.
65
66
67 USER DOCUMENTATION
68
69 The user documentation for PCRE comprises a number of different sec-
70 tions. In the "man" format, each of these is a separate "man page". In
71 the HTML format, each is a separate page, linked from the index page.
72 In the plain text format, all the sections are concatenated, for ease
73 of searching. The sections are as follows:
74
75 pcre this document
76 pcre-config show PCRE installation configuration information
77 pcreapi details of PCRE's native C API
78 pcrebuild options for building PCRE
79 pcrecallout details of the callout feature
80 pcrecompat discussion of Perl compatibility
81 pcrecpp details of the C++ wrapper
82 pcregrep description of the pcregrep command
83 pcrematching discussion of the two matching algorithms
84 pcrepartial details of the partial matching facility
85 pcrepattern syntax and semantics of supported
86 regular expressions
87 pcresyntax quick syntax reference
88 pcreperform discussion of performance issues
89 pcreposix the POSIX-compatible C API
90 pcreprecompile details of saving and re-using precompiled patterns
91 pcresample discussion of the sample program
92 pcrestack discussion of stack usage
93 pcretest description of the pcretest testing command
94
95 In addition, in the "man" and HTML formats, there is a short page for
96 each C library function, listing its arguments and results.
97
98
99 LIMITATIONS
100
101 There are some size limitations in PCRE but it is hoped that they will
102 never in practice be relevant.
103
104 The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
105 is compiled with the default internal linkage size of 2. If you want to
106 process regular expressions that are truly enormous, you can compile
107 PCRE with an internal linkage size of 3 or 4 (see the README file in
108 the source distribution and the pcrebuild documentation for details).
109 In these cases the limit is substantially larger. However, the speed
110 of execution is slower.
111
112 All values in repeating quantifiers must be less than 65536.
113
114 There is no limit to the number of parenthesized subpatterns, but there
115 can be no more than 65535 capturing subpatterns.
116
117 The maximum length of name for a named subpattern is 32 characters, and
118 the maximum number of named subpatterns is 10000.
119
120 The maximum length of a subject string is the largest positive number
121 that an integer variable can hold. However, when using the traditional
122 matching function, PCRE uses recursion to handle subpatterns and indef-
123 inite repetition. This means that the available stack space may limit
124 the size of a subject string that can be processed by certain patterns.
125 For a discussion of stack issues, see the pcrestack documentation.
126
127
128 UTF-8 AND UNICODE PROPERTY SUPPORT
129
130 From release 3.3, PCRE has had some support for character strings
131 encoded in the UTF-8 format. For release 4.0 this was greatly extended
132 to cover most common requirements, and in release 5.0 additional sup-
133 port for Unicode general category properties was added.
134
135 In order process UTF-8 strings, you must build PCRE to include UTF-8
136 support in the code, and, in addition, you must call pcre_compile()
137 with the PCRE_UTF8 option flag. When you do this, both the pattern and
138 any subject strings that are matched against it are treated as UTF-8
139 strings instead of just strings of bytes.
140
141 If you compile PCRE with UTF-8 support, but do not use it at run time,
142 the library will be a bit bigger, but the additional run time overhead
143 is limited to testing the PCRE_UTF8 flag occasionally, so should not be
144 very big.
145
146 If PCRE is built with Unicode character property support (which implies
147 UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup-
148 ported. The available properties that can be tested are limited to the
149 general category properties such as Lu for an upper case letter or Nd
150 for a decimal number, the Unicode script names such as Arabic or Han,
151 and the derived properties Any and L&. A full list is given in the
152 pcrepattern documentation. Only the short names for properties are sup-
153 ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
154 ter}, is not supported. Furthermore, in Perl, many properties may
155 optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE
156 does not support this.
157
158 The following comments apply when PCRE is running in UTF-8 mode:
159
160 1. When you set the PCRE_UTF8 flag, the strings passed as patterns and
161 subjects are checked for validity on entry to the relevant functions.
162 If an invalid UTF-8 string is passed, an error return is given. In some
163 situations, you may already know that your strings are valid, and
164 therefore want to skip these checks in order to improve performance. If
165 you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time,
166 PCRE assumes that the pattern or subject it is given (respectively)
167 contains only valid UTF-8 codes. In this case, it does not diagnose an
168 invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
169 PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may
170 crash.
171
172 2. An unbraced hexadecimal escape sequence (such as \xb3) matches a
173 two-byte UTF-8 character if the value is greater than 127.
174
175 3. Octal numbers up to \777 are recognized, and match two-byte UTF-8
176 characters for values greater than \177.
177
178 4. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
179 vidual bytes, for example: \x{100}{3}.
180
181 5. The dot metacharacter matches one UTF-8 character instead of a sin-
182 gle byte.
183
184 6. The escape sequence \C can be used to match a single byte in UTF-8
185 mode, but its use can lead to some strange effects. This facility is
186 not available in the alternative matching function, pcre_dfa_exec().
187
188 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
189 test characters of any code value, but the characters that PCRE recog-
190 nizes as digits, spaces, or word characters remain the same set as
191 before, all with values less than 256. This remains true even when PCRE
192 includes Unicode property support, because to do otherwise would slow
193 down PCRE in many common cases. If you really want to test for a wider
194 sense of, say, "digit", you must use Unicode property tests such as
195 \p{Nd}.
196
197 8. Similarly, characters that match the POSIX named character classes
198 are all low-valued characters.
199
200 9. However, the Perl 5.10 horizontal and vertical whitespace matching
201 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
202 acters.
203
204 10. Case-insensitive matching applies only to characters whose values
205 are less than 128, unless PCRE is built with Unicode property support.
206 Even when Unicode property support is available, PCRE still uses its
207 own character tables when checking the case of low-valued characters,
208 so as not to degrade performance. The Unicode property information is
209 used only for characters with higher values. Even when Unicode property
210 support is available, PCRE supports case-insensitive matching only when
211 there is a one-to-one mapping between a letter's cases. There are a
212 small number of many-to-one mappings in Unicode; these are not sup-
213 ported by PCRE.
214
215
216 AUTHOR
217
218 Philip Hazel
219 University Computing Service
220 Cambridge CB2 3QH, England.
221
222 Putting an actual email address here seems to have been a spam magnet,
223 so I've taken it away. If you want to email me, use my two initials,
224 followed by the two digits 10, at the domain cam.ac.uk.
225
226
227 REVISION
228
229 Last updated: 06 August 2007
230 Copyright (c) 1997-2007 University of Cambridge.
231 ------------------------------------------------------------------------------
232
233
234 PCREBUILD(3) PCREBUILD(3)
235
236
237 NAME
238 PCRE - Perl-compatible regular expressions
239
240
241 PCRE BUILD-TIME OPTIONS
242
243 This document describes the optional features of PCRE that can be
244 selected when the library is compiled. They are all selected, or dese-
245 lected, by providing options to the configure script that is run before
246 the make command. The complete list of options for configure (which
247 includes the standard ones such as the selection of the installation
248 directory) can be obtained by running
249
250 ./configure --help
251
252 The following sections include descriptions of options whose names
253 begin with --enable or --disable. These settings specify changes to the
254 defaults for the configure command. Because of the way that configure
255 works, --enable and --disable always come in pairs, so the complemen-
256 tary option always exists as well, but as it specifies the default, it
257 is not described.
258
259
260 C++ SUPPORT
261
262 By default, the configure script will search for a C++ compiler and C++
263 header files. If it finds them, it automatically builds the C++ wrapper
264 library for PCRE. You can disable this by adding
265
266 --disable-cpp
267
268 to the configure command.
269
270
271 UTF-8 SUPPORT
272
273 To build PCRE with support for UTF-8 character strings, add
274
275 --enable-utf8
276
277 to the configure command. Of itself, this does not make PCRE treat
278 strings as UTF-8. As well as compiling PCRE with this option, you also
279 have have to set the PCRE_UTF8 option when you call the pcre_compile()
280 function.
281
282
283 UNICODE CHARACTER PROPERTY SUPPORT
284
285 UTF-8 support allows PCRE to process character values greater than 255
286 in the strings that it handles. On its own, however, it does not pro-
287 vide any facilities for accessing the properties of such characters. If
288 you want to be able to use the pattern escapes \P, \p, and \X, which
289 refer to Unicode character properties, you must add
290
291 --enable-unicode-properties
292
293 to the configure command. This implies UTF-8 support, even if you have
294 not explicitly requested it.
295
296 Including Unicode property support adds around 30K of tables to the
297 PCRE library. Only the general category properties such as Lu and Nd
298 are supported. Details are given in the pcrepattern documentation.
299
300
301 CODE VALUE OF NEWLINE
302
303 By default, PCRE interprets character 10 (linefeed, LF) as indicating
304 the end of a line. This is the normal newline character on Unix-like
305 systems. You can compile PCRE to use character 13 (carriage return, CR)
306 instead, by adding
307
308 --enable-newline-is-cr
309
310 to the configure command. There is also a --enable-newline-is-lf
311 option, which explicitly specifies linefeed as the newline character.
312
313 Alternatively, you can specify that line endings are to be indicated by
314 the two character sequence CRLF. If you want this, add
315
316 --enable-newline-is-crlf
317
318 to the configure command. There is a fourth option, specified by
319
320 --enable-newline-is-anycrlf
321
322 which causes PCRE to recognize any of the three sequences CR, LF, or
323 CRLF as indicating a line ending. Finally, a fifth option, specified by
324
325 --enable-newline-is-any
326
327 causes PCRE to recognize any Unicode newline sequence.
328
329 Whatever line ending convention is selected when PCRE is built can be
330 overridden when the library functions are called. At build time it is
331 conventional to use the standard for your operating system.
332
333
334 BUILDING SHARED AND STATIC LIBRARIES
335
336 The PCRE building process uses libtool to build both shared and static
337 Unix libraries by default. You can suppress one of these by adding one
338 of
339
340 --disable-shared
341 --disable-static
342
343 to the configure command, as required.
344
345
346 POSIX MALLOC USAGE
347
348 When PCRE is called through the POSIX interface (see the pcreposix doc-
349 umentation), additional working storage is required for holding the
350 pointers to capturing substrings, because PCRE requires three integers
351 per substring, whereas the POSIX interface provides only two. If the
352 number of expected substrings is small, the wrapper function uses space
353 on the stack, because this is faster than using malloc() for each call.
354 The default threshold above which the stack is no longer used is 10; it
355 can be changed by adding a setting such as
356
357 --with-posix-malloc-threshold=20
358
359 to the configure command.
360
361
362 HANDLING VERY LARGE PATTERNS
363
364 Within a compiled pattern, offset values are used to point from one
365 part to another (for example, from an opening parenthesis to an alter-
366 nation metacharacter). By default, two-byte values are used for these
367 offsets, leading to a maximum size for a compiled pattern of around
368 64K. This is sufficient to handle all but the most gigantic patterns.
369 Nevertheless, some people do want to process enormous patterns, so it
370 is possible to compile PCRE to use three-byte or four-byte offsets by
371 adding a setting such as
372
373 --with-link-size=3
374
375 to the configure command. The value given must be 2, 3, or 4. Using
376 longer offsets slows down the operation of PCRE because it has to load
377 additional bytes when handling them.
378
379
380 AVOIDING EXCESSIVE STACK USAGE
381
382 When matching with the pcre_exec() function, PCRE implements backtrack-
383 ing by making recursive calls to an internal function called match().
384 In environments where the size of the stack is limited, this can se-
385 verely limit PCRE's operation. (The Unix environment does not usually
386 suffer from this problem, but it may sometimes be necessary to increase
387 the maximum stack size. There is a discussion in the pcrestack docu-
388 mentation.) An alternative approach to recursion that uses memory from
389 the heap to remember data, instead of using recursive function calls,
390 has been implemented to work round the problem of limited stack size.
391 If you want to build a version of PCRE that works this way, add
392
393 --disable-stack-for-recursion
394
395 to the configure command. With this configuration, PCRE will use the
396 pcre_stack_malloc and pcre_stack_free variables to call memory manage-
397 ment functions. By default these point to malloc() and free(), but you
398 can replace the pointers so that your own functions are used.
399
400 Separate functions are provided rather than using pcre_malloc and
401 pcre_free because the usage is very predictable: the block sizes
402 requested are always the same, and the blocks are always freed in
403 reverse order. A calling program might be able to implement optimized
404 functions that perform better than malloc() and free(). PCRE runs
405 noticeably more slowly when built in this way. This option affects only
406 the pcre_exec() function; it is not relevant for the the
407 pcre_dfa_exec() function.
408
409
410 LIMITING PCRE RESOURCE USAGE
411
412 Internally, PCRE has a function called match(), which it calls repeat-
413 edly (sometimes recursively) when matching a pattern with the
414 pcre_exec() function. By controlling the maximum number of times this
415 function may be called during a single matching operation, a limit can
416 be placed on the resources used by a single call to pcre_exec(). The
417 limit can be changed at run time, as described in the pcreapi documen-
418 tation. The default is 10 million, but this can be changed by adding a
419 setting such as
420
421 --with-match-limit=500000
422
423 to the configure command. This setting has no effect on the
424 pcre_dfa_exec() matching function.
425
426 In some environments it is desirable to limit the depth of recursive
427 calls of match() more strictly than the total number of calls, in order
428 to restrict the maximum amount of stack (or heap, if --disable-stack-
429 for-recursion is specified) that is used. A second limit controls this;
430 it defaults to the value that is set for --with-match-limit, which
431 imposes no additional constraints. However, you can set a lower limit
432 by adding, for example,
433
434 --with-match-limit-recursion=10000
435
436 to the configure command. This value can also be overridden at run
437 time.
438
439
440 CREATING CHARACTER TABLES AT BUILD TIME
441
442 PCRE uses fixed tables for processing characters whose code values are
443 less than 256. By default, PCRE is built with a set of tables that are
444 distributed in the file pcre_chartables.c.dist. These tables are for
445 ASCII codes only. If you add
446
447 --enable-rebuild-chartables
448
449 to the configure command, the distributed tables are no longer used.
450 Instead, a program called dftables is compiled and run. This outputs
451 the source for new set of tables, created in the default locale of your
452 C runtime system. (This method of replacing the tables does not work if
453 you are cross compiling, because dftables is run on the local host. If
454 you need to create alternative tables when cross compiling, you will
455 have to do so "by hand".)
456
457
458 USING EBCDIC CODE
459
460 PCRE assumes by default that it will run in an environment where the
461 character code is ASCII (or Unicode, which is a superset of ASCII).
462 This is the case for most computer operating systems. PCRE can, how-
463 ever, be compiled to run in an EBCDIC environment by adding
464
465 --enable-ebcdic
466
467 to the configure command. This setting implies --enable-rebuild-charta-
468 bles. You should only use it if you know that you are in an EBCDIC
469 environment (for example, an IBM mainframe operating system).
470
471
472 SEE ALSO
473
474 pcreapi(3), pcre_config(3).
475
476
477 AUTHOR
478
479 Philip Hazel
480 University Computing Service
481 Cambridge CB2 3QH, England.
482
483
484 REVISION
485
486 Last updated: 30 July 2007
487 Copyright (c) 1997-2007 University of Cambridge.
488 ------------------------------------------------------------------------------
489
490
491 PCREMATCHING(3) PCREMATCHING(3)
492
493
494 NAME
495 PCRE - Perl-compatible regular expressions
496
497
498 PCRE MATCHING ALGORITHMS
499
500 This document describes the two different algorithms that are available
501 in PCRE for matching a compiled regular expression against a given sub-
502 ject string. The "standard" algorithm is the one provided by the
503 pcre_exec() function. This works in the same was as Perl's matching
504 function, and provides a Perl-compatible matching operation.
505
506 An alternative algorithm is provided by the pcre_dfa_exec() function;
507 this operates in a different way, and is not Perl-compatible. It has
508 advantages and disadvantages compared with the standard algorithm, and
509 these are described below.
510
511 When there is only one possible way in which a given subject string can
512 match a pattern, the two algorithms give the same answer. A difference
513 arises, however, when there are multiple possibilities. For example, if
514 the pattern
515
516 ^<.*>
517
518 is matched against the string
519
520 <something> <something else> <something further>
521
522 there are three possible answers. The standard algorithm finds only one
523 of them, whereas the alternative algorithm finds all three.
524
525
526 REGULAR EXPRESSIONS AS TREES
527
528 The set of strings that are matched by a regular expression can be rep-
529 resented as a tree structure. An unlimited repetition in the pattern
530 makes the tree of infinite size, but it is still a tree. Matching the
531 pattern to a given subject string (from a given starting point) can be
532 thought of as a search of the tree. There are two ways to search a
533 tree: depth-first and breadth-first, and these correspond to the two
534 matching algorithms provided by PCRE.
535
536
537 THE STANDARD MATCHING ALGORITHM
538
539 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
540 sions", the standard algorithm is an "NFA algorithm". It conducts a
541 depth-first search of the pattern tree. That is, it proceeds along a
542 single path through the tree, checking that the subject matches what is
543 required. When there is a mismatch, the algorithm tries any alterna-
544 tives at the current point, and if they all fail, it backs up to the
545 previous branch point in the tree, and tries the next alternative
546 branch at that level. This often involves backing up (moving to the
547 left) in the subject string as well. The order in which repetition
548 branches are tried is controlled by the greedy or ungreedy nature of
549 the quantifier.
550
551 If a leaf node is reached, a matching string has been found, and at
552 that point the algorithm stops. Thus, if there is more than one possi-
553 ble match, this algorithm returns the first one that it finds. Whether
554 this is the shortest, the longest, or some intermediate length depends
555 on the way the greedy and ungreedy repetition quantifiers are specified
556 in the pattern.
557
558 Because it ends up with a single path through the tree, it is rela-
559 tively straightforward for this algorithm to keep track of the sub-
560 strings that are matched by portions of the pattern in parentheses.
561 This provides support for capturing parentheses and back references.
562
563
564 THE ALTERNATIVE MATCHING ALGORITHM
565
566 This algorithm conducts a breadth-first search of the tree. Starting
567 from the first matching point in the subject, it scans the subject
568 string from left to right, once, character by character, and as it does
569 this, it remembers all the paths through the tree that represent valid
570 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
571 though it is not implemented as a traditional finite state machine (it
572 keeps multiple states active simultaneously).
573
574 The scan continues until either the end of the subject is reached, or
575 there are no more unterminated paths. At this point, terminated paths
576 represent the different matching possibilities (if there are none, the
577 match has failed). Thus, if there is more than one possible match,
578 this algorithm finds all of them, and in particular, it finds the long-
579 est. In PCRE, there is an option to stop the algorithm after the first
580 match (which is necessarily the shortest) has been found.
581
582 Note that all the matches that are found start at the same point in the
583 subject. If the pattern
584
585 cat(er(pillar)?)
586
587 is matched against the string "the caterpillar catchment", the result
588 will be the three strings "cat", "cater", and "caterpillar" that start
589 at the fourth character of the subject. The algorithm does not automat-
590 ically move on to find matches that start at later positions.
591
592 There are a number of features of PCRE regular expressions that are not
593 supported by the alternative matching algorithm. They are as follows:
594
595 1. Because the algorithm finds all possible matches, the greedy or
596 ungreedy nature of repetition quantifiers is not relevant. Greedy and
597 ungreedy quantifiers are treated in exactly the same way. However, pos-
598 sessive quantifiers can make a difference when what follows could also
599 match what is quantified, for example in a pattern like this:
600
601 ^a++\w!
602
603 This pattern matches "aaab!" but not "aaa!", which would be matched by
604 a non-possessive quantifier. Similarly, if an atomic group is present,
605 it is matched as if it were a standalone pattern at the current point,
606 and the longest match is then "locked in" for the rest of the overall
607 pattern.
608
609 2. When dealing with multiple paths through the tree simultaneously, it
610 is not straightforward to keep track of captured substrings for the
611 different matching possibilities, and PCRE's implementation of this
612 algorithm does not attempt to do this. This means that no captured sub-
613 strings are available.
614
615 3. Because no substrings are captured, back references within the pat-
616 tern are not supported, and cause errors if encountered.
617
618 4. For the same reason, conditional expressions that use a backrefer-
619 ence as the condition or test for a specific group recursion are not
620 supported.
621
622 5. Because many paths through the tree may be active, the \K escape
623 sequence, which resets the start of the match when encountered (but may
624 be on some paths and not on others), is not supported. It causes an
625 error if encountered.
626
627 6. Callouts are supported, but the value of the capture_top field is
628 always 1, and the value of the capture_last field is always -1.
629
630 7. The \C escape sequence, which (in the standard algorithm) matches a
631 single byte, even in UTF-8 mode, is not supported because the alterna-
632 tive algorithm moves through the subject string one character at a
633 time, for all active paths through the tree.
634
635
636 ADVANTAGES OF THE ALTERNATIVE ALGORITHM
637
638 Using the alternative matching algorithm provides the following advan-
639 tages:
640
641 1. All possible matches (at a single point in the subject) are automat-
642 ically found, and in particular, the longest match is found. To find
643 more than one match using the standard algorithm, you have to do kludgy
644 things with callouts.
645
646 2. There is much better support for partial matching. The restrictions
647 on the content of the pattern that apply when using the standard algo-
648 rithm for partial matching do not apply to the alternative algorithm.
649 For non-anchored patterns, the starting position of a partial match is
650 available.
651
652 3. Because the alternative algorithm scans the subject string just
653 once, and never needs to backtrack, it is possible to pass very long
654 subject strings to the matching function in several pieces, checking
655 for partial matching each time.
656
657
658 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
659
660 The alternative algorithm suffers from a number of disadvantages:
661
662 1. It is substantially slower than the standard algorithm. This is
663 partly because it has to search for all possible matches, but is also
664 because it is less susceptible to optimization.
665
666 2. Capturing parentheses and back references are not supported.
667
668 3. Although atomic groups are supported, their use does not provide the
669 performance advantage that it does for the standard algorithm.
670
671
672 AUTHOR
673
674 Philip Hazel
675 University Computing Service
676 Cambridge CB2 3QH, England.
677
678
679 REVISION
680
681 Last updated: 29 May 2007
682 Copyright (c) 1997-2007 University of Cambridge.
683 ------------------------------------------------------------------------------
684
685
686 PCREAPI(3) PCREAPI(3)
687
688
689 NAME
690 PCRE - Perl-compatible regular expressions
691
692
693 PCRE NATIVE API
694
695 #include <pcre.h>
696
697 pcre *pcre_compile(const char *pattern, int options,
698 const char **errptr, int *erroffset,
699 const unsigned char *tableptr);
700
701 pcre *pcre_compile2(const char *pattern, int options,
702 int *errorcodeptr,
703 const char **errptr, int *erroffset,
704 const unsigned char *tableptr);
705
706 pcre_extra *pcre_study(const pcre *code, int options,
707 const char **errptr);
708
709 int pcre_exec(const pcre *code, const pcre_extra *extra,
710 const char *subject, int length, int startoffset,
711 int options, int *ovector, int ovecsize);
712
713 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
714 const char *subject, int length, int startoffset,
715 int options, int *ovector, int ovecsize,
716 int *workspace, int wscount);
717
718 int pcre_copy_named_substring(const pcre *code,
719 const char *subject, int *ovector,
720 int stringcount, const char *stringname,
721 char *buffer, int buffersize);
722
723 int pcre_copy_substring(const char *subject, int *ovector,
724 int stringcount, int stringnumber, char *buffer,
725 int buffersize);
726
727 int pcre_get_named_substring(const pcre *code,
728 const char *subject, int *ovector,
729 int stringcount, const char *stringname,
730 const char **stringptr);
731
732 int pcre_get_stringnumber(const pcre *code,
733 const char *name);
734
735 int pcre_get_stringtable_entries(const pcre *code,
736 const char *name, char **first, char **last);
737
738 int pcre_get_substring(const char *subject, int *ovector,
739 int stringcount, int stringnumber,
740 const char **stringptr);
741
742 int pcre_get_substring_list(const char *subject,
743 int *ovector, int stringcount, const char ***listptr);
744
745 void pcre_free_substring(const char *stringptr);
746
747 void pcre_free_substring_list(const char **stringptr);
748
749 const unsigned char *pcre_maketables(void);
750
751 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
752 int what, void *where);
753
754 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
755
756 int pcre_refcount(pcre *code, int adjust);
757
758 int pcre_config(int what, void *where);
759
760 char *pcre_version(void);
761
762 void *(*pcre_malloc)(size_t);
763
764 void (*pcre_free)(void *);
765
766 void *(*pcre_stack_malloc)(size_t);
767
768 void (*pcre_stack_free)(void *);
769
770 int (*pcre_callout)(pcre_callout_block *);
771
772
773 PCRE API OVERVIEW
774
775 PCRE has its own native API, which is described in this document. There
776 are also some wrapper functions that correspond to the POSIX regular
777 expression API. These are described in the pcreposix documentation.
778 Both of these APIs define a set of C function calls. A C++ wrapper is
779 distributed with PCRE. It is documented in the pcrecpp page.
780
781 The native API C function prototypes are defined in the header file
782 pcre.h, and on Unix systems the library itself is called libpcre. It
783 can normally be accessed by adding -lpcre to the command for linking an
784 application that uses PCRE. The header file defines the macros
785 PCRE_MAJOR and PCRE_MINOR to contain the major and minor release num-
786 bers for the library. Applications can use these to include support
787 for different releases of PCRE.
788
789 The functions pcre_compile(), pcre_compile2(), pcre_study(), and
790 pcre_exec() are used for compiling and matching regular expressions in
791 a Perl-compatible manner. A sample program that demonstrates the sim-
792 plest way of using them is provided in the file called pcredemo.c in
793 the source distribution. The pcresample documentation describes how to
794 run it.
795
796 A second matching function, pcre_dfa_exec(), which is not Perl-compati-
797 ble, is also provided. This uses a different algorithm for the match-
798 ing. The alternative algorithm finds all possible matches (at a given
799 point in the subject), and scans the subject just once. However, this
800 algorithm does not return captured substrings. A description of the two
801 matching algorithms and their advantages and disadvantages is given in
802 the pcrematching documentation.
803
804 In addition to the main compiling and matching functions, there are
805 convenience functions for extracting captured substrings from a subject
806 string that is matched by pcre_exec(). They are:
807
808 pcre_copy_substring()
809 pcre_copy_named_substring()
810 pcre_get_substring()
811 pcre_get_named_substring()
812 pcre_get_substring_list()
813 pcre_get_stringnumber()
814 pcre_get_stringtable_entries()
815
816 pcre_free_substring() and pcre_free_substring_list() are also provided,
817 to free the memory used for extracted strings.
818
819 The function pcre_maketables() is used to build a set of character
820 tables in the current locale for passing to pcre_compile(),
821 pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is
822 provided for specialist use. Most commonly, no special tables are
823 passed, in which case internal tables that are generated when PCRE is
824 built are used.
825
826 The function pcre_fullinfo() is used to find out information about a
827 compiled pattern; pcre_info() is an obsolete version that returns only
828 some of the available information, but is retained for backwards com-
829 patibility. The function pcre_version() returns a pointer to a string
830 containing the version of PCRE and its date of release.
831
832 The function pcre_refcount() maintains a reference count in a data
833 block containing a compiled pattern. This is provided for the benefit
834 of object-oriented applications.
835
836 The global variables pcre_malloc and pcre_free initially contain the
837 entry points of the standard malloc() and free() functions, respec-
838 tively. PCRE calls the memory management functions via these variables,
839 so a calling program can replace them if it wishes to intercept the
840 calls. This should be done before calling any PCRE functions.
841
842 The global variables pcre_stack_malloc and pcre_stack_free are also
843 indirections to memory management functions. These special functions
844 are used only when PCRE is compiled to use the heap for remembering
845 data, instead of recursive function calls, when running the pcre_exec()
846 function. See the pcrebuild documentation for details of how to do
847 this. It is a non-standard way of building PCRE, for use in environ-
848 ments that have limited stacks. Because of the greater use of memory
849 management, it runs more slowly. Separate functions are provided so
850 that special-purpose external code can be used for this case. When
851 used, these functions are always called in a stack-like manner (last
852 obtained, first freed), and always for memory blocks of the same size.
853 There is a discussion about PCRE's stack usage in the pcrestack docu-
854 mentation.
855
856 The global variable pcre_callout initially contains NULL. It can be set
857 by the caller to a "callout" function, which PCRE will then call at
858 specified points during a matching operation. Details are given in the
859 pcrecallout documentation.
860
861
862 NEWLINES
863
864 PCRE supports five different conventions for indicating line breaks in
865 strings: a single CR (carriage return) character, a single LF (line-
866 feed) character, the two-character sequence CRLF, any of the three pre-
867 ceding, or any Unicode newline sequence. The Unicode newline sequences
868 are the three just mentioned, plus the single characters VT (vertical
869 tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
870 separator, U+2028), and PS (paragraph separator, U+2029).
871
872 Each of the first three conventions is used by at least one operating
873 system as its standard newline sequence. When PCRE is built, a default
874 can be specified. The default default is LF, which is the Unix stan-
875 dard. When PCRE is run, the default can be overridden, either when a
876 pattern is compiled, or when it is matched.
877
878 In the PCRE documentation the word "newline" is used to mean "the char-
879 acter or pair of characters that indicate a line break". The choice of
880 newline convention affects the handling of the dot, circumflex, and
881 dollar metacharacters, the handling of #-comments in /x mode, and, when
882 CRLF is a recognized line ending sequence, the match position advance-
883 ment for a non-anchored pattern. The choice of newline convention does
884 not affect the interpretation of the \n or \r escape sequences.
885
886
887 MULTITHREADING
888
889 The PCRE functions can be used in multi-threading applications, with
890 the proviso that the memory management functions pointed to by
891 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
892 callout function pointed to by pcre_callout, are shared by all threads.
893
894 The compiled form of a regular expression is not altered during match-
895 ing, so the same compiled pattern can safely be used by several threads
896 at once.
897
898
899 SAVING PRECOMPILED PATTERNS FOR LATER USE
900
901 The compiled form of a regular expression can be saved and re-used at a
902 later time, possibly by a different program, and even on a host other
903 than the one on which it was compiled. Details are given in the
904 pcreprecompile documentation. However, compiling a regular expression
905 with one version of PCRE for use with a different version is not guar-
906 anteed to work and may cause crashes.
907
908
909 CHECKING BUILD-TIME OPTIONS
910
911 int pcre_config(int what, void *where);
912
913 The function pcre_config() makes it possible for a PCRE client to dis-
914 cover which optional features have been compiled into the PCRE library.
915 The pcrebuild documentation has more details about these optional fea-
916 tures.
917
918 The first argument for pcre_config() is an integer, specifying which
919 information is required; the second argument is a pointer to a variable
920 into which the information is placed. The following information is
921 available:
922
923 PCRE_CONFIG_UTF8
924
925 The output is an integer that is set to one if UTF-8 support is avail-
926 able; otherwise it is set to zero.
927
928 PCRE_CONFIG_UNICODE_PROPERTIES
929
930 The output is an integer that is set to one if support for Unicode
931 character properties is available; otherwise it is set to zero.
932
933 PCRE_CONFIG_NEWLINE
934
935 The output is an integer whose value specifies the default character
936 sequence that is recognized as meaning "newline". The four values that
937 are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
938 and -1 for ANY. The default should normally be the standard sequence
939 for your operating system.
940
941 PCRE_CONFIG_LINK_SIZE
942
943 The output is an integer that contains the number of bytes used for
944 internal linkage in compiled regular expressions. The value is 2, 3, or
945 4. Larger values allow larger regular expressions to be compiled, at
946 the expense of slower matching. The default value of 2 is sufficient
947 for all but the most massive patterns, since it allows the compiled
948 pattern to be up to 64K in size.
949
950 PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
951
952 The output is an integer that contains the threshold above which the
953 POSIX interface uses malloc() for output vectors. Further details are
954 given in the pcreposix documentation.
955
956 PCRE_CONFIG_MATCH_LIMIT
957
958 The output is an integer that gives the default limit for the number of
959 internal matching function calls in a pcre_exec() execution. Further
960 details are given with pcre_exec() below.
961
962 PCRE_CONFIG_MATCH_LIMIT_RECURSION
963
964 The output is an integer that gives the default limit for the depth of
965 recursion when calling the internal matching function in a pcre_exec()
966 execution. Further details are given with pcre_exec() below.
967
968 PCRE_CONFIG_STACKRECURSE
969
970 The output is an integer that is set to one if internal recursion when
971 running pcre_exec() is implemented by recursive function calls that use
972 the stack to remember their state. This is the usual way that PCRE is
973 compiled. The output is zero if PCRE was compiled to use blocks of data
974 on the heap instead of recursive function calls. In this case,
975 pcre_stack_malloc and pcre_stack_free are called to manage memory
976 blocks on the heap, thus avoiding the use of the stack.
977
978
979 COMPILING A PATTERN
980
981 pcre *pcre_compile(const char *pattern, int options,
982 const char **errptr, int *erroffset,
983 const unsigned char *tableptr);
984
985 pcre *pcre_compile2(const char *pattern, int options,
986 int *errorcodeptr,
987 const char **errptr, int *erroffset,
988 const unsigned char *tableptr);
989
990 Either of the functions pcre_compile() or pcre_compile2() can be called
991 to compile a pattern into an internal form. The only difference between
992 the two interfaces is that pcre_compile2() has an additional argument,
993 errorcodeptr, via which a numerical error code can be returned.
994
995 The pattern is a C string terminated by a binary zero, and is passed in
996 the pattern argument. A pointer to a single block of memory that is
997 obtained via pcre_malloc is returned. This contains the compiled code
998 and related data. The pcre type is defined for the returned block; this
999 is a typedef for a structure whose contents are not externally defined.
1000 It is up to the caller to free the memory (via pcre_free) when it is no
1001 longer required.
1002
1003 Although the compiled code of a PCRE regex is relocatable, that is, it
1004 does not depend on memory location, the complete pcre data block is not
1005 fully relocatable, because it may contain a copy of the tableptr argu-
1006 ment, which is an address (see below).
1007
1008 The options argument contains various bit settings that affect the com-
1009 pilation. It should be zero if no options are required. The available
1010 options are described below. Some of them, in particular, those that
1011 are compatible with Perl, can also be set and unset from within the
1012 pattern (see the detailed description in the pcrepattern documenta-
1013 tion). For these options, the contents of the options argument speci-
1014 fies their initial settings at the start of compilation and execution.
1015 The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time
1016 of matching as well as at compile time.
1017
1018 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
1019 if compilation of a pattern fails, pcre_compile() returns NULL, and
1020 sets the variable pointed to by errptr to point to a textual error mes-
1021 sage. This is a static string that is part of the library. You must not
1022 try to free it. The offset from the start of the pattern to the charac-
1023 ter where the error was discovered is placed in the variable pointed to
1024 by erroffset, which must not be NULL. If it is, an immediate error is
1025 given.
1026
1027 If pcre_compile2() is used instead of pcre_compile(), and the error-
1028 codeptr argument is not NULL, a non-zero error code number is returned
1029 via this argument in the event of an error. This is in addition to the
1030 textual error message. Error codes and messages are listed below.
1031
1032 If the final argument, tableptr, is NULL, PCRE uses a default set of
1033 character tables that are built when PCRE is compiled, using the
1034 default C locale. Otherwise, tableptr must be an address that is the
1035 result of a call to pcre_maketables(). This value is stored with the
1036 compiled pattern, and used again by pcre_exec(), unless another table
1037 pointer is passed to it. For more discussion, see the section on locale
1038 support below.
1039
1040 This code fragment shows a typical straightforward call to pcre_com-
1041 pile():
1042
1043 pcre *re;
1044 const char *error;
1045 int erroffset;
1046 re = pcre_compile(
1047 "^A.*Z", /* the pattern */
1048 0, /* default options */
1049 &error, /* for error message */
1050 &erroffset, /* for error offset */
1051 NULL); /* use default character tables */
1052
1053 The following names for option bits are defined in the pcre.h header
1054 file:
1055
1056 PCRE_ANCHORED
1057
1058 If this bit is set, the pattern is forced to be "anchored", that is, it
1059 is constrained to match only at the first matching point in the string
1060 that is being searched (the "subject string"). This effect can also be
1061 achieved by appropriate constructs in the pattern itself, which is the
1062 only way to do it in Perl.
1063
1064 PCRE_AUTO_CALLOUT
1065
1066 If this bit is set, pcre_compile() automatically inserts callout items,
1067 all with number 255, before each pattern item. For discussion of the
1068 callout facility, see the pcrecallout documentation.
1069
1070 PCRE_CASELESS
1071
1072 If this bit is set, letters in the pattern match both upper and lower
1073 case letters. It is equivalent to Perl's /i option, and it can be
1074 changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE
1075 always understands the concept of case for characters whose values are
1076 less than 128, so caseless matching is always possible. For characters
1077 with higher values, the concept of case is supported if PCRE is com-
1078 piled with Unicode property support, but not otherwise. If you want to
1079 use caseless matching for characters 128 and above, you must ensure
1080 that PCRE is compiled with Unicode property support as well as with
1081 UTF-8 support.
1082
1083 PCRE_DOLLAR_ENDONLY
1084
1085 If this bit is set, a dollar metacharacter in the pattern matches only
1086 at the end of the subject string. Without this option, a dollar also
1087 matches immediately before a newline at the end of the string (but not
1088 before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored
1089 if PCRE_MULTILINE is set. There is no equivalent to this option in
1090 Perl, and no way to set it within a pattern.
1091
1092 PCRE_DOTALL
1093
1094 If this bit is set, a dot metacharater in the pattern matches all char-
1095 acters, including those that indicate newline. Without it, a dot does
1096 not match when the current position is at a newline. This option is
1097 equivalent to Perl's /s option, and it can be changed within a pattern
1098 by a (?s) option setting. A negative class such as [^a] always matches
1099 newline characters, independent of the setting of this option.
1100
1101 PCRE_DUPNAMES
1102
1103 If this bit is set, names used to identify capturing subpatterns need
1104 not be unique. This can be helpful for certain types of pattern when it
1105 is known that only one instance of the named subpattern can ever be
1106 matched. There are more details of named subpatterns below; see also
1107 the pcrepattern documentation.
1108
1109 PCRE_EXTENDED
1110
1111 If this bit is set, whitespace data characters in the pattern are
1112 totally ignored except when escaped or inside a character class. White-
1113 space does not include the VT character (code 11). In addition, charac-
1114 ters between an unescaped # outside a character class and the next new-
1115 line, inclusive, are also ignored. This is equivalent to Perl's /x
1116 option, and it can be changed within a pattern by a (?x) option set-
1117 ting.
1118
1119 This option makes it possible to include comments inside complicated
1120 patterns. Note, however, that this applies only to data characters.
1121 Whitespace characters may never appear within special character
1122 sequences in a pattern, for example within the sequence (?( which
1123 introduces a conditional subpattern.
1124
1125 PCRE_EXTRA
1126
1127 This option was invented in order to turn on additional functionality
1128 of PCRE that is incompatible with Perl, but it is currently of very
1129 little use. When set, any backslash in a pattern that is followed by a
1130 letter that has no special meaning causes an error, thus reserving
1131 these combinations for future expansion. By default, as in Perl, a
1132 backslash followed by a letter with no special meaning is treated as a
1133 literal. (Perl can, however, be persuaded to give a warning for this.)
1134 There are at present no other features controlled by this option. It
1135 can also be set by a (?X) option setting within a pattern.
1136
1137 PCRE_FIRSTLINE
1138
1139 If this option is set, an unanchored pattern is required to match
1140 before or at the first newline in the subject string, though the
1141 matched text may continue over the newline.
1142
1143 PCRE_MULTILINE
1144
1145 By default, PCRE treats the subject string as consisting of a single
1146 line of characters (even if it actually contains newlines). The "start
1147 of line" metacharacter (^) matches only at the start of the string,
1148 while the "end of line" metacharacter ($) matches only at the end of
1149 the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1150 is set). This is the same as Perl.
1151
1152 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
1153 constructs match immediately following or immediately before internal
1154 newlines in the subject string, respectively, as well as at the very
1155 start and end. This is equivalent to Perl's /m option, and it can be
1156 changed within a pattern by a (?m) option setting. If there are no new-
1157 lines in a subject string, or no occurrences of ^ or $ in a pattern,
1158 setting PCRE_MULTILINE has no effect.
1159
1160 PCRE_NEWLINE_CR
1161 PCRE_NEWLINE_LF
1162 PCRE_NEWLINE_CRLF
1163 PCRE_NEWLINE_ANYCRLF
1164 PCRE_NEWLINE_ANY
1165
1166 These options override the default newline definition that was chosen
1167 when PCRE was built. Setting the first or the second specifies that a
1168 newline is indicated by a single character (CR or LF, respectively).
1169 Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
1170 two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies
1171 that any of the three preceding sequences should be recognized. Setting
1172 PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be
1173 recognized. The Unicode newline sequences are the three just mentioned,
1174 plus the single characters VT (vertical tab, U+000B), FF (formfeed,
1175 U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS
1176 (paragraph separator, U+2029). The last two are recognized only in
1177 UTF-8 mode.
1178
1179 The newline setting in the options word uses three bits that are
1180 treated as a number, giving eight possibilities. Currently only six are
1181 used (default plus the five values above). This means that if you set
1182 more than one newline option, the combination may or may not be sensi-
1183 ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1184 PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and
1185 cause an error.
1186
1187 The only time that a line break is specially recognized when compiling
1188 a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a
1189 character class is encountered. This indicates a comment that lasts
1190 until after the next line break sequence. In other circumstances, line
1191 break sequences are treated as literal data, except that in
1192 PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1193 and are therefore ignored.
1194
1195 The newline option that is set at compile time becomes the default that
1196 is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1197
1198 PCRE_NO_AUTO_CAPTURE
1199
1200 If this option is set, it disables the use of numbered capturing paren-
1201 theses in the pattern. Any opening parenthesis that is not followed by
1202 ? behaves as if it were followed by ?: but named parentheses can still
1203 be used for capturing (and they acquire numbers in the usual way).
1204 There is no equivalent of this option in Perl.
1205
1206 PCRE_UNGREEDY
1207
1208 This option inverts the "greediness" of the quantifiers so that they
1209 are not greedy by default, but become greedy if followed by "?". It is
1210 not compatible with Perl. It can also be set by a (?U) option setting
1211 within the pattern.
1212
1213 PCRE_UTF8
1214
1215 This option causes PCRE to regard both the pattern and the subject as
1216 strings of UTF-8 characters instead of single-byte character strings.
1217 However, it is available only when PCRE is built to include UTF-8 sup-
1218 port. If not, the use of this option provokes an error. Details of how
1219 this option changes the behaviour of PCRE are given in the section on
1220 UTF-8 support in the main pcre page.
1221
1222 PCRE_NO_UTF8_CHECK
1223
1224 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1225 automatically checked. If an invalid UTF-8 sequence of bytes is found,
1226 pcre_compile() returns an error. If you already know that your pattern
1227 is valid, and you want to skip this check for performance reasons, you
1228 can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
1229 passing an invalid UTF-8 string as a pattern is undefined. It may cause
1230 your program to crash. Note that this option can also be passed to
1231 pcre_exec() and pcre_dfa_exec(), to suppress the UTF-8 validity check-
1232 ing of subject strings.
1233
1234
1235 COMPILATION ERROR CODES
1236
1237 The following table lists the error codes than may be returned by
1238 pcre_compile2(), along with the error messages that may be returned by
1239 both compiling functions. As PCRE has developed, some error codes have
1240 fallen out of use. To avoid confusion, they have not been re-used.
1241
1242 0 no error
1243 1 \ at end of pattern
1244 2 \c at end of pattern
1245 3 unrecognized character follows \
1246 4 numbers out of order in {} quantifier
1247 5 number too big in {} quantifier
1248 6 missing terminating ] for character class
1249 7 invalid escape sequence in character class
1250 8 range out of order in character class
1251 9 nothing to repeat
1252 10 [this code is not in use]
1253 11 internal error: unexpected repeat
1254 12 unrecognized character after (?
1255 13 POSIX named classes are supported only within a class
1256 14 missing )
1257 15 reference to non-existent subpattern
1258 16 erroffset passed as NULL
1259 17 unknown option bit(s) set
1260 18 missing ) after comment
1261 19 [this code is not in use]
1262 20 regular expression too large
1263 21 failed to get memory
1264 22 unmatched parentheses
1265 23 internal error: code overflow
1266 24 unrecognized character after (?<
1267 25 lookbehind assertion is not fixed length
1268 26 malformed number or name after (?(
1269 27 conditional group contains more than two branches
1270 28 assertion expected after (?(
1271 29 (?R or (?[+-]digits must be followed by )
1272 30 unknown POSIX class name
1273 31 POSIX collating elements are not supported
1274 32 this version of PCRE is not compiled with PCRE_UTF8 support
1275 33 [this code is not in use]
1276 34 character value in \x{...} sequence is too large
1277 35 invalid condition (?(0)
1278 36 \C not allowed in lookbehind assertion
1279 37 PCRE does not support \L, \l, \N, \U, or \u
1280 38 number after (?C is > 255
1281 39 closing ) for (?C expected
1282 40 recursive call could loop indefinitely
1283 41 unrecognized character after (?P
1284 42 syntax error in subpattern name (missing terminator)
1285 43 two named subpatterns have the same name
1286 44 invalid UTF-8 string
1287 45 support for \P, \p, and \X has not been compiled
1288 46 malformed \P or \p sequence
1289 47 unknown property name after \P or \p
1290 48 subpattern name is too long (maximum 32 characters)
1291 49 too many named subpatterns (maximum 10,000)
1292 50 [this code is not in use]
1293 51 octal value is greater than \377 (not in UTF-8 mode)
1294 52 internal error: overran compiling workspace
1295 53 internal error: previously-checked referenced subpattern not
1296 found
1297 54 DEFINE group contains more than one branch
1298 55 repeating a DEFINE group is not allowed
1299 56 inconsistent NEWLINE options"
1300 57 \g is not followed by a braced name or an optionally braced
1301 non-zero number
1302 58 (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1303
1304
1305 STUDYING A PATTERN
1306
1307 pcre_extra *pcre_study(const pcre *code, int options
1308 const char **errptr);
1309
1310 If a compiled pattern is going to be used several times, it is worth
1311 spending more time analyzing it in order to speed up the time taken for
1312 matching. The function pcre_study() takes a pointer to a compiled pat-
1313 tern as its first argument. If studying the pattern produces additional
1314 information that will help speed up matching, pcre_study() returns a
1315 pointer to a pcre_extra block, in which the study_data field points to
1316 the results of the study.
1317
1318 The returned value from pcre_study() can be passed directly to
1319 pcre_exec(). However, a pcre_extra block also contains other fields
1320 that can be set by the caller before the block is passed; these are
1321 described below in the section on matching a pattern.
1322
1323 If studying the pattern does not produce any additional information
1324 pcre_study() returns NULL. In that circumstance, if the calling program
1325 wants to pass any of the other fields to pcre_exec(), it must set up
1326 its own pcre_extra block.
1327
1328 The second argument of pcre_study() contains option bits. At present,
1329 no options are defined, and this argument should always be zero.
1330
1331 The third argument for pcre_study() is a pointer for an error message.
1332 If studying succeeds (even if no data is returned), the variable it
1333 points to is set to NULL. Otherwise it is set to point to a textual
1334 error message. This is a static string that is part of the library. You
1335 must not try to free it. You should test the error pointer for NULL
1336 after calling pcre_study(), to be sure that it has run successfully.
1337
1338 This is a typical call to pcre_study():
1339
1340 pcre_extra *pe;
1341 pe = pcre_study(
1342 re, /* result of pcre_compile() */
1343 0, /* no options exist */
1344 &error); /* set to NULL or points to a message */
1345
1346 At present, studying a pattern is useful only for non-anchored patterns
1347 that do not have a single fixed starting character. A bitmap of possi-
1348 ble starting bytes is created.
1349
1350
1351 LOCALE SUPPORT
1352
1353 PCRE handles caseless matching, and determines whether characters are
1354 letters, digits, or whatever, by reference to a set of tables, indexed
1355 by character value. When running in UTF-8 mode, this applies only to
1356 characters with codes less than 128. Higher-valued codes never match
1357 escapes such as \w or \d, but can be tested with \p if PCRE is built
1358 with Unicode character property support. The use of locales with Uni-
1359 code is discouraged. If you are handling characters with codes greater
1360 than 128, you should either use UTF-8 and Unicode, or use locales, but
1361 not try to mix the two.
1362
1363 PCRE contains an internal set of tables that are used when the final
1364 argument of pcre_compile() is NULL. These are sufficient for many
1365 applications. Normally, the internal tables recognize only ASCII char-
1366 acters. However, when PCRE is built, it is possible to cause the inter-
1367 nal tables to be rebuilt in the default "C" locale of the local system,
1368 which may cause them to be different.
1369
1370 The internal tables can always be overridden by tables supplied by the
1371 application that calls PCRE. These may be created in a different locale
1372 from the default. As more and more applications change to using Uni-
1373 code, the need for this locale support is expected to die away.
1374
1375 External tables are built by calling the pcre_maketables() function,
1376 which has no arguments, in the relevant locale. The result can then be
1377 passed to pcre_compile() or pcre_exec() as often as necessary. For
1378 example, to build and use tables that are appropriate for the French
1379 locale (where accented characters with values greater than 128 are
1380 treated as letters), the following code could be used:
1381
1382 setlocale(LC_CTYPE, "fr_FR");
1383 tables = pcre_maketables();
1384 re = pcre_compile(..., tables);
1385
1386 The locale name "fr_FR" is used on Linux and other Unix-like systems;
1387 if you are using Windows, the name for the French locale is "french".
1388
1389 When pcre_maketables() runs, the tables are built in memory that is
1390 obtained via pcre_malloc. It is the caller's responsibility to ensure
1391 that the memory containing the tables remains available for as long as
1392 it is needed.
1393
1394 The pointer that is passed to pcre_compile() is saved with the compiled
1395 pattern, and the same tables are used via this pointer by pcre_study()
1396 and normally also by pcre_exec(). Thus, by default, for any single pat-
1397 tern, compilation, studying and matching all happen in the same locale,
1398 but different patterns can be compiled in different locales.
1399
1400 It is possible to pass a table pointer or NULL (indicating the use of
1401 the internal tables) to pcre_exec(). Although not intended for this
1402 purpose, this facility could be used to match a pattern in a different
1403 locale from the one in which it was compiled. Passing table pointers at
1404 run time is discussed below in the section on matching a pattern.
1405
1406
1407 INFORMATION ABOUT A PATTERN
1408
1409 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1410 int what, void *where);
1411
1412 The pcre_fullinfo() function returns information about a compiled pat-
1413 tern. It replaces the obsolete pcre_info() function, which is neverthe-
1414 less retained for backwards compability (and is documented below).
1415
1416 The first argument for pcre_fullinfo() is a pointer to the compiled
1417 pattern. The second argument is the result of pcre_study(), or NULL if
1418 the pattern was not studied. The third argument specifies which piece
1419 of information is required, and the fourth argument is a pointer to a
1420 variable to receive the data. The yield of the function is zero for
1421 success, or one of the following negative numbers:
1422
1423 PCRE_ERROR_NULL the argument code was NULL
1424 the argument where was NULL
1425 PCRE_ERROR_BADMAGIC the "magic number" was not found
1426 PCRE_ERROR_BADOPTION the value of what was invalid
1427
1428 The "magic number" is placed at the start of each compiled pattern as
1429 an simple check against passing an arbitrary memory pointer. Here is a
1430 typical call of pcre_fullinfo(), to obtain the length of the compiled
1431 pattern:
1432
1433 int rc;
1434 size_t length;
1435 rc = pcre_fullinfo(
1436 re, /* result of pcre_compile() */
1437 pe, /* result of pcre_study(), or NULL */
1438 PCRE_INFO_SIZE, /* what is required */
1439 &length); /* where to put the data */
1440
1441 The possible values for the third argument are defined in pcre.h, and
1442 are as follows:
1443
1444 PCRE_INFO_BACKREFMAX
1445
1446 Return the number of the highest back reference in the pattern. The
1447 fourth argument should point to an int variable. Zero is returned if
1448 there are no back references.
1449
1450 PCRE_INFO_CAPTURECOUNT
1451
1452 Return the number of capturing subpatterns in the pattern. The fourth
1453 argument should point to an int variable.
1454
1455 PCRE_INFO_DEFAULT_TABLES
1456
1457 Return a pointer to the internal default character tables within PCRE.
1458 The fourth argument should point to an unsigned char * variable. This
1459 information call is provided for internal use by the pcre_study() func-
1460 tion. External callers can cause PCRE to use its internal tables by
1461 passing a NULL table pointer.
1462
1463 PCRE_INFO_FIRSTBYTE
1464
1465 Return information about the first byte of any matched string, for a
1466 non-anchored pattern. The fourth argument should point to an int vari-
1467 able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1468 is still recognized for backwards compatibility.)
1469
1470 If there is a fixed first byte, for example, from a pattern such as
1471 (cat|cow|coyote), its value is returned. Otherwise, if either
1472
1473 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
1474 branch starts with "^", or
1475
1476 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1477 set (if it were set, the pattern would be anchored),
1478
1479 -1 is returned, indicating that the pattern matches only at the start
1480 of a subject string or after any newline within the string. Otherwise
1481 -2 is returned. For anchored patterns, -2 is returned.
1482
1483 PCRE_INFO_FIRSTTABLE
1484
1485 If the pattern was studied, and this resulted in the construction of a
1486 256-bit table indicating a fixed set of bytes for the first byte in any
1487 matching string, a pointer to the table is returned. Otherwise NULL is
1488 returned. The fourth argument should point to an unsigned char * vari-
1489 able.
1490
1491 PCRE_INFO_JCHANGED
1492
1493 Return 1 if the (?J) option setting is used in the pattern, otherwise
1494 0. The fourth argument should point to an int variable. The (?J) inter-
1495 nal option setting changes the local PCRE_DUPNAMES option.
1496
1497 PCRE_INFO_LASTLITERAL
1498
1499 Return the value of the rightmost literal byte that must exist in any
1500 matched string, other than at its start, if such a byte has been
1501 recorded. The fourth argument should point to an int variable. If there
1502 is no such byte, -1 is returned. For anchored patterns, a last literal
1503 byte is recorded only if it follows something of variable length. For
1504 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1505 /^a\dz\d/ the returned value is -1.
1506
1507 PCRE_INFO_NAMECOUNT
1508 PCRE_INFO_NAMEENTRYSIZE
1509 PCRE_INFO_NAMETABLE
1510
1511 PCRE supports the use of named as well as numbered capturing parenthe-
1512 ses. The names are just an additional way of identifying the parenthe-
1513 ses, which still acquire numbers. Several convenience functions such as
1514 pcre_get_named_substring() are provided for extracting captured sub-
1515 strings by name. It is also possible to extract the data directly, by
1516 first converting the name to a number in order to access the correct
1517 pointers in the output vector (described with pcre_exec() below). To do
1518 the conversion, you need to use the name-to-number map, which is
1519 described by these three values.
1520
1521 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1522 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1523 of each entry; both of these return an int value. The entry size
1524 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
1525 a pointer to the first entry of the table (a pointer to char). The
1526 first two bytes of each entry are the number of the capturing parenthe-
1527 sis, most significant byte first. The rest of the entry is the corre-
1528 sponding name, zero terminated. The names are in alphabetical order.
1529 When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1530 theses numbers. For example, consider the following pattern (assume
1531 PCRE_EXTENDED is set, so white space - including newlines - is
1532 ignored):
1533
1534 (?<date> (?<year>(\d\d)?\d\d) -
1535 (?<month>\d\d) - (?<day>\d\d) )
1536
1537 There are four named subpatterns, so the table has four entries, and
1538 each entry in the table is eight bytes long. The table is as follows,
1539 with non-printing bytes shows in hexadecimal, and undefined bytes shown
1540 as ??:
1541
1542 00 01 d a t e 00 ??
1543 00 05 d a y 00 ?? ??
1544 00 04 m o n t h 00
1545 00 02 y e a r 00 ??
1546
1547 When writing code to extract data from named subpatterns using the
1548 name-to-number map, remember that the length of the entries is likely
1549 to be different for each compiled pattern.
1550
1551 PCRE_INFO_OKPARTIAL
1552
1553 Return 1 if the pattern can be used for partial matching, otherwise 0.
1554 The fourth argument should point to an int variable. The pcrepartial
1555 documentation lists the restrictions that apply to patterns when par-
1556 tial matching is used.
1557
1558 PCRE_INFO_OPTIONS
1559
1560 Return a copy of the options with which the pattern was compiled. The
1561 fourth argument should point to an unsigned long int variable. These
1562 option bits are those specified in the call to pcre_compile(), modified
1563 by any top-level option settings at the start of the pattern itself. In
1564 other words, they are the options that will be in force when matching
1565 starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with
1566 the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1567 and PCRE_EXTENDED.
1568
1569 A pattern is automatically anchored by PCRE if all of its top-level
1570 alternatives begin with one of the following:
1571
1572 ^ unless PCRE_MULTILINE is set
1573 \A always
1574 \G always
1575 .* if PCRE_DOTALL is set and there are no back
1576 references to the subpattern in which .* appears
1577
1578 For such patterns, the PCRE_ANCHORED bit is set in the options returned
1579 by pcre_fullinfo().
1580
1581 PCRE_INFO_SIZE
1582
1583 Return the size of the compiled pattern, that is, the value that was
1584 passed as the argument to pcre_malloc() when PCRE was getting memory in
1585 which to place the compiled data. The fourth argument should point to a
1586 size_t variable.
1587
1588 PCRE_INFO_STUDYSIZE
1589
1590 Return the size of the data block pointed to by the study_data field in
1591 a pcre_extra block. That is, it is the value that was passed to
1592 pcre_malloc() when PCRE was getting memory into which to place the data
1593 created by pcre_study(). The fourth argument should point to a size_t
1594 variable.
1595
1596
1597 OBSOLETE INFO FUNCTION
1598
1599 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1600
1601 The pcre_info() function is now obsolete because its interface is too
1602 restrictive to return all the available data about a compiled pattern.
1603 New programs should use pcre_fullinfo() instead. The yield of
1604 pcre_info() is the number of capturing subpatterns, or one of the fol-
1605 lowing negative numbers:
1606
1607 PCRE_ERROR_NULL the argument code was NULL
1608 PCRE_ERROR_BADMAGIC the "magic number" was not found
1609
1610 If the optptr argument is not NULL, a copy of the options with which
1611 the pattern was compiled is placed in the integer it points to (see
1612 PCRE_INFO_OPTIONS above).
1613
1614 If the pattern is not anchored and the firstcharptr argument is not
1615 NULL, it is used to pass back information about the first character of
1616 any matched string (see PCRE_INFO_FIRSTBYTE above).
1617
1618
1619 REFERENCE COUNTS
1620
1621 int pcre_refcount(pcre *code, int adjust);
1622
1623 The pcre_refcount() function is used to maintain a reference count in
1624 the data block that contains a compiled pattern. It is provided for the
1625 benefit of applications that operate in an object-oriented manner,
1626 where different parts of the application may be using the same compiled
1627 pattern, but you want to free the block when they are all done.
1628
1629 When a pattern is compiled, the reference count field is initialized to
1630 zero. It is changed only by calling this function, whose action is to
1631 add the adjust value (which may be positive or negative) to it. The
1632 yield of the function is the new value. However, the value of the count
1633 is constrained to lie between 0 and 65535, inclusive. If the new value
1634 is outside these limits, it is forced to the appropriate limit value.
1635
1636 Except when it is zero, the reference count is not correctly preserved
1637 if a pattern is compiled on one host and then transferred to a host
1638 whose byte-order is different. (This seems a highly unlikely scenario.)
1639
1640
1641 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1642
1643 int pcre_exec(const pcre *code, const pcre_extra *extra,
1644 const char *subject, int length, int startoffset,
1645 int options, int *ovector, int ovecsize);
1646
1647 The function pcre_exec() is called to match a subject string against a
1648 compiled pattern, which is passed in the code argument. If the pattern
1649 has been studied, the result of the study should be passed in the extra
1650 argument. This function is the main matching facility of the library,
1651 and it operates in a Perl-like manner. For specialist use there is also
1652 an alternative matching function, which is described below in the sec-
1653 tion about the pcre_dfa_exec() function.
1654
1655 In most applications, the pattern will have been compiled (and option-
1656 ally studied) in the same process that calls pcre_exec(). However, it
1657 is possible to save compiled patterns and study data, and then use them
1658 later in different processes, possibly even on different hosts. For a
1659 discussion about this, see the pcreprecompile documentation.
1660
1661 Here is an example of a simple call to pcre_exec():
1662
1663 int rc;
1664 int ovector[30];
1665 rc = pcre_exec(
1666 re, /* result of pcre_compile() */
1667 NULL, /* we didn't study the pattern */
1668 "some string", /* the subject string */
1669 11, /* the length of the subject string */
1670 0, /* start at offset 0 in the subject */
1671 0, /* default options */
1672 ovector, /* vector of integers for substring information */
1673 30); /* number of elements (NOT size in bytes) */
1674
1675 Extra data for pcre_exec()
1676
1677 If the extra argument is not NULL, it must point to a pcre_extra data
1678 block. The pcre_study() function returns such a block (when it doesn't
1679 return NULL), but you can also create one for yourself, and pass addi-
1680 tional information in it. The pcre_extra block contains the following
1681 fields (not necessarily in this order):
1682
1683 unsigned long int flags;
1684 void *study_data;
1685 unsigned long int match_limit;
1686 unsigned long int match_limit_recursion;
1687 void *callout_data;
1688 const unsigned char *tables;
1689
1690 The flags field is a bitmap that specifies which of the other fields
1691 are set. The flag bits are:
1692
1693 PCRE_EXTRA_STUDY_DATA
1694 PCRE_EXTRA_MATCH_LIMIT
1695 PCRE_EXTRA_MATCH_LIMIT_RECURSION
1696 PCRE_EXTRA_CALLOUT_DATA
1697 PCRE_EXTRA_TABLES
1698
1699 Other flag bits should be set to zero. The study_data field is set in
1700 the pcre_extra block that is returned by pcre_study(), together with
1701 the appropriate flag bit. You should not set this yourself, but you may
1702 add to the block by setting the other fields and their corresponding
1703 flag bits.
1704
1705 The match_limit field provides a means of preventing PCRE from using up
1706 a vast amount of resources when running patterns that are not going to
1707 match, but which have a very large number of possibilities in their
1708 search trees. The classic example is the use of nested unlimited
1709 repeats.
1710
1711 Internally, PCRE uses a function called match() which it calls repeat-
1712 edly (sometimes recursively). The limit set by match_limit is imposed
1713 on the number of times this function is called during a match, which
1714 has the effect of limiting the amount of backtracking that can take
1715 place. For patterns that are not anchored, the count restarts from zero
1716 for each position in the subject string.
1717
1718 The default value for the limit can be set when PCRE is built; the
1719 default default is 10 million, which handles all but the most extreme
1720 cases. You can override the default by suppling pcre_exec() with a
1721 pcre_extra block in which match_limit is set, and
1722 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1723 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1724
1725 The match_limit_recursion field is similar to match_limit, but instead
1726 of limiting the total number of times that match() is called, it limits
1727 the depth of recursion. The recursion depth is a smaller number than
1728 the total number of calls, because not all calls to match() are recur-
1729 sive. This limit is of use only if it is set smaller than match_limit.
1730
1731 Limiting the recursion depth limits the amount of stack that can be
1732 used, or, when PCRE has been compiled to use memory on the heap instead
1733 of the stack, the amount of heap memory that can be used.
1734
1735 The default value for match_limit_recursion can be set when PCRE is
1736 built; the default default is the same value as the default for
1737 match_limit. You can override the default by suppling pcre_exec() with
1738 a pcre_extra block in which match_limit_recursion is set, and
1739 PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the
1740 limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1741
1742 The pcre_callout field is used in conjunction with the "callout" fea-
1743 ture, which is described in the pcrecallout documentation.
1744
1745 The tables field is used to pass a character tables pointer to
1746 pcre_exec(); this overrides the value that is stored with the compiled
1747 pattern. A non-NULL value is stored with the compiled pattern only if
1748 custom tables were supplied to pcre_compile() via its tableptr argu-
1749 ment. If NULL is passed to pcre_exec() using this mechanism, it forces
1750 PCRE's internal tables to be used. This facility is helpful when re-
1751 using patterns that have been saved after compiling with an external
1752 set of tables, because the external tables might be at a different
1753 address when pcre_exec() is called. See the pcreprecompile documenta-
1754 tion for a discussion of saving compiled patterns for later use.
1755
1756 Option bits for pcre_exec()
1757
1758 The unused bits of the options argument for pcre_exec() must be zero.
1759 The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
1760 PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and
1761 PCRE_PARTIAL.
1762
1763 PCRE_ANCHORED
1764
1765 The PCRE_ANCHORED option limits pcre_exec() to matching at the first
1766 matching position. If a pattern was compiled with PCRE_ANCHORED, or
1767 turned out to be anchored by virtue of its contents, it cannot be made
1768 unachored at matching time.
1769
1770 PCRE_NEWLINE_CR
1771 PCRE_NEWLINE_LF
1772 PCRE_NEWLINE_CRLF
1773 PCRE_NEWLINE_ANYCRLF
1774 PCRE_NEWLINE_ANY
1775
1776 These options override the newline definition that was chosen or
1777 defaulted when the pattern was compiled. For details, see the descrip-
1778 tion of pcre_compile() above. During matching, the newline choice
1779 affects the behaviour of the dot, circumflex, and dollar metacharac-
1780 ters. It may also alter the way the match position is advanced after a
1781 match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF,
1782 PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt
1783 fails when the current position is at a CRLF sequence, the match posi-
1784 tion is advanced by two characters instead of one, in other words, to
1785 after the CRLF.
1786
1787 PCRE_NOTBOL
1788
1789 This option specifies that first character of the subject string is not
1790 the beginning of a line, so the circumflex metacharacter should not
1791 match before it. Setting this without PCRE_MULTILINE (at compile time)
1792 causes circumflex never to match. This option affects only the behav-
1793 iour of the circumflex metacharacter. It does not affect \A.
1794
1795 PCRE_NOTEOL
1796
1797 This option specifies that the end of the subject string is not the end
1798 of a line, so the dollar metacharacter should not match it nor (except
1799 in multiline mode) a newline immediately before it. Setting this with-
1800 out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1801 option affects only the behaviour of the dollar metacharacter. It does
1802 not affect \Z or \z.
1803
1804 PCRE_NOTEMPTY
1805
1806 An empty string is not considered to be a valid match if this option is
1807 set. If there are alternatives in the pattern, they are tried. If all
1808 the alternatives match the empty string, the entire match fails. For
1809 example, if the pattern
1810
1811 a?b?
1812
1813 is applied to a string not beginning with "a" or "b", it matches the
1814 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1815 match is not valid, so PCRE searches further into the string for occur-
1816 rences of "a" or "b".
1817
1818 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1819 cial case of a pattern match of the empty string within its split()
1820 function, and when using the /g modifier. It is possible to emulate
1821 Perl's behaviour after matching a null string by first trying the match
1822 again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1823 if that fails by advancing the starting offset (see below) and trying
1824 an ordinary match again. There is some code that demonstrates how to do
1825 this in the pcredemo.c sample program.
1826
1827 PCRE_NO_UTF8_CHECK
1828
1829 When PCRE_UTF8 is set at compile time, the validity of the subject as a
1830 UTF-8 string is automatically checked when pcre_exec() is subsequently
1831 called. The value of startoffset is also checked to ensure that it
1832 points to the start of a UTF-8 character. If an invalid UTF-8 sequence
1833 of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If
1834 startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is
1835 returned.
1836
1837 If you already know that your subject is valid, and you want to skip
1838 these checks for performance reasons, you can set the
1839 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1840 do this for the second and subsequent calls to pcre_exec() if you are
1841 making repeated calls to find all the matches in a single subject
1842 string. However, you should be sure that the value of startoffset
1843 points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1844 set, the effect of passing an invalid UTF-8 string as a subject, or a
1845 value of startoffset that does not point to the start of a UTF-8 char-
1846 acter, is undefined. Your program may crash.
1847
1848 PCRE_PARTIAL
1849
1850 This option turns on the partial matching feature. If the subject
1851 string fails to match the pattern, but at some point during the match-
1852 ing process the end of the subject was reached (that is, the subject
1853 partially matches the pattern and the failure to match occurred only
1854 because there were not enough subject characters), pcre_exec() returns
1855 PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is
1856 used, there are restrictions on what may appear in the pattern. These
1857 are discussed in the pcrepartial documentation.
1858
1859 The string to be matched by pcre_exec()
1860
1861 The subject string is passed to pcre_exec() as a pointer in subject, a
1862 length in length, and a starting byte offset in startoffset. In UTF-8
1863 mode, the byte offset must point to the start of a UTF-8 character.
1864 Unlike the pattern string, the subject may contain binary zero bytes.
1865 When the starting offset is zero, the search for a match starts at the
1866 beginning of the subject, and this is by far the most common case.
1867
1868 A non-zero starting offset is useful when searching for another match
1869 in the same subject by calling pcre_exec() again after a previous suc-
1870 cess. Setting startoffset differs from just passing over a shortened
1871 string and setting PCRE_NOTBOL in the case of a pattern that begins
1872 with any kind of lookbehind. For example, consider the pattern
1873
1874 \Biss\B
1875
1876 which finds occurrences of "iss" in the middle of words. (\B matches
1877 only if the current position in the subject is not a word boundary.)
1878 When applied to the string "Mississipi" the first call to pcre_exec()
1879 finds the first occurrence. If pcre_exec() is called again with just
1880 the remainder of the subject, namely "issipi", it does not match,
1881 because \B is always false at the start of the subject, which is deemed
1882 to be a word boundary. However, if pcre_exec() is passed the entire
1883 string again, but with startoffset set to 4, it finds the second occur-
1884 rence of "iss" because it is able to look behind the starting point to
1885 discover that it is preceded by a letter.
1886
1887 If a non-zero starting offset is passed when the pattern is anchored,
1888 one attempt to match at the given offset is made. This can only succeed
1889 if the pattern does not require the match to be at the start of the
1890 subject.
1891
1892 How pcre_exec() returns captured substrings
1893
1894 In general, a pattern matches a certain portion of the subject, and in
1895 addition, further substrings from the subject may be picked out by
1896 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1897 this is called "capturing" in what follows, and the phrase "capturing
1898 subpattern" is used for a fragment of a pattern that picks out a sub-
1899 string. PCRE supports several other kinds of parenthesized subpattern
1900 that do not cause substrings to be captured.
1901
1902 Captured substrings are returned to the caller via a vector of integer
1903 offsets whose address is passed in ovector. The number of elements in
1904 the vector is passed in ovecsize, which must be a non-negative number.
1905 Note: this argument is NOT the size of ovector in bytes.
1906
1907 The first two-thirds of the vector is used to pass back captured sub-
1908 strings, each substring using a pair of integers. The remaining third
1909 of the vector is used as workspace by pcre_exec() while matching cap-
1910 turing subpatterns, and is not available for passing back information.
1911 The length passed in ovecsize should always be a multiple of three. If
1912 it is not, it is rounded down.
1913
1914 When a match is successful, information about captured substrings is
1915 returned in pairs of integers, starting at the beginning of ovector,
1916 and continuing up to two-thirds of its length at the most. The first
1917 element of a pair is set to the offset of the first character in a sub-
1918 string, and the second is set to the offset of the first character
1919 after the end of a substring. The first pair, ovector[0] and ovec-
1920 tor[1], identify the portion of the subject string matched by the
1921 entire pattern. The next pair is used for the first capturing subpat-
1922 tern, and so on. The value returned by pcre_exec() is one more than the
1923 highest numbered pair that has been set. For example, if two substrings
1924 have been captured, the returned value is 3. If there are no capturing
1925 subpatterns, the return value from a successful match is 1, indicating
1926 that just the first pair of offsets has been set.
1927
1928 If a capturing subpattern is matched repeatedly, it is the last portion
1929 of the string that it matched that is returned.
1930
1931 If the vector is too small to hold all the captured substring offsets,
1932 it is used as far as possible (up to two-thirds of its length), and the
1933 function returns a value of zero. In particular, if the substring off-
1934 sets are not of interest, pcre_exec() may be called with ovector passed
1935 as NULL and ovecsize as zero. However, if the pattern contains back
1936 references and the ovector is not big enough to remember the related
1937 substrings, PCRE has to get additional memory for use during matching.
1938 Thus it is usually advisable to supply an ovector.
1939
1940 The pcre_info() function can be used to find out how many capturing
1941 subpatterns there are in a compiled pattern. The smallest size for
1942 ovector that will allow for n captured substrings, in addition to the
1943 offsets of the substring matched by the whole pattern, is (n+1)*3.
1944
1945 It is possible for capturing subpattern number n+1 to match some part
1946 of the subject when subpattern n has not been used at all. For example,
1947 if the string "abc" is matched against the pattern (a|(z))(bc) the
1948 return from the function is 4, and subpatterns 1 and 3 are matched, but
1949 2 is not. When this happens, both values in the offset pairs corre-
1950 sponding to unused subpatterns are set to -1.
1951
1952 Offset values that correspond to unused subpatterns at the end of the
1953 expression are also set to -1. For example, if the string "abc" is
1954 matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
1955 matched. The return from the function is 2, because the highest used
1956 capturing subpattern number is 1. However, you can refer to the offsets
1957 for the second and third capturing subpatterns if you wish (assuming
1958 the vector is large enough, of course).
1959
1960 Some convenience functions are provided for extracting the captured
1961 substrings as separate strings. These are described below.
1962
1963 Error return values from pcre_exec()
1964
1965 If pcre_exec() fails, it returns a negative number. The following are
1966 defined in the header file:
1967
1968 PCRE_ERROR_NOMATCH (-1)
1969
1970 The subject string did not match the pattern.
1971
1972 PCRE_ERROR_NULL (-2)
1973
1974 Either code or subject was passed as NULL, or ovector was NULL and
1975 ovecsize was not zero.
1976
1977 PCRE_ERROR_BADOPTION (-3)
1978
1979 An unrecognized bit was set in the options argument.
1980
1981 PCRE_ERROR_BADMAGIC (-4)
1982
1983 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1984 to catch the case when it is passed a junk pointer and to detect when a
1985 pattern that was compiled in an environment of one endianness is run in
1986 an environment with the other endianness. This is the error that PCRE
1987 gives when the magic number is not present.
1988
1989 PCRE_ERROR_UNKNOWN_OPCODE (-5)
1990
1991 While running the pattern match, an unknown item was encountered in the
1992 compiled pattern. This error could be caused by a bug in PCRE or by
1993 overwriting of the compiled pattern.
1994
1995 PCRE_ERROR_NOMEMORY (-6)
1996
1997 If a pattern contains back references, but the ovector that is passed
1998 to pcre_exec() is not big enough to remember the referenced substrings,
1999 PCRE gets a block of memory at the start of matching to use for this
2000 purpose. If the call via pcre_malloc() fails, this error is given. The
2001 memory is automatically freed at the end of matching.
2002
2003 PCRE_ERROR_NOSUBSTRING (-7)
2004
2005 This error is used by the pcre_copy_substring(), pcre_get_substring(),
2006 and pcre_get_substring_list() functions (see below). It is never
2007 returned by pcre_exec().
2008
2009 PCRE_ERROR_MATCHLIMIT (-8)
2010
2011 The backtracking limit, as specified by the match_limit field in a
2012 pcre_extra structure (or defaulted) was reached. See the description
2013 above.
2014
2015 PCRE_ERROR_CALLOUT (-9)
2016
2017 This error is never generated by pcre_exec() itself. It is provided for
2018 use by callout functions that want to yield a distinctive error code.
2019 See the pcrecallout documentation for details.
2020
2021 PCRE_ERROR_BADUTF8 (-10)
2022
2023 A string that contains an invalid UTF-8 byte sequence was passed as a
2024 subject.
2025
2026 PCRE_ERROR_BADUTF8_OFFSET (-11)
2027
2028 The UTF-8 byte sequence that was passed as a subject was valid, but the
2029 value of startoffset did not point to the beginning of a UTF-8 charac-
2030 ter.
2031
2032 PCRE_ERROR_PARTIAL (-12)
2033
2034 The subject string did not match, but it did match partially. See the
2035 pcrepartial documentation for details of partial matching.
2036
2037 PCRE_ERROR_BADPARTIAL (-13)
2038
2039 The PCRE_PARTIAL option was used with a compiled pattern containing
2040 items that are not supported for partial matching. See the pcrepartial
2041 documentation for details of partial matching.
2042
2043 PCRE_ERROR_INTERNAL (-14)
2044
2045 An unexpected internal error has occurred. This error could be caused
2046 by a bug in PCRE or by overwriting of the compiled pattern.
2047
2048 PCRE_ERROR_BADCOUNT (-15)
2049
2050 This error is given if the value of the ovecsize argument is negative.
2051
2052 PCRE_ERROR_RECURSIONLIMIT (-21)
2053
2054 The internal recursion limit, as specified by the match_limit_recursion
2055 field in a pcre_extra structure (or defaulted) was reached. See the
2056 description above.
2057
2058 PCRE_ERROR_BADNEWLINE (-23)
2059
2060 An invalid combination of PCRE_NEWLINE_xxx options was given.
2061
2062 Error numbers -16 to -20 and -22 are not used by pcre_exec().
2063
2064
2065 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2066
2067 int pcre_copy_substring(const char *subject, int *ovector,
2068 int stringcount, int stringnumber, char *buffer,
2069 int buffersize);
2070
2071 int pcre_get_substring(const char *subject, int *ovector,
2072 int stringcount, int stringnumber,
2073 const char **stringptr);
2074
2075 int pcre_get_substring_list(const char *subject,
2076 int *ovector, int stringcount, const char ***listptr);
2077
2078 Captured substrings can be accessed directly by using the offsets
2079 returned by pcre_exec() in ovector. For convenience, the functions
2080 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
2081 string_list() are provided for extracting captured substrings as new,
2082 separate, zero-terminated strings. These functions identify substrings
2083 by number. The next section describes functions for extracting named
2084 substrings.
2085
2086 A substring that contains a binary zero is correctly extracted and has
2087 a further zero added on the end, but the result is not, of course, a C
2088 string. However, you can process such a string by referring to the
2089 length that is returned by pcre_copy_substring() and pcre_get_sub-
2090 string(). Unfortunately, the interface to pcre_get_substring_list() is
2091 not adequate for handling strings containing binary zeros, because the
2092 end of the final string is not independently indicated.
2093
2094 The first three arguments are the same for all three of these func-
2095 tions: subject is the subject string that has just been successfully
2096 matched, ovector is a pointer to the vector of integer offsets that was
2097 passed to pcre_exec(), and stringcount is the number of substrings that
2098 were captured by the match, including the substring that matched the
2099 entire regular expression. This is the value returned by pcre_exec() if
2100 it is greater than zero. If pcre_exec() returned zero, indicating that
2101 it ran out of space in ovector, the value passed as stringcount should
2102 be the number of elements in the vector divided by three.
2103
2104 The functions pcre_copy_substring() and pcre_get_substring() extract a
2105 single substring, whose number is given as stringnumber. A value of
2106 zero extracts the substring that matched the entire pattern, whereas
2107 higher values extract the captured substrings. For pcre_copy_sub-
2108 string(), the string is placed in buffer, whose length is given by
2109 buffersize, while for pcre_get_substring() a new block of memory is
2110 obtained via pcre_malloc, and its address is returned via stringptr.
2111 The yield of the function is the length of the string, not including
2112 the terminating zero, or one of these error codes:
2113
2114 PCRE_ERROR_NOMEMORY (-6)
2115
2116 The buffer was too small for pcre_copy_substring(), or the attempt to
2117 get memory failed for pcre_get_substring().
2118
2119 PCRE_ERROR_NOSUBSTRING (-7)
2120
2121 There is no substring whose number is stringnumber.
2122
2123 The pcre_get_substring_list() function extracts all available sub-
2124 strings and builds a list of pointers to them. All this is done in a
2125 single block of memory that is obtained via pcre_malloc. The address of
2126 the memory block is returned via listptr, which is also the start of
2127 the list of string pointers. The end of the list is marked by a NULL
2128 pointer. The yield of the function is zero if all went well, or the
2129 error code
2130
2131 PCRE_ERROR_NOMEMORY (-6)
2132
2133 if the attempt to get the memory block failed.
2134
2135 When any of these functions encounter a substring that is unset, which
2136 can happen when capturing subpattern number n+1 matches some part of
2137 the subject, but subpattern n has not been used at all, they return an
2138 empty string. This can be distinguished from a genuine zero-length sub-
2139 string by inspecting the appropriate offset in ovector, which is nega-
2140 tive for unset substrings.
2141
2142 The two convenience functions pcre_free_substring() and pcre_free_sub-
2143 string_list() can be used to free the memory returned by a previous
2144 call of pcre_get_substring() or pcre_get_substring_list(), respec-
2145 tively. They do nothing more than call the function pointed to by
2146 pcre_free, which of course could be called directly from a C program.
2147 However, PCRE is used in some situations where it is linked via a spe-
2148 cial interface to another programming language that cannot use
2149 pcre_free directly; it is for these cases that the functions are pro-
2150 vided.
2151
2152
2153 EXTRACTING CAPTURED SUBSTRINGS BY NAME
2154
2155 int pcre_get_stringnumber(const pcre *code,
2156 const char *name);
2157
2158 int pcre_copy_named_substring(const pcre *code,
2159 const char *subject, int *ovector,
2160 int stringcount, const char *stringname,
2161 char *buffer, int buffersize);
2162
2163 int pcre_get_named_substring(const pcre *code,
2164 const char *subject, int *ovector,
2165 int stringcount, const char *stringname,
2166 const char **stringptr);
2167
2168 To extract a substring by name, you first have to find associated num-
2169 ber. For example, for this pattern
2170
2171 (a+)b(?<xxx>\d+)...
2172
2173 the number of the subpattern called "xxx" is 2. If the name is known to
2174 be unique (PCRE_DUPNAMES was not set), you can find the number from the
2175 name by calling pcre_get_stringnumber(). The first argument is the com-
2176 piled pattern, and the second is the name. The yield of the function is
2177 the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2178 subpattern of that name.
2179
2180 Given the number, you can extract the substring directly, or use one of
2181 the functions described in the previous section. For convenience, there
2182 are also two functions that do the whole job.
2183
2184 Most of the arguments of pcre_copy_named_substring() and
2185 pcre_get_named_substring() are the same as those for the similarly
2186 named functions that extract by number. As these are described in the
2187 previous section, they are not re-described here. There are just two
2188 differences:
2189
2190 First, instead of a substring number, a substring name is given. Sec-
2191 ond, there is an extra argument, given at the start, which is a pointer
2192 to the compiled pattern. This is needed in order to gain access to the
2193 name-to-number translation table.
2194
2195 These functions call pcre_get_stringnumber(), and if it succeeds, they
2196 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
2197 ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the
2198 behaviour may not be what you want (see the next section).
2199
2200
2201 DUPLICATE SUBPATTERN NAMES
2202
2203 int pcre_get_stringtable_entries(const pcre *code,
2204 const char *name, char **first, char **last);
2205
2206 When a pattern is compiled with the PCRE_DUPNAMES option, names for
2207 subpatterns are not required to be unique. Normally, patterns with
2208 duplicate names are such that in any one match, only one of the named
2209 subpatterns participates. An example is shown in the pcrepattern docu-
2210 mentation.
2211
2212 When duplicates are present, pcre_copy_named_substring() and
2213 pcre_get_named_substring() return the first substring corresponding to
2214 the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING
2215 (-7) is returned; no data is returned. The pcre_get_stringnumber()
2216 function returns one of the numbers that are associated with the name,
2217 but it is not defined which it is.
2218
2219 If you want to get full details of all captured substrings for a given
2220 name, you must use the pcre_get_stringtable_entries() function. The
2221 first argument is the compiled pattern, and the second is the name. The
2222 third and fourth are pointers to variables which are updated by the
2223 function. After it has run, they point to the first and last entries in
2224 the name-to-number table for the given name. The function itself
2225 returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2226 there are none. The format of the table is described above in the sec-
2227 tion entitled Information about a pattern. Given all the relevant
2228 entries for the name, you can extract each of their numbers, and hence
2229 the captured data, if any.
2230
2231
2232 FINDING ALL POSSIBLE MATCHES
2233
2234 The traditional matching function uses a similar algorithm to Perl,
2235 which stops when it finds the first match, starting at a given point in
2236 the subject. If you want to find all possible matches, or the longest
2237 possible match, consider using the alternative matching function (see
2238 below) instead. If you cannot use the alternative function, but still
2239 need to find all possible matches, you can kludge it up by making use
2240 of the callout facility, which is described in the pcrecallout documen-
2241 tation.
2242
2243 What you have to do is to insert a callout right at the end of the pat-
2244 tern. When your callout function is called, extract and save the cur-
2245 rent matched substring. Then return 1, which forces pcre_exec() to
2246 backtrack and try other alternatives. Ultimately, when it runs out of
2247 matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2248
2249
2250 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2251
2252 int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2253 const char *subject, int length, int startoffset,
2254 int options, int *ovector, int ovecsize,
2255 int *workspace, int wscount);
2256
2257 The function pcre_dfa_exec() is called to match a subject string
2258 against a compiled pattern, using a matching algorithm that scans the
2259 subject string just once, and does not backtrack. This has different
2260 characteristics to the normal algorithm, and is not compatible with
2261 Perl. Some of the features of PCRE patterns are not supported. Never-
2262 theless, there are times when this kind of matching can be useful. For
2263 a discussion of the two matching algorithms, see the pcrematching docu-
2264 mentation.
2265
2266 The arguments for the pcre_dfa_exec() function are the same as for
2267 pcre_exec(), plus two extras. The ovector argument is used in a differ-
2268 ent way, and this is described below. The other common arguments are
2269 used in the same way as for pcre_exec(), so their description is not
2270 repeated here.
2271
2272 The two additional arguments provide workspace for the function. The
2273 workspace vector should contain at least 20 elements. It is used for
2274 keeping track of multiple paths through the pattern tree. More
2275 workspace will be needed for patterns and subjects where there are a
2276 lot of potential matches.
2277
2278 Here is an example of a simple call to pcre_dfa_exec():
2279
2280 int rc;
2281 int ovector[10];
2282 int wspace[20];
2283 rc = pcre_dfa_exec(
2284 re, /* result of pcre_compile() */
2285 NULL, /* we didn't study the pattern */
2286 "some string", /* the subject string */
2287 11, /* the length of the subject string */
2288 0, /* start at offset 0 in the subject */
2289 0, /* default options */
2290 ovector, /* vector of integers for substring information */
2291 10, /* number of elements (NOT size in bytes) */
2292 wspace, /* working space vector */
2293 20); /* number of elements (NOT size in bytes) */
2294
2295 Option bits for pcre_dfa_exec()
2296
2297 The unused bits of the options argument for pcre_dfa_exec() must be
2298 zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW-
2299 LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2300 PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2301 three of these are the same as for pcre_exec(), so their description is
2302 not repeated here.
2303
2304 PCRE_PARTIAL
2305
2306 This has the same general effect as it does for pcre_exec(), but the
2307 details are slightly different. When PCRE_PARTIAL is set for
2308 pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into
2309 PCRE_ERROR_PARTIAL if the end of the subject is reached, there have
2310 been no complete matches, but there is still at least one matching pos-
2311 sibility. The portion of the string that provided the partial match is
2312 set as the first matching string.
2313
2314 PCRE_DFA_SHORTEST
2315
2316 Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to
2317 stop as soon as it has found one match. Because of the way the alterna-
2318 tive algorithm works, this is necessarily the shortest possible match
2319 at the first possible matching point in the subject string.
2320
2321 PCRE_DFA_RESTART
2322
2323 When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and
2324 returns a partial match, it is possible to call it again, with addi-
2325 tional subject characters, and have it continue with the same match.
2326 The PCRE_DFA_RESTART option requests this action; when it is set, the
2327 workspace and wscount options must reference the same vector as before
2328 because data about the match so far is left in them after a partial
2329 match. There is more discussion of this facility in the pcrepartial
2330 documentation.
2331
2332 Successful returns from pcre_dfa_exec()
2333
2334 When pcre_dfa_exec() succeeds, it may have matched more than one sub-
2335 string in the subject. Note, however, that all the matches from one run
2336 of the function start at the same point in the subject. The shorter
2337 matches are all initial substrings of the longer matches. For example,
2338 if the pattern
2339
2340 <.*>
2341
2342 is matched against the string
2343
2344 This is <something> <something else> <something further> no more
2345
2346 the three matched strings are
2347
2348 <something>
2349 <something> <something else>
2350 <something> <something else> <something further>
2351
2352 On success, the yield of the function is a number greater than zero,
2353 which is the number of matched substrings. The substrings themselves
2354 are returned in ovector. Each string uses two elements; the first is
2355 the offset to the start, and the second is the offset to the end. In
2356 fact, all the strings have the same start offset. (Space could have
2357 been saved by giving this only once, but it was decided to retain some
2358 compatibility with the way pcre_exec() returns data, even though the
2359 meaning of the strings is different.)
2360
2361 The strings are returned in reverse order of length; that is, the long-
2362 est matching string is given first. If there were too many matches to
2363 fit into ovector, the yield of the function is zero, and the vector is
2364 filled with the longest matches.
2365
2366 Error returns from pcre_dfa_exec()
2367
2368 The pcre_dfa_exec() function returns a negative number when it fails.
2369 Many of the errors are the same as for pcre_exec(), and these are
2370 described above. There are in addition the following errors that are
2371 specific to pcre_dfa_exec():
2372
2373 PCRE_ERROR_DFA_UITEM (-16)
2374
2375 This return is given if pcre_dfa_exec() encounters an item in the pat-
2376 tern that it does not support, for instance, the use of \C or a back
2377 reference.
2378
2379 PCRE_ERROR_DFA_UCOND (-17)
2380
2381 This return is given if pcre_dfa_exec() encounters a condition item
2382 that uses a back reference for the condition, or a test for recursion
2383 in a specific group. These are not supported.
2384
2385 PCRE_ERROR_DFA_UMLIMIT (-18)
2386
2387 This return is given if pcre_dfa_exec() is called with an extra block
2388 that contains a setting of the match_limit field. This is not supported
2389 (it is meaningless).
2390
2391 PCRE_ERROR_DFA_WSSIZE (-19)
2392
2393 This return is given if pcre_dfa_exec() runs out of space in the
2394 workspace vector.
2395
2396 PCRE_ERROR_DFA_RECURSE (-20)
2397
2398 When a recursive subpattern is processed, the matching function calls
2399 itself recursively, using private vectors for ovector and workspace.
2400 This error is given if the output vector is not large enough. This
2401 should be extremely rare, as a vector of size 1000 is used.
2402
2403
2404 SEE ALSO
2405
2406 pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2407 tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2408
2409
2410 AUTHOR
2411
2412 Philip Hazel
2413 University Computing Service
2414 Cambridge CB2 3QH, England.
2415
2416
2417 REVISION
2418
2419 Last updated: 30 July 2007
2420 Copyright (c) 1997-2007 University of Cambridge.
2421 ------------------------------------------------------------------------------
2422
2423
2424 PCRECALLOUT(3) PCRECALLOUT(3)
2425
2426
2427 NAME
2428 PCRE - Perl-compatible regular expressions
2429
2430
2431 PCRE CALLOUTS
2432
2433 int (*pcre_callout)(pcre_callout_block *);
2434
2435 PCRE provides a feature called "callout", which is a means of temporar-
2436 ily passing control to the caller of PCRE in the middle of pattern
2437 matching. The caller of PCRE provides an external function by putting
2438 its entry point in the global variable pcre_callout. By default, this
2439 variable contains NULL, which disables all calling out.
2440
2441 Within a regular expression, (?C) indicates the points at which the
2442 external function is to be called. Different callout points can be
2443 identified by putting a number less than 256 after the letter C. The
2444 default value is zero. For example, this pattern has two callout
2445 points:
2446
2447 (?C1)abc(?C2)def
2448
2449 If the PCRE_AUTO_CALLOUT option bit is set when pcre_compile() is
2450 called, PCRE automatically inserts callouts, all with number 255,
2451 before each item in the pattern. For example, if PCRE_AUTO_CALLOUT is
2452 used with the pattern
2453
2454 A(\d{2}|--)
2455
2456 it is processed as if it were
2457
2458 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2459
2460 Notice that there is a callout before and after each parenthesis and
2461 alternation bar. Automatic callouts can be used for tracking the
2462 progress of pattern matching. The pcretest command has an option that
2463 sets automatic callouts; when it is used, the output indicates how the
2464 pattern is matched. This is useful information when you are trying to
2465 optimize the performance of a particular pattern.
2466
2467
2468 MISSING CALLOUTS
2469
2470 You should be aware that, because of optimizations in the way PCRE
2471 matches patterns, callouts sometimes do not happen. For example, if the
2472 pattern is
2473
2474 ab(?C4)cd
2475
2476 PCRE knows that any matching string must contain the letter "d". If the
2477 subject string is "abyz", the lack of "d" means that matching doesn't
2478 ever start, and the callout is never reached. However, with "abyd",
2479 though the result is still no match, the callout is obeyed.
2480
2481
2482 THE CALLOUT INTERFACE
2483
2484 During matching, when PCRE reaches a callout point, the external func-
2485 tion defined by pcre_callout is called (if it is set). This applies to
2486 both the pcre_exec() and the pcre_dfa_exec() matching functions. The
2487 only argument to the callout function is a pointer to a pcre_callout
2488 block. This structure contains the following fields:
2489
2490 int version;
2491 int callout_number;
2492 int *offset_vector;
2493 const char *subject;
2494 int subject_length;
2495 int start_match;
2496 int current_position;
2497 int capture_top;
2498 int capture_last;
2499 void *callout_data;
2500 int pattern_position;
2501 int next_item_length;
2502
2503 The version field is an integer containing the version number of the
2504 block format. The initial version was 0; the current version is 1. The
2505 version number will change again in future if additional fields are
2506 added, but the intention is never to remove any of the existing fields.
2507
2508 The callout_number field contains the number of the callout, as com-
2509 piled into the pattern (that is, the number after ?C for manual call-
2510 outs, and 255 for automatically generated callouts).
2511
2512 The offset_vector field is a pointer to the vector of offsets that was
2513 passed by the caller to pcre_exec() or pcre_dfa_exec(). When
2514 pcre_exec() is used, the contents can be inspected in order to extract
2515 substrings that have been matched so far, in the same way as for
2516 extracting substrings after a match has completed. For pcre_dfa_exec()
2517 this field is not useful.
2518
2519 The subject and subject_length fields contain copies of the values that
2520 were passed to pcre_exec().
2521
2522 The start_match field normally contains the offset within the subject
2523 at which the current match attempt started. However, if the escape
2524 sequence \K has been encountered, this value is changed to reflect the
2525 modified starting point. If the pattern is not anchored, the callout
2526 function may be called several times from the same point in the pattern
2527 for different starting points in the subject.
2528
2529 The current_position field contains the offset within the subject of
2530 the current match pointer.
2531
2532 When the pcre_exec() function is used, the capture_top field contains
2533 one more than the number of the highest numbered captured substring so
2534 far. If no substrings have been captured, the value of capture_top is
2535 one. This is always the case when pcre_dfa_exec() is used, because it
2536 does not support captured substrings.
2537
2538 The capture_last field contains the number of the most recently cap-
2539 tured substring. If no substrings have been captured, its value is -1.
2540 This is always the case when pcre_dfa_exec() is used.
2541
2542 The callout_data field contains a value that is passed to pcre_exec()
2543 or pcre_dfa_exec() specifically so that it can be passed back in call-
2544 outs. It is passed in the pcre_callout field of the pcre_extra data
2545 structure. If no such data was passed, the value of callout_data in a
2546 pcre_callout block is NULL. There is a description of the pcre_extra
2547 structure in the pcreapi documentation.
2548
2549 The pattern_position field is present from version 1 of the pcre_call-
2550 out structure. It contains the offset to the next item to be matched in
2551 the pattern string.
2552
2553 The next_item_length field is present from version 1 of the pcre_call-
2554 out structure. It contains the length of the next item to be matched in
2555 the pattern string. When the callout immediately precedes an alterna-
2556 tion bar, a closing parenthesis, or the end of the pattern, the length
2557 is zero. When the callout precedes an opening parenthesis, the length
2558 is that of the entire subpattern.
2559
2560 The pattern_position and next_item_length fields are intended to help
2561 in distinguishing between different automatic callouts, which all have
2562 the same callout number. However, they are set for all callouts.
2563
2564
2565 RETURN VALUES
2566
2567 The external callout function returns an integer to PCRE. If the value
2568 is zero, matching proceeds as normal. If the value is greater than
2569 zero, matching fails at the current point, but the testing of other
2570 matching possibilities goes ahead, just as if a lookahead assertion had
2571 failed. If the value is less than zero, the match is abandoned, and
2572 pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2573
2574 Negative values should normally be chosen from the set of
2575 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2576 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
2577 reserved for use by callout functions; it will never be used by PCRE
2578 itself.
2579
2580
2581 AUTHOR
2582
2583 Philip Hazel
2584 University Computing Service
2585 Cambridge CB2 3QH, England.
2586
2587
2588 REVISION
2589
2590 Last updated: 29 May 2007
2591 Copyright (c) 1997-2007 University of Cambridge.
2592 ------------------------------------------------------------------------------
2593
2594
2595 PCRECOMPAT(3) PCRECOMPAT(3)
2596
2597
2598 NAME
2599 PCRE - Perl-compatible regular expressions
2600
2601
2602 DIFFERENCES BETWEEN PCRE AND PERL
2603
2604 This document describes the differences in the ways that PCRE and Perl
2605 handle regular expressions. The differences described here are mainly
2606 with respect to Perl 5.8, though PCRE versions 7.0 and later contain
2607 some features that are expected to be in the forthcoming Perl 5.10.
2608
2609 1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
2610 of what it does have are given in the section on UTF-8 support in the
2611 main pcre page.
2612
2613 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2614 permits them, but they do not mean what you might think. For example,
2615 (?!a){3} does not assert that the next three characters are not "a". It
2616 just asserts that the next character is not "a" three times.
2617
2618 3. Capturing subpatterns that occur inside negative lookahead asser-
2619 tions are counted, but their entries in the offsets vector are never
2620 set. Perl sets its numerical variables from any such patterns that are
2621 matched before the assertion fails to match something (thereby succeed-
2622 ing), but only if the negative lookahead assertion contains just one
2623 branch.
2624
2625 4. Though binary zero characters are supported in the subject string,
2626 they are not allowed in a pattern string because it is passed as a nor-
2627 mal C string, terminated by zero. The escape sequence \0 can be used in
2628 the pattern to represent a binary zero.
2629
2630 5. The following Perl escape sequences are not supported: \l, \u, \L,
2631 \U, and \N. In fact these are implemented by Perl's general string-han-
2632 dling and are not part of its pattern matching engine. If any of these
2633 are encountered by PCRE, an error is generated.
2634
2635 6. The Perl escape sequences \p, \P, and \X are supported only if PCRE
2636 is built with Unicode character property support. The properties that
2637 can be tested with \p and \P are limited to the general category prop-
2638 erties such as Lu and Nd, script names such as Greek or Han, and the
2639 derived properties Any and L&.
2640
2641 7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2642 ters in between are treated as literals. This is slightly different
2643 from Perl in that $ and @ are also handled as literals inside the
2644 quotes. In Perl, they cause variable interpolation (but of course PCRE
2645 does not have variables). Note the following examples:
2646
2647 Pattern PCRE matches Perl matches
2648
2649 \Qabc$xyz\E abc$xyz abc followed by the
2650 contents of $xyz
2651 \Qabc\$xyz\E abc\$xyz abc\$xyz
2652 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
2653
2654 The \Q...\E sequence is recognized both inside and outside character
2655 classes.
2656
2657 8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2658 constructions. However, there is support for recursive patterns. This
2659 is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
2660 "callout" feature allows an external function to be called during pat-
2661 tern matching. See the pcrecallout documentation for details.
2662
2663 9. Subpatterns that are called recursively or as "subroutines" are
2664 always treated as atomic groups in PCRE. This is like Python, but
2665 unlike Perl.
2666
2667 10. There are some differences that are concerned with the settings of
2668 captured strings when part of a pattern is repeated. For example,
2669 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
2670 unset, but in PCRE it is set to "b".
2671
2672 11. PCRE provides some extensions to the Perl regular expression facil-
2673 ities. Perl 5.10 will include new features that are not in earlier
2674 versions, some of which (such as named parentheses) have been in PCRE
2675 for some time. This list is with respect to Perl 5.10:
2676
2677 (a) Although lookbehind assertions must match fixed length strings,
2678 each alternative branch of a lookbehind assertion can match a different
2679 length of string. Perl requires them all to have the same length.
2680
2681 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
2682 meta-character matches only at the very end of the string.
2683
2684 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2685 cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2686 ignored. (Perl can be made to issue a warning.)
2687
2688 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
2689 fiers is inverted, that is, by default they are not greedy, but if fol-
2690 lowed by a question mark they are.
2691
2692 (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2693 tried only at the first matching position in the subject string.
2694
2695 (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2696 TURE options for pcre_exec() have no Perl equivalents.
2697
2698 (g) The callout facility is PCRE-specific.
2699
2700 (h) The partial matching facility is PCRE-specific.
2701
2702 (i) Patterns compiled by PCRE can be saved and re-used at a later time,
2703 even on different hosts that have the other endianness.
2704
2705 (j) The alternative matching function (pcre_dfa_exec()) matches in a
2706 different way and is not Perl-compatible.
2707
2708
2709 AUTHOR
2710
2711 Philip Hazel
2712 University Computing Service
2713 Cambridge CB2 3QH, England.
2714
2715
2716 REVISION
2717
2718 Last updated: 13 June 2007
2719 Copyright (c) 1997-2007 University of Cambridge.
2720 ------------------------------------------------------------------------------
2721
2722
2723 PCREPATTERN(3) PCREPATTERN(3)
2724
2725
2726 NAME
2727 PCRE - Perl-compatible regular expressions
2728
2729
2730 PCRE REGULAR EXPRESSION DETAILS
2731
2732 The syntax and semantics of the regular expressions that are supported
2733 by PCRE are described in detail below. There is a quick-reference syn-
2734 tax summary in the pcresyntax page. Perl's regular expressions are
2735 described in its own documentation, and regular expressions in general
2736 are covered in a number of books, some of which have copious examples.
2737 Jeffrey Friedl's "Mastering Regular Expressions", published by
2738 O'Reilly, covers regular expressions in great detail. This description
2739 of PCRE's regular expressions is intended as reference material.
2740
2741 The original operation of PCRE was on strings of one-byte characters.
2742 However, there is now also support for UTF-8 character strings. To use
2743 this, you must build PCRE to include UTF-8 support, and then call
2744 pcre_compile() with the PCRE_UTF8 option. How this affects pattern
2745 matching is mentioned in several places below. There is also a summary
2746 of UTF-8 features in the section on UTF-8 support in the main pcre
2747 page.
2748
2749 The remainder of this document discusses the patterns that are sup-
2750 ported by PCRE when its main matching function, pcre_exec(), is used.
2751 From release 6.0, PCRE offers a second matching function,
2752 pcre_dfa_exec(), which matches using a different algorithm that is not
2753 Perl-compatible. Some of the features discussed below are not available
2754 when pcre_dfa_exec() is used. The advantages and disadvantages of the
2755 alternative function, and how it differs from the normal function, are
2756 discussed in the pcrematching page.
2757
2758
2759 CHARACTERS AND METACHARACTERS
2760
2761 A regular expression is a pattern that is matched against a subject
2762 string from left to right. Most characters stand for themselves in a
2763 pattern, and match the corresponding characters in the subject. As a
2764 trivial example, the pattern
2765
2766 The quick brown fox
2767
2768 matches a portion of a subject string that is identical to itself. When
2769 caseless matching is specified (the PCRE_CASELESS option), letters are
2770 matched independently of case. In UTF-8 mode, PCRE always understands
2771 the concept of case for characters whose values are less than 128, so
2772 caseless matching is always possible. For characters with higher val-
2773 ues, the concept of case is supported if PCRE is compiled with Unicode
2774 property support, but not otherwise. If you want to use caseless
2775 matching for characters 128 and above, you must ensure that PCRE is
2776 compiled with Unicode property support as well as with UTF-8 support.
2777
2778 The power of regular expressions comes from the ability to include
2779 alternatives and repetitions in the pattern. These are encoded in the
2780 pattern by the use of metacharacters, which do not stand for themselves
2781 but instead are interpreted in some special way.
2782
2783 There are two different sets of metacharacters: those that are recog-
2784 nized anywhere in the pattern except within square brackets, and those
2785 that are recognized within square brackets. Outside square brackets,
2786 the metacharacters are as follows:
2787
2788 \ general escape character with several uses
2789 ^ assert start of string (or line, in multiline mode)
2790 $ assert end of string (or line, in multiline mode)
2791 . match any character except newline (by default)
2792 [ start character class definition
2793 | start of alternative branch
2794 ( start subpattern
2795 ) end subpattern
2796 ? extends the meaning of (
2797 also 0 or 1 quantifier
2798 also quantifier minimizer
2799 * 0 or more quantifier
2800 + 1 or more quantifier
2801 also "possessive quantifier"
2802 { start min/max quantifier
2803
2804 Part of a pattern that is in square brackets is called a "character
2805 class". In a character class the only metacharacters are:
2806
2807 \ general escape character
2808 ^ negate the class, but only if the first character
2809 - indicates character range
2810 [ POSIX character class (only if followed by POSIX
2811 syntax)
2812 ] terminates the character class
2813
2814 The following sections describe the use of each of the metacharacters.
2815
2816
2817 BACKSLASH
2818
2819 The backslash character has several uses. Firstly, if it is followed by
2820 a non-alphanumeric character, it takes away any special meaning that
2821 character may have. This use of backslash as an escape character
2822 applies both inside and outside character classes.
2823
2824 For example, if you want to match a * character, you write \* in the
2825 pattern. This escaping action applies whether or not the following
2826 character would otherwise be interpreted as a metacharacter, so it is
2827 always safe to precede a non-alphanumeric with backslash to specify
2828 that it stands for itself. In particular, if you want to match a back-
2829 slash, you write \\.
2830
2831 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
2832 the pattern (other than in a character class) and characters between a
2833 # outside a character class and the next newline are ignored. An escap-
2834 ing backslash can be used to include a whitespace or # character as
2835 part of the pattern.
2836
2837 If you want to remove the special meaning from a sequence of charac-
2838 ters, you can do so by putting them between \Q and \E. This is differ-
2839 ent from Perl in that $ and @ are handled as literals in \Q...\E
2840 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
2841 tion. Note the following examples:
2842
2843 Pattern PCRE matches Perl matches
2844
2845 \Qabc$xyz\E abc$xyz abc followed by the
2846 contents of $xyz
2847 \Qabc\$xyz\E abc\$xyz abc\$xyz
2848 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
2849
2850 The \Q...\E sequence is recognized both inside and outside character
2851 classes.
2852
2853 Non-printing characters
2854
2855 A second use of backslash provides a way of encoding non-printing char-
2856 acters in patterns in a visible manner. There is no restriction on the
2857 appearance of non-printing characters, apart from the binary zero that
2858 terminates a pattern, but when a pattern is being prepared by text
2859 editing, it is usually easier to use one of the following escape
2860 sequences than the binary character it represents:
2861
2862 \a alarm, that is, the BEL character (hex 07)
2863 \cx "control-x", where x is any character
2864 \e escape (hex 1B)
2865 \f formfeed (hex 0C)
2866 \n newline (hex 0A)
2867 \r carriage return (hex 0D)
2868 \t tab (hex 09)
2869 \ddd character with octal code ddd, or backreference
2870 \xhh character with hex code hh
2871 \x{hhh..} character with hex code hhh..
2872
2873 The precise effect of \cx is as follows: if x is a lower case letter,
2874 it is converted to upper case. Then bit 6 of the character (hex 40) is
2875 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
2876 becomes hex 7B.
2877
2878 After \x, from zero to two hexadecimal digits are read (letters can be
2879 in upper or lower case). Any number of hexadecimal digits may appear
2880 between \x{ and }, but the value of the character code must be less
2881 than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is,
2882 the maximum hexadecimal value is 7FFFFFFF). If characters other than
2883 hexadecimal digits appear between \x{ and }, or if there is no termi-
2884 nating }, this form of escape is not recognized. Instead, the initial
2885 \x will be interpreted as a basic hexadecimal escape, with no following
2886 digits, giving a character whose value is zero.
2887
2888 Characters whose value is less than 256 can be defined by either of the
2889 two syntaxes for \x. There is no difference in the way they are han-
2890 dled. For example, \xdc is exactly the same as \x{dc}.
2891
2892 After \0 up to two further octal digits are read. If there are fewer
2893 than two digits, just those that are present are used. Thus the
2894 sequence \0\x\07 specifies two binary zeros followed by a BEL character
2895 (code value 7). Make sure you supply two digits after the initial zero
2896 if the pattern character that follows is itself an octal digit.
2897
2898 The handling of a backslash followed by a digit other than 0 is compli-
2899 cated. Outside a character class, PCRE reads it and any following dig-
2900 its as a decimal number. If the number is less than 10, or if there
2901 have been at least that many previous capturing left parentheses in the
2902 expression, the entire sequence is taken as a back reference. A
2903 description of how this works is given later, following the discussion
2904 of parenthesized subpatterns.
2905
2906 Inside a character class, or if the decimal number is greater than 9
2907 and there have not been that many capturing subpatterns, PCRE re-reads
2908 up to three octal digits following the backslash, and uses them to gen-
2909 erate a data character. Any subsequent digits stand for themselves. In
2910 non-UTF-8 mode, the value of a character specified in octal must be
2911 less than \400. In UTF-8 mode, values up to \777 are permitted. For
2912 example:
2913
2914 \040 is another way of writing a space
2915 \40 is the same, provided there are fewer than 40
2916 previous capturing subpatterns
2917 \7 is always a back reference
2918 \11 might be a back reference, or another way of
2919 writing a tab
2920 \011 is always a tab
2921 \0113 is a tab followed by the character "3"
2922 \113 might be a back reference, otherwise the
2923 character with octal code 113
2924 \377 might be a back reference, otherwise
2925 the byte consisting entirely of 1 bits
2926 \81 is either a back reference, or a binary zero
2927 followed by the two characters "8" and "1"
2928
2929 Note that octal values of 100 or greater must not be introduced by a
2930 leading zero, because no more than three octal digits are ever read.
2931
2932 All the sequences that define a single character value can be used both
2933 inside and outside character classes. In addition, inside a character
2934 class, the sequence \b is interpreted as the backspace character (hex
2935 08), and the sequences \R and \X are interpreted as the characters "R"
2936 and "X", respectively. Outside a character class, these sequences have
2937 different meanings (see below).
2938
2939 Absolute and relative back references
2940
2941 The sequence \g followed by an unsigned or a negative number, option-
2942 ally enclosed in braces, is an absolute or relative back reference. A
2943 named back reference can be coded as \g{name}. Back references are dis-
2944 cussed later, following the discussion of parenthesized subpatterns.
2945
2946 Generic character types
2947
2948 Another use of backslash is for specifying generic character types. The
2949 following are always recognized:
2950
2951 \d any decimal digit
2952 \D any character that is not a decimal digit
2953 \h any horizontal whitespace character
2954 \H any character that is not a horizontal whitespace character
2955 \s any whitespace character
2956 \S any character that is not a whitespace character
2957 \v any vertical whitespace character
2958 \V any character that is not a vertical whitespace character
2959 \w any "word" character
2960 \W any "non-word" character
2961
2962 Each pair of escape sequences partitions the complete set of characters
2963 into two disjoint sets. Any given character matches one, and only one,
2964 of each pair.
2965
2966 These character type sequences can appear both inside and outside char-
2967 acter classes. They each match one character of the appropriate type.
2968 If the current matching point is at the end of the subject string, all
2969 of them fail, since there is no character to match.
2970
2971 For compatibility with Perl, \s does not match the VT character (code
2972 11). This makes it different from the the POSIX "space" class. The \s
2973 characters are HT (9), LF (10), FF (12), CR (13), and space (32). If
2974 "use locale;" is included in a Perl script, \s may match the VT charac-
2975 ter. In PCRE, it never does.
2976
2977 In UTF-8 mode, characters with values greater than 128 never match \d,
2978 \s, or \w, and always match \D, \S, and \W. This is true even when Uni-
2979 code character property support is available. These sequences retain
2980 their original meanings from before UTF-8 support was available, mainly
2981 for efficiency reasons.
2982
2983 The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to
2984 the other sequences, these do match certain high-valued codepoints in
2985 UTF-8 mode. The horizontal space characters are:
2986
2987 U+0009 Horizontal tab
2988 U+0020 Space
2989 U+00A0 Non-break space
2990 U+1680 Ogham space mark
2991 U+180E Mongolian vowel separator
2992 U+2000 En quad
2993 U+2001 Em quad
2994 U+2002 En space
2995 U+2003 Em space
2996 U+2004 Three-per-em space
2997 U+2005 Four-per-em space
2998 U+2006 Six-per-em space
2999 U+2007 Figure space
3000 U+2008 Punctuation space
3001 U+2009 Thin space
3002 U+200A Hair space
3003 U+202F Narrow no-break space
3004 U+205F Medium mathematical space
3005 U+3000 Ideographic space
3006
3007 The vertical space characters are:
3008
3009 U+000A Linefeed
3010 U+000B Vertical tab
3011 U+000C Formfeed
3012 U+000D Carriage return
3013 U+0085 Next line
3014 U+2028 Line separator
3015 U+2029 Paragraph separator
3016
3017 A "word" character is an underscore or any character less than 256 that
3018 is a letter or digit. The definition of letters and digits is con-
3019 trolled by PCRE's low-valued character tables, and may vary if locale-
3020 specific matching is taking place (see "Locale support" in the pcreapi
3021 page). For example, in a French locale such as "fr_FR" in Unix-like
3022 systems, or "french" in Windows, some character codes greater than 128
3023 are used for accented letters, and these are matched by \w. The use of
3024 locales with Unicode is discouraged.
3025
3026 Newline sequences
3027
3028 Outside a character class, the escape sequence \R matches any Unicode
3029 newline sequence. This is a Perl 5.10 feature. In non-UTF-8 mode \R is
3030 equivalent to the following:
3031
3032 (?>\r\n|\n|\x0b|\f|\r|\x85)
3033
3034 This is an example of an "atomic group", details of which are given
3035 below. This particular group matches either the two-character sequence
3036 CR followed by LF, or one of the single characters LF (linefeed,
3037 U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage
3038 return, U+000D), or NEL (next line, U+0085). The two-character sequence
3039 is treated as a single unit that cannot be split.
3040
3041 In UTF-8 mode, two additional characters whose codepoints are greater
3042 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
3043 rator, U+2029). Unicode character property support is not needed for
3044 these characters to be recognized.
3045
3046 Inside a character class, \R matches the letter "R".
3047
3048 Unicode character properties
3049
3050 When PCRE is built with Unicode character property support, three addi-
3051 tional escape sequences that match characters with specific properties
3052 are available. When not in UTF-8 mode, these sequences are of course
3053 limited to testing characters whose codepoints are less than 256, but
3054 they do work in this mode. The extra escape sequences are:
3055
3056 \p{xx} a character with the xx property
3057 \P{xx} a character without the xx property
3058 \X an extended Unicode sequence
3059
3060 The property names represented by xx above are limited to the Unicode
3061 script names, the general category properties, and "Any", which matches
3062 any character (including newline). Other properties such as "InMusical-
3063 Symbols" are not currently supported by PCRE. Note that \P{Any} does
3064 not match any characters, so always causes a match failure.
3065
3066 Sets of Unicode characters are defined as belonging to certain scripts.
3067 A character from one of these sets can be matched using a script name.
3068 For example:
3069
3070 \p{Greek}
3071 \P{Han}
3072
3073 Those that are not part of an identified script are lumped together as
3074 "Common". The current list of scripts is:
3075
3076 Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
3077 Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform,
3078 Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
3079 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
3080 gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin,
3081 Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko,
3082 Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
3083 Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
3084 Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
3085
3086 Each character has exactly one general category property, specified by
3087 a two-letter abbreviation. For compatibility with Perl, negation can be
3088 specified by including a circumflex between the opening brace and the
3089 property name. For example, \p{^Lu} is the same as \P{Lu}.
3090
3091 If only one letter is specified with \p or \P, it includes all the gen-
3092 eral category properties that start with that letter. In this case, in
3093 the absence of negation, the curly brackets in the escape sequence are
3094 optional; these two examples have the same effect:
3095
3096 \p{L}
3097 \pL
3098
3099 The following general category property codes are supported:
3100
3101 C Other
3102 Cc Control
3103 Cf Format
3104 Cn Unassigned
3105 Co Private use
3106 Cs Surrogate
3107
3108 L Letter
3109 Ll Lower case letter
3110 Lm Modifier letter
3111 Lo Other letter
3112 Lt Title case letter
3113 Lu Upper case letter
3114
3115 M Mark
3116 Mc Spacing mark
3117 Me Enclosing mark
3118 Mn Non-spacing mark
3119
3120 N Number
3121 Nd Decimal number
3122 Nl Letter number
3123 No Other number
3124
3125 P Punctuation
3126 Pc Connector punctuation
3127 Pd Dash punctuation
3128 Pe Close punctuation
3129 Pf Final punctuation
3130 Pi Initial punctuation
3131 Po Other punctuation
3132 Ps Open punctuation
3133
3134 S Symbol
3135 Sc Currency symbol
3136 Sk Modifier symbol
3137 Sm Mathematical symbol
3138 So Other symbol
3139
3140 Z Separator
3141 Zl Line separator
3142 Zp Paragraph separator
3143 Zs Space separator
3144
3145 The special property L& is also supported: it matches a character that
3146 has the Lu, Ll, or Lt property, in other words, a letter that is not
3147 classified as a modifier or "other".
3148
3149 The long synonyms for these properties that Perl supports (such as
3150 \p{Letter}) are not supported by PCRE, nor is it permitted to prefix
3151 any of these properties with "Is".
3152
3153 No character that is in the Unicode table has the Cn (unassigned) prop-
3154 erty. Instead, this property is assumed for any code point that is not
3155 in the Unicode table.
3156
3157 Specifying caseless matching does not affect these escape sequences.
3158 For example, \p{Lu} always matches only upper case letters.
3159
3160 The \X escape matches any number of Unicode characters that form an
3161 extended Unicode sequence. \X is equivalent to
3162
3163 (?>\PM\pM*)
3164
3165 That is, it matches a character without the "mark" property, followed
3166 by zero or more characters with the "mark" property, and treats the
3167 sequence as an atomic group (see below). Characters with the "mark"
3168 property are typically accents that affect the preceding character.
3169 None of them have codepoints less than 256, so in non-UTF-8 mode \X
3170 matches any one character.
3171
3172 Matching characters by Unicode property is not fast, because PCRE has
3173 to search a structure that contains data for over fifteen thousand
3174 characters. That is why the traditional escape sequences such as \d and
3175 \w do not use Unicode properties in PCRE.
3176
3177 Resetting the match start
3178
3179 The escape sequence \K, which is a Perl 5.10 feature, causes any previ-
3180 ously matched characters not to be included in the final matched
3181 sequence. For example, the pattern:
3182
3183 foo\Kbar
3184
3185 matches "foobar", but reports that it has matched "bar". This feature
3186 is similar to a lookbehind assertion (described below). However, in
3187 this case, the part of the subject before the real match does not have
3188 to be of fixed length, as lookbehind assertions do. The use of \K does
3189 not interfere with the setting of captured substrings. For example,
3190 when the pattern
3191
3192 (foo)\Kbar
3193
3194 matches "foobar", the first substring is still set to "foo".
3195
3196 Simple assertions
3197
3198 The final use of backslash is for certain simple assertions. An asser-
3199 tion specifies a condition that has to be met at a particular point in
3200 a match, without consuming any characters from the subject string. The
3201 use of subpatterns for more complicated assertions is described below.
3202 The backslashed assertions are:
3203
3204 \b matches at a word boundary
3205 \B matches when not at a word boundary
3206 \A matches at the start of the subject
3207 \Z matches at the end of the subject
3208 also matches before a newline at the end of the subject
3209 \z matches only at the end of the subject
3210 \G matches at the first matching position in the subject
3211
3212 These assertions may not appear in character classes (but note that \b
3213 has a different meaning, namely the backspace character, inside a char-
3214 acter class).
3215
3216 A word boundary is a position in the subject string where the current
3217 character and the previous character do not both match \w or \W (i.e.
3218 one matches \w and the other matches \W), or the start or end of the
3219 string if the first or last character matches \w, respectively.
3220
3221 The \A, \Z, and \z assertions differ from the traditional circumflex
3222 and dollar (described in the next section) in that they only ever match
3223 at the very start and end of the subject string, whatever options are
3224 set. Thus, they are independent of multiline mode. These three asser-
3225 tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which
3226 affect only the behaviour of the circumflex and dollar metacharacters.
3227 However, if the startoffset argument of pcre_exec() is non-zero, indi-
3228 cating that matching is to start at a point other than the beginning of
3229 the subject, \A can never match. The difference between \Z and \z is
3230 that \Z matches before a newline at the end of the string as well as at
3231 the very end, whereas \z matches only at the end.
3232
3233 The \G assertion is true only when the current matching position is at
3234 the start point of the match, as specified by the startoffset argument
3235 of pcre_exec(). It differs from \A when the value of startoffset is
3236 non-zero. By calling pcre_exec() multiple times with appropriate argu-
3237 ments, you can mimic Perl's /g option, and it is in this kind of imple-
3238 mentation where \G can be useful.
3239
3240 Note, however, that PCRE's interpretation of \G, as the start of the
3241 current match, is subtly different from Perl's, which defines it as the
3242 end of the previous match. In Perl, these can be different when the
3243 previously matched string was empty. Because PCRE does just one match
3244 at a time, it cannot reproduce this behaviour.
3245
3246 If all the alternatives of a pattern begin with \G, the expression is
3247 anchored to the starting match position, and the "anchored" flag is set
3248 in the compiled regular expression.
3249
3250
3251 CIRCUMFLEX AND DOLLAR
3252
3253 Outside a character class, in the default matching mode, the circumflex
3254 character is an assertion that is true only if the current matching
3255 point is at the start of the subject string. If the startoffset argu-
3256 ment of pcre_exec() is non-zero, circumflex can never match if the
3257 PCRE_MULTILINE option is unset. Inside a character class, circumflex
3258 has an entirely different meaning (see below).
3259
3260 Circumflex need not be the first character of the pattern if a number
3261 of alternatives are involved, but it should be the first thing in each
3262 alternative in which it appears if the pattern is ever to match that
3263 branch. If all possible alternatives start with a circumflex, that is,
3264 if the pattern is constrained to match only at the start of the sub-
3265 ject, it is said to be an "anchored" pattern. (There are also other
3266 constructs that can cause a pattern to be anchored.)
3267
3268 A dollar character is an assertion that is true only if the current
3269 matching point is at the end of the subject string, or immediately
3270 before a newline at the end of the string (by default). Dollar need not
3271 be the last character of the pattern if a number of alternatives are
3272 involved, but it should be the last item in any branch in which it
3273 appears. Dollar has no special meaning in a character class.
3274
3275 The meaning of dollar can be changed so that it matches only at the
3276 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
3277 compile time. This does not affect the \Z assertion.
3278
3279 The meanings of the circumflex and dollar characters are changed if the
3280 PCRE_MULTILINE option is set. When this is the case, a circumflex
3281 matches immediately after internal newlines as well as at the start of
3282 the subject string. It does not match after a newline that ends the
3283 string. A dollar matches before any newlines in the string, as well as
3284 at the very end, when PCRE_MULTILINE is set. When newline is specified
3285 as the two-character sequence CRLF, isolated CR and LF characters do
3286 not indicate newlines.
3287
3288 For example, the pattern /^abc$/ matches the subject string "def\nabc"
3289 (where \n represents a newline) in multiline mode, but not otherwise.
3290 Consequently, patterns that are anchored in single line mode because
3291 all branches start with ^ are not anchored in multiline mode, and a
3292 match for circumflex is possible when the startoffset argument of
3293 pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if
3294 PCRE_MULTILINE is set.
3295
3296 Note that the sequences \A, \Z, and \z can be used to match the start
3297 and end of the subject in both modes, and if all branches of a pattern
3298 start with \A it is always anchored, whether or not PCRE_MULTILINE is
3299 set.
3300
3301
3302 FULL STOP (PERIOD, DOT)
3303
3304 Outside a character class, a dot in the pattern matches any one charac-
3305 ter in the subject string except (by default) a character that signi-
3306 fies the end of a line. In UTF-8 mode, the matched character may be
3307 more than one byte long.
3308
3309 When a line ending is defined as a single character, dot never matches
3310 that character; when the two-character sequence CRLF is used, dot does
3311 not match CR if it is immediately followed by LF, but otherwise it
3312 matches all characters (including isolated CRs and LFs). When any Uni-
3313 code line endings are being recognized, dot does not match CR or LF or
3314 any of the other line ending characters.
3315
3316 The behaviour of dot with regard to newlines can be changed. If the
3317 PCRE_DOTALL option is set, a dot matches any one character, without
3318 exception. If the two-character sequence CRLF is present in the subject
3319 string, it takes two dots to match it.
3320
3321 The handling of dot is entirely independent of the handling of circum-
3322 flex and dollar, the only relationship being that they both involve
3323 newlines. Dot has no special meaning in a character class.
3324
3325
3326 MATCHING A SINGLE BYTE
3327
3328 Outside a character class, the escape sequence \C matches any one byte,
3329 both in and out of UTF-8 mode. Unlike a dot, it always matches any
3330 line-ending characters. The feature is provided in Perl in order to
3331 match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char-
3332 acters into individual bytes, what remains in the string may be a mal-
3333 formed UTF-8 string. For this reason, the \C escape sequence is best
3334 avoided.
3335
3336 PCRE does not allow \C to appear in lookbehind assertions (described
3337 below), because in UTF-8 mode this would make it impossible to calcu-
3338 late the length of the lookbehind.
3339
3340
3341 SQUARE BRACKETS AND CHARACTER CLASSES
3342
3343 An opening square bracket introduces a character class, terminated by a
3344 closing square bracket. A closing square bracket on its own is not spe-
3345 cial. If a closing square bracket is required as a member of the class,
3346 it should be the first data character in the class (after an initial
3347 circumflex, if present) or escaped with a backslash.
3348
3349 A character class matches a single character in the subject. In UTF-8
3350 mode, the character may occupy more than one byte. A matched character
3351 must be in the set of characters defined by the class, unless the first
3352 character in the class definition is a circumflex, in which case the
3353 subject character must not be in the set defined by the class. If a
3354 circumflex is actually required as a member of the class, ensure it is
3355 not the first character, or escape it with a backslash.
3356
3357 For example, the character class [aeiou] matches any lower case vowel,
3358 while [^aeiou] matches any character that is not a lower case vowel.
3359 Note that a circumflex is just a convenient notation for specifying the
3360 characters that are in the class by enumerating those that are not. A
3361 class that starts with a circumflex is not an assertion: it still con-
3362 sumes a character from the subject string, and therefore it fails if
3363 the current pointer is at the end of the string.
3364
3365 In UTF-8 mode, characters with values greater than 255 can be included
3366 in a class as a literal string of bytes, or by using the \x{ escaping
3367 mechanism.
3368
3369 When caseless matching is set, any letters in a class represent both
3370 their upper case and lower case versions, so for example, a caseless
3371 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
3372 match "A", whereas a caseful version would. In UTF-8 mode, PCRE always
3373 understands the concept of case for characters whose values are less
3374 than 128, so caseless matching is always possible. For characters with
3375 higher values, the concept of case is supported if PCRE is compiled
3376 with Unicode property support, but not otherwise. If you want to use
3377 caseless matching for characters 128 and above, you must ensure that
3378 PCRE is compiled with Unicode property support as well as with UTF-8
3379 support.
3380
3381 Characters that might indicate line breaks are never treated in any
3382 special way when matching character classes, whatever line-ending
3383 sequence is in use, and whatever setting of the PCRE_DOTALL and
3384 PCRE_MULTILINE options is used. A class such as [^a] always matches one
3385 of these characters.
3386
3387 The minus (hyphen) character can be used to specify a range of charac-
3388 ters in a character class. For example, [d-m] matches any letter
3389 between d and m, inclusive. If a minus character is required in a
3390 class, it must be escaped with a backslash or appear in a position
3391 where it cannot be interpreted as indicating a range, typically as the
3392 first or last character in the class.
3393
3394 It is not possible to have the literal character "]" as the end charac-
3395 ter of a range. A pattern such as [W-]46] is interpreted as a class of
3396 two characters ("W" and "-") followed by a literal string "46]", so it
3397 would match "W46]" or "-46]". However, if the "]" is escaped with a
3398 backslash it is interpreted as the end of range, so [W-\]46] is inter-
3399 preted as a class containing a range followed by two other characters.
3400 The octal or hexadecimal representation of "]" can also be used to end
3401 a range.
3402
3403 Ranges operate in the collating sequence of character values. They can
3404 also be used for characters specified numerically, for example
3405 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
3406 are greater than 255, for example [\x{100}-\x{2ff}].
3407
3408 If a range that includes letters is used when caseless matching is set,
3409 it matches the letters in either case. For example, [W-c] is equivalent
3410 to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if
3411 character tables for a French locale are in use, [\xc8-\xcb] matches
3412 accented E characters in both cases. In UTF-8 mode, PCRE supports the
3413 concept of case for characters with values greater than 128 only when
3414 it is compiled with Unicode property support.
3415
3416 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
3417 in a character class, and add the characters that they match to the
3418 class. For example, [\dABCDEF] matches any hexadecimal digit. A circum-
3419 flex can conveniently be used with the upper case character types to
3420 specify a more restricted set of characters than the matching lower
3421 case type. For example, the class [^\W_] matches any letter or digit,
3422 but not underscore.
3423
3424 The only metacharacters that are recognized in character classes are
3425 backslash, hyphen (only where it can be interpreted as specifying a
3426 range), circumflex (only at the start), opening square bracket (only
3427 when it can be interpreted as introducing a POSIX class name - see the
3428 next section), and the terminating closing square bracket. However,
3429 escaping other non-alphanumeric characters does no harm.
3430
3431
3432 POSIX CHARACTER CLASSES
3433
3434 Perl supports the POSIX notation for character classes. This uses names
3435 enclosed by [: and :] within the enclosing square brackets. PCRE also
3436 supports this notation. For example,
3437
3438 [01[:alpha:]%]
3439
3440 matches "0", "1", any alphabetic character, or "%". The supported class
3441 names are
3442
3443 alnum letters and digits
3444 alpha letters
3445 ascii character codes 0 - 127
3446 blank space or tab only
3447 cntrl control characters
3448 digit decimal digits (same as \d)
3449 graph printing characters, excluding space
3450 lower lower case letters
3451 print printing characters, including space
3452 punct printing characters, excluding letters and digits
3453 space white space (not quite the same as \s)
3454 upper upper case letters
3455 word "word" characters (same as \w)
3456 xdigit hexadecimal digits
3457
3458 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
3459 and space (32). Notice that this list includes the VT character (code
3460 11). This makes "space" different to \s, which does not include VT (for
3461 Perl compatibility).
3462
3463 The name "word" is a Perl extension, and "blank" is a GNU extension
3464 from Perl 5.8. Another Perl extension is negation, which is indicated
3465 by a ^ character after the colon. For example,
3466
3467 [12[:^digit:]]
3468
3469 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
3470 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
3471 these are not supported, and an error is given if they are encountered.
3472
3473 In UTF-8 mode, characters with values greater than 128 do not match any
3474 of the POSIX character classes.
3475
3476
3477 VERTICAL BAR
3478
3479 Vertical bar characters are used to separate alternative patterns. For
3480 example, the pattern
3481
3482 gilbert|sullivan
3483
3484 matches either "gilbert" or "sullivan". Any number of alternatives may
3485 appear, and an empty alternative is permitted (matching the empty
3486 string). The matching process tries each alternative in turn, from left
3487 to right, and the first one that succeeds is used. If the alternatives
3488 are within a subpattern (defined below), "succeeds" means matching the
3489 rest of the main pattern as well as the alternative in the subpattern.
3490
3491
3492 INTERNAL OPTION SETTING
3493
3494 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
3495 PCRE_EXTENDED options can be changed from within the pattern by a
3496 sequence of Perl option letters enclosed between "(?" and ")". The
3497 option letters are
3498
3499 i for PCRE_CASELESS
3500 m for PCRE_MULTILINE
3501 s for PCRE_DOTALL
3502 x for PCRE_EXTENDED
3503
3504 For example, (?im) sets caseless, multiline matching. It is also possi-
3505 ble to unset these options by preceding the letter with a hyphen, and a
3506 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
3507 LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED,
3508 is also permitted. If a letter appears both before and after the
3509 hyphen, the option is unset.
3510
3511 When an option change occurs at top level (that is, not inside subpat-
3512 tern parentheses), the change applies to the remainder of the pattern
3513 that follows. If the change is placed right at the start of a pattern,
3514 PCRE extracts it into the global options (and it will therefore show up
3515 in data extracted by the pcre_fullinfo() function).
3516
3517 An option change within a subpattern (see below for a description of
3518 subpatterns) affects only that part of the current pattern that follows
3519 it, so
3520
3521 (a(?i)b)c
3522
3523 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
3524 used). By this means, options can be made to have different settings
3525 in different parts of the pattern. Any changes made in one alternative
3526 do carry on into subsequent branches within the same subpattern. For
3527 example,
3528
3529 (a(?i)b|c)
3530
3531 matches "ab", "aB", "c", and "C", even though when matching "C" the
3532 first branch is abandoned before the option setting. This is because
3533 the effects of option settings happen at compile time. There would be
3534 some very weird behaviour otherwise.
3535
3536 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA
3537 can be changed in the same way as the Perl-compatible options by using
3538 the characters J, U and X respectively.
3539
3540
3541 SUBPATTERNS
3542
3543 Subpatterns are delimited by parentheses (round brackets), which can be
3544 nested. Turning part of a pattern into a subpattern does two things:
3545
3546 1. It localizes a set of alternatives. For example, the pattern
3547
3548 cat(aract|erpillar|)
3549
3550 matches one of the words "cat", "cataract", or "caterpillar". Without
3551 the parentheses, it would match "cataract", "erpillar" or an empty
3552 string.
3553
3554 2. It sets up the subpattern as a capturing subpattern. This means
3555 that, when the whole pattern matches, that portion of the subject
3556 string that matched the subpattern is passed back to the caller via the
3557 ovector argument of pcre_exec(). Opening parentheses are counted from
3558 left to right (starting from 1) to obtain numbers for the capturing
3559 subpatterns.
3560
3561 For example, if the string "the red king" is matched against the pat-
3562 tern
3563
3564 the ((red|white) (king|queen))
3565
3566 the captured substrings are "red king", "red", and "king", and are num-
3567 bered 1, 2, and 3, respectively.
3568
3569 The fact that plain parentheses fulfil two functions is not always
3570 helpful. There are often times when a grouping subpattern is required
3571 without a capturing requirement. If an opening parenthesis is followed
3572 by a question mark and a colon, the subpattern does not do any captur-
3573 ing, and is not counted when computing the number of any subsequent
3574 capturing subpatterns. For example, if the string "the white queen" is
3575 matched against the pattern
3576
3577 the ((?:red|white) (king|queen))
3578
3579 the captured substrings are "white queen" and "queen", and are numbered
3580 1 and 2. The maximum number of capturing subpatterns is 65535.
3581
3582 As a convenient shorthand, if any option settings are required at the
3583 start of a non-capturing subpattern, the option letters may appear
3584 between the "?" and the ":". Thus the two patterns
3585
3586 (?i:saturday|sunday)
3587 (?:(?i)saturday|sunday)
3588
3589 match exactly the same set of strings. Because alternative branches are
3590 tried from left to right, and options are not reset until the end of
3591 the subpattern is reached, an option setting in one branch does affect
3592 subsequent branches, so the above patterns match "SUNDAY" as well as
3593 "Saturday".
3594
3595
3596 DUPLICATE SUBPATTERN NUMBERS
3597
3598 Perl 5.10 introduced a feature whereby each alternative in a subpattern
3599 uses the same numbers for its capturing parentheses. Such a subpattern
3600 starts with (?| and is itself a non-capturing subpattern. For example,
3601 consider this pattern:
3602
3603 (?|(Sat)ur|(Sun))day
3604
3605 Because the two alternatives are inside a (?| group, both sets of cap-
3606 turing parentheses are numbered one. Thus, when the pattern matches,
3607 you can look at captured substring number one, whichever alternative
3608 matched. This construct is useful when you want to capture part, but
3609 not all, of one of a number of alternatives. Inside a (?| group, paren-
3610 theses are numbered as usual, but the number is reset at the start of
3611 each branch. The numbers of any capturing buffers that follow the sub-
3612 pattern start after the highest number used in any branch. The follow-
3613 ing example is taken from the Perl documentation. The numbers under-
3614 neath show in which buffer the captured content will be stored.
3615
3616 # before ---------------branch-reset----------- after
3617 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
3618 # 1 2 2 3 2 3 4
3619
3620 A backreference or a recursive call to a numbered subpattern always
3621 refers to the first one in the pattern with the given number.
3622
3623 An alternative approach to using this "branch reset" feature is to use
3624 duplicate named subpatterns, as described in the next section.
3625
3626
3627 NAMED SUBPATTERNS
3628
3629 Identifying capturing parentheses by number is simple, but it can be
3630 very hard to keep track of the numbers in complicated regular expres-
3631 sions. Furthermore, if an expression is modified, the numbers may
3632 change. To help with this difficulty, PCRE supports the naming of sub-
3633 patterns. This feature was not added to Perl until release 5.10. Python
3634 had the feature earlier, and PCRE introduced it at release 4.0, using
3635 the Python syntax. PCRE now supports both the Perl and the Python syn-
3636 tax.
3637
3638 In PCRE, a subpattern can be named in one of three ways: (?<name>...)
3639 or (?'name'...) as in Perl, or (?P<name>...) as in Python. References
3640 to capturing parentheses from other parts of the pattern, such as back-
3641 references, recursion, and conditions, can be made by name as well as
3642 by number.
3643
3644 Names consist of up to 32 alphanumeric characters and underscores.
3645 Named capturing parentheses are still allocated numbers as well as
3646 names, exactly as if the names were not present. The PCRE API provides
3647 function calls for extracting the name-to-number translation table from
3648 a compiled pattern. There is also a convenience function for extracting
3649 a captured substring by name.
3650
3651 By default, a name must be unique within a pattern, but it is possible
3652 to relax this constraint by setting the PCRE_DUPNAMES option at compile
3653 time. This can be useful for patterns where only one instance of the
3654 named parentheses can match. Suppose you want to match the name of a
3655 weekday, either as a 3-letter abbreviation or as the full name, and in
3656 both cases you want to extract the abbreviation. This pattern (ignoring
3657 the line breaks) does the job:
3658
3659 (?<DN>Mon|Fri|Sun)(?:day)?|
3660 (?<DN>Tue)(?:sday)?|
3661 (?<DN>Wed)(?:nesday)?|
3662 (?<DN>Thu)(?:rsday)?|
3663 (?<DN>Sat)(?:urday)?
3664
3665 There are five capturing substrings, but only one is ever set after a
3666 match. (An alternative way of solving this problem is to use a "branch
3667 reset" subpattern, as described in the previous section.)
3668
3669 The convenience function for extracting the data by name returns the
3670 substring for the first (and in this example, the only) subpattern of
3671 that name that matched. This saves searching to find which numbered
3672 subpattern it was. If you make a reference to a non-unique named sub-
3673 pattern from elsewhere in the pattern, the one that corresponds to the
3674 lowest number is used. For further details of the interfaces for han-
3675 dling named subpatterns, see the pcreapi documentation.
3676
3677
3678 REPETITION
3679
3680 Repetition is specified by quantifiers, which can follow any of the
3681 following items:
3682
3683 a literal data character
3684 the dot metacharacter
3685 the \C escape sequence
3686 the \X escape sequence (in UTF-8 mode with Unicode properties)
3687 the \R escape sequence
3688 an escape such as \d that matches a single character
3689 a character class
3690 a back reference (see next section)
3691 a parenthesized subpattern (unless it is an assertion)
3692
3693 The general repetition quantifier specifies a minimum and maximum num-
3694 ber of permitted matches, by giving the two numbers in curly brackets
3695 (braces), separated by a comma. The numbers must be less than 65536,
3696 and the first must be less than or equal to the second. For example:
3697
3698 z{2,4}
3699
3700 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
3701 special character. If the second number is omitted, but the comma is
3702 present, there is no upper limit; if the second number and the comma
3703 are both omitted, the quantifier specifies an exact number of required
3704 matches. Thus
3705
3706 [aeiou]{3,}
3707
3708 matches at least 3 successive vowels, but may match many more, while
3709
3710 \d{8}
3711
3712 matches exactly 8 digits. An opening curly bracket that appears in a
3713 position where a quantifier is not allowed, or one that does not match
3714 the syntax of a quantifier, is taken as a literal character. For exam-
3715 ple, {,6} is not a quantifier, but a literal string of four characters.
3716
3717 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
3718 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
3719 acters, each of which is represented by a two-byte sequence. Similarly,
3720 when Unicode property support is available, \X{3} matches three Unicode
3721 extended sequences, each of which may be several bytes long (and they
3722 may be of different lengths).
3723
3724 The quantifier {0} is permitted, causing the expression to behave as if
3725 the previous item and the quantifier were not present.
3726
3727 For convenience, the three most common quantifiers have single-charac-
3728 ter abbreviations:
3729
3730 * is equivalent to {0,}
3731 + is equivalent to {1,}
3732 ? is equivalent to {0,1}
3733
3734 It is possible to construct infinite loops by following a subpattern
3735 that can match no characters with a quantifier that has no upper limit,
3736 for example:
3737
3738 (a?)*
3739
3740 Earlier versions of Perl and PCRE used to give an error at compile time
3741 for such patterns. However, because there are cases where this can be
3742 useful, such patterns are now accepted, but if any repetition of the
3743 subpattern does in fact match no characters, the loop is forcibly bro-
3744 ken.
3745
3746 By default, the quantifiers are "greedy", that is, they match as much
3747 as possible (up to the maximum number of permitted times), without
3748 causing the rest of the pattern to fail. The classic example of where
3749 this gives problems is in trying to match comments in C programs. These
3750 appear between /* and */ and within the comment, individual * and /
3751 characters may appear. An attempt to match C comments by applying the
3752 pattern
3753
3754 /\*.*\*/
3755
3756 to the string
3757
3758 /* first comment */ not comment /* second comment */
3759
3760 fails, because it matches the entire string owing to the greediness of
3761 the .* item.
3762
3763 However, if a quantifier is followed by a question mark, it ceases to
3764 be greedy, and instead matches the minimum number of times possible, so
3765 the pattern
3766
3767 /\*.*?\*/
3768
3769 does the right thing with the C comments. The meaning of the various
3770 quantifiers is not otherwise changed, just the preferred number of
3771 matches. Do not confuse this use of question mark with its use as a
3772 quantifier in its own right. Because it has two uses, it can sometimes
3773 appear doubled, as in
3774
3775 \d??\d
3776
3777 which matches one digit by preference, but can match two if that is the
3778 only way the rest of the pattern matches.
3779
3780 If the PCRE_UNGREEDY option is set (an option that is not available in
3781 Perl), the quantifiers are not greedy by default, but individual ones
3782 can be made greedy by following them with a question mark. In other
3783 words, it inverts the default behaviour.
3784
3785 When a parenthesized subpattern is quantified with a minimum repeat
3786 count that is greater than 1 or with a limited maximum, more memory is
3787 required for the compiled pattern, in proportion to the size of the
3788 minimum or maximum.
3789
3790 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
3791 alent to Perl's /s) is set, thus allowing the dot to match newlines,
3792 the pattern is implicitly anchored, because whatever follows will be
3793 tried against every character position in the subject string, so there
3794 is no point in retrying the overall match at any position after the
3795 first. PCRE normally treats such a pattern as though it were preceded
3796 by \A.
3797
3798 In cases where it is known that the subject string contains no new-
3799 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
3800 mization, or alternatively using ^ to indicate anchoring explicitly.
3801
3802 However, there is one situation where the optimization cannot be used.
3803 When .* is inside capturing parentheses that are the subject of a
3804 backreference elsewhere in the pattern, a match at the start may fail
3805 where a later one succeeds. Consider, for example:
3806
3807 (.*)abc\1
3808
3809 If the subject is "xyz123abc123" the match point is the fourth charac-
3810 ter. For this reason, such a pattern is not implicitly anchored.
3811
3812 When a capturing subpattern is repeated, the value captured is the sub-
3813 string that matched the final iteration. For example, after
3814
3815 (tweedle[dume]{3}\s*)+
3816
3817 has matched "tweedledum tweedledee" the value of the captured substring
3818 is "tweedledee". However, if there are nested capturing subpatterns,
3819 the corresponding captured values may have been set in previous itera-
3820 tions. For example, after
3821
3822 /(a|(b))+/
3823
3824 matches "aba" the value of the second captured substring is "b".
3825
3826
3827 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
3828
3829 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
3830 repetition, failure of what follows normally causes the repeated item
3831 to be re-evaluated to see if a different number of repeats allows the
3832 rest of the pattern to match. Sometimes it is useful to prevent this,
3833 either to change the nature of the match, or to cause it fail earlier
3834 than it otherwise might, when the author of the pattern knows there is
3835 no point in carrying on.
3836
3837 Consider, for example, the pattern \d+foo when applied to the subject
3838 line
3839
3840 123456bar
3841
3842 After matching all 6 digits and then failing to match "foo", the normal
3843 action of the matcher is to try again with only 5 digits matching the
3844 \d+ item, and then with 4, and so on, before ultimately failing.
3845 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
3846 the means for specifying that once a subpattern has matched, it is not
3847 to be re-evaluated in this way.
3848
3849 If we use atomic grouping for the previous example, the matcher gives
3850 up immediately on failing to match "foo" the first time. The notation
3851 is a kind of special parenthesis, starting with (?> as in this example:
3852
3853 (?>\d+)foo
3854
3855 This kind of parenthesis "locks up" the part of the pattern it con-
3856 tains once it has matched, and a failure further into the pattern is
3857 prevented from backtracking into it. Backtracking past it to previous
3858 items, however, works as normal.
3859
3860 An alternative description is that a subpattern of this type matches
3861 the string of characters that an identical standalone pattern would
3862 match, if anchored at the current point in the subject string.
3863
3864 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
3865 such as the above example can be thought of as a maximizing repeat that
3866 must swallow everything it can. So, while both \d+ and \d+? are pre-
3867 pared to adjust the number of digits they match in order to make the
3868 rest of the pattern match, (?>\d+) can only match an entire sequence of
3869 digits.
3870
3871 Atomic groups in general can of course contain arbitrarily complicated
3872 subpatterns, and can be nested. However, when the subpattern for an
3873 atomic group is just a single repeated item, as in the example above, a
3874 simpler notation, called a "possessive quantifier" can be used. This
3875 consists of an additional + character following a quantifier. Using
3876 this notation, the previous example can be rewritten as
3877
3878 \d++foo
3879
3880 Note that a possessive quantifier can be used with an entire group, for
3881 example:
3882
3883 (abc|xyz){2,3}+
3884
3885 Possessive quantifiers are always greedy; the setting of the
3886 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
3887 simpler forms of atomic group. However, there is no difference in the
3888 meaning of a possessive quantifier and the equivalent atomic group,
3889 though there may be a performance difference; possessive quantifiers
3890 should be slightly faster.
3891
3892 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
3893 tax. Jeffrey Friedl originated the idea (and the name) in the first
3894 edition of his book. Mike McCloskey liked it, so implemented it when he
3895 built Sun's Java package, and PCRE copied it from there. It ultimately
3896 found its way into Perl at release 5.10.
3897
3898 PCRE has an optimization that automatically "possessifies" certain sim-
3899 ple pattern constructs. For example, the sequence A+B is treated as
3900 A++B because there is no point in backtracking into a sequence of A's
3901 when B must follow.
3902
3903 When a pattern contains an unlimited repeat inside a subpattern that
3904 can itself be repeated an unlimited number of times, the use of an
3905 atomic group is the only way to avoid some failing matches taking a
3906 very long time indeed. The pattern
3907
3908 (\D+|<\d+>)*[!?]
3909
3910 matches an unlimited number of substrings that either consist of non-
3911 digits, or digits enclosed in <>, followed by either ! or ?. When it
3912 matches, it runs quickly. However, if it is applied to
3913
3914 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
3915
3916 it takes a long time before reporting failure. This is because the
3917 string can be divided between the internal \D+ repeat and the external
3918 * repeat in a large number of ways, and all have to be tried. (The
3919 example uses [!?] rather than a single character at the end, because
3920 both PCRE and Perl have an optimization that allows for fast failure
3921 when a single character is used. They remember the last single charac-
3922 ter that is required for a match, and fail early if it is not present
3923 in the string.) If the pattern is changed so that it uses an atomic
3924 group, like this:
3925
3926 ((?>\D+)|<\d+>)*[!?]
3927
3928 sequences of non-digits cannot be broken, and failure happens quickly.
3929
3930
3931 BACK REFERENCES
3932
3933 Outside a character class, a backslash followed by a digit greater than
3934 0 (and possibly further digits) is a back reference to a capturing sub-
3935 pattern earlier (that is, to its left) in the pattern, provided there
3936 have been that many previous capturing left parentheses.
3937
3938 However, if the decimal number following the backslash is less than 10,
3939 it is always taken as a back reference, and causes an error only if
3940 there are not that many capturing left parentheses in the entire pat-
3941 tern. In other words, the parentheses that are referenced need not be
3942 to the left of the reference for numbers less than 10. A "forward back
3943 reference" of this type can make sense when a repetition is involved
3944 and the subpattern to the right has participated in an earlier itera-
3945 tion.
3946
3947 It is not possible to have a numerical "forward back reference" to a
3948 subpattern whose number is 10 or more using this syntax because a
3949 sequence such as \50 is interpreted as a character defined in octal.
3950 See the subsection entitled "Non-printing characters" above for further
3951 details of the handling of digits following a backslash. There is no
3952 such problem when named parentheses are used. A back reference to any
3953 subpattern is possible using named parentheses (see below).
3954
3955 Another way of avoiding the ambiguity inherent in the use of digits
3956 following a backslash is to use the \g escape sequence, which is a fea-
3957 ture introduced in Perl 5.10. This escape must be followed by an
3958 unsigned number or a negative number, optionally enclosed in braces.
3959 These examples are all identical:
3960
3961 (ring), \1
3962 (ring), \g1
3963 (ring), \g{1}
3964
3965 An unsigned number specifies an absolute reference without the ambigu-
3966 ity that is present in the older syntax. It is also useful when literal
3967 digits follow the reference. A negative number is a relative reference.
3968 Consider this example:
3969
3970 (abc(def)ghi)\g{-1}
3971
3972 The sequence \g{-1} is a reference to the most recently started captur-
3973 ing subpattern before \g, that is, is it equivalent to \2. Similarly,
3974 \g{-2} would be equivalent to \1. The use of relative references can be
3975 helpful in long patterns, and also in patterns that are created by
3976 joining together fragments that contain references within themselves.
3977
3978 A back reference matches whatever actually matched the capturing sub-
3979 pattern in the current subject string, rather than anything matching
3980 the subpattern itself (see "Subpatterns as subroutines" below for a way
3981 of doing that). So the pattern
3982
3983 (sens|respons)e and \1ibility
3984
3985 matches "sense and sensibility" and "response and responsibility", but
3986 not "sense and responsibility". If caseful matching is in force at the
3987 time of the back reference, the case of letters is relevant. For exam-
3988 ple,
3989
3990 ((?i)rah)\s+\1
3991
3992 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
3993 original capturing subpattern is matched caselessly.
3994
3995 There are several different ways of writing back references to named
3996 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
3997 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
3998 unified back reference syntax, in which \g can be used for both numeric
3999 and named references, is also supported. We could rewrite the above
4000 example in any of the following ways:
4001
4002 (?<p1>(?i)rah)\s+\k<p1>
4003 (?'p1'(?i)rah)\s+\k{p1}
4004 (?P<p1>(?i)rah)\s+(?P=p1)
4005 (?<p1>(?i)rah)\s+\g{p1}
4006
4007 A subpattern that is referenced by name may appear in the pattern
4008 before or after the reference.
4009
4010 There may be more than one back reference to the same subpattern. If a
4011 subpattern has not actually been used in a particular match, any back
4012 references to it always fail. For example, the pattern
4013
4014 (a|(bc))\2
4015
4016 always fails if it starts to match "a" rather than "bc". Because there
4017 may be many capturing parentheses in a pattern, all digits following
4018 the backslash are taken as part of a potential back reference number.
4019 If the pattern continues with a digit character, some delimiter must be
4020 used to terminate the back reference. If the PCRE_EXTENDED option is
4021 set, this can be whitespace. Otherwise an empty comment (see "Com-
4022 ments" below) can be used.
4023
4024 A back reference that occurs inside the parentheses to which it refers
4025 fails when the subpattern is first used, so, for example, (a\1) never
4026 matches. However, such references can be useful inside repeated sub-
4027 patterns. For example, the pattern
4028
4029 (a|b\1)+
4030
4031 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
4032 ation of the subpattern, the back reference matches the character
4033 string corresponding to the previous iteration. In order for this to
4034 work, the pattern must be such that the first iteration does not need
4035 to match the back reference. This can be done using alternation, as in
4036 the example above, or by a quantifier with a minimum of zero.
4037
4038
4039 ASSERTIONS
4040
4041 An assertion is a test on the characters following or preceding the
4042 current matching point that does not actually consume any characters.
4043 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
4044 described above.
4045
4046 More complicated assertions are coded as subpatterns. There are two
4047 kinds: those that look ahead of the current position in the subject
4048 string, and those that look behind it. An assertion subpattern is
4049 matched in the normal way, except that it does not cause the current
4050 matching position to be changed.
4051
4052 Assertion subpatterns are not capturing subpatterns, and may not be
4053 repeated, because it makes no sense to assert the same thing several
4054 times. If any kind of assertion contains capturing subpatterns within
4055 it, these are counted for the purposes of numbering the capturing sub-
4056 patterns in the whole pattern. However, substring capturing is carried
4057 out only for positive assertions, because it does not make sense for
4058 negative assertions.
4059
4060 Lookahead assertions
4061
4062 Lookahead assertions start with (?= for positive assertions and (?! for
4063 negative assertions. For example,
4064
4065 \w+(?=;)
4066
4067 matches a word followed by a semicolon, but does not include the semi-
4068 colon in the match, and
4069
4070 foo(?!bar)
4071
4072 matches any occurrence of "foo" that is not followed by "bar". Note
4073 that the apparently similar pattern
4074
4075 (?!foo)bar
4076
4077 does not find an occurrence of "bar" that is preceded by something
4078 other than "foo"; it finds any occurrence of "bar" whatsoever, because
4079 the assertion (?!foo) is always true when the next three characters are
4080 "bar". A lookbehind assertion is needed to achieve the other effect.
4081
4082 If you want to force a matching failure at some point in a pattern, the
4083 most convenient way to do it is with (?!) because an empty string
4084 always matches, so an assertion that requires there not to be an empty
4085 string must always fail.
4086
4087 Lookbehind assertions
4088
4089 Lookbehind assertions start with (?<= for positive assertions and (?<!
4090 for negative assertions. For example,
4091
4092 (?<!foo)bar
4093
4094 does find an occurrence of "bar" that is not preceded by "foo". The
4095 contents of a lookbehind assertion are restricted such that all the
4096 strings it matches must have a fixed length. However, if there are sev-
4097 eral top-level alternatives, they do not all have to have the same
4098 fixed length. Thus
4099
4100 (?<=bullock|donkey)
4101
4102 is permitted, but
4103
4104 (?<!dogs?|cats?)
4105
4106 causes an error at compile time. Branches that match different length
4107 strings are permitted only at the top level of a lookbehind assertion.
4108 This is an extension compared with Perl (at least for 5.8), which
4109 requires all branches to match the same length of string. An assertion
4110 such as
4111
4112 (?<=ab(c|de))
4113
4114 is not permitted, because its single top-level branch can match two
4115 different lengths, but it is acceptable if rewritten to use two top-
4116 level branches:
4117
4118 (?<=abc|abde)
4119
4120 In some cases, the Perl 5.10 escape sequence \K (see above) can be used
4121 instead of a lookbehind assertion; this is not restricted to a fixed-
4122 length.
4123
4124 The implementation of lookbehind assertions is, for each alternative,
4125 to temporarily move the current position back by the fixed length and
4126 then try to match. If there are insufficient characters before the cur-
4127 rent position, the assertion fails.
4128
4129 PCRE does not allow the \C escape (which matches a single byte in UTF-8
4130 mode) to appear in lookbehind assertions, because it makes it impossi-
4131 ble to calculate the length of the lookbehind. The \X and \R escapes,
4132 which can match different numbers of bytes, are also not permitted.
4133
4134 Possessive quantifiers can be used in conjunction with lookbehind
4135 assertions to specify efficient matching at the end of the subject
4136 string. Consider a simple pattern such as
4137
4138 abcd$
4139
4140 when applied to a long string that does not match. Because matching
4141 proceeds from left to right, PCRE will look for each "a" in the subject
4142 and then see if what follows matches the rest of the pattern. If the
4143 pattern is specified as
4144
4145 ^.*abcd$
4146
4147 the initial .* matches the entire string at first, but when this fails
4148 (because there is no following "a"), it backtracks to match all but the
4149 last character, then all but the last two characters, and so on. Once
4150 again the search for "a" covers the entire string, from right to left,
4151 so we are no better off. However, if the pattern is written as
4152
4153 ^.*+(?<=abcd)
4154
4155 there can be no backtracking for the .*+ item; it can match only the
4156 entire string. The subsequent lookbehind assertion does a single test
4157 on the last four characters. If it fails, the match fails immediately.
4158 For long strings, this approach makes a significant difference to the
4159 processing time.
4160
4161 Using multiple assertions
4162
4163 Several assertions (of any sort) may occur in succession. For example,
4164
4165 (?<=\d{3})(?<!999)foo
4166
4167 matches "foo" preceded by three digits that are not "999". Notice that
4168 each of the assertions is applied independently at the same point in
4169 the subject string. First there is a check that the previous three
4170 characters are all digits, and then there is a check that the same
4171 three characters are not "999". This pattern does not match "foo" pre-
4172 ceded by six characters, the first of which are digits and the last
4173 three of which are not "999". For example, it doesn't match "123abc-
4174 foo". A pattern to do that is
4175
4176 (?<=\d{3}...)(?<!999)foo
4177
4178 This time the first assertion looks at the preceding six characters,
4179 checking that the first three are digits, and then the second assertion
4180 checks that the preceding three characters are not "999".
4181
4182 Assertions can be nested in any combination. For example,
4183
4184 (?<=(?<!foo)bar)baz
4185
4186 matches an occurrence of "baz" that is preceded by "bar" which in turn
4187 is not preceded by "foo", while
4188
4189 (?<=\d{3}(?!999)...)foo
4190
4191 is another pattern that matches "foo" preceded by three digits and any
4192 three characters that are not "999".
4193
4194
4195 CONDITIONAL SUBPATTERNS
4196
4197 It is possible to cause the matching process to obey a subpattern con-
4198 ditionally or to choose between two alternative subpatterns, depending
4199 on the result of an assertion, or whether a previous capturing subpat-
4200 tern matched or not. The two possible forms of conditional subpattern
4201 are
4202
4203 (?(condition)yes-pattern)
4204 (?(condition)yes-pattern|no-pattern)
4205
4206 If the condition is satisfied, the yes-pattern is used; otherwise the
4207 no-pattern (if present) is used. If there are more than two alterna-
4208 tives in the subpattern, a compile-time error occurs.
4209
4210 There are four kinds of condition: references to subpatterns, refer-
4211 ences to recursion, a pseudo-condition called DEFINE, and assertions.
4212
4213 Checking for a used subpattern by number
4214
4215 If the text between the parentheses consists of a sequence of digits,
4216 the condition is true if the capturing subpattern of that number has
4217 previously matched. An alternative notation is to precede the digits
4218 with a plus or minus sign. In this case, the subpattern number is rela-
4219 tive rather than absolute. The most recently opened parentheses can be
4220 referenced by (?(-1), the next most recent by (?(-2), and so on. In
4221 looping constructs it can also make sense to refer to subsequent groups
4222 with constructs such as (?(+2).
4223
4224 Consider the following pattern, which contains non-significant white
4225 space to make it more readable (assume the PCRE_EXTENDED option) and to
4226 divide it into three parts for ease of discussion:
4227
4228 ( \( )? [^()]+ (?(1) \) )
4229
4230 The first part matches an optional opening parenthesis, and if that
4231 character is present, sets it as the first captured substring. The sec-
4232 ond part matches one or more characters that are not parentheses. The
4233 third part is a conditional subpattern that tests whether the first set
4234 of parentheses matched or not. If they did, that is, if subject started
4235 with an opening parenthesis, the condition is true, and so the yes-pat-
4236 tern is executed and a closing parenthesis is required. Otherwise,
4237 since no-pattern is not present, the subpattern matches nothing. In
4238 other words, this pattern matches a sequence of non-parentheses,
4239 optionally enclosed in parentheses.
4240
4241 If you were embedding this pattern in a larger one, you could use a
4242 relative reference:
4243
4244 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
4245
4246 This makes the fragment independent of the parentheses in the larger
4247 pattern.
4248
4249 Checking for a used subpattern by name
4250
4251 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
4252 used subpattern by name. For compatibility with earlier versions of
4253 PCRE, which had this facility before Perl, the syntax (?(name)...) is
4254 also recognized. However, there is a possible ambiguity with this syn-
4255 tax, because subpattern names may consist entirely of digits. PCRE
4256 looks first for a named subpattern; if it cannot find one and the name
4257 consists entirely of digits, PCRE looks for a subpattern of that num-
4258 ber, which must be greater than zero. Using subpattern names that con-
4259 sist entirely of digits is not recommended.
4260
4261 Rewriting the above example to use a named subpattern gives this:
4262
4263 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
4264
4265
4266 Checking for pattern recursion
4267
4268 If the condition is the string (R), and there is no subpattern with the
4269 name R, the condition is true if a recursive call to the whole pattern
4270 or any subpattern has been made. If digits or a name preceded by amper-
4271 sand follow the letter R, for example:
4272
4273 (?(R3)...) or (?(R&name)...)
4274
4275 the condition is true if the most recent recursion is into the subpat-
4276 tern whose number or name is given. This condition does not check the
4277 entire recursion stack.
4278
4279 At "top level", all these recursion test conditions are false. Recur-
4280 sive patterns are described below.
4281
4282 Defining subpatterns for use by reference only
4283
4284 If the condition is the string (DEFINE), and there is no subpattern
4285 with the name DEFINE, the condition is always false. In this case,
4286 there may be only one alternative in the subpattern. It is always
4287 skipped if control reaches this point in the pattern; the idea of
4288 DEFINE is that it can be used to define "subroutines" that can be ref-
4289 erenced from elsewhere. (The use of "subroutines" is described below.)
4290 For example, a pattern to match an IPv4 address could be written like
4291 this (ignore whitespace and line breaks):
4292
4293 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
4294 \b (?&byte) (\.(?&byte)){3} \b
4295
4296 The first part of the pattern is a DEFINE group inside which a another
4297 group named "byte" is defined. This matches an individual component of
4298 an IPv4 address (a number less than 256). When matching takes place,
4299 this part of the pattern is skipped because DEFINE acts like a false
4300 condition.
4301
4302 The rest of the pattern uses references to the named group to match the
4303 four dot-separated components of an IPv4 address, insisting on a word
4304 boundary at each end.
4305
4306 Assertion conditions
4307
4308 If the condition is not in any of the above formats, it must be an
4309 assertion. This may be a positive or negative lookahead or lookbehind
4310 assertion. Consider this pattern, again containing non-significant
4311 white space, and with the two alternatives on the second line:
4312
4313 (?(?=[^a-z]*[a-z])
4314 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
4315
4316 The condition is a positive lookahead assertion that matches an
4317 optional sequence of non-letters followed by a letter. In other words,
4318 it tests for the presence of at least one letter in the subject. If a
4319 letter is found, the subject is matched against the first alternative;
4320 otherwise it is matched against the second. This pattern matches
4321 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
4322 letters and dd are digits.
4323
4324
4325 COMMENTS
4326
4327 The sequence (?# marks the start of a comment that continues up to the
4328 next closing parenthesis. Nested parentheses are not permitted. The
4329 characters that make up a comment play no part in the pattern matching
4330 at all.
4331
4332 If the PCRE_EXTENDED option is set, an unescaped # character outside a
4333 character class introduces a comment that continues to immediately
4334 after the next newline in the pattern.
4335
4336
4337 RECURSIVE PATTERNS
4338
4339 Consider the problem of matching a string in parentheses, allowing for
4340 unlimited nested parentheses. Without the use of recursion, the best
4341 that can be done is to use a pattern that matches up to some fixed
4342 depth of nesting. It is not possible to handle an arbitrary nesting
4343 depth.
4344
4345 For some time, Perl has provided a facility that allows regular expres-
4346 sions to recurse (amongst other things). It does this by interpolating
4347 Perl code in the expression at run time, and the code can refer to the
4348 expression itself. A Perl pattern using code interpolation to solve the
4349 parentheses problem can be created like this:
4350
4351 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
4352
4353 The (?p{...}) item interpolates Perl code at run time, and in this case
4354 refers recursively to the pattern in which it appears.
4355
4356 Obviously, PCRE cannot support the interpolation of Perl code. Instead,
4357 it supports special syntax for recursion of the entire pattern, and
4358 also for individual subpattern recursion. After its introduction in
4359 PCRE and Python, this kind of recursion was introduced into Perl at
4360 release 5.10.
4361
4362 A special item that consists of (? followed by a number greater than
4363 zero and a closing parenthesis is a recursive call of the subpattern of
4364 the given number, provided that it occurs inside that subpattern. (If
4365 not, it is a "subroutine" call, which is described in the next sec-
4366 tion.) The special item (?R) or (?0) is a recursive call of the entire
4367 regular expression.
4368
4369 In PCRE (like Python, but unlike Perl), a recursive subpattern call is
4370 always treated as an atomic group. That is, once it has matched some of
4371 the subject string, it is never re-entered, even if it contains untried
4372 alternatives and there is a subsequent matching failure.
4373
4374 This PCRE pattern solves the nested parentheses problem (assume the
4375 PCRE_EXTENDED option is set so that white space is ignored):
4376
4377 \( ( (?>[^()]+) | (?R) )* \)
4378
4379 First it matches an opening parenthesis. Then it matches any number of
4380 substrings which can either be a sequence of non-parentheses, or a
4381 recursive match of the pattern itself (that is, a correctly parenthe-
4382 sized substring). Finally there is a closing parenthesis.
4383
4384 If this were part of a larger pattern, you would not want to recurse
4385 the entire pattern, so instead you could use this:
4386
4387 ( \( ( (?>[^()]+) | (?1) )* \) )
4388
4389 We have put the pattern into parentheses, and caused the recursion to
4390 refer to them instead of the whole pattern.
4391
4392 In a larger pattern, keeping track of parenthesis numbers can be
4393 tricky. This is made easier by the use of relative references. (A Perl
4394 5.10 feature.) Instead of (?1) in the pattern above you can write
4395 (?-2) to refer to the second most recently opened parentheses preceding
4396 the recursion. In other words, a negative number counts capturing
4397 parentheses leftwards from the point at which it is encountered.
4398
4399 It is also possible to refer to subsequently opened parentheses, by
4400 writing references such as (?+2). However, these cannot be recursive
4401 because the reference is not inside the parentheses that are refer-
4402 enced. They are always "subroutine" calls, as described in the next
4403 section.
4404
4405 An alternative approach is to use named parentheses instead. The Perl
4406 syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also
4407 supported. We could rewrite the above example as follows:
4408
4409 (?<pn> \( ( (?>[^()]+) | (?&pn) )* \) )
4410
4411 If there is more than one subpattern with the same name, the earliest
4412 one is used.
4413
4414 This particular example pattern that we have been looking at contains
4415 nested unlimited repeats, and so the use of atomic grouping for match-
4416 ing strings of non-parentheses is important when applying the pattern
4417 to strings that do not match. For example, when this pattern is applied
4418 to
4419
4420 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
4421
4422 it yields "no match" quickly. However, if atomic grouping is not used,
4423 the match runs for a very long time indeed because there are so many
4424 different ways the + and * repeats can carve up the subject, and all
4425 have to be tested before failure can be reported.
4426
4427 At the end of a match, the values set for any capturing subpatterns are
4428 those from the outermost level of the recursion at which the subpattern
4429 value is set. If you want to obtain intermediate values, a callout
4430 function can be used (see below and the pcrecallout documentation). If
4431 the pattern above is matched against
4432
4433 (ab(cd)ef)
4434
4435 the value for the capturing parentheses is "ef", which is the last
4436 value taken on at the top level. If additional parentheses are added,
4437 giving
4438
4439 \( ( ( (?>[^()]+) | (?R) )* ) \)
4440 ^ ^
4441 ^ ^
4442
4443 the string they capture is "ab(cd)ef", the contents of the top level
4444 parentheses. If there are more than 15 capturing parentheses in a pat-
4445 tern, PCRE has to obtain extra memory to store data during a recursion,
4446 which it does by using pcre_malloc, freeing it via pcre_free after-
4447 wards. If no memory can be obtained, the match fails with the
4448 PCRE_ERROR_NOMEMORY error.
4449
4450 Do not confuse the (?R) item with the condition (R), which tests for
4451 recursion. Consider this pattern, which matches text in angle brack-
4452 ets, allowing for arbitrary nesting. Only digits are allowed in nested
4453 brackets (that is, when recursing), whereas any characters are permit-
4454 ted at the outer level.
4455
4456 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
4457
4458 In this pattern, (?(R) is the start of a conditional subpattern, with
4459 two different alternatives for the recursive and non-recursive cases.
4460 The (?R) item is the actual recursive call.
4461
4462
4463 SUBPATTERNS AS SUBROUTINES
4464
4465 If the syntax for a recursive subpattern reference (either by number or
4466 by name) is used outside the parentheses to which it refers, it oper-
4467 ates like a subroutine in a programming language. The "called" subpat-
4468 tern may be defined before or after the reference. A numbered reference
4469 can be absolute or relative, as in these examples:
4470
4471 (...(absolute)...)...(?2)...
4472 (...(relative)...)...(?-1)...
4473 (...(?+1)...(relative)...
4474
4475 An earlier example pointed out that the pattern
4476
4477 (sens|respons)e and \1ibility
4478
4479 matches "sense and sensibility" and "response and responsibility", but
4480 not "sense and responsibility". If instead the pattern
4481
4482 (sens|respons)e and (?1)ibility
4483
4484 is used, it does match "sense and responsibility" as well as the other
4485 two strings. Another example is given in the discussion of DEFINE
4486 above.
4487
4488 Like recursive subpatterns, a "subroutine" call is always treated as an
4489 atomic group. That is, once it has matched some of the subject string,
4490 it is never re-entered, even if it contains untried alternatives and
4491 there is a subsequent matching failure.
4492
4493 When a subpattern is used as a subroutine, processing options such as
4494 case-independence are fixed when the subpattern is defined. They cannot
4495 be changed for different calls. For example, consider this pattern:
4496
4497 (abc)(?i:(?-1))
4498
4499 It matches "abcabc". It does not match "abcABC" because the change of
4500 processing option does not affect the called subpattern.
4501
4502
4503 CALLOUTS
4504
4505 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
4506 Perl code to be obeyed in the middle of matching a regular expression.
4507 This makes it possible, amongst other things, to extract different sub-
4508 strings that match the same pair of parentheses when there is a repeti-
4509 tion.
4510
4511 PCRE provides a similar feature, but of course it cannot obey arbitrary
4512 Perl code. The feature is called "callout". The caller of PCRE provides
4513 an external function by putting its entry point in the global variable
4514 pcre_callout. By default, this variable contains NULL, which disables
4515 all calling out.
4516
4517 Within a regular expression, (?C) indicates the points at which the
4518 external function is to be called. If you want to identify different
4519 callout points, you can put a number less than 256 after the letter C.
4520 The default value is zero. For example, this pattern has two callout
4521 points:
4522
4523 (?C1)abc(?C2)def
4524
4525 If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are
4526 automatically installed before each item in the pattern. They are all
4527 numbered 255.
4528
4529 During matching, when PCRE reaches a callout point (and pcre_callout is
4530 set), the external function is called. It is provided with the number
4531 of the callout, the position in the pattern, and, optionally, one item
4532 of data originally supplied by the caller of pcre_exec(). The callout
4533 function may cause matching to proceed, to backtrack, or to fail alto-
4534 gether. A complete description of the interface to the callout function
4535 is given in the pcrecallout documentation.
4536
4537
4538 SEE ALSO
4539
4540 pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
4541
4542
4543 AUTHOR
4544
4545 Philip Hazel
4546 University Computing Service
4547 Cambridge CB2 3QH, England.
4548
4549
4550 REVISION
4551
4552 Last updated: 06 August 2007
4553 Copyright (c) 1997-2007 University of Cambridge.
4554 ------------------------------------------------------------------------------
4555
4556
4557 PCRESYNTAX(3) PCRESYNTAX(3)
4558
4559
4560 NAME
4561 PCRE - Perl-compatible regular expressions
4562
4563
4564 PCRE REGULAR EXPRESSION SYNTAX SUMMARY
4565
4566 The full syntax and semantics of the regular expressions that are sup-
4567 ported by PCRE are described in the pcrepattern documentation. This
4568 document contains just a quick-reference summary of the syntax.
4569
4570
4571 QUOTING
4572
4573 \x where x is non-alphanumeric is a literal x
4574 \Q...\E treat enclosed characters as literal
4575
4576
4577 CHARACTERS
4578
4579 \a alarm, that is, the BEL character (hex 07)
4580 \cx "control-x", where x is any character
4581 \e escape (hex 1B)
4582 \f formfeed (hex 0C)
4583 \n newline (hex 0A)
4584 \r carriage return (hex 0D)
4585 \t tab (hex 09)
4586 \ddd character with octal code ddd, or backreference
4587 \xhh character with hex code hh
4588 \x{hhh..} character with hex code hhh..
4589
4590
4591 CHARACTER TYPES
4592
4593 . any character except newline;
4594 in dotall mode, any character whatsoever
4595 \C one byte, even in UTF-8 mode (best avoided)
4596 \d a decimal digit
4597 \D a character that is not a decimal digit
4598 \h a horizontal whitespace character
4599 \H a character that is not a horizontal whitespace character
4600 \p{xx} a character with the xx property
4601 \P{xx} a character without the xx property
4602 \R a newline sequence
4603 \s a whitespace character
4604 \S a character that is not a whitespace character
4605 \v a vertical whitespace character
4606 \V a character that is not a vertical whitespace character
4607 \w a "word" character
4608 \W a "non-word" character
4609 \X an extended Unicode sequence
4610
4611 In PCRE, \d, \D, \s, \S, \w, and \W recognize only ASCII characters.
4612
4613
4614 GENERAL CATEGORY PROPERTY CODES FOR \p and \P
4615
4616 C Other
4617 Cc Control
4618 Cf Format
4619 Cn Unassigned
4620 Co Private use
4621 Cs Surrogate
4622
4623 L Letter
4624 Ll Lower case letter
4625 Lm Modifier letter
4626 Lo Other letter
4627 Lt Title case letter
4628 Lu Upper case letter
4629 L& Ll, Lu, or Lt
4630
4631 M Mark
4632 Mc Spacing mark
4633 Me Enclosing mark
4634 Mn Non-spacing mark
4635
4636 N Number
4637 Nd Decimal number
4638 Nl Letter number
4639 No Other number
4640
4641 P Punctuation
4642 Pc Connector punctuation
4643 Pd Dash punctuation
4644 Pe Close punctuation
4645 Pf Final punctuation
4646 Pi Initial punctuation
4647 Po Other punctuation
4648 Ps Open punctuation
4649
4650 S Symbol
4651 Sc Currency symbol
4652 Sk Modifier symbol
4653 Sm Mathematical symbol
4654 So Other symbol
4655
4656 Z Separator
4657 Zl Line separator
4658 Zp Paragraph separator
4659 Zs Space separator
4660
4661
4662 SCRIPT NAMES FOR \p AND \P
4663
4664 Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese,
4665 Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform,
4666 Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic,
4667 Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira-
4668 gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin,
4669 Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko,
4670 Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician,
4671 Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa,
4672 Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi.
4673
4674
4675 CHARACTER CLASSES
4676
4677 [...] positive character class
4678 [^...] negative character class
4679 [x-y] range (can be used for hex characters)
4680 [[:xxx:]] positive POSIX named set
4681 [[^:xxx:]] negative POSIX named set
4682
4683 alnum alphanumeric
4684 alpha alphabetic
4685 ascii 0-127
4686 blank space or tab
4687 cntrl control character
4688 digit decimal digit
4689 graph printing, excluding space
4690 lower lower case letter
4691 print printing, including space
4692 punct printing, excluding alphanumeric
4693 space whitespace
4694 upper upper case letter
4695 word same as \w
4696 xdigit hexadecimal digit
4697
4698 In PCRE, POSIX character set names recognize only ASCII characters. You
4699 can use \Q...\E inside a character class.
4700
4701
4702 QUANTIFIERS
4703
4704 ? 0 or 1, greedy
4705 ?+ 0 or 1, possessive
4706 ?? 0 or 1, lazy
4707 * 0 or more, greedy
4708 *+ 0 or more, possessive
4709 *? 0 or more, lazy
4710 + 1 or more, greedy
4711 ++ 1 or more, possessive
4712 +? 1 or more, lazy
4713 {n} exactly n
4714 {n,m} at least n, no more than m, greedy
4715 {n,m}+ at least n, no more than m, possessive
4716 {n,m}? at least n, no more than m, lazy
4717 {n,} n or more, greedy
4718 {n,}+ n or more, possessive
4719 {n,}? n or more, lazy
4720
4721
4722 ANCHORS AND SIMPLE ASSERTIONS
4723
4724 \b word boundary
4725 \B not a word boundary
4726 ^ start of subject
4727 also after internal newline in multiline mode
4728 \A start of subject
4729 $ end of subject
4730 also before newline at end of subject
4731 also before internal newline in multiline mode
4732 \Z end of subject
4733 also before newline at end of subject
4734 \z end of subject
4735 \G first matching position in subject
4736
4737
4738 MATCH POINT RESET
4739
4740 \K reset start of match
4741
4742
4743 ALTERNATION
4744
4745 expr|expr|expr...
4746
4747
4748 CAPTURING
4749
4750 (...) capturing group
4751 (?<name>...) named capturing group (Perl)
4752 (?'name'...) named capturing group (Perl)
4753 (?P<name>...) named capturing group (Python)
4754 (?:...) non-capturing group
4755 (?|...) non-capturing group; reset group numbers for
4756 capturing groups in each alternative
4757
4758
4759 ATOMIC GROUPS
4760
4761 (?>...) atomic, non-capturing group
4762
4763
4764 COMMENT
4765
4766 (?#....) comment (not nestable)
4767
4768
4769 OPTION SETTING
4770
4771 (?i) caseless
4772 (?J) allow duplicate names
4773 (?m) multiline
4774 (?s) single line (dotall)
4775 (?U) default ungreedy (lazy)
4776 (?x) extended (ignore white space)
4777 (?-...) unset option(s)
4778
4779
4780 LOOKAHEAD AND LOOKBEHIND ASSERTIONS
4781
4782 (?=...) positive look ahead
4783 (?!...) negative look ahead
4784 (?<=...) positive look behind
4785 (?<!...) negative look behind
4786
4787 Each top-level branch of a look behind must be of a fixed length.
4788
4789
4790 BACKREFERENCES
4791
4792 \n reference by number (can be ambiguous)
4793 \gn reference by number
4794 \g{n} reference by number
4795 \g{-n} relative reference by number
4796 \k<name> reference by name (Perl)
4797 \k'name' reference by name (Perl)
4798 \g{name} reference by name (Perl)
4799 \k{name} reference by name (.NET)
4800 (?P=name) reference by name (Python)
4801
4802
4803 SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
4804
4805 (?R) recurse whole pattern
4806 (?n) call subpattern by absolute number
4807 (?+n) call subpattern by relative number
4808 (?-n) call subpattern by relative number
4809 (?&name) call subpattern by name (Perl)
4810 (?P>name) call subpattern by name (Python)
4811
4812
4813 CONDITIONAL PATTERNS
4814
4815 (?(condition)yes-pattern)
4816 (?(condition)yes-pattern|no-pattern)
4817
4818 (?(n)... absolute reference condition
4819 (?(+n)... relative reference condition
4820 (?(-n)... relative reference condition
4821 (?(<name>)... named reference condition (Perl)
4822 (?('name')... named reference condition (Perl)
4823 (?(name)... named reference condition (PCRE)
4824 (?(R)... overall recursion condition
4825 (?(Rn)... specific group recursion condition
4826 (?(R&name)... specific recursion condition
4827 (?(DEFINE)... define subpattern for reference
4828 (?(assert)... assertion condition
4829
4830
4831 CALLOUTS
4832
4833 (?C) callout
4834 (?Cn) callout with data n
4835
4836
4837 SEE ALSO
4838
4839 pcrepattern(3), pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3).
4840
4841
4842 AUTHOR
4843
4844 Philip Hazel
4845 University Computing Service
4846 Cambridge CB2 3QH, England.
4847
4848
4849 REVISION
4850
4851 Last updated: 06 August 2007
4852 Copyright (c) 1997-2007 University of Cambridge.
4853 ------------------------------------------------------------------------------
4854
4855
4856 PCREPARTIAL(3) PCREPARTIAL(3)
4857
4858
4859 NAME
4860 PCRE - Perl-compatible regular expressions
4861
4862
4863 PARTIAL MATCHING IN PCRE
4864
4865 In normal use of PCRE, if the subject string that is passed to
4866 pcre_exec() or pcre_dfa_exec() matches as far as it goes, but is too
4867 short to match the entire pattern, PCRE_ERROR_NOMATCH is returned.
4868 There are circumstances where it might be helpful to distinguish this
4869 case from other cases in which there is no match.
4870
4871 Consider, for example, an application where a human is required to type
4872 in data for a field with specific formatting requirements. An example
4873 might be a date in the form ddmmmyy, defined by this pattern:
4874
4875 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
4876
4877 If the application sees the user's keystrokes one by one, and can check
4878 that what has been typed so far is potentially valid, it is able to
4879 raise an error as soon as a mistake is made, possibly beeping and not
4880 reflecting the character that has been typed. This immediate feedback
4881 is likely to be a better user interface than a check that is delayed
4882 until the entire string has been entered.
4883
4884 PCRE supports the concept of partial matching by means of the PCRE_PAR-
4885 TIAL option, which can be set when calling pcre_exec() or
4886 pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code
4887 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
4888 during the matching process the last part of the subject string matched
4889 part of the pattern. Unfortunately, for non-anchored matching, it is
4890 not possible to obtain the position of the start of the partial match.
4891 No captured data is set when PCRE_ERROR_PARTIAL is returned.
4892
4893 When PCRE_PARTIAL is set for pcre_dfa_exec(), the return code
4894 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of
4895 the subject is reached, there have been no complete matches, but there
4896 is still at least one matching possibility. The portion of the string
4897 that provided the partial match is set as the first matching string.
4898
4899 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers
4900 the last literal byte in a pattern, and abandons matching immediately
4901 if such a byte is not present in the subject string. This optimization
4902 cannot be used for a subject string that might match only partially.
4903
4904
4905 RESTRICTED PATTERNS FOR PCRE_PARTIAL
4906
4907 Because of the way certain internal optimizations are implemented in
4908 the pcre_exec() function, the PCRE_PARTIAL option cannot be used with
4909 all patterns. These restrictions do not apply when pcre_dfa_exec() is
4910 used. For pcre_exec(), repeated single characters such as
4911
4912 a{2,4}
4913
4914 and repeated single metasequences such as
4915
4916 \d+
4917
4918 are not permitted if the maximum number of occurrences is greater than
4919 one. Optional items such as \d? (where the maximum is one) are permit-
4920 ted. Quantifiers with any values are permitted after parentheses, so
4921 the invalid examples above can be coded thus:
4922
4923 (a){2,4}
4924 (\d)+
4925
4926 These constructions run more slowly, but for the kinds of application
4927 that are envisaged for this facility, this is not felt to be a major
4928 restriction.
4929
4930 If PCRE_PARTIAL is set for a pattern that does not conform to the
4931 restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL
4932 (-13). You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to
4933 find out if a compiled pattern can be used for partial matching.
4934
4935
4936 EXAMPLE OF PARTIAL MATCHING USING PCRETEST
4937
4938 If the escape sequence \P is present in a pcretest data line, the
4939 PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that
4940 uses the date example quoted above:
4941
4942 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4943 data> 25jun04\P
4944 0: 25jun04
4945 1: jun
4946 data> 25dec3\P
4947 Partial match
4948 data> 3ju\P
4949 Partial match
4950 data> 3juj\P
4951 No match
4952 data> j\P
4953 No match
4954
4955 The first data string is matched completely, so pcretest shows the
4956 matched substrings. The remaining four strings do not match the com-
4957 plete pattern, but the first two are partial matches. The same test,
4958 using pcre_dfa_exec() matching (by means of the \D escape sequence),
4959 produces the following output:
4960
4961 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4962 data> 25jun04\P\D
4963 0: 25jun04
4964 data> 23dec3\P\D
4965 Partial match: 23dec3
4966 data> 3ju\P\D
4967 Partial match: 3ju
4968 data> 3juj\P\D
4969 No match
4970 data> j\P\D
4971 No match
4972
4973 Notice that in this case the portion of the string that was matched is
4974 made available.
4975
4976
4977 MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()
4978
4979 When a partial match has been found using pcre_dfa_exec(), it is possi-
4980 ble to continue the match by providing additional subject data and
4981 calling pcre_dfa_exec() again with the same compiled regular expres-
4982 sion, this time setting the PCRE_DFA_RESTART option. You must also pass
4983 the same working space as before, because this is where details of the
4984 previous partial match are stored. Here is an example using pcretest,
4985 using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and
4986 \D are as above):
4987
4988 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
4989 data> 23ja\P\D
4990 Partial match: 23ja
4991 data> n05\R\D
4992 0: n05
4993
4994 The first call has "23ja" as the subject, and requests partial match-
4995 ing; the second call has "n05" as the subject for the continued
4996 (restarted) match. Notice that when the match is complete, only the
4997 last part is shown; PCRE does not retain the previously partially-
4998 matched string. It is up to the calling program to do that if it needs
4999 to.
5000
5001 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial
5002 matching over multiple segments. This facility can be used to pass very
5003 long subject strings to pcre_dfa_exec(). However, some care is needed
5004 for certain types of pattern.
5005
5006 1. If the pattern contains tests for the beginning or end of a line,
5007 you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri-
5008 ate, when the subject string for any call does not contain the begin-
5009 ning or end of a line.
5010
5011 2. If the pattern contains backward assertions (including \b or \B),
5012 you need to arrange for some overlap in the subject strings to allow
5013 for this. For example, you could pass the subject in chunks that are
5014 500 bytes long, but in a buffer of 700 bytes, with the starting offset
5015 set to 200 and the previous 200 bytes at the start of the buffer.
5016
5017 3. Matching a subject string that is split into multiple segments does
5018 not always produce exactly the same result as matching over one single
5019 long string. The difference arises when there are multiple matching
5020 possibilities, because a partial match result is given only when there
5021 are no completed matches in a call to pcre_dfa_exec(). This means that
5022 as soon as the shortest match has been found, continuation to a new
5023 subject segment is no longer possible. Consider this pcretest example:
5024
5025 re> /dog(sbody)?/
5026 data> do\P\D
5027 Partial match: do
5028 data> gsb\R\P\D
5029 0: g
5030 data> dogsbody\D
5031 0: dogsbody
5032 1: dog
5033
5034 The pattern matches the words "dog" or "dogsbody". When the subject is
5035 presented in several parts ("do" and "gsb" being the first two) the
5036 match stops when "dog" has been found, and it is not possible to con-
5037 tinue. On the other hand, if "dogsbody" is presented as a single
5038 string, both matches are found.
5039
5040 Because of this phenomenon, it does not usually make sense to end a
5041 pattern that is going to be matched in this way with a variable repeat.
5042
5043 4. Patterns that contain alternatives at the top level which do not all
5044 start with the same pattern item may not work as expected. For example,
5045 consider this pattern:
5046
5047 1234|3789
5048
5049 If the first part of the subject is "ABC123", a partial match of the
5050 first alternative is found at offset 3. There is no partial match for
5051 the second alternative, because such a match does not start at the same
5052 point in the subject string. Attempting to continue with the string
5053 "789" does not yield a match because only those alternatives that match
5054 at one point in the subject are remembered. The problem arises because
5055 the start of the second alternative matches within the first alterna-
5056 tive. There is no problem with anchored patterns or patterns such as:
5057
5058 1234|ABCD
5059
5060 where no string can be a partial match for both alternatives.
5061
5062
5063 AUTHOR
5064
5065 Philip Hazel
5066 University Computing Service
5067 Cambridge CB2 3QH, England.
5068
5069
5070 REVISION
5071
5072 Last updated: 04 June 2007
5073 Copyright (c) 1997-2007 University of Cambridge.
5074 ------------------------------------------------------------------------------
5075
5076
5077 PCREPRECOMPILE(3) PCREPRECOMPILE(3)
5078
5079
5080 NAME
5081 PCRE - Perl-compatible regular expressions
5082
5083
5084 SAVING AND RE-USING PRECOMPILED PCRE PATTERNS
5085
5086 If you are running an application that uses a large number of regular
5087 expression patterns, it may be useful to store them in a precompiled
5088 form instead of having to compile them every time the application is
5089 run. If you are not using any private character tables (see the
5090 pcre_maketables() documentation), this is relatively straightforward.
5091 If you are using private tables, it is a little bit more complicated.
5092
5093 If you save compiled patterns to a file, you can copy them to a differ-
5094 ent host and run them there. This works even if the new host has the
5095 opposite endianness to the one on which the patterns were compiled.
5096 There may be a small performance penalty, but it should be insignifi-
5097 cant. However, compiling regular expressions with one version of PCRE
5098 for use with a different version is not guaranteed to work and may
5099 cause crashes.
5100
5101
5102 SAVING A COMPILED PATTERN
5103 The value returned by pcre_compile() points to a single block of memory
5104 that holds the compiled pattern and associated data. You can find the
5105 length of this block in bytes by calling pcre_fullinfo() with an argu-
5106 ment of PCRE_INFO_SIZE. You can then save the data in any appropriate
5107 manner. Here is sample code that compiles a pattern and writes it to a
5108 file. It assumes that the variable fd refers to a file that is open for
5109 output:
5110
5111 int erroroffset, rc, size;
5112 char *error;
5113 pcre *re;
5114
5115 re = pcre_compile("my pattern", 0, &error, &erroroffset, NULL);
5116 if (re == NULL) { ... handle errors ... }
5117 rc = pcre_fullinfo(re, NULL, PCRE_INFO_SIZE, &size);
5118 if (rc < 0) { ... handle errors ... }
5119 rc = fwrite(re, 1, size, fd);
5120 if (rc != size) { ... handle errors ... }
5121
5122 In this example, the bytes that comprise the compiled pattern are
5123 copied exactly. Note that this is binary data that may contain any of
5124 the 256 possible byte values. On systems that make a distinction
5125 between binary and non-binary data, be sure that the file is opened for
5126 binary output.
5127
5128 If you want to write more than one pattern to a file, you will have to
5129 devise a way of separating them. For binary data, preceding each pat-
5130 tern with its length is probably the most straightforward approach.
5131 Another possibility is to write out the data in hexadecimal instead of
5132 binary, one pattern to a line.
5133
5134 Saving compiled patterns in a file is only one possible way of storing
5135 them for later use. They could equally well be saved in a database, or
5136 in the memory of some daemon process that passes them via sockets to
5137 the processes that want them.
5138
5139 If the pattern has been studied, it is also possible to save the study
5140 data in a similar way to the compiled pattern itself. When studying
5141 generates additional information, pcre_study() returns a pointer to a
5142 pcre_extra data block. Its format is defined in the section on matching
5143 a pattern in the pcreapi documentation. The study_data field points to
5144 the binary study data, and this is what you must save (not the
5145 pcre_extra block itself). The length of the study data can be obtained
5146 by calling pcre_fullinfo() with an argument of PCRE_INFO_STUDYSIZE.
5147 Remember to check that pcre_study() did return a non-NULL value before
5148 trying to save the study data.
5149
5150
5151 RE-USING A PRECOMPILED PATTERN
5152
5153 Re-using a precompiled pattern is straightforward. Having reloaded it
5154 into main memory, you pass its pointer to pcre_exec() or
5155 pcre_dfa_exec() in the usual way. This should work even on another
5156 host, and even if that host has the opposite endianness to the one
5157 where the pattern was compiled.
5158
5159 However, if you passed a pointer to custom character tables when the
5160 pattern was compiled (the tableptr argument of pcre_compile()), you
5161 must now pass a similar pointer to pcre_exec() or pcre_dfa_exec(),
5162 because the value saved with the compiled pattern will obviously be
5163 nonsense. A field in a pcre_extra() block is used to pass this data, as
5164 described in the section on matching a pattern in the pcreapi documen-
5165 tation.
5166
5167 If you did not provide custom character tables when the pattern was
5168 compiled, the pointer in the compiled pattern is NULL, which causes
5169 pcre_exec() to use PCRE's internal tables. Thus, you do not need to
5170 take any special action at run time in this case.
5171
5172 If you saved study data with the compiled pattern, you need to create
5173 your own pcre_extra data block and set the study_data field to point to
5174 the reloaded study data. You must also set the PCRE_EXTRA_STUDY_DATA
5175 bit in the flags field to indicate that study data is present. Then
5176 pass the pcre_extra block to pcre_exec() or pcre_dfa_exec() in the
5177 usual way.
5178
5179
5180 COMPATIBILITY WITH DIFFERENT PCRE RELEASES
5181
5182 In general, it is safest to recompile all saved patterns when you
5183 update to a new PCRE release, though not all updates actually require
5184 this. Recompiling is definitely needed for release 7.2.
5185
5186
5187 AUTHOR
5188
5189 Philip Hazel
5190 University Computing Service
5191 Cambridge CB2 3QH, England.
5192
5193
5194 REVISION
5195
5196 Last updated: 13 June 2007
5197 Copyright (c) 1997-2007 University of Cambridge.
5198 ------------------------------------------------------------------------------
5199
5200
5201 PCREPERFORM(3) PCREPERFORM(3)
5202
5203
5204 NAME
5205 PCRE - Perl-compatible regular expressions
5206
5207
5208 PCRE PERFORMANCE
5209
5210 Two aspects of performance are discussed below: memory usage and pro-
5211 cessing time. The way you express your pattern as a regular expression
5212 can affect both of them.
5213
5214
5215 MEMORY USAGE
5216
5217 Patterns are compiled by PCRE into a reasonably efficient byte code, so
5218 that most simple patterns do not use much memory. However, there is one
5219 case where memory usage can be unexpectedly large. When a parenthesized
5220 subpattern has a quantifier with a minimum greater than 1 and/or a lim-
5221 ited maximum, the whole subpattern is repeated in the compiled code.
5222 For example, the pattern
5223
5224 (abc|def){2,4}
5225
5226 is compiled as if it were
5227
5228 (abc|def)(abc|def)((abc|def)(abc|def)?)?
5229
5230 (Technical aside: It is done this way so that backtrack points within
5231 each of the repetitions can be independently maintained.)
5232
5233 For regular expressions whose quantifiers use only small numbers, this
5234 is not usually a problem. However, if the numbers are large, and par-
5235 ticularly if such repetitions are nested, the memory usage can become
5236 an embarrassment. For example, the very simple pattern
5237
5238 ((ab){1,1000}c){1,3}
5239
5240 uses 51K bytes when compiled. When PCRE is compiled with its default
5241 internal pointer size of two bytes, the size limit on a compiled pat-
5242 tern is 64K, and this is reached with the above pattern if the outer
5243 repetition is increased from 3 to 4. PCRE can be compiled to use larger
5244 internal pointers and thus handle larger compiled patterns, but it is
5245 better to try to rewrite your pattern to use less memory if you can.
5246
5247 One way of reducing the memory usage for such patterns is to make use
5248 of PCRE's "subroutine" facility. Re-writing the above pattern as
5249
5250 ((ab)(?2){0,999}c)(?1){0,2}
5251
5252 reduces the memory requirements to 18K, and indeed it remains under 20K
5253 even with the outer repetition increased to 100. However, this pattern
5254 is not exactly equivalent, because the "subroutine" calls are treated
5255 as atomic groups into which there can be no backtracking if there is a
5256 subsequent matching failure. Therefore, PCRE cannot do this kind of
5257 rewriting automatically. Furthermore, there is a noticeable loss of
5258 speed when executing the modified pattern. Nevertheless, if the atomic
5259 grouping is not a problem and the loss of speed is acceptable, this
5260 kind of rewriting will allow you to process patterns that PCRE cannot
5261 otherwise handle.
5262
5263
5264 PROCESSING TIME
5265
5266 Certain items in regular expression patterns are processed more effi-
5267 ciently than others. It is more efficient to use a character class like
5268 [aeiou] than a set of single-character alternatives such as
5269 (a|e|i|o|u). In general, the simplest construction that provides the
5270 required behaviour is usually the most efficient. Jeffrey Friedl's book
5271 contains a lot of useful general discussion about optimizing regular
5272 expressions for efficient performance. This document contains a few
5273 observations about PCRE.
5274
5275 Using Unicode character properties (the \p, \P, and \X escapes) is
5276 slow, because PCRE has to scan a structure that contains data for over
5277 fifteen thousand characters whenever it needs a character's property.
5278 If you can find an alternative pattern that does not use character
5279 properties, it will probably be faster.
5280
5281 When a pattern begins with .* not in parentheses, or in parentheses
5282 that are not the subject of a backreference, and the PCRE_DOTALL option
5283 is set, the pattern is implicitly anchored by PCRE, since it can match
5284 only at the start of a subject string. However, if PCRE_DOTALL is not
5285 set, PCRE cannot make this optimization, because the . metacharacter
5286 does not then match a newline, and if the subject string contains new-
5287 lines, the pattern may match from the character immediately following
5288 one of them instead of from the very start. For example, the pattern
5289
5290 .*second
5291
5292 matches the subject "first\nand second" (where \n stands for a newline
5293 character), with the match starting at the seventh character. In order
5294 to do this, PCRE has to retry the match starting after every newline in
5295 the subject.
5296
5297 If you are using such a pattern with subject strings that do not con-
5298 tain newlines, the best performance is obtained by setting PCRE_DOTALL,
5299 or starting the pattern with ^.* or ^.*? to indicate explicit anchor-
5300 ing. That saves PCRE from having to scan along the subject looking for
5301 a newline to restart at.
5302
5303 Beware of patterns that contain nested indefinite repeats. These can
5304 take a long time to run when applied to a string that does not match.
5305 Consider the pattern fragment
5306
5307 ^(a+)*
5308
5309 This can match "aaaa" in 16 different ways, and this number increases
5310 very rapidly as the string gets longer. (The * repeat can match 0, 1,
5311 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
5312 repeats can match different numbers of times.) When the remainder of
5313 the pattern is such that the entire match is going to fail, PCRE has in
5314 principle to try every possible variation, and this can take an
5315 extremely long time, even for relatively short strings.
5316
5317 An optimization catches some of the more simple cases such as
5318
5319 (a+)*b
5320
5321 where a literal character follows. Before embarking on the standard
5322 matching procedure, PCRE checks that there is a "b" later in the sub-
5323 ject string, and if there is not, it fails the match immediately. How-
5324 ever, when there is no following literal this optimization cannot be
5325 used. You can see the difference by comparing the behaviour of
5326
5327 (a+)*\d
5328
5329 with the pattern above. The former gives a failure almost instantly
5330 when applied to a whole line of "a" characters, whereas the latter
5331 takes an appreciable time with strings longer than about 20 characters.
5332
5333 In many cases, the solution to this kind of performance issue is to use
5334 an atomic group or a possessive quantifier.
5335
5336
5337 AUTHOR
5338
5339 Philip Hazel
5340 University Computing Service
5341 Cambridge CB2 3QH, England.
5342
5343
5344 REVISION
5345
5346 Last updated: 06 March 2007
5347 Copyright (c) 1997-2007 University of Cambridge.
5348 ------------------------------------------------------------------------------
5349
5350
5351 PCREPOSIX(3) PCREPOSIX(3)
5352
5353
5354 NAME
5355 PCRE - Perl-compatible regular expressions.
5356
5357
5358 SYNOPSIS OF POSIX API
5359
5360 #include <pcreposix.h>
5361
5362 int regcomp(regex_t *preg, const char *pattern,
5363 int cflags);
5364
5365 int regexec(regex_t *preg, const char *string,
5366 size_t nmatch, regmatch_t pmatch[], int eflags);
5367
5368 size_t regerror(int errcode, const regex_t *preg,
5369 char *errbuf, size_t errbuf_size);
5370
5371 void regfree(regex_t *preg);
5372
5373
5374 DESCRIPTION
5375
5376 This set of functions provides a POSIX-style API to the PCRE regular
5377 expression package. See the pcreapi documentation for a description of
5378 PCRE's native API, which contains much additional functionality.
5379
5380 The functions described here are just wrapper functions that ultimately
5381 call the PCRE native API. Their prototypes are defined in the
5382 pcreposix.h header file, and on Unix systems the library itself is
5383 called pcreposix.a, so can be accessed by adding -lpcreposix to the
5384 command for linking an application that uses them. Because the POSIX
5385 functions call the native ones, it is also necessary to add -lpcre.
5386
5387 I have implemented only those option bits that can be reasonably mapped
5388 to PCRE native options. In addition, the option REG_EXTENDED is defined
5389 with the value zero. This has no effect, but since programs that are
5390 written to the POSIX interface often use it, this makes it easier to
5391 slot in PCRE as a replacement library. Other POSIX options are not even
5392 defined.
5393
5394 When PCRE is called via these functions, it is only the API that is
5395 POSIX-like in style. The syntax and semantics of the regular expres-
5396 sions themselves are still those of Perl, subject to the setting of
5397 various PCRE options, as described below. "POSIX-like in style" means
5398 that the API approximates to the POSIX definition; it is not fully
5399 POSIX-compatible, and in multi-byte encoding domains it is probably
5400 even less compatible.
5401
5402 The header for these functions is supplied as pcreposix.h to avoid any
5403 potential clash with other POSIX libraries. It can, of course, be
5404 renamed or aliased as regex.h, which is the "correct" name. It provides
5405 two structure types, regex_t for compiled internal forms, and reg-
5406 match_t for returning captured substrings. It also defines some con-
5407 stants whose names start with "REG_"; these are used for setting
5408 options and identifying error codes.
5409
5410
5411 COMPILING A PATTERN
5412
5413 The function regcomp() is called to compile a pattern into an internal
5414 form. The pattern is a C string terminated by a binary zero, and is
5415 passed in the argument pattern. The preg argument is a pointer to a
5416 regex_t structure that is used as a base for storing information about
5417 the compiled regular expression.
5418
5419 The argument cflags is either zero, or contains one or more of the bits
5420 defined by the following macros:
5421
5422 REG_DOTALL
5423
5424 The PCRE_DOTALL option is set when the regular expression is passed for
5425 compilation to the native function. Note that REG_DOTALL is not part of
5426 the POSIX standard.
5427
5428 REG_ICASE
5429
5430 The PCRE_CASELESS option is set when the regular expression is passed
5431 for compilation to the native function.
5432
5433 REG_NEWLINE
5434
5435 The PCRE_MULTILINE option is set when the regular expression is passed
5436 for compilation to the native function. Note that this does not mimic
5437 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
5438 tion).
5439
5440 REG_NOSUB
5441
5442 The PCRE_NO_AUTO_CAPTURE option is set when the regular expression is
5443 passed for compilation to the native function. In addition, when a pat-
5444 tern that is compiled with this flag is passed to regexec() for match-
5445 ing, the nmatch and pmatch arguments are ignored, and no captured
5446 strings are returned.
5447
5448 REG_UTF8
5449
5450 The PCRE_UTF8 option is set when the regular expression is passed for
5451 compilation to the native function. This causes the pattern itself and
5452 all data strings used for matching it to be treated as UTF-8 strings.
5453 Note that REG_UTF8 is not part of the POSIX standard.
5454
5455 In the absence of these flags, no options are passed to the native
5456 function. This means the the regex is compiled with PCRE default
5457 semantics. In particular, the way it handles newline characters in the
5458 subject string is the Perl way, not the POSIX way. Note that setting
5459 PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
5460 It does not affect the way newlines are matched by . (they aren't) or
5461 by a negative class such as [^a] (they are).
5462
5463 The yield of regcomp() is zero on success, and non-zero otherwise. The
5464 preg structure is filled in on success, and one member of the structure
5465 is public: re_nsub contains the number of capturing subpatterns in the
5466 regular expression. Various error codes are defined in the header file.
5467
5468
5469 MATCHING NEWLINE CHARACTERS
5470
5471 This area is not simple, because POSIX and Perl take different views of
5472 things. It is not possible to get PCRE to obey POSIX semantics, but
5473 then PCRE was never intended to be a POSIX engine. The following table
5474 lists the different possibilities for matching newline characters in
5475 PCRE:
5476
5477 Default Change with
5478
5479 . matches newline no PCRE_DOTALL
5480 newline matches [^a] yes not changeable
5481 $ matches \n at end yes PCRE_DOLLARENDONLY
5482 $ matches \n in middle no PCRE_MULTILINE
5483 ^ matches \n in middle no PCRE_MULTILINE
5484
5485 This is the equivalent table for POSIX:
5486
5487 Default Change with
5488
5489 . matches newline yes REG_NEWLINE
5490 newline matches [^a] yes REG_NEWLINE
5491 $ matches \n at end no REG_NEWLINE
5492 $ matches \n in middle no REG_NEWLINE
5493 ^ matches \n in middle no REG_NEWLINE
5494
5495 PCRE's behaviour is the same as Perl's, except that there is no equiva-
5496 lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is
5497 no way to stop newline from matching [^a].
5498
5499 The default POSIX newline handling can be obtained by setting
5500 PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE
5501 behave exactly as for the REG_NEWLINE action.
5502
5503
5504 MATCHING A PATTERN
5505
5506 The function regexec() is called to match a compiled pattern preg
5507 against a given string, which is terminated by a zero byte, subject to
5508 the options in eflags. These can be:
5509
5510 REG_NOTBOL
5511
5512 The PCRE_NOTBOL option is set when calling the underlying PCRE matching
5513 function.
5514
5515 REG_NOTEOL
5516
5517 The PCRE_NOTEOL option is set when calling the underlying PCRE matching
5518 function.
5519
5520 If the pattern was compiled with the REG_NOSUB flag, no data about any
5521 matched strings is returned. The nmatch and pmatch arguments of
5522 regexec() are ignored.
5523
5524 Otherwise,the portion of the string that was matched, and also any cap-
5525 tured substrings, are returned via the pmatch argument, which points to
5526 an array of nmatch structures of type regmatch_t, containing the mem-
5527 bers rm_so and rm_eo. These contain the offset to the first character
5528 of each substring and the offset to the first character after the end
5529 of each substring, respectively. The 0th element of the vector relates
5530 to the entire portion of string that was matched; subsequent elements
5531 relate to the capturing subpatterns of the regular expression. Unused
5532 entries in the array have both structure members set to -1.
5533
5534 A successful match yields a zero return; various error codes are
5535 defined in the header file, of which REG_NOMATCH is the "expected"
5536 failure code.
5537
5538
5539 ERROR MESSAGES
5540
5541 The regerror() function maps a non-zero errorcode from either regcomp()
5542 or regexec() to a printable message. If preg is not NULL, the error
5543 should have arisen from the use of that structure. A message terminated
5544 by a binary zero is placed in errbuf. The length of the message,
5545 including the zero, is limited to errbuf_size. The yield of the func-
5546 tion is the size of buffer needed to hold the whole message.
5547
5548
5549 MEMORY USAGE
5550
5551 Compiling a regular expression causes memory to be allocated and asso-
5552 ciated with the preg structure. The function regfree() frees all such
5553 memory, after which preg may no longer be used as a compiled expres-
5554 sion.
5555
5556
5557 AUTHOR
5558
5559 Philip Hazel
5560 University Computing Service
5561 Cambridge CB2 3QH, England.
5562
5563
5564 REVISION
5565
5566 Last updated: 06 March 2007
5567 Copyright (c) 1997-2007 University of Cambridge.
5568 ------------------------------------------------------------------------------
5569
5570
5571 PCRECPP(3) PCRECPP(3)
5572
5573
5574 NAME
5575 PCRE - Perl-compatible regular expressions.
5576
5577
5578 SYNOPSIS OF C++ WRAPPER
5579
5580 #include <pcrecpp.h>
5581
5582
5583 DESCRIPTION
5584
5585 The C++ wrapper for PCRE was provided by Google Inc. Some additional
5586 functionality was added by Giuseppe Maxia. This brief man page was con-
5587 structed from the notes in the pcrecpp.h file, which should be con-
5588 sulted for further details.
5589
5590
5591 MATCHING INTERFACE
5592
5593 The "FullMatch" operation checks that supplied text matches a supplied
5594 pattern exactly. If pointer arguments are supplied, it copies matched
5595 sub-strings that match sub-patterns into them.
5596
5597 Example: successful match
5598 pcrecpp::RE re("h.*o");
5599 re.FullMatch("hello");
5600
5601 Example: unsuccessful match (requires full match):
5602 pcrecpp::RE re("e");
5603 !re.FullMatch("hello");
5604
5605 Example: creating a temporary RE object:
5606 pcrecpp::RE("h.*o").FullMatch("hello");
5607
5608 You can pass in a "const char*" or a "string" for "text". The examples
5609 below tend to use a const char*. You can, as in the different examples
5610 above, store the RE object explicitly in a variable or use a temporary
5611 RE object. The examples below use one mode or the other arbitrarily.
5612 Either could correctly be used for any of these examples.
5613
5614 You must supply extra pointer arguments to extract matched subpieces.
5615
5616 Example: extracts "ruby" into "s" and 1234 into "i"
5617 int i;
5618 string s;
5619 pcrecpp::RE re("(\\w+):(\\d+)");
5620 re.FullMatch("ruby:1234", &s, &i);
5621
5622 Example: does not try to extract any extra sub-patterns
5623 re.FullMatch("ruby:1234", &s);
5624
5625 Example: does not try to extract into NULL
5626 re.FullMatch("ruby:1234", NULL, &i);
5627
5628 Example: integer overflow causes failure
5629 !re.FullMatch("ruby:1234567891234", NULL, &i);
5630
5631 Example: fails because there aren't enough sub-patterns:
5632 !pcrecpp::RE("\\w+:\\d+").FullMatch("ruby:1234", &s);
5633
5634 Example: fails because string cannot be stored in integer
5635 !pcrecpp::RE("(.*)").FullMatch("ruby", &i);
5636
5637 The provided pointer arguments can be pointers to any scalar numeric
5638 type, or one of:
5639
5640 string (matched piece is copied to string)
5641 StringPiece (StringPiece is mutated to point to matched piece)
5642 T (where "bool T::ParseFrom(const char*, int)" exists)
5643 NULL (the corresponding matched sub-pattern is not copied)
5644
5645 The function returns true iff all of the following conditions are sat-
5646 isfied:
5647
5648 a. "text" matches "pattern" exactly;
5649
5650 b. The number of matched sub-patterns is >= number of supplied
5651 pointers;
5652
5653 c. The "i"th argument has a suitable type for holding the
5654 string captured as the "i"th sub-pattern. If you pass in
5655 NULL for the "i"th argument, or pass fewer arguments than
5656 number of sub-patterns, "i"th captured sub-pattern is
5657 ignored.
5658
5659 CAVEAT: An optional sub-pattern that does not exist in the matched
5660 string is assigned the empty string. Therefore, the following will
5661 return false (because the empty string is not a valid number):
5662
5663 int number;
5664 pcrecpp::RE::FullMatch("abc", "[a-z]+(\\d+)?", &number);
5665
5666 The matching interface supports at most 16 arguments per call. If you
5667 need more, consider using the more general interface
5668 pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch.
5669
5670
5671 QUOTING METACHARACTERS
5672
5673 You can use the "QuoteMeta" operation to insert backslashes before all
5674 potentially meaningful characters in a string. The returned string,
5675 used as a regular expression, will exactly match the original string.
5676
5677 Example:
5678 string quoted = RE::QuoteMeta(unquoted);
5679
5680 Note that it's legal to escape a character even if it has no special
5681 meaning in a regular expression -- so this function does that. (This
5682 also makes it identical to the perl function of the same name; see
5683 "perldoc -f quotemeta".) For example, "1.5-2.0?" becomes
5684 "1\.5\-2\.0\?".
5685
5686
5687 PARTIAL MATCHES
5688
5689 You can use the "PartialMatch" operation when you want the pattern to
5690 match any substring of the text.
5691
5692 Example: simple search for a string:
5693 pcrecpp::RE("ell").PartialMatch("hello");
5694
5695 Example: find first number in a string:
5696 int number;
5697 pcrecpp::RE re("(\\d+)");
5698 re.PartialMatch("x*100 + 20", &number);
5699 assert(number == 100);
5700
5701
5702 UTF-8 AND THE MATCHING INTERFACE
5703
5704 By default, pattern and text are plain text, one byte per character.
5705 The UTF8 flag, passed to the constructor, causes both pattern and
5706 string to be treated as UTF-8 text, still a byte stream but potentially
5707 multiple bytes per character. In practice, the text is likelier to be
5708 UTF-8 than the pattern, but the match returned may depend on the UTF8
5709 flag, so always use it when matching UTF8 text. For example, "." will
5710 match one byte normally but with UTF8 set may match up to three bytes
5711 of a multi-byte character.
5712
5713 Example:
5714 pcrecpp::RE_Options options;
5715 options.set_utf8();
5716 pcrecpp::RE re(utf8_pattern, options);
5717 re.FullMatch(utf8_string);
5718
5719 Example: using the convenience function UTF8():
5720 pcrecpp::RE re(utf8_pattern, pcrecpp::UTF8());
5721 re.FullMatch(utf8_string);
5722
5723 NOTE: The UTF8 flag is ignored if pcre was not configured with the
5724 --enable-utf8 flag.
5725
5726
5727 PASSING MODIFIERS TO THE REGULAR EXPRESSION ENGINE
5728
5729 PCRE defines some modifiers to change the behavior of the regular
5730 expression engine. The C++ wrapper defines an auxiliary class,
5731 RE_Options, as a vehicle to pass such modifiers to a RE class. Cur-
5732 rently, the following modifiers are supported:
5733
5734 modifier description Perl corresponding
5735
5736 PCRE_CASELESS case insensitive match /i
5737 PCRE_MULTILINE multiple lines match /m
5738 PCRE_DOTALL dot matches newlines /s
5739 PCRE_DOLLAR_ENDONLY $ matches only at end N/A
5740 PCRE_EXTRA strict escape parsing N/A
5741 PCRE_EXTENDED ignore whitespaces /x
5742 PCRE_UTF8 handles UTF8 chars built-in
5743 PCRE_UNGREEDY reverses * and *? N/A
5744 PCRE_NO_AUTO_CAPTURE disables capturing parens N/A (*)
5745
5746 (*) Both Perl and PCRE allow non capturing parentheses by means of the
5747 "?:" modifier within the pattern itself. e.g. (?:ab|cd) does not cap-
5748 ture, while (ab|cd) does.
5749
5750 For a full account on how each modifier works, please check the PCRE
5751 API reference page.
5752
5753 For each modifier, there are two member functions whose name is made
5754 out of the modifier in lowercase, without the "PCRE_" prefix. For
5755 instance, PCRE_CASELESS is handled by
5756
5757 bool caseless()
5758
5759 which returns true if the modifier is set, and
5760
5761 RE_Options & set_caseless(bool)
5762
5763 which sets or unsets the modifier. Moreover, PCRE_EXTRA_MATCH_LIMIT can
5764 be accessed through the set_match_limit() and match_limit() member
5765 functions. Setting match_limit to a non-zero value will limit the exe-
5766 cution of pcre to keep it from doing bad things like blowing the stack
5767 or taking an eternity to return a result. A value of 5000 is good
5768 enough to stop stack blowup in a 2MB thread stack. Setting match_limit
5769 to zero disables match limiting. Alternatively, you can call
5770 match_limit_recursion() which uses PCRE_EXTRA_MATCH_LIMIT_RECURSION to
5771 limit how much PCRE recurses. match_limit() limits the number of
5772 matches PCRE does; match_limit_recursion() limits the depth of internal
5773 recursion, and therefore the amount of stack that is used.
5774
5775 Normally, to pass one or more modifiers to a RE class, you declare a
5776 RE_Options object, set the appropriate options, and pass this object to
5777 a RE constructor. Example:
5778
5779 RE_options opt;
5780 opt.set_caseless(true);
5781 if (RE("HELLO", opt).PartialMatch("hello world")) ...
5782
5783 RE_options has two constructors. The default constructor takes no argu-
5784 ments and creates a set of flags that are off by default. The optional
5785 parameter option_flags is to facilitate transfer of legacy code from C
5786 programs. This lets you do
5787
5788 RE(pattern,
5789 RE_Options(PCRE_CASELESS|PCRE_MULTILINE)).PartialMatch(str);
5790
5791 However, new code is better off doing
5792
5793 RE(pattern,
5794 RE_Options().set_caseless(true).set_multiline(true))
5795 .PartialMatch(str);
5796
5797 If you are going to pass one of the most used modifiers, there are some
5798 convenience functions that return a RE_Options class with the appropri-
5799 ate modifier already set: CASELESS(), UTF8(), MULTILINE(), DOTALL(),
5800 and EXTENDED().
5801
5802 If you need to set several options at once, and you don't want to go
5803 through the pains of declaring a RE_Options object and setting several
5804 options, there is a parallel method that give you such ability on the
5805 fly. You can concatenate several set_xxxxx() member functions, since
5806 each of them returns a reference to its class object. For example, to
5807 pass PCRE_CASELESS, PCRE_EXTENDED, and PCRE_MULTILINE to a RE with one
5808 statement, you may write:
5809
5810 RE(" ^ xyz \\s+ .* blah$",
5811 RE_Options()
5812 .set_caseless(true)
5813 .set_extended(true)
5814 .set_multiline(true)).PartialMatch(sometext);
5815
5816
5817 SCANNING TEXT INCREMENTALLY
5818
5819 The "Consume" operation may be useful if you want to repeatedly match
5820 regular expressions at the front of a string and skip over them as they
5821 match. This requires use of the "StringPiece" type, which represents a
5822 sub-range of a real string. Like RE, StringPiece is defined in the
5823 pcrecpp namespace.
5824
5825 Example: read lines of the form "var = value" from a string.
5826 string contents = ...; // Fill string somehow
5827 pcrecpp::StringPiece input(contents); // Wrap in a StringPiece
5828
5829 string var;
5830 int value;
5831 pcrecpp::RE re("(\\w+) = (\\d+)\n");
5832 while (re.Consume(&input, &var, &value)) {
5833 ...;
5834 }
5835
5836 Each successful call to "Consume" will set "var/value", and also
5837 advance "input" so it points past the matched text.
5838
5839 The "FindAndConsume" operation is similar to "Consume" but does not
5840 anchor your match at the beginning of the string. For example, you
5841 could extract all words from a string by repeatedly calling
5842
5843 pcrecpp::RE("(\\w+)").FindAndConsume(&input, &word)
5844
5845
5846 PARSING HEX/OCTAL/C-RADIX NUMBERS
5847
5848 By default, if you pass a pointer to a numeric value, the corresponding
5849 text is interpreted as a base-10 number. You can instead wrap the
5850 pointer with a call to one of the operators Hex(), Octal(), or CRadix()
5851 to interpret the text in another base. The CRadix operator interprets
5852 C-style "0" (base-8) and "0x" (base-16) prefixes, but defaults to
5853 base-10.
5854
5855 Example:
5856 int a, b, c, d;
5857 pcrecpp::RE re("(.*) (.*) (.*) (.*)");
5858 re.FullMatch("100 40 0100 0x40",
5859 pcrecpp::Octal(&a), pcrecpp::Hex(&b),
5860 pcrecpp::CRadix(&c), pcrecpp::CRadix(&d));
5861
5862 will leave 64 in a, b, c, and d.
5863
5864
5865 REPLACING PARTS OF STRINGS
5866
5867 You can replace the first match of "pattern" in "str" with "rewrite".
5868 Within "rewrite", backslash-escaped digits (\1 to \9) can be used to
5869 insert text matching corresponding parenthesized group from the pat-
5870 tern. \0 in "rewrite" refers to the entire matching text. For example:
5871
5872 string s = "yabba dabba doo";
5873 pcrecpp::RE("b+").Replace("d", &s);
5874
5875 will leave "s" containing "yada dabba doo". The result is true if the
5876 pattern matches and a replacement occurs, false otherwise.
5877
5878 GlobalReplace is like Replace except that it replaces all occurrences
5879 of the pattern in the string with the rewrite. Replacements are not
5880 subject to re-matching. For example:
5881
5882 string s = "yabba dabba doo";
5883 pcrecpp::RE("b+").GlobalReplace("d", &s);
5884
5885 will leave "s" containing "yada dada doo". It returns the number of
5886 replacements made.
5887
5888 Extract is like Replace, except that if the pattern matches, "rewrite"
5889 is copied into "out" (an additional argument) with substitutions. The
5890 non-matching portions of "text" are ignored. Returns true iff a match
5891 occurred and the extraction happened successfully; if no match occurs,
5892 the string is left unaffected.
5893
5894
5895 AUTHOR
5896
5897 The C++ wrapper was contributed by Google Inc.
5898 Copyright (c) 2007 Google Inc.
5899
5900
5901 REVISION
5902
5903 Last updated: 06 March 2007
5904 ------------------------------------------------------------------------------
5905
5906
5907 PCRESAMPLE(3) PCRESAMPLE(3)
5908
5909
5910 NAME
5911 PCRE - Perl-compatible regular expressions
5912
5913
5914 PCRE SAMPLE PROGRAM
5915
5916 A simple, complete demonstration program, to get you started with using
5917 PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
5918
5919 The program compiles the regular expression that is its first argument,
5920 and matches it against the subject string in its second argument. No
5921 PCRE options are set, and default character tables are used. If match-
5922 ing succeeds, the program outputs the portion of the subject that
5923 matched, together with the contents of any captured substrings.
5924
5925 If the -g option is given on the command line, the program then goes on
5926 to check for further matches of the same regular expression in the same
5927 subject string. The logic is a little bit tricky because of the possi-
5928 bility of matching an empty string. Comments in the code explain what
5929 is going on.
5930
5931 The demonstration program is automatically built if you use "./config-
5932 ure;make" to build PCRE. Otherwise, if PCRE is installed in the stan-
5933 dard include and library directories for your system, you should be
5934 able to compile the demonstration program using this command:
5935
5936 gcc -o pcredemo pcredemo.c -lpcre
5937
5938 If PCRE is installed elsewhere, you may need to add additional options
5939 to the command line. For example, on a Unix-like system that has PCRE
5940 installed in /usr/local, you can compile the demonstration program
5941 using a command like this:
5942
5943 gcc -o pcredemo -I/usr/local/include pcredemo.c \
5944 -L/usr/local/lib -lpcre
5945
5946 Once you have compiled the demonstration program, you can run simple
5947 tests like this:
5948
5949 ./pcredemo 'cat|dog' 'the cat sat on the mat'
5950 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
5951
5952 Note that there is a much more comprehensive test program, called
5953 pcretest, which supports many more facilities for testing regular
5954 expressions and the PCRE library. The pcredemo program is provided as a
5955 simple coding example.
5956
5957 On some operating systems (e.g. Solaris), when PCRE is not installed in
5958 the standard library directory, you may get an error like this when you
5959 try to run pcredemo:
5960
5961 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
5962 directory
5963
5964 This is caused by the way shared library support works on those sys-
5965 tems. You need to add
5966
5967 -R/usr/local/lib
5968
5969 (for example) to the compile command to get round this problem.
5970
5971
5972 AUTHOR
5973
5974 Philip Hazel
5975 University Computing Service
5976 Cambridge CB2 3QH, England.
5977
5978
5979 REVISION
5980
5981 Last updated: 13 June 2007
5982 Copyright (c) 1997-2007 University of Cambridge.
5983 ------------------------------------------------------------------------------
5984 PCRESTACK(3) PCRESTACK(3)
5985
5986
5987 NAME
5988 PCRE - Perl-compatible regular expressions
5989
5990
5991 PCRE DISCUSSION OF STACK USAGE
5992
5993 When you call pcre_exec(), it makes use of an internal function called
5994 match(). This calls itself recursively at branch points in the pattern,
5995 in order to remember the state of the match so that it can back up and
5996 try a different alternative if the first one fails. As matching pro-
5997 ceeds deeper and deeper into the tree of possibilities, the recursion
5998 depth increases.
5999
6000 Not all calls of match() increase the recursion depth; for an item such
6001 as a* it may be called several times at the same level, after matching
6002 different numbers of a's. Furthermore, in a number of cases where the
6003 result of the recursive call would immediately be passed back as the
6004 result of the current call (a "tail recursion"), the function is just
6005 restarted instead.
6006
6007 The pcre_dfa_exec() function operates in an entirely different way, and
6008 hardly uses recursion at all. The limit on its complexity is the amount
6009 of workspace it is given. The comments that follow do NOT apply to
6010 pcre_dfa_exec(); they are relevant only for pcre_exec().
6011
6012 You can set limits on the number of times that match() is called, both
6013 in total and recursively. If the limit is exceeded, an error occurs.
6014 For details, see the section on extra data for pcre_exec() in the
6015 pcreapi documentation.
6016
6017 Each time that match() is actually called recursively, it uses memory
6018 from the process stack. For certain kinds of pattern and data, very
6019 large amounts of stack may be needed, despite the recognition of "tail
6020 recursion". You can often reduce the amount of recursion, and there-
6021 fore the amount of stack used, by modifying the pattern that is being
6022 matched. Consider, for example, this pattern:
6023
6024 ([^<]|<(?!inet))+
6025
6026 It matches from wherever it starts until it encounters "<inet" or the
6027 end of the data, and is the kind of pattern that might be used when
6028 processing an XML file. Each iteration of the outer parentheses matches
6029 either one character that is not "<" or a "<" that is not followed by
6030 "inet". However, each time a parenthesis is processed, a recursion
6031 occurs, so this formulation uses a stack frame for each matched charac-
6032 ter. For a long string, a lot of stack is required. Consider now this
6033 rewritten pattern, which matches exactly the same strings:
6034
6035 ([^<]++|<(?!inet))+
6036
6037 This uses very much less stack, because runs of characters that do not
6038 contain "<" are "swallowed" in one item inside the parentheses. Recur-
6039 sion happens only when a "<" character that is not followed by "inet"
6040 is encountered (and we assume this is relatively rare). A possessive
6041 quantifier is used to stop any backtracking into the runs of non-"<"
6042 characters, but that is not related to stack usage.
6043
6044 This example shows that one way of avoiding stack problems when match-
6045 ing long subject strings is to write repeated parenthesized subpatterns
6046 to match more than one character whenever possible.
6047
6048 In environments where stack memory is constrained, you might want to
6049 compile PCRE to use heap memory instead of stack for remembering back-
6050 up points. This makes it run a lot more slowly, however. Details of how
6051 to do this are given in the pcrebuild documentation. When built in this
6052 way, instead of using the stack, PCRE obtains and frees memory by call-
6053 ing the functions that are pointed to by the pcre_stack_malloc and
6054 pcre_stack_free variables. By default, these point to malloc() and
6055 free(), but you can replace the pointers to cause PCRE to use your own
6056 functions. Since the block sizes are always the same, and are always
6057 freed in reverse order, it may be possible to implement customized mem-
6058 ory handlers that are more efficient than the standard functions.
6059
6060 In Unix-like environments, there is not often a problem with the stack
6061 unless very long strings are involved, though the default limit on
6062 stack size varies from system to system. Values from 8Mb to 64Mb are
6063 common. You can find your default limit by running the command:
6064
6065 ulimit -s
6066
6067 Unfortunately, the effect of running out of stack is often SIGSEGV,
6068 though sometimes a more explicit error message is given. You can nor-
6069 mally increase the limit on stack size by code such as this:
6070
6071 struct rlimit rlim;
6072 getrlimit(RLIMIT_STACK, &rlim);
6073 rlim.rlim_cur = 100*1024*1024;
6074 setrlimit(RLIMIT_STACK, &rlim);
6075
6076 This reads the current limits (soft and hard) using getrlimit(), then
6077 attempts to increase the soft limit to 100Mb using setrlimit(). You
6078 must do this before calling pcre_exec().
6079
6080 PCRE has an internal counter that can be used to limit the depth of
6081 recursion, and thus cause pcre_exec() to give an error code before it
6082 runs out of stack. By default, the limit is very large, and unlikely
6083 ever to operate. It can be changed when PCRE is built, and it can also
6084 be set when pcre_exec() is called. For details of these interfaces, see
6085 the pcrebuild and pcreapi documentation.
6086
6087 As a very rough rule of thumb, you should reckon on about 500 bytes per
6088 recursion. Thus, if you want to limit your stack usage to 8Mb, you
6089 should set the limit at 16000 recursions. A 64Mb stack, on the other
6090 hand, can support around 128000 recursions. The pcretest test program
6091 has a command line option (-S) that can be used to increase the size of
6092 its stack.
6093
6094
6095 AUTHOR
6096
6097 Philip Hazel
6098 University Computing Service
6099 Cambridge CB2 3QH, England.
6100
6101
6102 REVISION
6103
6104 Last updated: 05 June 2007
6105 Copyright (c) 1997-2007 University of Cambridge.
6106 ------------------------------------------------------------------------------
6107
6108

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5