ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 71 - (show annotations)
Sat Feb 24 21:40:24 2007 UTC (14 years, 1 month ago) by nigel
File MIME type: text/plain
File size: 144666 byte(s)
Load pcre-4.4 into code/trunk.
1 This file contains a concatenation of the PCRE man pages, converted to plain
2 text format for ease of searching with a text editor, or for use on systems
3 that do not have a man page processor. The small individual files that give
4 synopses of each function in the library have not been included. There are
5 separate text files for the pcregrep and pcretest commands.
6 -----------------------------------------------------------------------------
9 PCRE - Perl-compatible regular expressions
14 The PCRE library is a set of functions that implement regu-
15 lar expression pattern matching using the same syntax and
16 semantics as Perl, with just a few differences. The current
17 implementation of PCRE (release 4.x) corresponds approxi-
18 mately with Perl 5.8, including support for UTF-8 encoded
19 strings. However, this support has to be explicitly
20 enabled; it is not the default.
22 PCRE is written in C and released as a C library. However, a
23 number of people have written wrappers and interfaces of
24 various kinds. A C++ class is included in these contribu-
25 tions, which can be found in the Contrib directory at the
26 primary FTP site, which is:
28 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30 Details of exactly which Perl regular expression features
31 are and are not supported by PCRE are given in separate
32 documents. See the pcrepattern and pcrecompat pages.
34 Some features of PCRE can be included, excluded, or changed
35 when the library is built. The pcre_config() function makes
36 it possible for a client to discover which features are
37 available. Documentation about building PCRE for various
38 operating systems can be found in the README file in the
39 source distribution.
44 The user documentation for PCRE has been split up into a
45 number of different sections. In the "man" format, each of
46 these is a separate "man page". In the HTML format, each is
47 a separate page, linked from the index page. In the plain
48 text format, all the sections are concatenated, for ease of
49 searching. The sections are as follows:
51 pcre this document
52 pcreapi details of PCRE's native API
53 pcrebuild options for building PCRE
54 pcrecallout details of the callout feature
55 pcrecompat discussion of Perl compatibility
56 pcregrep description of the pcregrep command
57 pcrepattern syntax and semantics of supported
58 regular expressions
59 pcreperform discussion of performance issues
60 pcreposix the POSIX-compatible API
61 pcresample discussion of the sample program
62 pcretest the pcretest testing command
64 In addition, in the "man" and HTML formats, there is a short
65 page for each library function, listing its arguments and
66 results.
71 There are some size limitations in PCRE but it is hoped that
72 they will never in practice be relevant.
74 The maximum length of a compiled pattern is 65539 (sic)
75 bytes if PCRE is compiled with the default internal linkage
76 size of 2. If you want to process regular expressions that
77 are truly enormous, you can compile PCRE with an internal
78 linkage size of 3 or 4 (see the README file in the source
79 distribution and the pcrebuild documentation for details).
80 If these cases the limit is substantially larger. However,
81 the speed of execution will be slower.
83 All values in repeating quantifiers must be less than 65536.
84 The maximum number of capturing subpatterns is 65535.
86 There is no limit to the number of non-capturing subpat-
87 terns, but the maximum depth of nesting of all kinds of
88 parenthesized subpattern, including capturing subpatterns,
89 assertions, and other types of subpattern, is 200.
91 The maximum length of a subject string is the largest posi-
92 tive number that an integer variable can hold. However, PCRE
93 uses recursion to handle subpatterns and indefinite repeti-
94 tion. This means that the available stack space may limit
95 the size of a subject string that can be processed by cer-
96 tain patterns.
101 Starting at release 3.3, PCRE has had some support for char-
102 acter strings encoded in the UTF-8 format. For release 4.0
103 this has been greatly extended to cover most common require-
104 ments.
106 In order process UTF-8 strings, you must build PCRE to
107 include UTF-8 support in the code, and, in addition, you
108 must call pcre_compile() with the PCRE_UTF8 option flag.
109 When you do this, both the pattern and any subject strings
110 that are matched against it are treated as UTF-8 strings
111 instead of just strings of bytes.
113 If you compile PCRE with UTF-8 support, but do not use it at
114 run time, the library will be a bit bigger, but the addi-
115 tional run time overhead is limited to testing the PCRE_UTF8
116 flag in several places, so should not be very large.
118 The following comments apply when PCRE is running in UTF-8
119 mode:
121 1. When you set the PCRE_UTF8 flag, the strings passed as
122 patterns and subjects are checked for validity on entry to
123 the relevant functions. If an invalid UTF-8 string is
124 passed, an error return is given. In some situations, you
125 may already know that your strings are valid, and therefore
126 want to skip these checks in order to improve performance.
127 If you set the PCRE_NO_UTF8_CHECK flag at compile time or at
128 run time, PCRE assumes that the pattern or subject it is
129 given (respectively) contains only valid UTF-8 codes. In
130 this case, it does not diagnose an invalid UTF-8 string. If
131 you pass an invalid UTF-8 string to PCRE when
132 PCRE_NO_UTF8_CHECK is set, the results are undefined. Your
133 program may crash.
135 2. In a pattern, the escape sequence \x{...}, where the con-
136 tents of the braces is a string of hexadecimal digits, is
137 interpreted as a UTF-8 character whose code number is the
138 given hexadecimal number, for example: \x{1234}. If a non-
139 hexadecimal digit appears between the braces, the item is
140 not recognized. This escape sequence can be used either as
141 a literal, or within a character class.
143 3. The original hexadecimal escape sequence, \xhh, matches a
144 two-byte UTF-8 character if the value is greater than 127.
146 4. Repeat quantifiers apply to complete UTF-8 characters,
147 not to individual bytes, for example: \x{100}{3}.
149 5. The dot metacharacter matches one UTF-8 character instead
150 of a single byte.
152 6. The escape sequence \C can be used to match a single byte
153 in UTF-8 mode, but its use can lead to some strange effects.
155 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
156 correctly test characters of any code value, but the charac-
157 ters that PCRE recognizes as digits, spaces, or word charac-
158 ters remain the same set as before, all with values less
159 than 256.
161 8. Case-insensitive matching applies only to characters
162 whose values are less than 256. PCRE does not support the
163 notion of "case" for higher-valued characters.
165 9. PCRE does not support the use of Unicode tables and pro-
166 perties or the Perl escapes \p, \P, and \X.
171 Philip Hazel <ph10@cam.ac.uk>
172 University Computing Service,
173 Cambridge CB2 3QG, England.
174 Phone: +44 1223 334714
176 Last updated: 20 August 2003
177 Copyright (c) 1997-2003 University of Cambridge.
178 -----------------------------------------------------------------------------
180 NAME
181 PCRE - Perl-compatible regular expressions
186 This document describes the optional features of PCRE that
187 can be selected when the library is compiled. They are all
188 selected, or deselected, by providing options to the config-
189 ure script which is run before the make command. The com-
190 plete list of options for configure (which includes the
191 standard ones such as the selection of the installation
192 directory) can be obtained by running
194 ./configure --help
196 The following sections describe certain options whose names
197 begin with --enable or --disable. These settings specify
198 changes to the defaults for the configure command. Because
199 of the way that configure works, --enable and --disable
200 always come in pairs, so the complementary option always
201 exists as well, but as it specifies the default, it is not
202 described.
207 To build PCRE with support for UTF-8 character strings, add
209 --enable-utf8
211 to the configure command. Of itself, this does not make PCRE
212 treat strings as UTF-8. As well as compiling PCRE with this
213 option, you also have have to set the PCRE_UTF8 option when
214 you call the pcre_compile() function.
219 By default, PCRE treats character 10 (linefeed) as the new-
220 line character. This is the normal newline character on
221 Unix-like systems. You can compile PCRE to use character 13
222 (carriage return) instead by adding
224 --enable-newline-is-cr
226 to the configure command. For completeness there is also a
227 --enable-newline-is-lf option, which explicitly specifies
228 linefeed as the newline character.
233 The PCRE building process uses libtool to build both shared
234 and static Unix libraries by default. You can suppress one
235 of these by adding one of
237 --disable-shared
238 --disable-static
240 to the configure command, as required.
245 When PCRE is called through the POSIX interface (see the
246 pcreposix documentation), additional working storage is
247 required for holding the pointers to capturing substrings
248 because PCRE requires three integers per substring, whereas
249 the POSIX interface provides only two. If the number of
250 expected substrings is small, the wrapper function uses
251 space on the stack, because this is faster than using mal-
252 loc() for each call. The default threshold above which the
253 stack is no longer used is 10; it can be changed by adding a
254 setting such as
256 --with-posix-malloc-threshold=20
258 to the configure command.
263 Internally, PCRE has a function called match() which it
264 calls repeatedly (possibly recursively) when performing a
265 matching operation. By limiting the number of times this
266 function may be called, a limit can be placed on the
267 resources used by a single call to pcre_exec(). The limit
268 can be changed at run time, as described in the pcreapi
269 documentation. The default is 10 million, but this can be
270 changed by adding a setting such as
272 --with-match-limit=500000
274 to the configure command.
279 Within a compiled pattern, offset values are used to point
280 from one part to another (for example, from an opening
281 parenthesis to an alternation metacharacter). By default
282 two-byte values are used for these offsets, leading to a
283 maximum size for a compiled pattern of around 64K. This is
284 sufficient to handle all but the most gigantic patterns.
285 Nevertheless, some people do want to process enormous pat-
286 terns, so it is possible to compile PCRE to use three-byte
287 or four-byte offsets by adding a setting such as
289 --with-link-size=3
291 to the configure command. The value given must be 2, 3, or
292 4. Using longer offsets slows down the operation of PCRE
293 because it has to load additional bytes when handling them.
295 If you build PCRE with an increased link size, test 2 (and
296 test 5 if you are using UTF-8) will fail. Part of the output
297 of these tests is a representation of the compiled pattern,
298 and this changes with the link size.
300 Last updated: 21 January 2003
301 Copyright (c) 1997-2003 University of Cambridge.
302 -----------------------------------------------------------------------------
304 NAME
305 PCRE - Perl-compatible regular expressions
310 #include <pcre.h>
312 pcre *pcre_compile(const char *pattern, int options,
313 const char **errptr, int *erroffset,
314 const unsigned char *tableptr);
316 pcre_extra *pcre_study(const pcre *code, int options,
317 const char **errptr);
319 int pcre_exec(const pcre *code, const pcre_extra *extra,
320 const char *subject, int length, int startoffset,
321 int options, int *ovector, int ovecsize);
323 int pcre_copy_named_substring(const pcre *code,
324 const char *subject, int *ovector,
325 int stringcount, const char *stringname,
326 char *buffer, int buffersize);
328 int pcre_copy_substring(const char *subject, int *ovector,
329 int stringcount, int stringnumber, char *buffer,
330 int buffersize);
332 int pcre_get_named_substring(const pcre *code,
333 const char *subject, int *ovector,
334 int stringcount, const char *stringname,
335 const char **stringptr);
337 int pcre_get_stringnumber(const pcre *code,
338 const char *name);
340 int pcre_get_substring(const char *subject, int *ovector,
341 int stringcount, int stringnumber,
342 const char **stringptr);
344 int pcre_get_substring_list(const char *subject,
345 int *ovector, int stringcount, const char ***listptr);
347 void pcre_free_substring(const char *stringptr);
349 void pcre_free_substring_list(const char **stringptr);
351 const unsigned char *pcre_maketables(void);
353 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
354 int what, void *where);
357 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
359 int pcre_config(int what, void *where);
361 char *pcre_version(void);
363 void *(*pcre_malloc)(size_t);
365 void (*pcre_free)(void *);
367 int (*pcre_callout)(pcre_callout_block *);
372 PCRE has its own native API, which is described in this
373 document. There is also a set of wrapper functions that
374 correspond to the POSIX regular expression API. These are
375 described in the pcreposix documentation.
377 The native API function prototypes are defined in the header
378 file pcre.h, and on Unix systems the library itself is
379 called libpcre.a, so can be accessed by adding -lpcre to the
380 command for linking an application which calls it. The
381 header file defines the macros PCRE_MAJOR and PCRE_MINOR to
382 contain the major and minor release numbers for the library.
383 Applications can use these to include support for different
384 releases.
386 The functions pcre_compile(), pcre_study(), and pcre_exec()
387 are used for compiling and matching regular expressions. A
388 sample program that demonstrates the simplest way of using
389 them is given in the file pcredemo.c. The pcresample docu-
390 mentation describes how to run it.
392 There are convenience functions for extracting captured sub-
393 strings from a matched subject string. They are:
395 pcre_copy_substring()
396 pcre_copy_named_substring()
397 pcre_get_substring()
398 pcre_get_named_substring()
399 pcre_get_substring_list()
401 pcre_free_substring() and pcre_free_substring_list() are
402 also provided, to free the memory used for extracted
403 strings.
405 The function pcre_maketables() is used (optionally) to build
406 a set of character tables in the current locale for passing
407 to pcre_compile().
409 The function pcre_fullinfo() is used to find out information
410 about a compiled pattern; pcre_info() is an obsolete version
411 which returns only some of the available information, but is
412 retained for backwards compatibility. The function
413 pcre_version() returns a pointer to a string containing the
414 version of PCRE and its date of release.
416 The global variables pcre_malloc and pcre_free initially
417 contain the entry points of the standard malloc() and free()
418 functions respectively. PCRE calls the memory management
419 functions via these variables, so a calling program can
420 replace them if it wishes to intercept the calls. This
421 should be done before calling any PCRE functions.
423 The global variable pcre_callout initially contains NULL. It
424 can be set by the caller to a "callout" function, which PCRE
425 will then call at specified points during a matching opera-
426 tion. Details are given in the pcrecallout documentation.
431 The PCRE functions can be used in multi-threading applica-
432 tions, with the proviso that the memory management functions
433 pointed to by pcre_malloc and pcre_free, and the callout
434 function pointed to by pcre_callout, are shared by all
435 threads.
437 The compiled form of a regular expression is not altered
438 during matching, so the same compiled pattern can safely be
439 used by several threads at once.
444 int pcre_config(int what, void *where);
446 The function pcre_config() makes it possible for a PCRE
447 client to discover which optional features have been com-
448 piled into the PCRE library. The pcrebuild documentation has
449 more details about these optional features.
451 The first argument for pcre_config() is an integer, specify-
452 ing which information is required; the second argument is a
453 pointer to a variable into which the information is placed.
454 The following information is available:
458 The output is an integer that is set to one if UTF-8 support
459 is available; otherwise it is set to zero.
463 The output is an integer that is set to the value of the
464 code that is used for the newline character. It is either
465 linefeed (10) or carriage return (13), and should normally
466 be the standard character for your operating system.
470 The output is an integer that contains the number of bytes
471 used for internal linkage in compiled regular expressions.
472 The value is 2, 3, or 4. Larger values allow larger regular
473 expressions to be compiled, at the expense of slower match-
474 ing. The default value of 2 is sufficient for all but the
475 most massive patterns, since it allows the compiled pattern
476 to be up to 64K in size.
480 The output is an integer that contains the threshold above
481 which the POSIX interface uses malloc() for output vectors.
482 Further details are given in the pcreposix documentation.
486 The output is an integer that gives the default limit for
487 the number of internal matching function calls in a
488 pcre_exec() execution. Further details are given with
489 pcre_exec() below.
494 pcre *pcre_compile(const char *pattern, int options,
495 const char **errptr, int *erroffset,
496 const unsigned char *tableptr);
498 The function pcre_compile() is called to compile a pattern
499 into an internal form. The pattern is a C string terminated
500 by a binary zero, and is passed in the argument pattern. A
501 pointer to a single block of memory that is obtained via
502 pcre_malloc is returned. This contains the compiled code and
503 related data. The pcre type is defined for the returned
504 block; this is a typedef for a structure whose contents are
505 not externally defined. It is up to the caller to free the
506 memory when it is no longer required.
508 Although the compiled code of a PCRE regex is relocatable,
509 that is, it does not depend on memory location, the complete
510 pcre data block is not fully relocatable, because it con-
511 tains a copy of the tableptr argument, which is an address
512 (see below).
513 The options argument contains independent bits that affect
514 the compilation. It should be zero if no options are
515 required. Some of the options, in particular, those that are
516 compatible with Perl, can also be set and unset from within
517 the pattern (see the detailed description of regular expres-
518 sions in the pcrepattern documentation). For these options,
519 the contents of the options argument specifies their initial
520 settings at the start of compilation and execution. The
521 PCRE_ANCHORED option can be set at the time of matching as
522 well as at compile time.
524 If errptr is NULL, pcre_compile() returns NULL immediately.
525 Otherwise, if compilation of a pattern fails, pcre_compile()
526 returns NULL, and sets the variable pointed to by errptr to
527 point to a textual error message. The offset from the start
528 of the pattern to the character where the error was
529 discovered is placed in the variable pointed to by
530 erroffset, which must not be NULL. If it is, an immediate
531 error is given.
533 If the final argument, tableptr, is NULL, PCRE uses a
534 default set of character tables which are built when it is
535 compiled, using the default C locale. Otherwise, tableptr
536 must be the result of a call to pcre_maketables(). See the
537 section on locale support below.
539 This code fragment shows a typical straightforward call to
540 pcre_compile():
542 pcre *re;
543 const char *error;
544 int erroffset;
545 re = pcre_compile(
546 "^A.*Z", /* the pattern */
547 0, /* default options */
548 &error, /* for error message */
549 &erroffset, /* for error offset */
550 NULL); /* use default character tables */
552 The following option bits are defined:
556 If this bit is set, the pattern is forced to be "anchored",
557 that is, it is constrained to match only at the first match-
558 ing point in the string which is being searched (the "sub-
559 ject string"). This effect can also be achieved by appropri-
560 ate constructs in the pattern itself, which is the only way
561 to do it in Perl.
565 If this bit is set, letters in the pattern match both upper
566 and lower case letters. It is equivalent to Perl's /i
567 option, and it can be changed within a pattern by a (?i)
568 option setting.
572 If this bit is set, a dollar metacharacter in the pattern
573 matches only at the end of the subject string. Without this
574 option, a dollar also matches immediately before the final
575 character if it is a newline (but not before any other new-
576 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
577 PCRE_MULTILINE is set. There is no equivalent to this option
578 in Perl, and no way to set it within a pattern.
582 If this bit is set, a dot metacharater in the pattern
583 matches all characters, including newlines. Without it, new-
584 lines are excluded. This option is equivalent to Perl's /s
585 option, and it can be changed within a pattern by a (?s)
586 option setting. A negative class such as [^a] always matches
587 a newline character, independent of the setting of this
588 option.
592 If this bit is set, whitespace data characters in the pat-
593 tern are totally ignored except when escaped or inside a
594 character class. Whitespace does not include the VT charac-
595 ter (code 11). In addition, characters between an unescaped
596 # outside a character class and the next newline character,
597 inclusive, are also ignored. This is equivalent to Perl's /x
598 option, and it can be changed within a pattern by a (?x)
599 option setting.
601 This option makes it possible to include comments inside
602 complicated patterns. Note, however, that this applies only
603 to data characters. Whitespace characters may never appear
604 within special character sequences in a pattern, for example
605 within the sequence (?( which introduces a conditional sub-
606 pattern.
610 This option was invented in order to turn on additional
611 functionality of PCRE that is incompatible with Perl, but it
612 is currently of very little use. When set, any backslash in
613 a pattern that is followed by a letter that has no special
614 meaning causes an error, thus reserving these combinations
615 for future expansion. By default, as in Perl, a backslash
616 followed by a letter with no special meaning is treated as a
617 literal. There are at present no other features controlled
618 by this option. It can also be set by a (?X) option setting
619 within a pattern.
623 By default, PCRE treats the subject string as consisting of
624 a single "line" of characters (even if it actually contains
625 several newlines). The "start of line" metacharacter (^)
626 matches only at the start of the string, while the "end of
627 line" metacharacter ($) matches only at the end of the
628 string, or before a terminating newline (unless
629 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
631 When PCRE_MULTILINE it is set, the "start of line" and "end
632 of line" constructs match immediately following or immedi-
633 ately before any newline in the subject string, respec-
634 tively, as well as at the very start and end. This is
635 equivalent to Perl's /m option, and it can be changed within
636 a pattern by a (?m) option setting. If there are no "\n"
637 characters in a subject string, or no occurrences of ^ or $
638 in a pattern, setting PCRE_MULTILINE has no effect.
642 If this option is set, it disables the use of numbered cap-
643 turing parentheses in the pattern. Any opening parenthesis
644 that is not followed by ? behaves as if it were followed by
645 ?: but named parentheses can still be used for capturing
646 (and they acquire numbers in the usual way). There is no
647 equivalent of this option in Perl.
651 This option inverts the "greediness" of the quantifiers so
652 that they are not greedy by default, but become greedy if
653 followed by "?". It is not compatible with Perl. It can also
654 be set by a (?U) option setting within the pattern.
658 This option causes PCRE to regard both the pattern and the
659 subject as strings of UTF-8 characters instead of single-
660 byte character strings. However, it is available only if
661 PCRE has been built to include UTF-8 support. If not, the
662 use of this option provokes an error. Details of how this
663 option changes the behaviour of PCRE are given in the sec-
664 tion on UTF-8 support in the main pcre page.
668 When PCRE_UTF8 is set, the validity of the pattern as a
669 UTF-8 string is automatically checked. If an invalid UTF-8
670 sequence of bytes is found, pcre_compile() returns an error.
671 If you already know that your pattern is valid, and you want
672 to skip this check for performance reasons, you can set the
673 PCRE_NO_UTF8_CHECK option. When it is set, the effect of
674 passing an invalid UTF-8 string as a pattern is undefined.
675 It may cause your program to crash. Note that there is a
676 similar option for suppressing the checking of subject
677 strings passed to pcre_exec().
683 pcre_extra *pcre_study(const pcre *code, int options,
684 const char **errptr);
686 When a pattern is going to be used several times, it is
687 worth spending more time analyzing it in order to speed up
688 the time taken for matching. The function pcre_study() takes
689 a pointer to a compiled pattern as its first argument. If
690 studing the pattern produces additional information that
691 will help speed up matching, pcre_study() returns a pointer
692 to a pcre_extra block, in which the study_data field points
693 to the results of the study.
695 The returned value from a pcre_study() can be passed
696 directly to pcre_exec(). However, the pcre_extra block also
697 contains other fields that can be set by the caller before
698 the block is passed; these are described below. If studying
699 the pattern does not produce any additional information,
700 pcre_study() returns NULL. In that circumstance, if the cal-
701 ling program wants to pass some of the other fields to
702 pcre_exec(), it must set up its own pcre_extra block.
704 The second argument contains option bits. At present, no
705 options are defined for pcre_study(), and this argument
706 should always be zero.
708 The third argument for pcre_study() is a pointer for an
709 error message. If studying succeeds (even if no data is
710 returned), the variable it points to is set to NULL. Other-
711 wise it points to a textual error message. You should there-
712 fore test the error pointer for NULL after calling
713 pcre_study(), to be sure that it has run successfully.
715 This is a typical call to pcre_study():
717 pcre_extra *pe;
718 pe = pcre_study(
719 re, /* result of pcre_compile() */
720 0, /* no options exist */
721 &error); /* set to NULL or points to a message */
723 At present, studying a pattern is useful only for non-
724 anchored patterns that do not have a single fixed starting
725 character. A bitmap of possible starting characters is
726 created.
731 PCRE handles caseless matching, and determines whether char-
732 acters are letters, digits, or whatever, by reference to a
733 set of tables. When running in UTF-8 mode, this applies only
734 to characters with codes less than 256. The library contains
735 a default set of tables that is created in the default C
736 locale when PCRE is compiled. This is used when the final
737 argument of pcre_compile() is NULL, and is sufficient for
738 many applications.
740 An alternative set of tables can, however, be supplied. Such
741 tables are built by calling the pcre_maketables() function,
742 which has no arguments, in the relevant locale. The result
743 can then be passed to pcre_compile() as often as necessary.
744 For example, to build and use tables that are appropriate
745 for the French locale (where accented characters with codes
746 greater than 128 are treated as letters), the following code
747 could be used:
749 setlocale(LC_CTYPE, "fr");
750 tables = pcre_maketables();
751 re = pcre_compile(..., tables);
753 The tables are built in memory that is obtained via
754 pcre_malloc. The pointer that is passed to pcre_compile is
755 saved with the compiled pattern, and the same tables are
756 used via this pointer by pcre_study() and pcre_exec(). Thus,
757 for any single pattern, compilation, studying and matching
758 all happen in the same locale, but different patterns can be
759 compiled in different locales. It is the caller's responsi-
760 bility to ensure that the memory containing the tables
761 remains available for as long as it is needed.
766 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
767 int what, void *where);
769 The pcre_fullinfo() function returns information about a
770 compiled pattern. It replaces the obsolete pcre_info() func-
771 tion, which is nevertheless retained for backwards compabil-
772 ity (and is documented below).
773 The first argument for pcre_fullinfo() is a pointer to the
774 compiled pattern. The second argument is the result of
775 pcre_study(), or NULL if the pattern was not studied. The
776 third argument specifies which piece of information is
777 required, and the fourth argument is a pointer to a variable
778 to receive the data. The yield of the function is zero for
779 success, or one of the following negative numbers:
781 PCRE_ERROR_NULL the argument code was NULL
782 the argument where was NULL
783 PCRE_ERROR_BADMAGIC the "magic number" was not found
784 PCRE_ERROR_BADOPTION the value of what was invalid
786 Here is a typical call of pcre_fullinfo(), to obtain the
787 length of the compiled pattern:
789 int rc;
790 unsigned long int length;
791 rc = pcre_fullinfo(
792 re, /* result of pcre_compile() */
793 pe, /* result of pcre_study(), or NULL */
794 PCRE_INFO_SIZE, /* what is required */
795 &length); /* where to put the data */
797 The possible values for the third argument are defined in
798 pcre.h, and are as follows:
802 Return the number of the highest back reference in the pat-
803 tern. The fourth argument should point to an int variable.
804 Zero is returned if there are no back references.
808 Return the number of capturing subpatterns in the pattern.
809 The fourth argument should point to an int variable.
813 Return information about the first byte of any matched
814 string, for a non-anchored pattern. (This option used to be
815 called PCRE_INFO_FIRSTCHAR; the old name is still recognized
816 for backwards compatibility.)
818 If there is a fixed first byte, e.g. from a pattern such as
819 (cat|cow|coyote), it is returned in the integer pointed to
820 by where. Otherwise, if either
822 (a) the pattern was compiled with the PCRE_MULTILINE option,
823 and every branch starts with "^", or
825 (b) every branch of the pattern starts with ".*" and
826 PCRE_DOTALL is not set (if it were set, the pattern would be
827 anchored),
829 -1 is returned, indicating that the pattern matches only at
830 the start of a subject string or after any newline within
831 the string. Otherwise -2 is returned. For anchored patterns,
832 -2 is returned.
836 If the pattern was studied, and this resulted in the con-
837 struction of a 256-bit table indicating a fixed set of bytes
838 for the first byte in any matching string, a pointer to the
839 table is returned. Otherwise NULL is returned. The fourth
840 argument should point to an unsigned char * variable.
844 Return the value of the rightmost literal byte that must
845 exist in any matched string, other than at its start, if
846 such a byte has been recorded. The fourth argument should
847 point to an int variable. If there is no such byte, -1 is
848 returned. For anchored patterns, a last literal byte is
849 recorded only if it follows something of variable length.
850 For example, for the pattern /^a\d+z\d+/ the returned value
851 is "z", but for /^a\dz\d/ the returned value is -1.
857 PCRE supports the use of named as well as numbered capturing
858 parentheses. The names are just an additional way of identi-
859 fying the parentheses, which still acquire a number. A
860 caller that wants to extract data from a named subpattern
861 must convert the name to a number in order to access the
862 correct pointers in the output vector (described with
863 pcre_exec() below). In order to do this, it must first use
864 these three values to obtain the name-to-number mapping
865 table for the pattern.
867 The map consists of a number of fixed-size entries.
868 PCRE_INFO_NAMECOUNT gives the number of entries, and
869 PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
870 of these return an int value. The entry size depends on the
871 length of the longest name. PCRE_INFO_NAMETABLE returns a
872 pointer to the first entry of the table (a pointer to char).
873 The first two bytes of each entry are the number of the cap-
874 turing parenthesis, most significant byte first. The rest of
875 the entry is the corresponding name, zero terminated. The
876 names are in alphabetical order. For example, consider the
877 following pattern (assume PCRE_EXTENDED is set, so white
878 space - including newlines - is ignored):
880 (?P<date> (?P<year>(\d\d)?\d\d) -
881 (?P<month>\d\d) - (?P<day>\d\d) )
883 There are four named subpatterns, so the table has four
884 entries, and each entry in the table is eight bytes long.
885 The table is as follows, with non-printing bytes shows in
886 hex, and undefined bytes shown as ??:
888 00 01 d a t e 00 ??
889 00 05 d a y 00 ?? ??
890 00 04 m o n t h 00
891 00 02 y e a r 00 ??
893 When writing code to extract data from named subpatterns,
894 remember that the length of each entry may be different for
895 each compiled pattern.
899 Return a copy of the options with which the pattern was com-
900 piled. The fourth argument should point to an unsigned long
901 int variable. These option bits are those specified in the
902 call to pcre_compile(), modified by any top-level option
903 settings within the pattern itself.
905 A pattern is automatically anchored by PCRE if all of its
906 top-level alternatives begin with one of the following:
908 ^ unless PCRE_MULTILINE is set
909 \A always
910 \G always
911 .* if PCRE_DOTALL is set and there are no back
912 references to the subpattern in which .* appears
914 For such patterns, the PCRE_ANCHORED bit is set in the
915 options returned by pcre_fullinfo().
919 Return the size of the compiled pattern, that is, the value
920 that was passed as the argument to pcre_malloc() when PCRE
921 was getting memory in which to place the compiled data. The
922 fourth argument should point to a size_t variable.
926 Returns the size of the data block pointed to by the
927 study_data field in a pcre_extra block. That is, it is the
928 value that was passed to pcre_malloc() when PCRE was getting
929 memory into which to place the data created by pcre_study().
930 The fourth argument should point to a size_t variable.
935 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
937 The pcre_info() function is now obsolete because its inter-
938 face is too restrictive to return all the available data
939 about a compiled pattern. New programs should use
940 pcre_fullinfo() instead. The yield of pcre_info() is the
941 number of capturing subpatterns, or one of the following
942 negative numbers:
944 PCRE_ERROR_NULL the argument code was NULL
945 PCRE_ERROR_BADMAGIC the "magic number" was not found
947 If the optptr argument is not NULL, a copy of the options
948 with which the pattern was compiled is placed in the integer
949 it points to (see PCRE_INFO_OPTIONS above).
951 If the pattern is not anchored and the firstcharptr argument
952 is not NULL, it is used to pass back information about the
953 first character of any matched string (see
959 int pcre_exec(const pcre *code, const pcre_extra *extra,
960 const char *subject, int length, int startoffset,
961 int options, int *ovector, int ovecsize);
963 The function pcre_exec() is called to match a subject string
964 against a pre-compiled pattern, which is passed in the code
965 argument. If the pattern has been studied, the result of the
966 study should be passed in the extra argument.
968 Here is an example of a simple call to pcre_exec():
970 int rc;
971 int ovector[30];
972 rc = pcre_exec(
973 re, /* result of pcre_compile() */
974 NULL, /* we didn't study the pattern */
975 "some string", /* the subject string */
976 11, /* the length of the subject string */
977 0, /* start at offset 0 in the subject */
978 0, /* default options */
979 ovector, /* vector for substring information */
980 30); /* number of elements in the vector */
982 If the extra argument is not NULL, it must point to a
983 pcre_extra data block. The pcre_study() function returns
984 such a block (when it doesn't return NULL), but you can also
985 create one for yourself, and pass additional information in
986 it. The fields in the block are as follows:
988 unsigned long int flags;
989 void *study_data;
990 unsigned long int match_limit;
991 void *callout_data;
993 The flags field is a bitmap that specifies which of the
994 other fields are set. The flag bits are:
1000 Other flag bits should be set to zero. The study_data field
1001 is set in the pcre_extra block that is returned by
1002 pcre_study(), together with the appropriate flag bit. You
1003 should not set this yourself, but you can add to the block
1004 by setting the other fields.
1006 The match_limit field provides a means of preventing PCRE
1007 from using up a vast amount of resources when running pat-
1008 terns that are not going to match, but which have a very
1009 large number of possibilities in their search trees. The
1010 classic example is the use of nested unlimited repeats.
1011 Internally, PCRE uses a function called match() which it
1012 calls repeatedly (sometimes recursively). The limit is
1013 imposed on the number of times this function is called dur-
1014 ing a match, which has the effect of limiting the amount of
1015 recursion and backtracking that can take place. For patterns
1016 that are not anchored, the count starts from zero for each
1017 position in the subject string.
1019 The default limit for the library can be set when PCRE is
1020 built; the default default is 10 million, which handles all
1021 but the most extreme cases. You can reduce the default by
1022 suppling pcre_exec() with a pcre_extra block in which
1023 match_limit is set to a smaller value, and
1024 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1025 limit is exceeded, pcre_exec() returns
1028 The pcre_callout field is used in conjunction with the "cal-
1029 lout" feature, which is described in the pcrecallout docu-
1030 mentation.
1032 The PCRE_ANCHORED option can be passed in the options argu-
1033 ment, whose unused bits must be zero. This limits
1034 pcre_exec() to matching at the first matching position. How-
1035 ever, if a pattern was compiled with PCRE_ANCHORED, or
1036 turned out to be anchored by virtue of its contents, it can-
1037 not be made unachored at matching time.
1039 When PCRE_UTF8 was set at compile time, the validity of the
1040 subject as a UTF-8 string is automatically checked. If an
1041 invalid UTF-8 sequence of bytes is found, pcre_exec()
1042 returns the error PCRE_ERROR_BADUTF8. If you already know
1043 that your subject is valid, and you want to skip this check
1044 for performance reasons, you can set the PCRE_NO_UTF8_CHECK
1045 option when calling pcre_exec(). When this option is set,
1046 the effect of passing an invalid UTF-8 string as a subject
1047 is undefined. It may cause your program to crash.
1049 There are also three further options that can be set only at
1050 matching time:
1054 The first character of the string is not the beginning of a
1055 line, so the circumflex metacharacter should not match
1056 before it. Setting this without PCRE_MULTILINE (at compile
1057 time) causes circumflex never to match.
1061 The end of the string is not the end of a line, so the dol-
1062 lar metacharacter should not match it nor (except in multi-
1063 line mode) a newline immediately before it. Setting this
1064 without PCRE_MULTILINE (at compile time) causes dollar never
1065 to match.
1069 An empty string is not considered to be a valid match if
1070 this option is set. If there are alternatives in the pat-
1071 tern, they are tried. If all the alternatives match the
1072 empty string, the entire match fails. For example, if the
1073 pattern
1075 a?b?
1077 is applied to a string not beginning with "a" or "b", it
1078 matches the empty string at the start of the subject. With
1079 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1080 further into the string for occurrences of "a" or "b".
1082 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1083 make a special case of a pattern match of the empty string
1084 within its split() function, and when using the /g modifier.
1085 It is possible to emulate Perl's behaviour after matching a
1086 null string by first trying the match again at the same
1087 offset with PCRE_NOTEMPTY set, and then if that fails by
1088 advancing the starting offset (see below) and trying an
1089 ordinary match again.
1091 The subject string is passed to pcre_exec() as a pointer in
1092 subject, a length in length, and a starting offset in star-
1093 toffset. Unlike the pattern string, the subject may contain
1094 binary zero bytes. When the starting offset is zero, the
1095 search for a match starts at the beginning of the subject,
1096 and this is by far the most common case.
1098 If the pattern was compiled with the PCRE_UTF8 option, the
1099 subject must be a sequence of bytes that is a valid UTF-8
1100 string. If an invalid UTF-8 string is passed, PCRE's
1101 behaviour is not defined.
1103 A non-zero starting offset is useful when searching for
1104 another match in the same subject by calling pcre_exec()
1105 again after a previous success. Setting startoffset differs
1106 from just passing over a shortened string and setting
1107 PCRE_NOTBOL in the case of a pattern that begins with any
1108 kind of lookbehind. For example, consider the pattern
1110 \Biss\B
1112 which finds occurrences of "iss" in the middle of words. (\B
1113 matches only if the current position in the subject is not a
1114 word boundary.) When applied to the string "Mississipi" the
1115 first call to pcre_exec() finds the first occurrence. If
1116 pcre_exec() is called again with just the remainder of the
1117 subject, namely "issipi", it does not match, because \B is
1118 always false at the start of the subject, which is deemed to
1119 be a word boundary. However, if pcre_exec() is passed the
1120 entire string again, but with startoffset set to 4, it finds
1121 the second occurrence of "iss" because it is able to look
1122 behind the starting point to discover that it is preceded by
1123 a letter.
1125 If a non-zero starting offset is passed when the pattern is
1126 anchored, one attempt to match at the given offset is tried.
1127 This can only succeed if the pattern does not require the
1128 match to be at the start of the subject.
1130 In general, a pattern matches a certain portion of the sub-
1131 ject, and in addition, further substrings from the subject
1132 may be picked out by parts of the pattern. Following the
1133 usage in Jeffrey Friedl's book, this is called "capturing"
1134 in what follows, and the phrase "capturing subpattern" is
1135 used for a fragment of a pattern that picks out a substring.
1136 PCRE supports several other kinds of parenthesized subpat-
1137 tern that do not cause substrings to be captured.
1138 Captured substrings are returned to the caller via a vector
1139 of integer offsets whose address is passed in ovector. The
1140 number of elements in the vector is passed in ovecsize. The
1141 first two-thirds of the vector is used to pass back captured
1142 substrings, each substring using a pair of integers. The
1143 remaining third of the vector is used as workspace by
1144 pcre_exec() while matching capturing subpatterns, and is not
1145 available for passing back information. The length passed in
1146 ovecsize should always be a multiple of three. If it is not,
1147 it is rounded down.
1149 When a match has been successful, information about captured
1150 substrings is returned in pairs of integers, starting at the
1151 beginning of ovector, and continuing up to two-thirds of its
1152 length at the most. The first element of a pair is set to
1153 the offset of the first character in a substring, and the
1154 second is set to the offset of the first character after the
1155 end of a substring. The first pair, ovector[0] and ovec-
1156 tor[1], identify the portion of the subject string matched
1157 by the entire pattern. The next pair is used for the first
1158 capturing subpattern, and so on. The value returned by
1159 pcre_exec() is the number of pairs that have been set. If
1160 there are no capturing subpatterns, the return value from a
1161 successful match is 1, indicating that just the first pair
1162 of offsets has been set.
1164 Some convenience functions are provided for extracting the
1165 captured substrings as separate strings. These are described
1166 in the following section.
1168 It is possible for an capturing subpattern number n+1 to
1169 match some part of the subject when subpattern n has not
1170 been used at all. For example, if the string "abc" is
1171 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1172 are matched, but 2 is not. When this happens, both offset
1173 values corresponding to the unused subpattern are set to -1.
1175 If a capturing subpattern is matched repeatedly, it is the
1176 last portion of the string that it matched that gets
1177 returned.
1179 If the vector is too small to hold all the captured sub-
1180 strings, it is used as far as possible (up to two-thirds of
1181 its length), and the function returns a value of zero. In
1182 particular, if the substring offsets are not of interest,
1183 pcre_exec() may be called with ovector passed as NULL and
1184 ovecsize as zero. However, if the pattern contains back
1185 references and the ovector isn't big enough to remember the
1186 related substrings, PCRE has to get additional memory for
1187 use during matching. Thus it is usually advisable to supply
1188 an ovector.
1190 Note that pcre_info() can be used to find out how many cap-
1191 turing subpatterns there are in a compiled pattern. The
1192 smallest size for ovector that will allow for n captured
1193 substrings, in addition to the offsets of the substring
1194 matched by the whole pattern, is (n+1)*3.
1196 If pcre_exec() fails, it returns a negative number. The fol-
1197 lowing are defined in the header file:
1201 The subject string did not match the pattern.
1205 Either code or subject was passed as NULL, or ovector was
1206 NULL and ovecsize was not zero.
1210 An unrecognized bit was set in the options argument.
1214 PCRE stores a 4-byte "magic number" at the start of the com-
1215 piled code, to catch the case when it is passed a junk
1216 pointer. This is the error it gives when the magic number
1217 isn't present.
1221 While running the pattern match, an unknown item was encoun-
1222 tered in the compiled pattern. This error could be caused by
1223 a bug in PCRE or by overwriting of the compiled pattern.
1227 If a pattern contains back references, but the ovector that
1228 is passed to pcre_exec() is not big enough to remember the
1229 referenced substrings, PCRE gets a block of memory at the
1230 start of matching to use for this purpose. If the call via
1231 pcre_malloc() fails, this error is given. The memory is
1232 freed at the end of matching.
1236 This error is used by the pcre_copy_substring(),
1237 pcre_get_substring(), and pcre_get_substring_list() func-
1238 tions (see below). It is never returned by pcre_exec().
1242 The recursion and backtracking limit, as specified by the
1243 match_limit field in a pcre_extra structure (or defaulted)
1244 was reached. See the description above.
1248 This error is never generated by pcre_exec() itself. It is
1249 provided for use by callout functions that want to yield a
1250 distinctive error code. See the pcrecallout documentation
1251 for details.
1255 A string that contains an invalid UTF-8 byte sequence was
1256 passed as a subject.
1261 int pcre_copy_substring(const char *subject, int *ovector,
1262 int stringcount, int stringnumber, char *buffer,
1263 int buffersize);
1265 int pcre_get_substring(const char *subject, int *ovector,
1266 int stringcount, int stringnumber,
1267 const char **stringptr);
1269 int pcre_get_substring_list(const char *subject,
1270 int *ovector, int stringcount, const char ***listptr);
1272 Captured substrings can be accessed directly by using the
1273 offsets returned by pcre_exec() in ovector. For convenience,
1274 the functions pcre_copy_substring(), pcre_get_substring(),
1275 and pcre_get_substring_list() are provided for extracting
1276 captured substrings as new, separate, zero-terminated
1277 strings. These functions identify substrings by number. The
1278 next section describes functions for extracting named sub-
1279 strings. A substring that contains a binary zero is
1280 correctly extracted and has a further zero added on the end,
1281 but the result is not, of course, a C string.
1283 The first three arguments are the same for all three of
1284 these functions: subject is the subject string which has
1285 just been successfully matched, ovector is a pointer to the
1286 vector of integer offsets that was passed to pcre_exec(),
1287 and stringcount is the number of substrings that were cap-
1288 tured by the match, including the substring that matched the
1289 entire regular expression. This is the value returned by
1290 pcre_exec if it is greater than zero. If pcre_exec()
1291 returned zero, indicating that it ran out of space in ovec-
1292 tor, the value passed as stringcount should be the size of
1293 the vector divided by three.
1294 The functions pcre_copy_substring() and pcre_get_substring()
1295 extract a single substring, whose number is given as string-
1296 number. A value of zero extracts the substring that matched
1297 the entire pattern, while higher values extract the captured
1298 substrings. For pcre_copy_substring(), the string is placed
1299 in buffer, whose length is given by buffersize, while for
1300 pcre_get_substring() a new block of memory is obtained via
1301 pcre_malloc, and its address is returned via stringptr. The
1302 yield of the function is the length of the string, not
1303 including the terminating zero, or one of
1307 The buffer was too small for pcre_copy_substring(), or the
1308 attempt to get memory failed for pcre_get_substring().
1312 There is no substring whose number is stringnumber.
1314 The pcre_get_substring_list() function extracts all avail-
1315 able substrings and builds a list of pointers to them. All
1316 this is done in a single block of memory which is obtained
1317 via pcre_malloc. The address of the memory block is returned
1318 via listptr, which is also the start of the list of string
1319 pointers. The end of the list is marked by a NULL pointer.
1320 The yield of the function is zero if all went well, or
1324 if the attempt to get the memory block failed.
1326 When any of these functions encounter a substring that is
1327 unset, which can happen when capturing subpattern number n+1
1328 matches some part of the subject, but subpattern n has not
1329 been used at all, they return an empty string. This can be
1330 distinguished from a genuine zero-length substring by
1331 inspecting the appropriate offset in ovector, which is nega-
1332 tive for unset substrings.
1334 The two convenience functions pcre_free_substring() and
1335 pcre_free_substring_list() can be used to free the memory
1336 returned by a previous call of pcre_get_substring() or
1337 pcre_get_substring_list(), respectively. They do nothing
1338 more than call the function pointed to by pcre_free, which
1339 of course could be called directly from a C program. How-
1340 ever, PCRE is used in some situations where it is linked via
1341 a special interface to another programming language which
1342 cannot use pcre_free directly; it is for these cases that
1343 the functions are provided.
1348 int pcre_copy_named_substring(const pcre *code,
1349 const char *subject, int *ovector,
1350 int stringcount, const char *stringname,
1351 char *buffer, int buffersize);
1353 int pcre_get_stringnumber(const pcre *code,
1354 const char *name);
1356 int pcre_get_named_substring(const pcre *code,
1357 const char *subject, int *ovector,
1358 int stringcount, const char *stringname,
1359 const char **stringptr);
1361 To extract a substring by name, you first have to find asso-
1362 ciated number. This can be done by calling
1363 pcre_get_stringnumber(). The first argument is the compiled
1364 pattern, and the second is the name. For example, for this
1365 pattern
1367 ab(?<xxx>\d+)...
1369 the number of the subpattern called "xxx" is 1. Given the
1370 number, you can then extract the substring directly, or use
1371 one of the functions described in the previous section. For
1372 convenience, there are also two functions that do the whole
1373 job.
1375 Most of the arguments of pcre_copy_named_substring() and
1376 pcre_get_named_substring() are the same as those for the
1377 functions that extract by number, and so are not re-
1378 described here. There are just two differences.
1380 First, instead of a substring number, a substring name is
1381 given. Second, there is an extra argument, given at the
1382 start, which is a pointer to the compiled pattern. This is
1383 needed in order to gain access to the name-to-number trans-
1384 lation table.
1386 These functions call pcre_get_stringnumber(), and if it
1387 succeeds, they then call pcre_copy_substring() or
1388 pcre_get_substring(), as appropriate.
1390 Last updated: 20 August 2003
1391 Copyright (c) 1997-2003 University of Cambridge.
1392 -----------------------------------------------------------------------------
1394 NAME
1395 PCRE - Perl-compatible regular expressions
1400 int (*pcre_callout)(pcre_callout_block *);
1402 PCRE provides a feature called "callout", which is a means
1403 of temporarily passing control to the caller of PCRE in the
1404 middle of pattern matching. The caller of PCRE provides an
1405 external function by putting its entry point in the global
1406 variable pcre_callout. By default, this variable contains
1407 NULL, which disables all calling out.
1409 Within a regular expression, (?C) indicates the points at
1410 which the external function is to be called. Different cal-
1411 lout points can be identified by putting a number less than
1412 256 after the letter C. The default value is zero. For
1413 example, this pattern has two callout points:
1415 (?C1)9abc(?C2)def
1417 During matching, when PCRE reaches a callout point (and
1418 pcre_callout is set), the external function is called. Its
1419 only argument is a pointer to a pcre_callout block. This
1420 contains the following variables:
1422 int version;
1423 int callout_number;
1424 int *offset_vector;
1425 const char *subject;
1426 int subject_length;
1427 int start_match;
1428 int current_position;
1429 int capture_top;
1430 int capture_last;
1431 void *callout_data;
1433 The version field is an integer containing the version
1434 number of the block format. The current version is zero. The
1435 version number may change in future if additional fields are
1436 added, but the intention is never to remove any of the
1437 existing fields.
1439 The callout_number field contains the number of the callout,
1440 as compiled into the pattern (that is, the number after ?C).
1442 The offset_vector field is a pointer to the vector of
1443 offsets that was passed by the caller to pcre_exec(). The
1444 contents can be inspected in order to extract substrings
1445 that have been matched so far, in the same way as for
1446 extracting substrings after a match has completed.
1447 The subject and subject_length fields contain copies the
1448 values that were passed to pcre_exec().
1450 The start_match field contains the offset within the subject
1451 at which the current match attempt started. If the pattern
1452 is not anchored, the callout function may be called several
1453 times for different starting points.
1455 The current_position field contains the offset within the
1456 subject of the current match pointer.
1458 The capture_top field contains one more than the number of
1459 the highest numbered captured substring so far. If no sub-
1460 strings have been captured, the value of capture_top is one.
1462 The capture_last field contains the number of the most
1463 recently captured substring.
1465 The callout_data field contains a value that is passed to
1466 pcre_exec() by the caller specifically so that it can be
1467 passed back in callouts. It is passed in the pcre_callout
1468 field of the pcre_extra data structure. If no such data was
1469 passed, the value of callout_data in a pcre_callout block is
1470 NULL. There is a description of the pcre_extra structure in
1471 the pcreapi documentation.
1477 The callout function returns an integer. If the value is
1478 zero, matching proceeds as normal. If the value is greater
1479 than zero, matching fails at the current point, but back-
1480 tracking to test other possibilities goes ahead, just as if
1481 a lookahead assertion had failed. If the value is less than
1482 zero, the match is abandoned, and pcre_exec() returns the
1483 value.
1485 Negative values should normally be chosen from the set of
1486 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1487 forces a standard "no match" failure. The error number
1488 PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1489 it will never be used by PCRE itself.
1491 Last updated: 21 January 2003
1492 Copyright (c) 1997-2003 University of Cambridge.
1493 -----------------------------------------------------------------------------
1495 NAME
1496 PCRE - Perl-compatible regular expressions
1501 This document describes the differences in the ways that
1502 PCRE and Perl handle regular expressions. The differences
1503 described here are with respect to Perl 5.8.
1505 1. PCRE does not allow repeat quantifiers on lookahead
1506 assertions. Perl permits them, but they do not mean what you
1507 might think. For example, (?!a){3} does not assert that the
1508 next three characters are not "a". It just asserts that the
1509 next character is not "a" three times.
1511 2. Capturing subpatterns that occur inside negative looka-
1512 head assertions are counted, but their entries in the
1513 offsets vector are never set. Perl sets its numerical vari-
1514 ables from any such patterns that are matched before the
1515 assertion fails to match something (thereby succeeding), but
1516 only if the negative lookahead assertion contains just one
1517 branch.
1519 3. Though binary zero characters are supported in the sub-
1520 ject string, they are not allowed in a pattern string
1521 because it is passed as a normal C string, terminated by
1522 zero. The escape sequence "\0" can be used in the pattern to
1523 represent a binary zero.
1525 4. The following Perl escape sequences are not supported:
1526 \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1527 mented by Perl's general string-handling and are not part of
1528 its pattern matching engine. If any of these are encountered
1529 by PCRE, an error is generated.
1531 5. PCRE does support the \Q...\E escape for quoting sub-
1532 strings. Characters in between are treated as literals. This
1533 is slightly different from Perl in that $ and @ are also
1534 handled as literals inside the quotes. In Perl, they cause
1535 variable interpolation (but of course PCRE does not have
1536 variables). Note the following examples:
1538 Pattern PCRE matches Perl matches
1540 \Qabc$xyz\E abc$xyz abc followed by the
1541 contents of $xyz
1542 \Qabc\$xyz\E abc\$xyz abc\$xyz
1543 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1545 In PCRE, the \Q...\E mechanism is not recognized inside a
1546 character class.
1548 8. Fairly obviously, PCRE does not support the (?{code}) and
1549 (?p{code}) constructions. However, there is some experimen-
1550 tal support for recursive patterns using the non-Perl items
1551 (?R), (?number) and (?P>name). Also, the PCRE "callout"
1552 feature allows an external function to be called during pat-
1553 tern matching.
1555 9. There are some differences that are concerned with the
1556 settings of captured strings when part of a pattern is
1557 repeated. For example, matching "aba" against the pattern
1558 /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1559 to "b".
1561 10. PCRE provides some extensions to the Perl regular
1562 expression facilities:
1564 (a) Although lookbehind assertions must match fixed length
1565 strings, each alternative branch of a lookbehind assertion
1566 can match a different length of string. Perl requires them
1567 all to have the same length.
1569 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1570 set, the $ meta-character matches only at the very end of
1571 the string.
1573 (c) If PCRE_EXTRA is set, a backslash followed by a letter
1574 with no special meaning is faulted.
1576 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1577 tion quantifiers is inverted, that is, by default they are
1578 not greedy, but if followed by a question mark they are.
1580 (e) PCRE_ANCHORED can be used to force a pattern to be tried
1581 only at the first matching position in the subject string.
1584 PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1585 equivalents.
1587 (g) The (?R), (?number), and (?P>name) constructs allows for
1588 recursive pattern matching (Perl can do this using the
1589 (?p{code}) construct, which PCRE cannot support.)
1591 (h) PCRE supports named capturing substrings, using the
1592 Python syntax.
1594 (i) PCRE supports the possessive quantifier "++" syntax,
1595 taken from Sun's Java package.
1597 (j) The (R) condition, for testing recursion, is a PCRE
1598 extension.
1600 (k) The callout facility is PCRE-specific.
1602 Last updated: 03 February 2003
1603 Copyright (c) 1997-2003 University of Cambridge.
1604 -----------------------------------------------------------------------------
1606 NAME
1607 PCRE - Perl-compatible regular expressions
1612 The syntax and semantics of the regular expressions sup-
1613 ported by PCRE are described below. Regular expressions are
1614 also described in the Perl documentation and in a number of
1615 other books, some of which have copious examples. Jeffrey
1616 Friedl's "Mastering Regular Expressions", published by
1617 O'Reilly, covers them in great detail. The description here
1618 is intended as reference documentation.
1620 The basic operation of PCRE is on strings of bytes. However,
1621 there is also support for UTF-8 character strings. To use
1622 this support you must build PCRE to include UTF-8 support,
1623 and then call pcre_compile() with the PCRE_UTF8 option. How
1624 this affects the pattern matching is mentioned in several
1625 places below. There is also a summary of UTF-8 features in
1626 the section on UTF-8 support in the main pcre page.
1628 A regular expression is a pattern that is matched against a
1629 subject string from left to right. Most characters stand for
1630 themselves in a pattern, and match the corresponding charac-
1631 ters in the subject. As a trivial example, the pattern
1633 The quick brown fox
1635 matches a portion of a subject string that is identical to
1636 itself. The power of regular expressions comes from the
1637 ability to include alternatives and repetitions in the pat-
1638 tern. These are encoded in the pattern by the use of meta-
1639 characters, which do not stand for themselves but instead
1640 are interpreted in some special way.
1642 There are two different sets of meta-characters: those that
1643 are recognized anywhere in the pattern except within square
1644 brackets, and those that are recognized in square brackets.
1645 Outside square brackets, the meta-characters are as follows:
1647 \ general escape character with several uses
1648 ^ assert start of string (or line, in multiline mode)
1649 $ assert end of string (or line, in multiline mode)
1650 . match any character except newline (by default)
1651 [ start character class definition
1652 | start of alternative branch
1653 ( start subpattern
1654 ) end subpattern
1655 ? extends the meaning of (
1656 also 0 or 1 quantifier
1657 also quantifier minimizer
1658 * 0 or more quantifier
1659 + 1 or more quantifier
1660 also "possessive quantifier"
1661 { start min/max quantifier
1663 Part of a pattern that is in square brackets is called a
1664 "character class". In a character class the only meta-
1665 characters are:
1667 \ general escape character
1668 ^ negate the class, but only if the first character
1669 - indicates character range
1670 [ POSIX character class (only if followed by POSIX
1671 syntax)
1672 ] terminates the character class
1674 The following sections describe the use of each of the
1675 meta-characters.
1680 The backslash character has several uses. Firstly, if it is
1681 followed by a non-alphameric character, it takes away any
1682 special meaning that character may have. This use of
1683 backslash as an escape character applies both inside and
1684 outside character classes.
1686 For example, if you want to match a * character, you write
1687 \* in the pattern. This escaping action applies whether or
1688 not the following character would otherwise be interpreted
1689 as a meta-character, so it is always safe to precede a non-
1690 alphameric with backslash to specify that it stands for
1691 itself. In particular, if you want to match a backslash, you
1692 write \\.
1694 If a pattern is compiled with the PCRE_EXTENDED option, whi-
1695 tespace in the pattern (other than in a character class) and
1696 characters between a # outside a character class and the
1697 next newline character are ignored. An escaping backslash
1698 can be used to include a whitespace or # character as part
1699 of the pattern.
1701 If you want to remove the special meaning from a sequence of
1702 characters, you can do so by putting them between \Q and \E.
1703 This is different from Perl in that $ and @ are handled as
1704 literals in \Q...\E sequences in PCRE, whereas in Perl, $
1705 and @ cause variable interpolation. Note the following exam-
1706 ples:
1708 Pattern PCRE matches Perl matches
1710 \Qabc$xyz\E abc$xyz abc followed by the
1712 contents of $xyz
1713 \Qabc\$xyz\E abc\$xyz abc\$xyz
1714 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1716 The \Q...\E sequence is recognized both inside and outside
1717 character classes.
1719 A second use of backslash provides a way of encoding non-
1720 printing characters in patterns in a visible manner. There
1721 is no restriction on the appearance of non-printing charac-
1722 ters, apart from the binary zero that terminates a pattern,
1723 but when a pattern is being prepared by text editing, it is
1724 usually easier to use one of the following escape sequences
1725 than the binary character it represents:
1727 \a alarm, that is, the BEL character (hex 07)
1728 \cx "control-x", where x is any character
1729 \e escape (hex 1B)
1730 \f formfeed (hex 0C)
1731 \n newline (hex 0A)
1732 \r carriage return (hex 0D)
1733 \t tab (hex 09)
1734 \ddd character with octal code ddd, or backreference
1735 \xhh character with hex code hh
1736 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1738 The precise effect of \cx is as follows: if x is a lower
1739 case letter, it is converted to upper case. Then bit 6 of
1740 the character (hex 40) is inverted. Thus \cz becomes hex
1741 1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1743 After \x, from zero to two hexadecimal digits are read
1744 (letters can be in upper or lower case). In UTF-8 mode, any
1745 number of hexadecimal digits may appear between \x{ and },
1746 but the value of the character code must be less than 2**31
1747 (that is, the maximum hexadecimal value is 7FFFFFFF). If
1748 characters other than hexadecimal digits appear between \x{
1749 and }, or if there is no terminating }, this form of escape
1750 is not recognized. Instead, the initial \x will be inter-
1751 preted as a basic hexadecimal escape, with no following
1752 digits, giving a byte whose value is zero.
1754 Characters whose value is less than 256 can be defined by
1755 either of the two syntaxes for \x when PCRE is in UTF-8
1756 mode. There is no difference in the way they are handled.
1757 For example, \xdc is exactly the same as \x{dc}.
1759 After \0 up to two further octal digits are read. In both
1760 cases, if there are fewer than two digits, just those that
1761 are present are used. Thus the sequence \0\x\07 specifies
1762 two binary zeros followed by a BEL character (code value 7).
1763 Make sure you supply two digits after the initial zero if
1764 the character that follows is itself an octal digit.
1766 The handling of a backslash followed by a digit other than 0
1767 is complicated. Outside a character class, PCRE reads it
1768 and any following digits as a decimal number. If the number
1769 is less than 10, or if there have been at least that many
1770 previous capturing left parentheses in the expression, the
1771 entire sequence is taken as a back reference. A description
1772 of how this works is given later, following the discussion
1773 of parenthesized subpatterns.
1775 Inside a character class, or if the decimal number is
1776 greater than 9 and there have not been that many capturing
1777 subpatterns, PCRE re-reads up to three octal digits follow-
1778 ing the backslash, and generates a single byte from the
1779 least significant 8 bits of the value. Any subsequent digits
1780 stand for themselves. For example:
1782 \040 is another way of writing a space
1783 \40 is the same, provided there are fewer than 40
1784 previous capturing subpatterns
1785 \7 is always a back reference
1786 \11 might be a back reference, or another way of
1787 writing a tab
1788 \011 is always a tab
1789 \0113 is a tab followed by the character "3"
1790 \113 might be a back reference, otherwise the
1791 character with octal code 113
1792 \377 might be a back reference, otherwise
1793 the byte consisting entirely of 1 bits
1794 \81 is either a back reference, or a binary zero
1795 followed by the two characters "8" and "1"
1797 Note that octal values of 100 or greater must not be intro-
1798 duced by a leading zero, because no more than three octal
1799 digits are ever read.
1801 All the sequences that define a single byte value or a sin-
1802 gle UTF-8 character (in UTF-8 mode) can be used both inside
1803 and outside character classes. In addition, inside a charac-
1804 ter class, the sequence \b is interpreted as the backspace
1805 character (hex 08). Outside a character class it has a dif-
1806 ferent meaning (see below).
1808 The third use of backslash is for specifying generic charac-
1809 ter types:
1811 \d any decimal digit
1812 \D any character that is not a decimal digit
1813 \s any whitespace character
1814 \S any character that is not a whitespace character
1815 \w any "word" character
1816 W any "non-word" character
1818 Each pair of escape sequences partitions the complete set of
1819 characters into two disjoint sets. Any given character
1820 matches one, and only one, of each pair.
1822 In UTF-8 mode, characters with values greater than 255 never
1823 match \d, \s, or \w, and always match \D, \S, and \W.
1825 For compatibility with Perl, \s does not match the VT char-
1826 acter (code 11). This makes it different from the the POSIX
1827 "space" class. The \s characters are HT (9), LF (10), FF
1828 (12), CR (13), and space (32).
1830 A "word" character is any letter or digit or the underscore
1831 character, that is, any character which can be part of a
1832 Perl "word". The definition of letters and digits is con-
1833 trolled by PCRE's character tables, and may vary if locale-
1834 specific matching is taking place (see "Locale support" in
1835 the pcreapi page). For example, in the "fr" (French) locale,
1836 some character codes greater than 128 are used for accented
1837 letters, and these are matched by \w.
1839 These character type sequences can appear both inside and
1840 outside character classes. They each match one character of
1841 the appropriate type. If the current matching point is at
1842 the end of the subject string, all of them fail, since there
1843 is no character to match.
1845 The fourth use of backslash is for certain simple asser-
1846 tions. An assertion specifies a condition that has to be met
1847 at a particular point in a match, without consuming any
1848 characters from the subject string. The use of subpatterns
1849 for more complicated assertions is described below. The
1850 backslashed assertions are
1852 \b matches at a word boundary
1853 \B matches when not at a word boundary
1854 \A matches at start of subject
1855 \Z matches at end of subject or before newline at end
1856 \z matches at end of subject
1857 \G matches at first matching position in subject
1859 These assertions may not appear in character classes (but
1860 note that \b has a different meaning, namely the backspace
1861 character, inside a character class).
1863 A word boundary is a position in the subject string where
1864 the current character and the previous character do not both
1865 match \w or \W (i.e. one matches \w and the other matches
1866 \W), or the start or end of the string if the first or last
1867 character matches \w, respectively.
1868 The \A, \Z, and \z assertions differ from the traditional
1869 circumflex and dollar (described below) in that they only
1870 ever match at the very start and end of the subject string,
1871 whatever options are set. Thus, they are independent of mul-
1872 tiline mode.
1874 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1875 options. If the startoffset argument of pcre_exec() is non-
1876 zero, indicating that matching is to start at a point other
1877 than the beginning of the subject, \A can never match. The
1878 difference between \Z and \z is that \Z matches before a
1879 newline that is the last character of the string as well as
1880 at the end of the string, whereas \z matches only at the
1881 end.
1883 The \G assertion is true only when the current matching
1884 position is at the start point of the match, as specified by
1885 the startoffset argument of pcre_exec(). It differs from \A
1886 when the value of startoffset is non-zero. By calling
1887 pcre_exec() multiple times with appropriate arguments, you
1888 can mimic Perl's /g option, and it is in this kind of imple-
1889 mentation where \G can be useful.
1891 Note, however, that PCRE's interpretation of \G, as the
1892 start of the current match, is subtly different from Perl's,
1893 which defines it as the end of the previous match. In Perl,
1894 these can be different when the previously matched string
1895 was empty. Because PCRE does just one match at a time, it
1896 cannot reproduce this behaviour.
1898 If all the alternatives of a pattern begin with \G, the
1899 expression is anchored to the starting match position, and
1900 the "anchored" flag is set in the compiled regular expres-
1901 sion.
1906 Outside a character class, in the default matching mode, the
1907 circumflex character is an assertion which is true only if
1908 the current matching point is at the start of the subject
1909 string. If the startoffset argument of pcre_exec() is non-
1910 zero, circumflex can never match if the PCRE_MULTILINE
1911 option is unset. Inside a character class, circumflex has an
1912 entirely different meaning (see below).
1914 Circumflex need not be the first character of the pattern if
1915 a number of alternatives are involved, but it should be the
1916 first thing in each alternative in which it appears if the
1917 pattern is ever to match that branch. If all possible alter-
1918 natives start with a circumflex, that is, if the pattern is
1919 constrained to match only at the start of the subject, it is
1920 said to be an "anchored" pattern. (There are also other con-
1921 structs that can cause a pattern to be anchored.)
1923 A dollar character is an assertion which is true only if the
1924 current matching point is at the end of the subject string,
1925 or immediately before a newline character that is the last
1926 character in the string (by default). Dollar need not be the
1927 last character of the pattern if a number of alternatives
1928 are involved, but it should be the last item in any branch
1929 in which it appears. Dollar has no special meaning in a
1930 character class.
1932 The meaning of dollar can be changed so that it matches only
1933 at the very end of the string, by setting the
1934 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1935 affect the \Z assertion.
1937 The meanings of the circumflex and dollar characters are
1938 changed if the PCRE_MULTILINE option is set. When this is
1939 the case, they match immediately after and immediately
1940 before an internal newline character, respectively, in addi-
1941 tion to matching at the start and end of the subject string.
1942 For example, the pattern /^abc$/ matches the subject string
1943 "def\nabc" in multiline mode, but not otherwise. Conse-
1944 quently, patterns that are anchored in single line mode
1945 because all branches start with ^ are not anchored in multi-
1946 line mode, and a match for circumflex is possible when the
1947 startoffset argument of pcre_exec() is non-zero. The
1948 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1949 set.
1951 Note that the sequences \A, \Z, and \z can be used to match
1952 the start and end of the subject in both modes, and if all
1953 branches of a pattern start with \A it is always anchored,
1954 whether PCRE_MULTILINE is set or not.
1959 Outside a character class, a dot in the pattern matches any
1960 one character in the subject, including a non-printing char-
1961 acter, but not (by default) newline. In UTF-8 mode, a dot
1962 matches any UTF-8 character, which might be more than one
1963 byte long, except (by default) for newline. If the
1964 PCRE_DOTALL option is set, dots match newlines as well. The
1965 handling of dot is entirely independent of the handling of
1966 circumflex and dollar, the only relationship being that they
1967 both involve newline characters. Dot has no special meaning
1968 in a character class.
1974 Outside a character class, the escape sequence \C matches
1975 any one byte, both in and out of UTF-8 mode. Unlike a dot,
1976 it always matches a newline. The feature is provided in Perl
1977 in order to match individual bytes in UTF-8 mode. Because
1978 it breaks up UTF-8 characters into individual bytes, what
1979 remains in the string may be a malformed UTF-8 string. For
1980 this reason it is best avoided.
1982 PCRE does not allow \C to appear in lookbehind assertions
1983 (see below), because in UTF-8 mode it makes it impossible to
1984 calculate the length of the lookbehind.
1989 An opening square bracket introduces a character class, ter-
1990 minated by a closing square bracket. A closing square
1991 bracket on its own is not special. If a closing square
1992 bracket is required as a member of the class, it should be
1993 the first data character in the class (after an initial cir-
1994 cumflex, if present) or escaped with a backslash.
1996 A character class matches a single character in the subject.
1997 In UTF-8 mode, the character may occupy more than one byte.
1998 A matched character must be in the set of characters defined
1999 by the class, unless the first character in the class defin-
2000 ition is a circumflex, in which case the subject character
2001 must not be in the set defined by the class. If a circumflex
2002 is actually required as a member of the class, ensure it is
2003 not the first character, or escape it with a backslash.
2005 For example, the character class [aeiou] matches any lower
2006 case vowel, while [^aeiou] matches any character that is not
2007 a lower case vowel. Note that a circumflex is just a con-
2008 venient notation for specifying the characters which are in
2009 the class by enumerating those that are not. It is not an
2010 assertion: it still consumes a character from the subject
2011 string, and fails if the current pointer is at the end of
2012 the string.
2014 In UTF-8 mode, characters with values greater than 255 can
2015 be included in a class as a literal string of bytes, or by
2016 using the \x{ escaping mechanism.
2018 When caseless matching is set, any letters in a class
2019 represent both their upper case and lower case versions, so
2020 for example, a caseless [aeiou] matches "A" as well as "a",
2021 and a caseless [^aeiou] does not match "A", whereas a case-
2022 ful version would. PCRE does not support the concept of case
2023 for characters with values greater than 255.
2024 The newline character is never treated in any special way in
2025 character classes, whatever the setting of the PCRE_DOTALL
2026 or PCRE_MULTILINE options is. A class such as [^a] will
2027 always match a newline.
2029 The minus (hyphen) character can be used to specify a range
2030 of characters in a character class. For example, [d-m]
2031 matches any letter between d and m, inclusive. If a minus
2032 character is required in a class, it must be escaped with a
2033 backslash or appear in a position where it cannot be inter-
2034 preted as indicating a range, typically as the first or last
2035 character in the class.
2037 It is not possible to have the literal character "]" as the
2038 end character of a range. A pattern such as [W-]46] is
2039 interpreted as a class of two characters ("W" and "-") fol-
2040 lowed by a literal string "46]", so it would match "W46]" or
2041 "-46]". However, if the "]" is escaped with a backslash it
2042 is interpreted as the end of range, so [W-\]46] is inter-
2043 preted as a single class containing a range followed by two
2044 separate characters. The octal or hexadecimal representation
2045 of "]" can also be used to end a range.
2047 Ranges operate in the collating sequence of character
2048 values. They can also be used for characters specified
2049 numerically, for example [\000-\037]. In UTF-8 mode, ranges
2050 can include characters whose values are greater than 255,
2051 for example [\x{100}-\x{2ff}].
2053 If a range that includes letters is used when caseless
2054 matching is set, it matches the letters in either case. For
2055 example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2056 caselessly, and if character tables for the "fr" locale are
2057 in use, [\xc8-\xcb] matches accented E characters in both
2058 cases.
2060 The character types \d, \D, \s, \S, \w, and \W may also
2061 appear in a character class, and add the characters that
2062 they match to the class. For example, [\dABCDEF] matches any
2063 hexadecimal digit. A circumflex can conveniently be used
2064 with the upper case character types to specify a more res-
2065 tricted set of characters than the matching lower case type.
2066 For example, the class [^\W_] matches any letter or digit,
2067 but not underscore.
2069 All non-alphameric characters other than \, -, ^ (at the
2070 start) and the terminating ] are non-special in character
2071 classes, but it does no harm if they are escaped.
2076 Perl supports the POSIX notation for character classes,
2077 which uses names enclosed by [: and :] within the enclosing
2078 square brackets. PCRE also supports this notation. For exam-
2079 ple,
2081 [01[:alpha:]%]
2083 matches "0", "1", any alphabetic character, or "%". The sup-
2084 ported class names are
2086 alnum letters and digits
2087 alpha letters
2088 ascii character codes 0 - 127
2089 blank space or tab only
2090 cntrl control characters
2091 digit decimal digits (same as \d)
2092 graph printing characters, excluding space
2093 lower lower case letters
2094 print printing characters, including space
2095 punct printing characters, excluding letters and digits
2096 space white space (not quite the same as \s)
2097 upper upper case letters
2098 word "word" characters (same as \w)
2099 xdigit hexadecimal digits
2101 The "space" characters are HT (9), LF (10), VT (11), FF
2102 (12), CR (13), and space (32). Notice that this list
2103 includes the VT character (code 11). This makes "space" dif-
2104 ferent to \s, which does not include VT (for Perl compati-
2105 bility).
2107 The name "word" is a Perl extension, and "blank" is a GNU
2108 extension from Perl 5.8. Another Perl extension is negation,
2109 which is indicated by a ^ character after the colon. For
2110 example,
2112 [12[:^digit:]]
2114 matches "1", "2", or any non-digit. PCRE (and Perl) also
2115 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2116 "collating element", but these are not supported, and an
2117 error is given if they are encountered.
2119 In UTF-8 mode, characters with values greater than 255 do
2120 not match any of the POSIX character classes.
2125 Vertical bar characters are used to separate alternative
2126 patterns. For example, the pattern
2128 gilbert|sullivan
2130 matches either "gilbert" or "sullivan". Any number of alter-
2131 natives may appear, and an empty alternative is permitted
2132 (matching the empty string). The matching process tries
2133 each alternative in turn, from left to right, and the first
2134 one that succeeds is used. If the alternatives are within a
2135 subpattern (defined below), "succeeds" means matching the
2136 rest of the main pattern as well as the alternative in the
2137 subpattern.
2142 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2143 PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2144 within the pattern by a sequence of Perl option letters
2145 enclosed between "(?" and ")". The option letters are
2147 i for PCRE_CASELESS
2149 s for PCRE_DOTALL
2150 x for PCRE_EXTENDED
2152 For example, (?im) sets caseless, multiline matching. It is
2153 also possible to unset these options by preceding the letter
2154 with a hyphen, and a combined setting and unsetting such as
2155 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2156 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2157 If a letter appears both before and after the hyphen, the
2158 option is unset.
2160 When an option change occurs at top level (that is, not
2161 inside subpattern parentheses), the change applies to the
2162 remainder of the pattern that follows. If the change is
2163 placed right at the start of a pattern, PCRE extracts it
2164 into the global options (and it will therefore show up in
2165 data extracted by the pcre_fullinfo() function).
2167 An option change within a subpattern affects only that part
2168 of the current pattern that follows it, so
2170 (a(?i)b)c
2172 matches abc and aBc and no other strings (assuming
2173 PCRE_CASELESS is not used). By this means, options can be
2174 made to have different settings in different parts of the
2175 pattern. Any changes made in one alternative do carry on
2176 into subsequent branches within the same subpattern. For
2177 example,
2179 (a(?i)b|c)
2181 matches "ab", "aB", "c", and "C", even though when matching
2182 "C" the first branch is abandoned before the option setting.
2183 This is because the effects of option settings happen at
2184 compile time. There would be some very weird behaviour oth-
2185 erwise.
2187 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2188 be changed in the same way as the Perl-compatible options by
2189 using the characters U and X respectively. The (?X) flag
2190 setting is special in that it must always occur earlier in
2191 the pattern than any of the additional features it turns on,
2192 even when it is at top level. It is best put at the start.
2197 Subpatterns are delimited by parentheses (round brackets),
2198 which can be nested. Marking part of a pattern as a subpat-
2199 tern does two things:
2201 1. It localizes a set of alternatives. For example, the pat-
2202 tern
2204 cat(aract|erpillar|)
2206 matches one of the words "cat", "cataract", or "caterpil-
2207 lar". Without the parentheses, it would match "cataract",
2208 "erpillar" or the empty string.
2210 2. It sets up the subpattern as a capturing subpattern (as
2211 defined above). When the whole pattern matches, that por-
2212 tion of the subject string that matched the subpattern is
2213 passed back to the caller via the ovector argument of
2214 pcre_exec(). Opening parentheses are counted from left to
2215 right (starting from 1) to obtain the numbers of the captur-
2216 ing subpatterns.
2218 For example, if the string "the red king" is matched against
2219 the pattern
2221 the ((red|white) (king|queen))
2223 the captured substrings are "red king", "red", and "king",
2224 and are numbered 1, 2, and 3, respectively.
2226 The fact that plain parentheses fulfil two functions is not
2227 always helpful. There are often times when a grouping sub-
2228 pattern is required without a capturing requirement. If an
2229 opening parenthesis is followed by a question mark and a
2230 colon, the subpattern does not do any capturing, and is not
2231 counted when computing the number of any subsequent captur-
2232 ing subpatterns. For example, if the string "the white
2233 queen" is matched against the pattern
2235 the ((?:red|white) (king|queen))
2237 the captured substrings are "white queen" and "queen", and
2238 are numbered 1 and 2. The maximum number of capturing sub-
2239 patterns is 65535, and the maximum depth of nesting of all
2240 subpatterns, both capturing and non-capturing, is 200.
2242 As a convenient shorthand, if any option settings are
2243 required at the start of a non-capturing subpattern, the
2244 option letters may appear between the "?" and the ":". Thus
2245 the two patterns
2247 (?i:saturday|sunday)
2248 (?:(?i)saturday|sunday)
2250 match exactly the same set of strings. Because alternative
2251 branches are tried from left to right, and options are not
2252 reset until the end of the subpattern is reached, an option
2253 setting in one branch does affect subsequent branches, so
2254 the above patterns match "SUNDAY" as well as "Saturday".
2259 Identifying capturing parentheses by number is simple, but
2260 it can be very hard to keep track of the numbers in compli-
2261 cated regular expressions. Furthermore, if an expression is
2262 modified, the numbers may change. To help with the diffi-
2263 culty, PCRE supports the naming of subpatterns, something
2264 that Perl does not provide. The Python syntax (?P<name>...)
2265 is used. Names consist of alphanumeric characters and under-
2266 scores, and must be unique within a pattern.
2268 Named capturing parentheses are still allocated numbers as
2269 well as names. The PCRE API provides function calls for
2270 extracting the name-to-number translation table from a com-
2271 piled pattern. For further details see the pcreapi documen-
2272 tation.
2277 Repetition is specified by quantifiers, which can follow any
2278 of the following items:
2280 a literal data character
2281 the . metacharacter
2282 the \C escape sequence
2283 escapes such as \d that match single characters
2284 a character class
2285 a back reference (see next section)
2286 a parenthesized subpattern (unless it is an assertion)
2288 The general repetition quantifier specifies a minimum and
2289 maximum number of permitted matches, by giving the two
2290 numbers in curly brackets (braces), separated by a comma.
2291 The numbers must be less than 65536, and the first must be
2292 less than or equal to the second. For example:
2294 z{2,4}
2296 matches "zz", "zzz", or "zzzz". A closing brace on its own
2297 is not a special character. If the second number is omitted,
2298 but the comma is present, there is no upper limit; if the
2299 second number and the comma are both omitted, the quantifier
2300 specifies an exact number of required matches. Thus
2302 [aeiou]{3,}
2304 matches at least 3 successive vowels, but may match many
2305 more, while
2307 \d{8}
2309 matches exactly 8 digits. An opening curly bracket that
2310 appears in a position where a quantifier is not allowed, or
2311 one that does not match the syntax of a quantifier, is taken
2312 as a literal character. For example, {,6} is not a quantif-
2313 ier, but a literal string of four characters.
2315 In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2316 than to individual bytes. Thus, for example, \x{100}{2}
2317 matches two UTF-8 characters, each of which is represented
2318 by a two-byte sequence.
2320 The quantifier {0} is permitted, causing the expression to
2321 behave as if the previous item and the quantifier were not
2322 present.
2324 For convenience (and historical compatibility) the three
2325 most common quantifiers have single-character abbreviations:
2327 * is equivalent to {0,}
2328 + is equivalent to {1,}
2329 ? is equivalent to {0,1}
2331 It is possible to construct infinite loops by following a
2332 subpattern that can match no characters with a quantifier
2333 that has no upper limit, for example:
2335 (a?)*
2337 Earlier versions of Perl and PCRE used to give an error at
2338 compile time for such patterns. However, because there are
2339 cases where this can be useful, such patterns are now
2340 accepted, but if any repetition of the subpattern does in
2341 fact match no characters, the loop is forcibly broken.
2343 By default, the quantifiers are "greedy", that is, they
2344 match as much as possible (up to the maximum number of per-
2345 mitted times), without causing the rest of the pattern to
2346 fail. The classic example of where this gives problems is in
2347 trying to match comments in C programs. These appear between
2348 the sequences /* and */ and within the sequence, individual
2349 * and / characters may appear. An attempt to match C com-
2350 ments by applying the pattern
2352 /\*.*\*/
2354 to the string
2356 /* first command */ not comment /* second comment */
2358 fails, because it matches the entire string owing to the
2359 greediness of the .* item.
2361 However, if a quantifier is followed by a question mark, it
2362 ceases to be greedy, and instead matches the minimum number
2363 of times possible, so the pattern
2365 /\*.*?\*/
2367 does the right thing with the C comments. The meaning of the
2368 various quantifiers is not otherwise changed, just the pre-
2369 ferred number of matches. Do not confuse this use of ques-
2370 tion mark with its use as a quantifier in its own right.
2371 Because it has two uses, it can sometimes appear doubled, as
2372 in
2374 \d??\d
2376 which matches one digit by preference, but can match two if
2377 that is the only way the rest of the pattern matches.
2379 If the PCRE_UNGREEDY option is set (an option which is not
2380 available in Perl), the quantifiers are not greedy by
2381 default, but individual ones can be made greedy by following
2382 them with a question mark. In other words, it inverts the
2383 default behaviour.
2385 When a parenthesized subpattern is quantified with a minimum
2386 repeat count that is greater than 1 or with a limited max-
2387 imum, more store is required for the compiled pattern, in
2388 proportion to the size of the minimum or maximum.
2389 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2390 option (equivalent to Perl's /s) is set, thus allowing the .
2391 to match newlines, the pattern is implicitly anchored,
2392 because whatever follows will be tried against every charac-
2393 ter position in the subject string, so there is no point in
2394 retrying the overall match at any position after the first.
2395 PCRE normally treats such a pattern as though it were pre-
2396 ceded by \A.
2398 In cases where it is known that the subject string contains
2399 no newlines, it is worth setting PCRE_DOTALL in order to
2400 obtain this optimization, or alternatively using ^ to indi-
2401 cate anchoring explicitly.
2403 However, there is one situation where the optimization can-
2404 not be used. When .* is inside capturing parentheses that
2405 are the subject of a backreference elsewhere in the pattern,
2406 a match at the start may fail, and a later one succeed. Con-
2407 sider, for example:
2409 (.*)abc\1
2411 If the subject is "xyz123abc123" the match point is the
2412 fourth character. For this reason, such a pattern is not
2413 implicitly anchored.
2415 When a capturing subpattern is repeated, the value captured
2416 is the substring that matched the final iteration. For exam-
2417 ple, after
2419 (tweedle[dume]{3}\s*)+
2421 has matched "tweedledum tweedledee" the value of the cap-
2422 tured substring is "tweedledee". However, if there are
2423 nested capturing subpatterns, the corresponding captured
2424 values may have been set in previous iterations. For exam-
2425 ple, after
2427 /(a|(b))+/
2429 matches "aba" the value of the second captured substring is
2430 "b".
2435 With both maximizing and minimizing repetition, failure of
2436 what follows normally causes the repeated item to be re-
2437 evaluated to see if a different number of repeats allows the
2438 rest of the pattern to match. Sometimes it is useful to
2439 prevent this, either to change the nature of the match, or
2440 to cause it fail earlier than it otherwise might, when the
2441 author of the pattern knows there is no point in carrying
2442 on.
2444 Consider, for example, the pattern \d+foo when applied to
2445 the subject line
2447 123456bar
2449 After matching all 6 digits and then failing to match "foo",
2450 the normal action of the matcher is to try again with only 5
2451 digits matching the \d+ item, and then with 4, and so on,
2452 before ultimately failing. "Atomic grouping" (a term taken
2453 from Jeffrey Friedl's book) provides the means for specify-
2454 ing that once a subpattern has matched, it is not to be re-
2455 evaluated in this way.
2457 If we use atomic grouping for the previous example, the
2458 matcher would give up immediately on failing to match "foo"
2459 the first time. The notation is a kind of special
2460 parenthesis, starting with (?> as in this example:
2462 (?>\d+)bar
2464 This kind of parenthesis "locks up" the part of the pattern
2465 it contains once it has matched, and a failure further into
2466 the pattern is prevented from backtracking into it. Back-
2467 tracking past it to previous items, however, works as nor-
2468 mal.
2470 An alternative description is that a subpattern of this type
2471 matches the string of characters that an identical stan-
2472 dalone pattern would match, if anchored at the current point
2473 in the subject string.
2475 Atomic grouping subpatterns are not capturing subpatterns.
2476 Simple cases such as the above example can be thought of as
2477 a maximizing repeat that must swallow everything it can. So,
2478 while both \d+ and \d+? are prepared to adjust the number of
2479 digits they match in order to make the rest of the pattern
2480 match, (?>\d+) can only match an entire sequence of digits.
2482 Atomic groups in general can of course contain arbitrarily
2483 complicated subpatterns, and can be nested. However, when
2484 the subpattern for an atomic group is just a single repeated
2485 item, as in the example above, a simpler notation, called a
2486 "possessive quantifier" can be used. This consists of an
2487 additional + character following a quantifier. Using this
2488 notation, the previous example can be rewritten as
2490 \d++bar
2492 Possessive quantifiers are always greedy; the setting of the
2493 PCRE_UNGREEDY option is ignored. They are a convenient nota-
2494 tion for the simpler forms of atomic group. However, there
2495 is no difference in the meaning or processing of a posses-
2496 sive quantifier and the equivalent atomic group.
2498 The possessive quantifier syntax is an extension to the Perl
2499 syntax. It originates in Sun's Java package.
2501 When a pattern contains an unlimited repeat inside a subpat-
2502 tern that can itself be repeated an unlimited number of
2503 times, the use of an atomic group is the only way to avoid
2504 some failing matches taking a very long time indeed. The
2505 pattern
2507 (\D+|<\d+>)*[!?]
2509 matches an unlimited number of substrings that either con-
2510 sist of non-digits, or digits enclosed in <>, followed by
2511 either ! or ?. When it matches, it runs quickly. However, if
2512 it is applied to
2514 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2516 it takes a long time before reporting failure. This is
2517 because the string can be divided between the two repeats in
2518 a large number of ways, and all have to be tried. (The exam-
2519 ple used [!?] rather than a single character at the end,
2520 because both PCRE and Perl have an optimization that allows
2521 for fast failure when a single character is used. They
2522 remember the last single character that is required for a
2523 match, and fail early if it is not present in the string.)
2524 If the pattern is changed to
2526 ((?>\D+)|<\d+>)*[!?]
2528 sequences of non-digits cannot be broken, and failure hap-
2529 pens quickly.
2534 Outside a character class, a backslash followed by a digit
2535 greater than 0 (and possibly further digits) is a back
2536 reference to a capturing subpattern earlier (that is, to its
2537 left) in the pattern, provided there have been that many
2538 previous capturing left parentheses.
2540 However, if the decimal number following the backslash is
2541 less than 10, it is always taken as a back reference, and
2542 causes an error only if there are not that many capturing
2543 left parentheses in the entire pattern. In other words, the
2544 parentheses that are referenced need not be to the left of
2545 the reference for numbers less than 10. See the section
2546 entitled "Backslash" above for further details of the han-
2547 dling of digits following a backslash.
2549 A back reference matches whatever actually matched the cap-
2550 turing subpattern in the current subject string, rather than
2551 anything matching the subpattern itself (see "Subpatterns as
2552 subroutines" below for a way of doing that). So the pattern
2554 (sens|respons)e and \1ibility
2556 matches "sense and sensibility" and "response and responsi-
2557 bility", but not "sense and responsibility". If caseful
2558 matching is in force at the time of the back reference, the
2559 case of letters is relevant. For example,
2561 ((?i)rah)\s+\1
2563 matches "rah rah" and "RAH RAH", but not "RAH rah", even
2564 though the original capturing subpattern is matched case-
2565 lessly.
2567 Back references to named subpatterns use the Python syntax
2568 (?P=name). We could rewrite the above example as follows:
2570 (?<p1>(?i)rah)\s+(?P=p1)
2572 There may be more than one back reference to the same sub-
2573 pattern. If a subpattern has not actually been used in a
2574 particular match, any back references to it always fail. For
2575 example, the pattern
2577 (a|(bc))\2
2579 always fails if it starts to match "a" rather than "bc".
2580 Because there may be many capturing parentheses in a pat-
2581 tern, all digits following the backslash are taken as part
2582 of a potential back reference number. If the pattern contin-
2583 ues with a digit character, some delimiter must be used to
2584 terminate the back reference. If the PCRE_EXTENDED option is
2585 set, this can be whitespace. Otherwise an empty comment can
2586 be used.
2588 A back reference that occurs inside the parentheses to which
2589 it refers fails when the subpattern is first used, so, for
2590 example, (a\1) never matches. However, such references can
2591 be useful inside repeated subpatterns. For example, the pat-
2592 tern
2594 (a|b\1)+
2596 matches any number of "a"s and also "aba", "ababbaa" etc. At
2597 each iteration of the subpattern, the back reference matches
2598 the character string corresponding to the previous itera-
2599 tion. In order for this to work, the pattern must be such
2600 that the first iteration does not need to match the back
2601 reference. This can be done using alternation, as in the
2602 example above, or by a quantifier with a minimum of zero.
2607 An assertion is a test on the characters following or
2608 preceding the current matching point that does not actually
2609 consume any characters. The simple assertions coded as \b,
2610 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2611 plicated assertions are coded as subpatterns. There are two
2612 kinds: those that look ahead of the current position in the
2613 subject string, and those that look behind it.
2615 An assertion subpattern is matched in the normal way, except
2616 that it does not cause the current matching position to be
2617 changed. Lookahead assertions start with (?= for positive
2618 assertions and (?! for negative assertions. For example,
2620 \w+(?=;)
2622 matches a word followed by a semicolon, but does not include
2623 the semicolon in the match, and
2625 foo(?!bar)
2627 matches any occurrence of "foo" that is not followed by
2628 "bar". Note that the apparently similar pattern
2630 (?!foo)bar
2632 does not find an occurrence of "bar" that is preceded by
2633 something other than "foo"; it finds any occurrence of "bar"
2634 whatsoever, because the assertion (?!foo) is always true
2635 when the next three characters are "bar". A lookbehind
2636 assertion is needed to achieve this effect.
2638 If you want to force a matching failure at some point in a
2639 pattern, the most convenient way to do it is with (?!)
2640 because an empty string always matches, so an assertion that
2641 requires there not to be an empty string must always fail.
2643 Lookbehind assertions start with (?<= for positive asser-
2644 tions and (?<! for negative assertions. For example,
2646 (?<!foo)bar
2648 does find an occurrence of "bar" that is not preceded by
2649 "foo". The contents of a lookbehind assertion are restricted
2650 such that all the strings it matches must have a fixed
2651 length. However, if there are several alternatives, they do
2652 not all have to have the same fixed length. Thus
2654 (?<=bullock|donkey)
2656 is permitted, but
2658 (?<!dogs?|cats?)
2660 causes an error at compile time. Branches that match dif-
2661 ferent length strings are permitted only at the top level of
2662 a lookbehind assertion. This is an extension compared with
2663 Perl (at least for 5.8), which requires all branches to
2664 match the same length of string. An assertion such as
2666 (?<=ab(c|de))
2668 is not permitted, because its single top-level branch can
2669 match two different lengths, but it is acceptable if rewrit-
2670 ten to use two top-level branches:
2672 (?<=abc|abde)
2674 The implementation of lookbehind assertions is, for each
2675 alternative, to temporarily move the current position back
2676 by the fixed width and then try to match. If there are
2677 insufficient characters before the current position, the
2678 match is deemed to fail.
2680 PCRE does not allow the \C escape (which matches a single
2681 byte in UTF-8 mode) to appear in lookbehind assertions,
2682 because it makes it impossible to calculate the length of
2683 the lookbehind.
2685 Atomic groups can be used in conjunction with lookbehind
2686 assertions to specify efficient matching at the end of the
2687 subject string. Consider a simple pattern such as
2689 abcd$
2691 when applied to a long string that does not match. Because
2692 matching proceeds from left to right, PCRE will look for
2693 each "a" in the subject and then see if what follows matches
2694 the rest of the pattern. If the pattern is specified as
2696 ^.*abcd$
2698 the initial .* matches the entire string at first, but when
2699 this fails (because there is no following "a"), it back-
2700 tracks to match all but the last character, then all but the
2701 last two characters, and so on. Once again the search for
2702 "a" covers the entire string, from right to left, so we are
2703 no better off. However, if the pattern is written as
2705 ^(?>.*)(?<=abcd)
2707 or, equivalently,
2709 ^.*+(?<=abcd)
2711 there can be no backtracking for the .* item; it can match
2712 only the entire string. The subsequent lookbehind assertion
2713 does a single test on the last four characters. If it fails,
2714 the match fails immediately. For long strings, this approach
2715 makes a significant difference to the processing time.
2717 Several assertions (of any sort) may occur in succession.
2718 For example,
2720 (?<=\d{3})(?<!999)foo
2722 matches "foo" preceded by three digits that are not "999".
2723 Notice that each of the assertions is applied independently
2724 at the same point in the subject string. First there is a
2725 check that the previous three characters are all digits, and
2726 then there is a check that the same three characters are not
2727 "999". This pattern does not match "foo" preceded by six
2728 characters, the first of which are digits and the last three
2729 of which are not "999". For example, it doesn't match
2730 "123abcfoo". A pattern to do that is
2732 (?<=\d{3}...)(?<!999)foo
2734 This time the first assertion looks at the preceding six
2735 characters, checking that the first three are digits, and
2736 then the second assertion checks that the preceding three
2737 characters are not "999".
2739 Assertions can be nested in any combination. For example,
2741 (?<=(?<!foo)bar)baz
2743 matches an occurrence of "baz" that is preceded by "bar"
2744 which in turn is not preceded by "foo", while
2746 (?<=\d{3}(?!999)...)foo
2748 is another pattern which matches "foo" preceded by three
2749 digits and any three characters that are not "999".
2751 Assertion subpatterns are not capturing subpatterns, and may
2752 not be repeated, because it makes no sense to assert the
2753 same thing several times. If any kind of assertion contains
2754 capturing subpatterns within it, these are counted for the
2755 purposes of numbering the capturing subpatterns in the whole
2756 pattern. However, substring capturing is carried out only
2757 for positive assertions, because it does not make sense for
2758 negative assertions.
2763 It is possible to cause the matching process to obey a sub-
2764 pattern conditionally or to choose between two alternative
2765 subpatterns, depending on the result of an assertion, or
2766 whether a previous capturing subpattern matched or not. The
2767 two possible forms of conditional subpattern are
2769 (?(condition)yes-pattern)
2770 (?(condition)yes-pattern|no-pattern)
2772 If the condition is satisfied, the yes-pattern is used; oth-
2773 erwise the no-pattern (if present) is used. If there are
2774 more than two alternatives in the subpattern, a compile-time
2775 error occurs.
2777 There are three kinds of condition. If the text between the
2778 parentheses consists of a sequence of digits, the condition
2779 is satisfied if the capturing subpattern of that number has
2780 previously matched. The number must be greater than zero.
2781 Consider the following pattern, which contains non-
2782 significant white space to make it more readable (assume the
2783 PCRE_EXTENDED option) and to divide it into three parts for
2784 ease of discussion:
2786 ( \( )? [^()]+ (?(1) \) )
2788 The first part matches an optional opening parenthesis, and
2789 if that character is present, sets it as the first captured
2790 substring. The second part matches one or more characters
2791 that are not parentheses. The third part is a conditional
2792 subpattern that tests whether the first set of parentheses
2793 matched or not. If they did, that is, if subject started
2794 with an opening parenthesis, the condition is true, and so
2795 the yes-pattern is executed and a closing parenthesis is
2796 required. Otherwise, since no-pattern is not present, the
2797 subpattern matches nothing. In other words, this pattern
2798 matches a sequence of non-parentheses, optionally enclosed
2799 in parentheses.
2801 If the condition is the string (R), it is satisfied if a
2802 recursive call to the pattern or subpattern has been made.
2803 At "top level", the condition is false. This is a PCRE
2804 extension. Recursive patterns are described in the next
2805 section.
2807 If the condition is not a sequence of digits or (R), it must
2808 be an assertion. This may be a positive or negative looka-
2809 head or lookbehind assertion. Consider this pattern, again
2810 containing non-significant white space, and with the two
2811 alternatives on the second line:
2813 (?(?=[^a-z]*[a-z])
2814 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2816 The condition is a positive lookahead assertion that matches
2817 an optional sequence of non-letters followed by a letter. In
2818 other words, it tests for the presence of at least one
2819 letter in the subject. If a letter is found, the subject is
2820 matched against the first alternative; otherwise it is
2821 matched against the second. This pattern matches strings in
2822 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2823 letters and dd are digits.
2828 The sequence (?# marks the start of a comment which contin-
2829 ues up to the next closing parenthesis. Nested parentheses
2830 are not permitted. The characters that make up a comment
2831 play no part in the pattern matching at all.
2833 If the PCRE_EXTENDED option is set, an unescaped # character
2834 outside a character class introduces a comment that contin-
2835 ues up to the next newline character in the pattern.
2840 Consider the problem of matching a string in parentheses,
2841 allowing for unlimited nested parentheses. Without the use
2842 of recursion, the best that can be done is to use a pattern
2843 that matches up to some fixed depth of nesting. It is not
2844 possible to handle an arbitrary nesting depth. Perl has pro-
2845 vided an experimental facility that allows regular expres-
2846 sions to recurse (amongst other things). It does this by
2847 interpolating Perl code in the expression at run time, and
2848 the code can refer to the expression itself. A Perl pattern
2849 to solve the parentheses problem can be created like this:
2851 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2853 The (?p{...}) item interpolates Perl code at run time, and
2854 in this case refers recursively to the pattern in which it
2855 appears. Obviously, PCRE cannot support the interpolation of
2856 Perl code. Instead, it supports some special syntax for
2857 recursion of the entire pattern, and also for individual
2858 subpattern recursion.
2860 The special item that consists of (? followed by a number
2861 greater than zero and a closing parenthesis is a recursive
2862 call of the subpattern of the given number, provided that it
2863 occurs inside that subpattern. (If not, it is a "subroutine"
2864 call, which is described in the next section.) The special
2865 item (?R) is a recursive call of the entire regular expres-
2866 sion.
2868 For example, this PCRE pattern solves the nested parentheses
2869 problem (assume the PCRE_EXTENDED option is set so that
2870 white space is ignored):
2872 \( ( (?>[^()]+) | (?R) )* \)
2874 First it matches an opening parenthesis. Then it matches any
2875 number of substrings which can either be a sequence of non-
2876 parentheses, or a recursive match of the pattern itself
2877 (that is a correctly parenthesized substring). Finally
2878 there is a closing parenthesis.
2880 If this were part of a larger pattern, you would not want to
2881 recurse the entire pattern, so instead you could use this:
2883 ( \( ( (?>[^()]+) | (?1) )* \) )
2885 We have put the pattern into parentheses, and caused the
2886 recursion to refer to them instead of the whole pattern. In
2887 a larger pattern, keeping track of parenthesis numbers can
2888 be tricky. It may be more convenient to use named
2889 parentheses instead. For this, PCRE uses (?P>name), which is
2890 an extension to the Python syntax that PCRE uses for named
2891 parentheses (Perl does not provide named parentheses). We
2892 could rewrite the above example as follows:
2894 (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2896 This particular example pattern contains nested unlimited
2897 repeats, and so the use of atomic grouping for matching
2898 strings of non-parentheses is important when applying the
2899 pattern to strings that do not match. For example, when this
2900 pattern is applied to
2902 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2904 it yields "no match" quickly. However, if atomic grouping is
2905 not used, the match runs for a very long time indeed because
2906 there are so many different ways the + and * repeats can
2907 carve up the subject, and all have to be tested before
2908 failure can be reported.
2909 At the end of a match, the values set for any capturing sub-
2910 patterns are those from the outermost level of the recursion
2911 at which the subpattern value is set. If you want to obtain
2912 intermediate values, a callout function can be used (see
2913 below and the pcrecallout documentation). If the pattern
2914 above is matched against
2916 (ab(cd)ef)
2918 the value for the capturing parentheses is "ef", which is
2919 the last value taken on at the top level. If additional
2920 parentheses are added, giving
2922 \( ( ( (?>[^()]+) | (?R) )* ) \)
2923 ^ ^
2924 ^ ^
2926 the string they capture is "ab(cd)ef", the contents of the
2927 top level parentheses. If there are more than 15 capturing
2928 parentheses in a pattern, PCRE has to obtain extra memory to
2929 store data during a recursion, which it does by using
2930 pcre_malloc, freeing it via pcre_free afterwards. If no
2931 memory can be obtained, the match fails with the
2934 Do not confuse the (?R) item with the condition (R), which
2935 tests for recursion. Consider this pattern, which matches
2936 text in angle brackets, allowing for arbitrary nesting. Only
2937 digits are allowed in nested brackets (that is, when recurs-
2938 ing), whereas any characters are permitted at the outer
2939 level.
2941 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2943 In this pattern, (?(R) is the start of a conditional subpat-
2944 tern, with two different alternatives for the recursive and
2945 non-recursive cases. The (?R) item is the actual recursive
2946 call.
2951 If the syntax for a recursive subpattern reference (either
2952 by number or by name) is used outside the parentheses to
2953 which it refers, it operates like a subroutine in a program-
2954 ming language. An earlier example pointed out that the pat-
2955 tern
2957 (sens|respons)e and \1ibility
2959 matches "sense and sensibility" and "response and responsi-
2960 bility", but not "sense and responsibility". If instead the
2961 pattern
2963 (sens|respons)e and (?1)ibility
2965 is used, it does match "sense and responsibility" as well as
2966 the other two strings. Such references must, however, follow
2967 the subpattern to which they refer.
2972 Perl has a feature whereby using the sequence (?{...})
2973 causes arbitrary Perl code to be obeyed in the middle of
2974 matching a regular expression. This makes it possible,
2975 amongst other things, to extract different substrings that
2976 match the same pair of parentheses when there is a repeti-
2977 tion.
2979 PCRE provides a similar feature, but of course it cannot
2980 obey arbitrary Perl code. The feature is called "callout".
2981 The caller of PCRE provides an external function by putting
2982 its entry point in the global variable pcre_callout. By
2983 default, this variable contains NULL, which disables all
2984 calling out.
2986 Within a regular expression, (?C) indicates the points at
2987 which the external function is to be called. If you want to
2988 identify different callout points, you can put a number less
2989 than 256 after the letter C. The default value is zero. For
2990 example, this pattern has two callout points:
2992 (?C1)9abc(?C2)def
2994 During matching, when PCRE reaches a callout point (and
2995 pcre_callout is set), the external function is called. It is
2996 provided with the number of the callout, and, optionally,
2997 one item of data originally supplied by the caller of
2998 pcre_exec(). The callout function may cause matching to
2999 backtrack, or to fail altogether. A complete description of
3000 the interface to the callout function is given in the pcre-
3001 callout documentation.
3003 Last updated: 03 February 2003
3004 Copyright (c) 1997-2003 University of Cambridge.
3005 -----------------------------------------------------------------------------
3007 NAME
3008 PCRE - Perl-compatible regular expressions
3013 Certain items that may appear in regular expression patterns
3014 are more efficient than others. It is more efficient to use
3015 a character class like [aeiou] than a set of alternatives
3016 such as (a|e|i|o|u). In general, the simplest construction
3017 that provides the required behaviour is usually the most
3018 efficient. Jeffrey Friedl's book contains a lot of discus-
3019 sion about optimizing regular expressions for efficient per-
3020 formance.
3022 When a pattern begins with .* not in parentheses, or in
3023 parentheses that are not the subject of a backreference, and
3024 the PCRE_DOTALL option is set, the pattern is implicitly
3025 anchored by PCRE, since it can match only at the start of a
3026 subject string. However, if PCRE_DOTALL is not set, PCRE
3027 cannot make this optimization, because the . metacharacter
3028 does not then match a newline, and if the subject string
3029 contains newlines, the pattern may match from the character
3030 immediately following one of them instead of from the very
3031 start. For example, the pattern
3033 .*second
3035 matches the subject "first\nand second" (where \n stands for
3036 a newline character), with the match starting at the seventh
3037 character. In order to do this, PCRE has to retry the match
3038 starting after every newline in the subject.
3040 If you are using such a pattern with subject strings that do
3041 not contain newlines, the best performance is obtained by
3042 setting PCRE_DOTALL, or starting the pattern with ^.* to
3043 indicate explicit anchoring. That saves PCRE from having to
3044 scan along the subject looking for a newline to restart at.
3046 Beware of patterns that contain nested indefinite repeats.
3047 These can take a long time to run when applied to a string
3048 that does not match. Consider the pattern fragment
3050 (a+)*
3052 This can match "aaaa" in 33 different ways, and this number
3053 increases very rapidly as the string gets longer. (The *
3054 repeat can match 0, 1, 2, 3, or 4 times, and for each of
3055 those cases other than 0, the + repeats can match different
3056 numbers of times.) When the remainder of the pattern is such
3057 that the entire match is going to fail, PCRE has in princi-
3058 ple to try every possible variation, and this can take an
3059 extremely long time.
3060 An optimization catches some of the more simple cases such
3061 as
3063 (a+)*b
3065 where a literal character follows. Before embarking on the
3066 standard matching procedure, PCRE checks that there is a "b"
3067 later in the subject string, and if there is not, it fails
3068 the match immediately. However, when there is no following
3069 literal this optimization cannot be used. You can see the
3070 difference by comparing the behaviour of
3072 (a+)*\d
3074 with the pattern above. The former gives a failure almost
3075 instantly when applied to a whole line of "a" characters,
3076 whereas the latter takes an appreciable time with strings
3077 longer than about 20 characters.
3079 Last updated: 03 February 2003
3080 Copyright (c) 1997-2003 University of Cambridge.
3081 -----------------------------------------------------------------------------
3083 NAME
3084 PCRE - Perl-compatible regular expressions.
3088 #include <pcreposix.h>
3090 int regcomp(regex_t *preg, const char *pattern,
3091 int cflags);
3093 int regexec(regex_t *preg, const char *string,
3094 size_t nmatch, regmatch_t pmatch[], int eflags);
3096 size_t regerror(int errcode, const regex_t *preg,
3097 char *errbuf, size_t errbuf_size);
3099 void regfree(regex_t *preg);
3104 This set of functions provides a POSIX-style API to the PCRE
3105 regular expression package. See the pcreapi documentation
3106 for a description of the native API, which contains addi-
3107 tional functionality.
3109 The functions described here are just wrapper functions that
3110 ultimately call the PCRE native API. Their prototypes are
3111 defined in the pcreposix.h header file, and on Unix systems
3112 the library itself is called pcreposix.a, so can be accessed
3113 by adding -lpcreposix to the command for linking an applica-
3114 tion which uses them. Because the POSIX functions call the
3115 native ones, it is also necessary to add -lpcre.
3117 I have implemented only those option bits that can be rea-
3118 sonably mapped to PCRE native options. In addition, the
3119 options REG_EXTENDED and REG_NOSUB are defined with the
3120 value zero. They have no effect, but since programs that are
3121 written to the POSIX interface often use them, this makes it
3122 easier to slot in PCRE as a replacement library. Other POSIX
3123 options are not even defined.
3125 When PCRE is called via these functions, it is only the API
3126 that is POSIX-like in style. The syntax and semantics of the
3127 regular expressions themselves are still those of Perl, sub-
3128 ject to the setting of various PCRE options, as described
3129 below. "POSIX-like in style" means that the API approximates
3130 to the POSIX definition; it is not fully POSIX-compatible,
3131 and in multi-byte encoding domains it is probably even less
3132 compatible.
3134 The header for these functions is supplied as pcreposix.h to
3135 avoid any potential clash with other POSIX libraries. It
3136 can, of course, be renamed or aliased as regex.h, which is
3137 the "correct" name. It provides two structure types, regex_t
3138 for compiled internal forms, and regmatch_t for returning
3139 captured substrings. It also defines some constants whose
3140 names start with "REG_"; these are used for setting options
3141 and identifying error codes.
3146 The function regcomp() is called to compile a pattern into
3147 an internal form. The pattern is a C string terminated by a
3148 binary zero, and is passed in the argument pattern. The preg
3149 argument is a pointer to a regex_t structure which is used
3150 as a base for storing information about the compiled expres-
3151 sion.
3153 The argument cflags is either zero, or contains one or more
3154 of the bits defined by the following macros:
3158 The PCRE_CASELESS option is set when the expression is
3159 passed for compilation to the native function.
3163 The PCRE_MULTILINE option is set when the expression is
3164 passed for compilation to the native function. Note that
3165 this does not mimic the defined POSIX behaviour for
3166 REG_NEWLINE (see the following section).
3168 In the absence of these flags, no options are passed to the
3169 native function. This means the the regex is compiled with
3170 PCRE default semantics. In particular, the way it handles
3171 newline characters in the subject string is the Perl way,
3172 not the POSIX way. Note that setting PCRE_MULTILINE has only
3173 some of the effects specified for REG_NEWLINE. It does not
3174 affect the way newlines are matched by . (they aren't) or by
3175 a negative class such as [^a] (they are).
3177 The yield of regcomp() is zero on success, and non-zero oth-
3178 erwise. The preg structure is filled in on success, and one
3179 member of the structure is public: re_nsub contains the
3180 number of capturing subpatterns in the regular expression.
3181 Various error codes are defined in the header file.
3186 This area is not simple, because POSIX and Perl take dif-
3187 ferent views of things. It is not possible to get PCRE to
3188 obey POSIX semantics, but then PCRE was never intended to be
3189 a POSIX engine. The following table lists the different pos-
3190 sibilities for matching newline characters in PCRE:
3192 Default Change with
3194 . matches newline no PCRE_DOTALL
3195 newline matches [^a] yes not changeable
3196 $ matches \n at end yes PCRE_DOLLARENDONLY
3197 $ matches \n in middle no PCRE_MULTILINE
3198 ^ matches \n in middle no PCRE_MULTILINE
3200 This is the equivalent table for POSIX:
3202 Default Change with
3204 . matches newline yes REG_NEWLINE
3205 newline matches [^a] yes REG_NEWLINE
3206 $ matches \n at end no REG_NEWLINE
3207 $ matches \n in middle no REG_NEWLINE
3208 ^ matches \n in middle no REG_NEWLINE
3210 PCRE's behaviour is the same as Perl's, except that there is
3211 no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3212 and Perl, there is no way to stop newline from matching
3213 [^a].
3215 The default POSIX newline handling can be obtained by set-
3216 ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3217 to make PCRE behave exactly as for the REG_NEWLINE action.
3222 The function regexec() is called to match a pre-compiled
3223 pattern preg against a given string, which is terminated by
3224 a zero byte, subject to the options in eflags. These can be:
3228 The PCRE_NOTBOL option is set when calling the underlying
3229 PCRE matching function.
3233 The PCRE_NOTEOL option is set when calling the underlying
3234 PCRE matching function.
3236 The portion of the string that was matched, and also any
3237 captured substrings, are returned via the pmatch argument,
3238 which points to an array of nmatch structures of type
3239 regmatch_t, containing the members rm_so and rm_eo. These
3240 contain the offset to the first character of each substring
3241 and the offset to the first character after the end of each
3242 substring, respectively. The 0th element of the vector
3243 relates to the entire portion of string that was matched;
3244 subsequent elements relate to the capturing subpatterns of
3245 the regular expression. Unused entries in the array have
3246 both structure members set to -1.
3248 A successful match yields a zero return; various error codes
3249 are defined in the header file, of which REG_NOMATCH is the
3250 "expected" failure code.
3255 The regerror() function maps a non-zero errorcode from
3256 either regcomp() or regexec() to a printable message. If
3257 preg is not NULL, the error should have arisen from the use
3258 of that structure. A message terminated by a binary zero is
3259 placed in errbuf. The length of the message, including the
3260 zero, is limited to errbuf_size. The yield of the function
3261 is the size of buffer needed to hold the whole message.
3266 Compiling a regular expression causes memory to be allocated
3267 and associated with the preg structure. The function reg-
3268 free() frees all such memory, after which preg may no longer
3269 be used as a compiled expression.
3274 Philip Hazel <ph10@cam.ac.uk>
3275 University Computing Service,
3276 Cambridge CB2 3QG, England.
3278 Last updated: 03 February 2003
3279 Copyright (c) 1997-2003 University of Cambridge.
3280 -----------------------------------------------------------------------------
3282 NAME
3283 PCRE - Perl-compatible regular expressions
3288 A simple, complete demonstration program, to get you started
3289 with using PCRE, is supplied in the file pcredemo.c in the
3290 PCRE distribution.
3292 The program compiles the regular expression that is its
3293 first argument, and matches it against the subject string in
3294 its second argument. No PCRE options are set, and default
3295 character tables are used. If matching succeeds, the program
3296 outputs the portion of the subject that matched, together
3297 with the contents of any captured substrings.
3299 If the -g option is given on the command line, the program
3300 then goes on to check for further matches of the same regu-
3301 lar expression in the same subject string. The logic is a
3302 little bit tricky because of the possibility of matching an
3303 empty string. Comments in the code explain what is going on.
3305 On a Unix system that has PCRE installed in /usr/local, you
3306 can compile the demonstration program using a command like
3307 this:
3309 gcc -o pcredemo pcredemo.c -I/usr/local/include \
3310 -L/usr/local/lib -lpcre
3312 Then you can run simple tests like this:
3314 ./pcredemo 'cat|dog' 'the cat sat on the mat'
3315 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3317 Note that there is a much more comprehensive test program,
3318 called pcretest, which supports many more facilities for
3319 testing regular expressions and the PCRE library. The
3320 pcredemo program is provided as a simple coding example.
3322 On some operating systems (e.g. Solaris) you may get an
3323 error like this when you try to run pcredemo:
3325 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3326 file or directory
3328 This is caused by the way shared library support works on
3329 those systems. You need to add
3331 -R/usr/local/lib
3333 to the compile command to get round this problem.
3335 Last updated: 28 January 2003
3336 Copyright (c) 1997-2003 University of Cambridge.
3337 -----------------------------------------------------------------------------

  ViewVC Help
Powered by ViewVC 1.1.5