ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 65 - (show annotations)
Sat Feb 24 21:40:08 2007 UTC (14 years, 6 months ago) by nigel
File MIME type: text/plain
File size: 142393 byte(s)
Load pcre-4.1 into code/trunk.
1 This file contains a concatenation of the PCRE man pages, converted to plain
2 text format for ease of searching with a text editor, or for use on systems
3 that do not have a man page processor. The small individual files that give
4 synopses of each function in the library have not been included. There are
5 separate text files for the pcregrep and pcretest commands.
6 -----------------------------------------------------------------------------
9 PCRE - Perl-compatible regular expressions
14 The PCRE library is a set of functions that implement regu-
15 lar expression pattern matching using the same syntax and
16 semantics as Perl, with just a few differences. The current
17 implementation of PCRE (release 4.x) corresponds approxi-
18 mately with Perl 5.8, including support for UTF-8 encoded
19 strings. However, this support has to be explicitly
20 enabled; it is not the default.
22 PCRE is written in C and released as a C library. However, a
23 number of people have written wrappers and interfaces of
24 various kinds. A C++ class is included in these contribu-
25 tions, which can be found in the Contrib directory at the
26 primary FTP site, which is:
28 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30 Details of exactly which Perl regular expression features
31 are and are not supported by PCRE are given in separate
32 documents. See the pcrepattern and pcrecompat pages.
34 Some features of PCRE can be included, excluded, or changed
35 when the library is built. The pcre_config() function makes
36 it possible for a client to discover which features are
37 available. Documentation about building PCRE for various
38 operating systems can be found in the README file in the
39 source distribution.
44 The user documentation for PCRE has been split up into a
45 number of different sections. In the "man" format, each of
46 these is a separate "man page". In the HTML format, each is
47 a separate page, linked from the index page. In the plain
48 text format, all the sections are concatenated, for ease of
49 searching. The sections are as follows:
51 pcre this document
52 pcreapi details of PCRE's native API
53 pcrebuild options for building PCRE
54 pcrecallout details of the callout feature
55 pcrecompat discussion of Perl compatibility
56 pcregrep description of the pcregrep command
57 pcrepattern syntax and semantics of supported
58 regular expressions
59 pcreperform discussion of performance issues
60 pcreposix the POSIX-compatible API
61 pcresample discussion of the sample program
62 pcretest the pcretest testing command
64 In addition, in the "man" and HTML formats, there is a short
65 page for each library function, listing its arguments and
66 results.
71 There are some size limitations in PCRE but it is hoped that
72 they will never in practice be relevant.
74 The maximum length of a compiled pattern is 65539 (sic)
75 bytes if PCRE is compiled with the default internal linkage
76 size of 2. If you want to process regular expressions that
77 are truly enormous, you can compile PCRE with an internal
78 linkage size of 3 or 4 (see the README file in the source
79 distribution and the pcrebuild documentation for details).
80 If these cases the limit is substantially larger. However,
81 the speed of execution will be slower.
83 All values in repeating quantifiers must be less than 65536.
84 The maximum number of capturing subpatterns is 65535.
86 There is no limit to the number of non-capturing subpat-
87 terns, but the maximum depth of nesting of all kinds of
88 parenthesized subpattern, including capturing subpatterns,
89 assertions, and other types of subpattern, is 200.
91 The maximum length of a subject string is the largest posi-
92 tive number that an integer variable can hold. However, PCRE
93 uses recursion to handle subpatterns and indefinite repeti-
94 tion. This means that the available stack space may limit
95 the size of a subject string that can be processed by cer-
96 tain patterns.
101 Starting at release 3.3, PCRE has had some support for char-
102 acter strings encoded in the UTF-8 format. For release 4.0
103 this has been greatly extended to cover most common require-
104 ments.
106 In order process UTF-8 strings, you must build PCRE to
107 include UTF-8 support in the code, and, in addition, you
108 must call pcre_compile() with the PCRE_UTF8 option flag.
109 When you do this, both the pattern and any subject strings
110 that are matched against it are treated as UTF-8 strings
111 instead of just strings of bytes.
113 If you compile PCRE with UTF-8 support, but do not use it at
114 run time, the library will be a bit bigger, but the addi-
115 tional run time overhead is limited to testing the PCRE_UTF8
116 flag in several places, so should not be very large.
118 The following comments apply when PCRE is running in UTF-8
119 mode:
121 1. PCRE assumes that the strings it is given contain valid
122 UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
123 you pass invalid UTF-8 strings to PCRE, the results are
124 undefined.
126 2. In a pattern, the escape sequence \x{...}, where the con-
127 tents of the braces is a string of hexadecimal digits, is
128 interpreted as a UTF-8 character whose code number is the
129 given hexadecimal number, for example: \x{1234}. If a non-
130 hexadecimal digit appears between the braces, the item is
131 not recognized. This escape sequence can be used either as
132 a literal, or within a character class.
134 3. The original hexadecimal escape sequence, \xhh, matches a
135 two-byte UTF-8 character if the value is greater than 127.
137 4. Repeat quantifiers apply to complete UTF-8 characters,
138 not to individual bytes, for example: \x{100}{3}.
140 5. The dot metacharacter matches one UTF-8 character instead
141 of a single byte.
143 6. The escape sequence \C can be used to match a single byte
144 in UTF-8 mode, but its use can lead to some strange effects.
146 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
147 correctly test characters of any code value, but the charac-
148 ters that PCRE recognizes as digits, spaces, or word charac-
149 ters remain the same set as before, all with values less
150 than 256.
152 8. Case-insensitive matching applies only to characters
153 whose values are less than 256. PCRE does not support the
154 notion of "case" for higher-valued characters.
156 9. PCRE does not support the use of Unicode tables and pro-
157 perties or the Perl escapes \p, \P, and \X.
162 Philip Hazel <ph10@cam.ac.uk>
163 University Computing Service,
164 Cambridge CB2 3QG, England.
165 Phone: +44 1223 334714
167 Last updated: 04 February 2003
168 Copyright (c) 1997-2003 University of Cambridge.
169 -----------------------------------------------------------------------------
171 NAME
172 PCRE - Perl-compatible regular expressions
177 This document describes the optional features of PCRE that
178 can be selected when the library is compiled. They are all
179 selected, or deselected, by providing options to the config-
180 ure script which is run before the make command. The com-
181 plete list of options for configure (which includes the
182 standard ones such as the selection of the installation
183 directory) can be obtained by running
185 ./configure --help
187 The following sections describe certain options whose names
188 begin with --enable or --disable. These settings specify
189 changes to the defaults for the configure command. Because
190 of the way that configure works, --enable and --disable
191 always come in pairs, so the complementary option always
192 exists as well, but as it specifies the default, it is not
193 described.
198 To build PCRE with support for UTF-8 character strings, add
200 --enable-utf8
202 to the configure command. Of itself, this does not make PCRE
203 treat strings as UTF-8. As well as compiling PCRE with this
204 option, you also have have to set the PCRE_UTF8 option when
205 you call the pcre_compile() function.
210 By default, PCRE treats character 10 (linefeed) as the new-
211 line character. This is the normal newline character on
212 Unix-like systems. You can compile PCRE to use character 13
213 (carriage return) instead by adding
215 --enable-newline-is-cr
217 to the configure command. For completeness there is also a
218 --enable-newline-is-lf option, which explicitly specifies
219 linefeed as the newline character.
224 The PCRE building process uses libtool to build both shared
225 and static Unix libraries by default. You can suppress one
226 of these by adding one of
228 --disable-shared
229 --disable-static
231 to the configure command, as required.
236 When PCRE is called through the POSIX interface (see the
237 pcreposix documentation), additional working storage is
238 required for holding the pointers to capturing substrings
239 because PCRE requires three integers per substring, whereas
240 the POSIX interface provides only two. If the number of
241 expected substrings is small, the wrapper function uses
242 space on the stack, because this is faster than using mal-
243 loc() for each call. The default threshold above which the
244 stack is no longer used is 10; it can be changed by adding a
245 setting such as
247 --with-posix-malloc-threshold=20
249 to the configure command.
254 Internally, PCRE has a function called match() which it
255 calls repeatedly (possibly recursively) when performing a
256 matching operation. By limiting the number of times this
257 function may be called, a limit can be placed on the
258 resources used by a single call to pcre_exec(). The limit
259 can be changed at run time, as described in the pcreapi
260 documentation. The default is 10 million, but this can be
261 changed by adding a setting such as
263 --with-match-limit=500000
265 to the configure command.
270 Within a compiled pattern, offset values are used to point
271 from one part to another (for example, from an opening
272 parenthesis to an alternation metacharacter). By default
273 two-byte values are used for these offsets, leading to a
274 maximum size for a compiled pattern of around 64K. This is
275 sufficient to handle all but the most gigantic patterns.
276 Nevertheless, some people do want to process enormous pat-
277 terns, so it is possible to compile PCRE to use three-byte
278 or four-byte offsets by adding a setting such as
280 --with-link-size=3
282 to the configure command. The value given must be 2, 3, or
283 4. Using longer offsets slows down the operation of PCRE
284 because it has to load additional bytes when handling them.
286 If you build PCRE with an increased link size, test 2 (and
287 test 5 if you are using UTF-8) will fail. Part of the output
288 of these tests is a representation of the compiled pattern,
289 and this changes with the link size.
291 Last updated: 21 January 2003
292 Copyright (c) 1997-2003 University of Cambridge.
293 -----------------------------------------------------------------------------
295 NAME
296 PCRE - Perl-compatible regular expressions
301 #include <pcre.h>
303 pcre *pcre_compile(const char *pattern, int options,
304 const char **errptr, int *erroffset,
305 const unsigned char *tableptr);
307 pcre_extra *pcre_study(const pcre *code, int options,
308 const char **errptr);
310 int pcre_exec(const pcre *code, const pcre_extra *extra,
311 const char *subject, int length, int startoffset,
312 int options, int *ovector, int ovecsize);
314 int pcre_copy_named_substring(const pcre *code,
315 const char *subject, int *ovector,
316 int stringcount, const char *stringname,
317 char *buffer, int buffersize);
319 int pcre_copy_substring(const char *subject, int *ovector,
320 int stringcount, int stringnumber, char *buffer,
321 int buffersize);
323 int pcre_get_named_substring(const pcre *code,
324 const char *subject, int *ovector,
325 int stringcount, const char *stringname,
326 const char **stringptr);
328 int pcre_get_stringnumber(const pcre *code,
329 const char *name);
331 int pcre_get_substring(const char *subject, int *ovector,
332 int stringcount, int stringnumber,
333 const char **stringptr);
335 int pcre_get_substring_list(const char *subject,
336 int *ovector, int stringcount, const char ***listptr);
338 void pcre_free_substring(const char *stringptr);
340 void pcre_free_substring_list(const char **stringptr);
342 const unsigned char *pcre_maketables(void);
344 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345 int what, void *where);
348 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
350 int pcre_config(int what, void *where);
352 char *pcre_version(void);
354 void *(*pcre_malloc)(size_t);
356 void (*pcre_free)(void *);
358 int (*pcre_callout)(pcre_callout_block *);
363 PCRE has its own native API, which is described in this
364 document. There is also a set of wrapper functions that
365 correspond to the POSIX regular expression API. These are
366 described in the pcreposix documentation.
368 The native API function prototypes are defined in the header
369 file pcre.h, and on Unix systems the library itself is
370 called libpcre.a, so can be accessed by adding -lpcre to the
371 command for linking an application which calls it. The
372 header file defines the macros PCRE_MAJOR and PCRE_MINOR to
373 contain the major and minor release numbers for the library.
374 Applications can use these to include support for different
375 releases.
377 The functions pcre_compile(), pcre_study(), and pcre_exec()
378 are used for compiling and matching regular expressions. A
379 sample program that demonstrates the simplest way of using
380 them is given in the file pcredemo.c. The pcresample docu-
381 mentation describes how to run it.
383 There are convenience functions for extracting captured sub-
384 strings from a matched subject string. They are:
386 pcre_copy_substring()
387 pcre_copy_named_substring()
388 pcre_get_substring()
389 pcre_get_named_substring()
390 pcre_get_substring_list()
392 pcre_free_substring() and pcre_free_substring_list() are
393 also provided, to free the memory used for extracted
394 strings.
396 The function pcre_maketables() is used (optionally) to build
397 a set of character tables in the current locale for passing
398 to pcre_compile().
400 The function pcre_fullinfo() is used to find out information
401 about a compiled pattern; pcre_info() is an obsolete version
402 which returns only some of the available information, but is
403 retained for backwards compatibility. The function
404 pcre_version() returns a pointer to a string containing the
405 version of PCRE and its date of release.
407 The global variables pcre_malloc and pcre_free initially
408 contain the entry points of the standard malloc() and free()
409 functions respectively. PCRE calls the memory management
410 functions via these variables, so a calling program can
411 replace them if it wishes to intercept the calls. This
412 should be done before calling any PCRE functions.
414 The global variable pcre_callout initially contains NULL. It
415 can be set by the caller to a "callout" function, which PCRE
416 will then call at specified points during a matching opera-
417 tion. Details are given in the pcrecallout documentation.
422 The PCRE functions can be used in multi-threading applica-
423 tions, with the proviso that the memory management functions
424 pointed to by pcre_malloc and pcre_free, and the callout
425 function pointed to by pcre_callout, are shared by all
426 threads.
428 The compiled form of a regular expression is not altered
429 during matching, so the same compiled pattern can safely be
430 used by several threads at once.
435 int pcre_config(int what, void *where);
437 The function pcre_config() makes it possible for a PCRE
438 client to discover which optional features have been com-
439 piled into the PCRE library. The pcrebuild documentation has
440 more details about these optional features.
442 The first argument for pcre_config() is an integer, specify-
443 ing which information is required; the second argument is a
444 pointer to a variable into which the information is placed.
445 The following information is available:
449 The output is an integer that is set to one if UTF-8 support
450 is available; otherwise it is set to zero.
454 The output is an integer that is set to the value of the
455 code that is used for the newline character. It is either
456 linefeed (10) or carriage return (13), and should normally
457 be the standard character for your operating system.
461 The output is an integer that contains the number of bytes
462 used for internal linkage in compiled regular expressions.
463 The value is 2, 3, or 4. Larger values allow larger regular
464 expressions to be compiled, at the expense of slower match-
465 ing. The default value of 2 is sufficient for all but the
466 most massive patterns, since it allows the compiled pattern
467 to be up to 64K in size.
471 The output is an integer that contains the threshold above
472 which the POSIX interface uses malloc() for output vectors.
473 Further details are given in the pcreposix documentation.
477 The output is an integer that gives the default limit for
478 the number of internal matching function calls in a
479 pcre_exec() execution. Further details are given with
480 pcre_exec() below.
485 pcre *pcre_compile(const char *pattern, int options,
486 const char **errptr, int *erroffset,
487 const unsigned char *tableptr);
489 The function pcre_compile() is called to compile a pattern
490 into an internal form. The pattern is a C string terminated
491 by a binary zero, and is passed in the argument pattern. A
492 pointer to a single block of memory that is obtained via
493 pcre_malloc is returned. This contains the compiled code and
494 related data. The pcre type is defined for the returned
495 block; this is a typedef for a structure whose contents are
496 not externally defined. It is up to the caller to free the
497 memory when it is no longer required.
499 Although the compiled code of a PCRE regex is relocatable,
500 that is, it does not depend on memory location, the complete
501 pcre data block is not fully relocatable, because it con-
502 tains a copy of the tableptr argument, which is an address
503 (see below).
504 The options argument contains independent bits that affect
505 the compilation. It should be zero if no options are
506 required. Some of the options, in particular, those that are
507 compatible with Perl, can also be set and unset from within
508 the pattern (see the detailed description of regular expres-
509 sions in the pcrepattern documentation). For these options,
510 the contents of the options argument specifies their initial
511 settings at the start of compilation and execution. The
512 PCRE_ANCHORED option can be set at the time of matching as
513 well as at compile time.
515 If errptr is NULL, pcre_compile() returns NULL immediately.
516 Otherwise, if compilation of a pattern fails, pcre_compile()
517 returns NULL, and sets the variable pointed to by errptr to
518 point to a textual error message. The offset from the start
519 of the pattern to the character where the error was
520 discovered is placed in the variable pointed to by
521 erroffset, which must not be NULL. If it is, an immediate
522 error is given.
524 If the final argument, tableptr, is NULL, PCRE uses a
525 default set of character tables which are built when it is
526 compiled, using the default C locale. Otherwise, tableptr
527 must be the result of a call to pcre_maketables(). See the
528 section on locale support below.
530 This code fragment shows a typical straightforward call to
531 pcre_compile():
533 pcre *re;
534 const char *error;
535 int erroffset;
536 re = pcre_compile(
537 "^A.*Z", /* the pattern */
538 0, /* default options */
539 &error, /* for error message */
540 &erroffset, /* for error offset */
541 NULL); /* use default character tables */
543 The following option bits are defined:
547 If this bit is set, the pattern is forced to be "anchored",
548 that is, it is constrained to match only at the first match-
549 ing point in the string which is being searched (the "sub-
550 ject string"). This effect can also be achieved by appropri-
551 ate constructs in the pattern itself, which is the only way
552 to do it in Perl.
556 If this bit is set, letters in the pattern match both upper
557 and lower case letters. It is equivalent to Perl's /i
558 option, and it can be changed within a pattern by a (?i)
559 option setting.
563 If this bit is set, a dollar metacharacter in the pattern
564 matches only at the end of the subject string. Without this
565 option, a dollar also matches immediately before the final
566 character if it is a newline (but not before any other new-
567 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
568 PCRE_MULTILINE is set. There is no equivalent to this option
569 in Perl, and no way to set it within a pattern.
573 If this bit is set, a dot metacharater in the pattern
574 matches all characters, including newlines. Without it, new-
575 lines are excluded. This option is equivalent to Perl's /s
576 option, and it can be changed within a pattern by a (?s)
577 option setting. A negative class such as [^a] always matches
578 a newline character, independent of the setting of this
579 option.
583 If this bit is set, whitespace data characters in the pat-
584 tern are totally ignored except when escaped or inside a
585 character class. Whitespace does not include the VT charac-
586 ter (code 11). In addition, characters between an unescaped
587 # outside a character class and the next newline character,
588 inclusive, are also ignored. This is equivalent to Perl's /x
589 option, and it can be changed within a pattern by a (?x)
590 option setting.
592 This option makes it possible to include comments inside
593 complicated patterns. Note, however, that this applies only
594 to data characters. Whitespace characters may never appear
595 within special character sequences in a pattern, for example
596 within the sequence (?( which introduces a conditional sub-
597 pattern.
601 This option was invented in order to turn on additional
602 functionality of PCRE that is incompatible with Perl, but it
603 is currently of very little use. When set, any backslash in
604 a pattern that is followed by a letter that has no special
605 meaning causes an error, thus reserving these combinations
606 for future expansion. By default, as in Perl, a backslash
607 followed by a letter with no special meaning is treated as a
608 literal. There are at present no other features controlled
609 by this option. It can also be set by a (?X) option setting
610 within a pattern.
614 By default, PCRE treats the subject string as consisting of
615 a single "line" of characters (even if it actually contains
616 several newlines). The "start of line" metacharacter (^)
617 matches only at the start of the string, while the "end of
618 line" metacharacter ($) matches only at the end of the
619 string, or before a terminating newline (unless
620 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
622 When PCRE_MULTILINE it is set, the "start of line" and "end
623 of line" constructs match immediately following or immedi-
624 ately before any newline in the subject string, respec-
625 tively, as well as at the very start and end. This is
626 equivalent to Perl's /m option, and it can be changed within
627 a pattern by a (?m) option setting. If there are no "\n"
628 characters in a subject string, or no occurrences of ^ or $
629 in a pattern, setting PCRE_MULTILINE has no effect.
633 If this option is set, it disables the use of numbered cap-
634 turing parentheses in the pattern. Any opening parenthesis
635 that is not followed by ? behaves as if it were followed by
636 ?: but named parentheses can still be used for capturing
637 (and they acquire numbers in the usual way). There is no
638 equivalent of this option in Perl.
642 This option inverts the "greediness" of the quantifiers so
643 that they are not greedy by default, but become greedy if
644 followed by "?". It is not compatible with Perl. It can also
645 be set by a (?U) option setting within the pattern.
649 This option causes PCRE to regard both the pattern and the
650 subject as strings of UTF-8 characters instead of single-
651 byte character strings. However, it is available only if
652 PCRE has been built to include UTF-8 support. If not, the
653 use of this option provokes an error. Details of how this
654 option changes the behaviour of PCRE are given in the sec-
655 tion on UTF-8 support in the main pcre page.
660 pcre_extra *pcre_study(const pcre *code, int options,
661 const char **errptr);
663 When a pattern is going to be used several times, it is
664 worth spending more time analyzing it in order to speed up
665 the time taken for matching. The function pcre_study() takes
666 a pointer to a compiled pattern as its first argument. If
667 studing the pattern produces additional information that
668 will help speed up matching, pcre_study() returns a pointer
669 to a pcre_extra block, in which the study_data field points
670 to the results of the study.
672 The returned value from a pcre_study() can be passed
673 directly to pcre_exec(). However, the pcre_extra block also
674 contains other fields that can be set by the caller before
675 the block is passed; these are described below. If studying
676 the pattern does not produce any additional information,
677 pcre_study() returns NULL. In that circumstance, if the cal-
678 ling program wants to pass some of the other fields to
679 pcre_exec(), it must set up its own pcre_extra block.
681 The second argument contains option bits. At present, no
682 options are defined for pcre_study(), and this argument
683 should always be zero.
685 The third argument for pcre_study() is a pointer for an
686 error message. If studying succeeds (even if no data is
687 returned), the variable it points to is set to NULL. Other-
688 wise it points to a textual error message. You should there-
689 fore test the error pointer for NULL after calling
690 pcre_study(), to be sure that it has run successfully.
692 This is a typical call to pcre_study():
694 pcre_extra *pe;
695 pe = pcre_study(
696 re, /* result of pcre_compile() */
697 0, /* no options exist */
698 &error); /* set to NULL or points to a message */
700 At present, studying a pattern is useful only for non-
701 anchored patterns that do not have a single fixed starting
702 character. A bitmap of possible starting characters is
703 created.
708 PCRE handles caseless matching, and determines whether char-
709 acters are letters, digits, or whatever, by reference to a
710 set of tables. When running in UTF-8 mode, this applies only
711 to characters with codes less than 256. The library contains
712 a default set of tables that is created in the default C
713 locale when PCRE is compiled. This is used when the final
714 argument of pcre_compile() is NULL, and is sufficient for
715 many applications.
717 An alternative set of tables can, however, be supplied. Such
718 tables are built by calling the pcre_maketables() function,
719 which has no arguments, in the relevant locale. The result
720 can then be passed to pcre_compile() as often as necessary.
721 For example, to build and use tables that are appropriate
722 for the French locale (where accented characters with codes
723 greater than 128 are treated as letters), the following code
724 could be used:
726 setlocale(LC_CTYPE, "fr");
727 tables = pcre_maketables();
728 re = pcre_compile(..., tables);
730 The tables are built in memory that is obtained via
731 pcre_malloc. The pointer that is passed to pcre_compile is
732 saved with the compiled pattern, and the same tables are
733 used via this pointer by pcre_study() and pcre_exec(). Thus,
734 for any single pattern, compilation, studying and matching
735 all happen in the same locale, but different patterns can be
736 compiled in different locales. It is the caller's responsi-
737 bility to ensure that the memory containing the tables
738 remains available for as long as it is needed.
743 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744 int what, void *where);
746 The pcre_fullinfo() function returns information about a
747 compiled pattern. It replaces the obsolete pcre_info() func-
748 tion, which is nevertheless retained for backwards compabil-
749 ity (and is documented below).
751 The first argument for pcre_fullinfo() is a pointer to the
752 compiled pattern. The second argument is the result of
753 pcre_study(), or NULL if the pattern was not studied. The
754 third argument specifies which piece of information is
755 required, and the fourth argument is a pointer to a variable
756 to receive the data. The yield of the function is zero for
757 success, or one of the following negative numbers:
759 PCRE_ERROR_NULL the argument code was NULL
760 the argument where was NULL
761 PCRE_ERROR_BADMAGIC the "magic number" was not found
762 PCRE_ERROR_BADOPTION the value of what was invalid
764 Here is a typical call of pcre_fullinfo(), to obtain the
765 length of the compiled pattern:
767 int rc;
768 unsigned long int length;
769 rc = pcre_fullinfo(
770 re, /* result of pcre_compile() */
771 pe, /* result of pcre_study(), or NULL */
772 PCRE_INFO_SIZE, /* what is required */
773 &length); /* where to put the data */
775 The possible values for the third argument are defined in
776 pcre.h, and are as follows:
780 Return the number of the highest back reference in the pat-
781 tern. The fourth argument should point to an int variable.
782 Zero is returned if there are no back references.
786 Return the number of capturing subpatterns in the pattern.
787 The fourth argument should point to an int variable.
791 Return information about the first byte of any matched
792 string, for a non-anchored pattern. (This option used to be
793 called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794 for backwards compatibility.)
796 If there is a fixed first byte, e.g. from a pattern such as
797 (cat|cow|coyote), it is returned in the integer pointed to
798 by where. Otherwise, if either
800 (a) the pattern was compiled with the PCRE_MULTILINE option,
801 and every branch starts with "^", or
803 (b) every branch of the pattern starts with ".*" and
804 PCRE_DOTALL is not set (if it were set, the pattern would be
805 anchored),
807 -1 is returned, indicating that the pattern matches only at
808 the start of a subject string or after any newline within
809 the string. Otherwise -2 is returned. For anchored patterns,
810 -2 is returned.
814 If the pattern was studied, and this resulted in the con-
815 struction of a 256-bit table indicating a fixed set of bytes
816 for the first byte in any matching string, a pointer to the
817 table is returned. Otherwise NULL is returned. The fourth
818 argument should point to an unsigned char * variable.
822 Return the value of the rightmost literal byte that must
823 exist in any matched string, other than at its start, if
824 such a byte has been recorded. The fourth argument should
825 point to an int variable. If there is no such byte, -1 is
826 returned. For anchored patterns, a last literal byte is
827 recorded only if it follows something of variable length.
828 For example, for the pattern /^a\d+z\d+/ the returned value
829 is "z", but for /^a\dz\d/ the returned value is -1.
835 PCRE supports the use of named as well as numbered capturing
836 parentheses. The names are just an additional way of identi-
837 fying the parentheses, which still acquire a number. A
838 caller that wants to extract data from a named subpattern
839 must convert the name to a number in order to access the
840 correct pointers in the output vector (described with
841 pcre_exec() below). In order to do this, it must first use
842 these three values to obtain the name-to-number mapping
843 table for the pattern.
845 The map consists of a number of fixed-size entries.
846 PCRE_INFO_NAMECOUNT gives the number of entries, and
847 PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
848 of these return an int value. The entry size depends on the
849 length of the longest name. PCRE_INFO_NAMETABLE returns a
850 pointer to the first entry of the table (a pointer to char).
851 The first two bytes of each entry are the number of the cap-
852 turing parenthesis, most significant byte first. The rest of
853 the entry is the corresponding name, zero terminated. The
854 names are in alphabetical order. For example, consider the
855 following pattern (assume PCRE_EXTENDED is set, so white
856 space - including newlines - is ignored):
858 (?P<date> (?P<year>(\d\d)?\d\d) -
859 (?P<month>\d\d) - (?P<day>\d\d) )
861 There are four named subpatterns, so the table has four
862 entries, and each entry in the table is eight bytes long.
863 The table is as follows, with non-printing bytes shows in
864 hex, and undefined bytes shown as ??:
866 00 01 d a t e 00 ??
867 00 05 d a y 00 ?? ??
868 00 04 m o n t h 00
869 00 02 y e a r 00 ??
871 When writing code to extract data from named subpatterns,
872 remember that the length of each entry may be different for
873 each compiled pattern.
877 Return a copy of the options with which the pattern was com-
878 piled. The fourth argument should point to an unsigned long
879 int variable. These option bits are those specified in the
880 call to pcre_compile(), modified by any top-level option
881 settings within the pattern itself.
883 A pattern is automatically anchored by PCRE if all of its
884 top-level alternatives begin with one of the following:
886 ^ unless PCRE_MULTILINE is set
887 \A always
888 \G always
889 .* if PCRE_DOTALL is set and there are no back
890 references to the subpattern in which .* appears
892 For such patterns, the PCRE_ANCHORED bit is set in the
893 options returned by pcre_fullinfo().
897 Return the size of the compiled pattern, that is, the value
898 that was passed as the argument to pcre_malloc() when PCRE
899 was getting memory in which to place the compiled data. The
900 fourth argument should point to a size_t variable.
904 Returns the size of the data block pointed to by the
905 study_data field in a pcre_extra block. That is, it is the
906 value that was passed to pcre_malloc() when PCRE was getting
907 memory into which to place the data created by pcre_study().
908 The fourth argument should point to a size_t variable.
913 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
915 The pcre_info() function is now obsolete because its inter-
916 face is too restrictive to return all the available data
917 about a compiled pattern. New programs should use
918 pcre_fullinfo() instead. The yield of pcre_info() is the
919 number of capturing subpatterns, or one of the following
920 negative numbers:
922 PCRE_ERROR_NULL the argument code was NULL
923 PCRE_ERROR_BADMAGIC the "magic number" was not found
925 If the optptr argument is not NULL, a copy of the options
926 with which the pattern was compiled is placed in the integer
927 it points to (see PCRE_INFO_OPTIONS above).
929 If the pattern is not anchored and the firstcharptr argument
930 is not NULL, it is used to pass back information about the
931 first character of any matched string (see
937 int pcre_exec(const pcre *code, const pcre_extra *extra,
938 const char *subject, int length, int startoffset,
939 int options, int *ovector, int ovecsize);
941 The function pcre_exec() is called to match a subject string
942 against a pre-compiled pattern, which is passed in the code
943 argument. If the pattern has been studied, the result of the
944 study should be passed in the extra argument.
946 Here is an example of a simple call to pcre_exec():
948 int rc;
949 int ovector[30];
950 rc = pcre_exec(
951 re, /* result of pcre_compile() */
952 NULL, /* we didn't study the pattern */
953 "some string", /* the subject string */
954 11, /* the length of the subject string */
955 0, /* start at offset 0 in the subject */
956 0, /* default options */
957 ovector, /* vector for substring information */
958 30); /* number of elements in the vector */
960 If the extra argument is not NULL, it must point to a
961 pcre_extra data block. The pcre_study() function returns
962 such a block (when it doesn't return NULL), but you can also
963 create one for yourself, and pass additional information in
964 it. The fields in the block are as follows:
966 unsigned long int flags;
967 void *study_data;
968 unsigned long int match_limit;
969 void *callout_data;
971 The flags field is a bitmap that specifies which of the
972 other fields are set. The flag bits are:
978 Other flag bits should be set to zero. The study_data field
979 is set in the pcre_extra block that is returned by
980 pcre_study(), together with the appropriate flag bit. You
981 should not set this yourself, but you can add to the block
982 by setting the other fields.
984 The match_limit field provides a means of preventing PCRE
985 from using up a vast amount of resources when running pat-
986 terns that are not going to match, but which have a very
987 large number of possibilities in their search trees. The
988 classic example is the use of nested unlimited repeats.
989 Internally, PCRE uses a function called match() which it
990 calls repeatedly (sometimes recursively). The limit is
991 imposed on the number of times this function is called dur-
992 ing a match, which has the effect of limiting the amount of
993 recursion and backtracking that can take place. For patterns
994 that are not anchored, the count starts from zero for each
995 position in the subject string.
997 The default limit for the library can be set when PCRE is
998 built; the default default is 10 million, which handles all
999 but the most extreme cases. You can reduce the default by
1000 suppling pcre_exec() with a pcre_extra block in which
1001 match_limit is set to a smaller value, and
1002 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1003 limit is exceeded, pcre_exec() returns
1006 The pcre_callout field is used in conjunction with the "cal-
1007 lout" feature, which is described in the pcrecallout docu-
1008 mentation.
1010 The PCRE_ANCHORED option can be passed in the options argu-
1011 ment, whose unused bits must be zero. This limits
1012 pcre_exec() to matching at the first matching position. How-
1013 ever, if a pattern was compiled with PCRE_ANCHORED, or
1014 turned out to be anchored by virtue of its contents, it can-
1015 not be made unachored at matching time.
1017 There are also three further options that can be set only at
1018 matching time:
1022 The first character of the string is not the beginning of a
1023 line, so the circumflex metacharacter should not match
1024 before it. Setting this without PCRE_MULTILINE (at compile
1025 time) causes circumflex never to match.
1029 The end of the string is not the end of a line, so the dol-
1030 lar metacharacter should not match it nor (except in multi-
1031 line mode) a newline immediately before it. Setting this
1032 without PCRE_MULTILINE (at compile time) causes dollar never
1033 to match.
1037 An empty string is not considered to be a valid match if
1038 this option is set. If there are alternatives in the pat-
1039 tern, they are tried. If all the alternatives match the
1040 empty string, the entire match fails. For example, if the
1041 pattern
1043 a?b?
1045 is applied to a string not beginning with "a" or "b", it
1046 matches the empty string at the start of the subject. With
1047 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1048 further into the string for occurrences of "a" or "b".
1050 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1051 make a special case of a pattern match of the empty string
1052 within its split() function, and when using the /g modifier.
1053 It is possible to emulate Perl's behaviour after matching a
1054 null string by first trying the match again at the same
1055 offset with PCRE_NOTEMPTY set, and then if that fails by
1056 advancing the starting offset (see below) and trying an
1057 ordinary match again.
1059 The subject string is passed to pcre_exec() as a pointer in
1060 subject, a length in length, and a starting offset in star-
1061 toffset. Unlike the pattern string, the subject may contain
1062 binary zero bytes. When the starting offset is zero, the
1063 search for a match starts at the beginning of the subject,
1064 and this is by far the most common case.
1066 If the pattern was compiled with the PCRE_UTF8 option, the
1067 subject must be a sequence of bytes that is a valid UTF-8
1068 string. If an invalid UTF-8 string is passed, PCRE's
1069 behaviour is not defined.
1071 A non-zero starting offset is useful when searching for
1072 another match in the same subject by calling pcre_exec()
1073 again after a previous success. Setting startoffset differs
1074 from just passing over a shortened string and setting
1075 PCRE_NOTBOL in the case of a pattern that begins with any
1076 kind of lookbehind. For example, consider the pattern
1078 \Biss\B
1080 which finds occurrences of "iss" in the middle of words. (\B
1081 matches only if the current position in the subject is not a
1082 word boundary.) When applied to the string "Mississipi" the
1083 first call to pcre_exec() finds the first occurrence. If
1084 pcre_exec() is called again with just the remainder of the
1085 subject, namely "issipi", it does not match, because \B is
1086 always false at the start of the subject, which is deemed to
1087 be a word boundary. However, if pcre_exec() is passed the
1088 entire string again, but with startoffset set to 4, it finds
1089 the second occurrence of "iss" because it is able to look
1090 behind the starting point to discover that it is preceded by
1091 a letter.
1093 If a non-zero starting offset is passed when the pattern is
1094 anchored, one attempt to match at the given offset is tried.
1095 This can only succeed if the pattern does not require the
1096 match to be at the start of the subject.
1098 In general, a pattern matches a certain portion of the sub-
1099 ject, and in addition, further substrings from the subject
1100 may be picked out by parts of the pattern. Following the
1101 usage in Jeffrey Friedl's book, this is called "capturing"
1102 in what follows, and the phrase "capturing subpattern" is
1103 used for a fragment of a pattern that picks out a substring.
1104 PCRE supports several other kinds of parenthesized subpat-
1105 tern that do not cause substrings to be captured.
1107 Captured substrings are returned to the caller via a vector
1108 of integer offsets whose address is passed in ovector. The
1109 number of elements in the vector is passed in ovecsize. The
1110 first two-thirds of the vector is used to pass back captured
1111 substrings, each substring using a pair of integers. The
1112 remaining third of the vector is used as workspace by
1113 pcre_exec() while matching capturing subpatterns, and is not
1114 available for passing back information. The length passed in
1115 ovecsize should always be a multiple of three. If it is not,
1116 it is rounded down.
1118 When a match has been successful, information about captured
1119 substrings is returned in pairs of integers, starting at the
1120 beginning of ovector, and continuing up to two-thirds of its
1121 length at the most. The first element of a pair is set to
1122 the offset of the first character in a substring, and the
1123 second is set to the offset of the first character after the
1124 end of a substring. The first pair, ovector[0] and ovec-
1125 tor[1], identify the portion of the subject string matched
1126 by the entire pattern. The next pair is used for the first
1127 capturing subpattern, and so on. The value returned by
1128 pcre_exec() is the number of pairs that have been set. If
1129 there are no capturing subpatterns, the return value from a
1130 successful match is 1, indicating that just the first pair
1131 of offsets has been set.
1133 Some convenience functions are provided for extracting the
1134 captured substrings as separate strings. These are described
1135 in the following section.
1137 It is possible for an capturing subpattern number n+1 to
1138 match some part of the subject when subpattern n has not
1139 been used at all. For example, if the string "abc" is
1140 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1141 are matched, but 2 is not. When this happens, both offset
1142 values corresponding to the unused subpattern are set to -1.
1144 If a capturing subpattern is matched repeatedly, it is the
1145 last portion of the string that it matched that gets
1146 returned.
1148 If the vector is too small to hold all the captured sub-
1149 strings, it is used as far as possible (up to two-thirds of
1150 its length), and the function returns a value of zero. In
1151 particular, if the substring offsets are not of interest,
1152 pcre_exec() may be called with ovector passed as NULL and
1153 ovecsize as zero. However, if the pattern contains back
1154 references and the ovector isn't big enough to remember the
1155 related substrings, PCRE has to get additional memory for
1156 use during matching. Thus it is usually advisable to supply
1157 an ovector.
1159 Note that pcre_info() can be used to find out how many cap-
1160 turing subpatterns there are in a compiled pattern. The
1161 smallest size for ovector that will allow for n captured
1162 substrings, in addition to the offsets of the substring
1163 matched by the whole pattern, is (n+1)*3.
1165 If pcre_exec() fails, it returns a negative number. The fol-
1166 lowing are defined in the header file:
1170 The subject string did not match the pattern.
1174 Either code or subject was passed as NULL, or ovector was
1175 NULL and ovecsize was not zero.
1179 An unrecognized bit was set in the options argument.
1183 PCRE stores a 4-byte "magic number" at the start of the com-
1184 piled code, to catch the case when it is passed a junk
1185 pointer. This is the error it gives when the magic number
1186 isn't present.
1190 While running the pattern match, an unknown item was encoun-
1191 tered in the compiled pattern. This error could be caused by
1192 a bug in PCRE or by overwriting of the compiled pattern.
1196 If a pattern contains back references, but the ovector that
1197 is passed to pcre_exec() is not big enough to remember the
1198 referenced substrings, PCRE gets a block of memory at the
1199 start of matching to use for this purpose. If the call via
1200 pcre_malloc() fails, this error is given. The memory is
1201 freed at the end of matching.
1205 This error is used by the pcre_copy_substring(),
1206 pcre_get_substring(), and pcre_get_substring_list() func-
1207 tions (see below). It is never returned by pcre_exec().
1211 The recursion and backtracking limit, as specified by the
1212 match_limit field in a pcre_extra structure (or defaulted)
1213 was reached. See the description above.
1217 This error is never generated by pcre_exec() itself. It is
1218 provided for use by callout functions that want to yield a
1219 distinctive error code. See the pcrecallout documentation
1220 for details.
1225 int pcre_copy_substring(const char *subject, int *ovector,
1226 int stringcount, int stringnumber, char *buffer,
1227 int buffersize);
1229 int pcre_get_substring(const char *subject, int *ovector,
1230 int stringcount, int stringnumber,
1231 const char **stringptr);
1233 int pcre_get_substring_list(const char *subject,
1234 int *ovector, int stringcount, const char ***listptr);
1236 Captured substrings can be accessed directly by using the
1237 offsets returned by pcre_exec() in ovector. For convenience,
1238 the functions pcre_copy_substring(), pcre_get_substring(),
1239 and pcre_get_substring_list() are provided for extracting
1240 captured substrings as new, separate, zero-terminated
1241 strings. These functions identify substrings by number. The
1242 next section describes functions for extracting named sub-
1243 strings. A substring that contains a binary zero is
1244 correctly extracted and has a further zero added on the end,
1245 but the result is not, of course, a C string.
1247 The first three arguments are the same for all three of
1248 these functions: subject is the subject string which has
1249 just been successfully matched, ovector is a pointer to the
1250 vector of integer offsets that was passed to pcre_exec(),
1251 and stringcount is the number of substrings that were cap-
1252 tured by the match, including the substring that matched the
1253 entire regular expression. This is the value returned by
1254 pcre_exec if it is greater than zero. If pcre_exec()
1255 returned zero, indicating that it ran out of space in ovec-
1256 tor, the value passed as stringcount should be the size of
1257 the vector divided by three.
1259 The functions pcre_copy_substring() and pcre_get_substring()
1260 extract a single substring, whose number is given as string-
1261 number. A value of zero extracts the substring that matched
1262 the entire pattern, while higher values extract the captured
1263 substrings. For pcre_copy_substring(), the string is placed
1264 in buffer, whose length is given by buffersize, while for
1265 pcre_get_substring() a new block of memory is obtained via
1266 pcre_malloc, and its address is returned via stringptr. The
1267 yield of the function is the length of the string, not
1268 including the terminating zero, or one of
1272 The buffer was too small for pcre_copy_substring(), or the
1273 attempt to get memory failed for pcre_get_substring().
1277 There is no substring whose number is stringnumber.
1279 The pcre_get_substring_list() function extracts all avail-
1280 able substrings and builds a list of pointers to them. All
1281 this is done in a single block of memory which is obtained
1282 via pcre_malloc. The address of the memory block is returned
1283 via listptr, which is also the start of the list of string
1284 pointers. The end of the list is marked by a NULL pointer.
1285 The yield of the function is zero if all went well, or
1289 if the attempt to get the memory block failed.
1291 When any of these functions encounter a substring that is
1292 unset, which can happen when capturing subpattern number n+1
1293 matches some part of the subject, but subpattern n has not
1294 been used at all, they return an empty string. This can be
1295 distinguished from a genuine zero-length substring by
1296 inspecting the appropriate offset in ovector, which is nega-
1297 tive for unset substrings.
1299 The two convenience functions pcre_free_substring() and
1300 pcre_free_substring_list() can be used to free the memory
1301 returned by a previous call of pcre_get_substring() or
1302 pcre_get_substring_list(), respectively. They do nothing
1303 more than call the function pointed to by pcre_free, which
1304 of course could be called directly from a C program. How-
1305 ever, PCRE is used in some situations where it is linked via
1306 a special interface to another programming language which
1307 cannot use pcre_free directly; it is for these cases that
1308 the functions are provided.
1313 int pcre_copy_named_substring(const pcre *code,
1314 const char *subject, int *ovector,
1315 int stringcount, const char *stringname,
1316 char *buffer, int buffersize);
1318 int pcre_get_stringnumber(const pcre *code,
1319 const char *name);
1321 int pcre_get_named_substring(const pcre *code,
1322 const char *subject, int *ovector,
1323 int stringcount, const char *stringname,
1324 const char **stringptr);
1326 To extract a substring by name, you first have to find asso-
1327 ciated number. This can be done by calling
1328 pcre_get_stringnumber(). The first argument is the compiled
1329 pattern, and the second is the name. For example, for this
1330 pattern
1332 ab(?<xxx>\d+)...
1334 the number of the subpattern called "xxx" is 1. Given the
1335 number, you can then extract the substring directly, or use
1336 one of the functions described in the previous section. For
1337 convenience, there are also two functions that do the whole
1338 job.
1340 Most of the arguments of pcre_copy_named_substring() and
1341 pcre_get_named_substring() are the same as those for the
1342 functions that extract by number, and so are not re-
1343 described here. There are just two differences.
1345 First, instead of a substring number, a substring name is
1346 given. Second, there is an extra argument, given at the
1347 start, which is a pointer to the compiled pattern. This is
1348 needed in order to gain access to the name-to-number trans-
1349 lation table.
1351 These functions call pcre_get_stringnumber(), and if it
1352 succeeds, they then call pcre_copy_substring() or
1353 pcre_get_substring(), as appropriate.
1355 Last updated: 03 February 2003
1356 Copyright (c) 1997-2003 University of Cambridge.
1357 -----------------------------------------------------------------------------
1359 NAME
1360 PCRE - Perl-compatible regular expressions
1365 int (*pcre_callout)(pcre_callout_block *);
1367 PCRE provides a feature called "callout", which is a means
1368 of temporarily passing control to the caller of PCRE in the
1369 middle of pattern matching. The caller of PCRE provides an
1370 external function by putting its entry point in the global
1371 variable pcre_callout. By default, this variable contains
1372 NULL, which disables all calling out.
1374 Within a regular expression, (?C) indicates the points at
1375 which the external function is to be called. Different cal-
1376 lout points can be identified by putting a number less than
1377 256 after the letter C. The default value is zero. For
1378 example, this pattern has two callout points:
1380 (?C1)9abc(?C2)def
1382 During matching, when PCRE reaches a callout point (and
1383 pcre_callout is set), the external function is called. Its
1384 only argument is a pointer to a pcre_callout block. This
1385 contains the following variables:
1387 int version;
1388 int callout_number;
1389 int *offset_vector;
1390 const char *subject;
1391 int subject_length;
1392 int start_match;
1393 int current_position;
1394 int capture_top;
1395 int capture_last;
1396 void *callout_data;
1398 The version field is an integer containing the version
1399 number of the block format. The current version is zero. The
1400 version number may change in future if additional fields are
1401 added, but the intention is never to remove any of the
1402 existing fields.
1404 The callout_number field contains the number of the callout,
1405 as compiled into the pattern (that is, the number after ?C).
1407 The offset_vector field is a pointer to the vector of
1408 offsets that was passed by the caller to pcre_exec(). The
1409 contents can be inspected in order to extract substrings
1410 that have been matched so far, in the same way as for
1411 extracting substrings after a match has completed.
1412 The subject and subject_length fields contain copies the
1413 values that were passed to pcre_exec().
1415 The start_match field contains the offset within the subject
1416 at which the current match attempt started. If the pattern
1417 is not anchored, the callout function may be called several
1418 times for different starting points.
1420 The current_position field contains the offset within the
1421 subject of the current match pointer.
1423 The capture_top field contains the number of the highest
1424 captured substring so far.
1426 The capture_last field contains the number of the most
1427 recently captured substring.
1429 The callout_data field contains a value that is passed to
1430 pcre_exec() by the caller specifically so that it can be
1431 passed back in callouts. It is passed in the pcre_callout
1432 field of the pcre_extra data structure. If no such data was
1433 passed, the value of callout_data in a pcre_callout block is
1434 NULL. There is a description of the pcre_extra structure in
1435 the pcreapi documentation.
1441 The callout function returns an integer. If the value is
1442 zero, matching proceeds as normal. If the value is greater
1443 than zero, matching fails at the current point, but back-
1444 tracking to test other possibilities goes ahead, just as if
1445 a lookahead assertion had failed. If the value is less than
1446 zero, the match is abandoned, and pcre_exec() returns the
1447 value.
1449 Negative values should normally be chosen from the set of
1450 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1451 forces a standard "no match" failure. The error number
1452 PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1453 it will never be used by PCRE itself.
1455 Last updated: 21 January 2003
1456 Copyright (c) 1997-2003 University of Cambridge.
1457 -----------------------------------------------------------------------------
1459 NAME
1460 PCRE - Perl-compatible regular expressions
1465 This document describes the differences in the ways that
1466 PCRE and Perl handle regular expressions. The differences
1467 described here are with respect to Perl 5.8.
1469 1. PCRE does not allow repeat quantifiers on lookahead
1470 assertions. Perl permits them, but they do not mean what you
1471 might think. For example, (?!a){3} does not assert that the
1472 next three characters are not "a". It just asserts that the
1473 next character is not "a" three times.
1475 2. Capturing subpatterns that occur inside negative looka-
1476 head assertions are counted, but their entries in the
1477 offsets vector are never set. Perl sets its numerical vari-
1478 ables from any such patterns that are matched before the
1479 assertion fails to match something (thereby succeeding), but
1480 only if the negative lookahead assertion contains just one
1481 branch.
1483 3. Though binary zero characters are supported in the sub-
1484 ject string, they are not allowed in a pattern string
1485 because it is passed as a normal C string, terminated by
1486 zero. The escape sequence "\0" can be used in the pattern to
1487 represent a binary zero.
1489 4. The following Perl escape sequences are not supported:
1490 \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1491 mented by Perl's general string-handling and are not part of
1492 its pattern matching engine. If any of these are encountered
1493 by PCRE, an error is generated.
1495 5. PCRE does support the \Q...\E escape for quoting sub-
1496 strings. Characters in between are treated as literals. This
1497 is slightly different from Perl in that $ and @ are also
1498 handled as literals inside the quotes. In Perl, they cause
1499 variable interpolation (but of course PCRE does not have
1500 variables). Note the following examples:
1502 Pattern PCRE matches Perl matches
1504 \Qabc$xyz\E abc$xyz abc followed by the
1505 contents of $xyz
1506 \Qabc\$xyz\E abc\$xyz abc\$xyz
1507 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1509 In PCRE, the \Q...\E mechanism is not recognized inside a
1510 character class.
1512 8. Fairly obviously, PCRE does not support the (?{code}) and
1513 (?p{code}) constructions. However, there is some experimen-
1514 tal support for recursive patterns using the non-Perl items
1515 (?R), (?number) and (?P>name). Also, the PCRE "callout"
1516 feature allows an external function to be called during pat-
1517 tern matching.
1519 9. There are some differences that are concerned with the
1520 settings of captured strings when part of a pattern is
1521 repeated. For example, matching "aba" against the pattern
1522 /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1523 to "b".
1525 10. PCRE provides some extensions to the Perl regular
1526 expression facilities:
1528 (a) Although lookbehind assertions must match fixed length
1529 strings, each alternative branch of a lookbehind assertion
1530 can match a different length of string. Perl requires them
1531 all to have the same length.
1533 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1534 set, the $ meta-character matches only at the very end of
1535 the string.
1537 (c) If PCRE_EXTRA is set, a backslash followed by a letter
1538 with no special meaning is faulted.
1540 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1541 tion quantifiers is inverted, that is, by default they are
1542 not greedy, but if followed by a question mark they are.
1544 (e) PCRE_ANCHORED can be used to force a pattern to be tried
1545 only at the first matching position in the subject string.
1548 PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1549 equivalents.
1551 (g) The (?R), (?number), and (?P>name) constructs allows for
1552 recursive pattern matching (Perl can do this using the
1553 (?p{code}) construct, which PCRE cannot support.)
1555 (h) PCRE supports named capturing substrings, using the
1556 Python syntax.
1558 (i) PCRE supports the possessive quantifier "++" syntax,
1559 taken from Sun's Java package.
1561 (j) The (R) condition, for testing recursion, is a PCRE
1562 extension.
1564 (k) The callout facility is PCRE-specific.
1566 Last updated: 03 February 2003
1567 Copyright (c) 1997-2003 University of Cambridge.
1568 -----------------------------------------------------------------------------
1570 NAME
1571 PCRE - Perl-compatible regular expressions
1576 The syntax and semantics of the regular expressions sup-
1577 ported by PCRE are described below. Regular expressions are
1578 also described in the Perl documentation and in a number of
1579 other books, some of which have copious examples. Jeffrey
1580 Friedl's "Mastering Regular Expressions", published by
1581 O'Reilly, covers them in great detail. The description here
1582 is intended as reference documentation.
1584 The basic operation of PCRE is on strings of bytes. However,
1585 there is also support for UTF-8 character strings. To use
1586 this support you must build PCRE to include UTF-8 support,
1587 and then call pcre_compile() with the PCRE_UTF8 option. How
1588 this affects the pattern matching is mentioned in several
1589 places below. There is also a summary of UTF-8 features in
1590 the section on UTF-8 support in the main pcre page.
1592 A regular expression is a pattern that is matched against a
1593 subject string from left to right. Most characters stand for
1594 themselves in a pattern, and match the corresponding charac-
1595 ters in the subject. As a trivial example, the pattern
1597 The quick brown fox
1599 matches a portion of a subject string that is identical to
1600 itself. The power of regular expressions comes from the
1601 ability to include alternatives and repetitions in the pat-
1602 tern. These are encoded in the pattern by the use of meta-
1603 characters, which do not stand for themselves but instead
1604 are interpreted in some special way.
1606 There are two different sets of meta-characters: those that
1607 are recognized anywhere in the pattern except within square
1608 brackets, and those that are recognized in square brackets.
1609 Outside square brackets, the meta-characters are as follows:
1611 \ general escape character with several uses
1612 ^ assert start of string (or line, in multiline mode)
1613 $ assert end of string (or line, in multiline mode)
1614 . match any character except newline (by default)
1615 [ start character class definition
1616 | start of alternative branch
1617 ( start subpattern
1618 ) end subpattern
1619 ? extends the meaning of (
1620 also 0 or 1 quantifier
1621 also quantifier minimizer
1622 * 0 or more quantifier
1623 + 1 or more quantifier
1624 also "possessive quantifier"
1625 { start min/max quantifier
1627 Part of a pattern that is in square brackets is called a
1628 "character class". In a character class the only meta-
1629 characters are:
1631 \ general escape character
1632 ^ negate the class, but only if the first character
1633 - indicates character range
1634 [ POSIX character class (only if followed by POSIX
1635 syntax)
1636 ] terminates the character class
1638 The following sections describe the use of each of the
1639 meta-characters.
1644 The backslash character has several uses. Firstly, if it is
1645 followed by a non-alphameric character, it takes away any
1646 special meaning that character may have. This use of
1647 backslash as an escape character applies both inside and
1648 outside character classes.
1650 For example, if you want to match a * character, you write
1651 \* in the pattern. This escaping action applies whether or
1652 not the following character would otherwise be interpreted
1653 as a meta-character, so it is always safe to precede a non-
1654 alphameric with backslash to specify that it stands for
1655 itself. In particular, if you want to match a backslash, you
1656 write \\.
1658 If a pattern is compiled with the PCRE_EXTENDED option, whi-
1659 tespace in the pattern (other than in a character class) and
1660 characters between a # outside a character class and the
1661 next newline character are ignored. An escaping backslash
1662 can be used to include a whitespace or # character as part
1663 of the pattern.
1665 If you want to remove the special meaning from a sequence of
1666 characters, you can do so by putting them between \Q and \E.
1667 This is different from Perl in that $ and @ are handled as
1668 literals in \Q...\E sequences in PCRE, whereas in Perl, $
1669 and @ cause variable interpolation. Note the following exam-
1670 ples:
1672 Pattern PCRE matches Perl matches
1674 \Qabc$xyz\E abc$xyz abc followed by the
1676 contents of $xyz
1677 \Qabc\$xyz\E abc\$xyz abc\$xyz
1678 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1680 The \Q...\E sequence is recognized both inside and outside
1681 character classes.
1683 A second use of backslash provides a way of encoding non-
1684 printing characters in patterns in a visible manner. There
1685 is no restriction on the appearance of non-printing charac-
1686 ters, apart from the binary zero that terminates a pattern,
1687 but when a pattern is being prepared by text editing, it is
1688 usually easier to use one of the following escape sequences
1689 than the binary character it represents:
1691 \a alarm, that is, the BEL character (hex 07)
1692 \cx "control-x", where x is any character
1693 \e escape (hex 1B)
1694 \f formfeed (hex 0C)
1695 \n newline (hex 0A)
1696 \r carriage return (hex 0D)
1697 \t tab (hex 09)
1698 \ddd character with octal code ddd, or backreference
1699 \xhh character with hex code hh
1700 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1702 The precise effect of \cx is as follows: if x is a lower
1703 case letter, it is converted to upper case. Then bit 6 of
1704 the character (hex 40) is inverted. Thus \cz becomes hex
1705 1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1707 After \x, from zero to two hexadecimal digits are read
1708 (letters can be in upper or lower case). In UTF-8 mode, any
1709 number of hexadecimal digits may appear between \x{ and },
1710 but the value of the character code must be less than 2**31
1711 (that is, the maximum hexadecimal value is 7FFFFFFF). If
1712 characters other than hexadecimal digits appear between \x{
1713 and }, or if there is no terminating }, this form of escape
1714 is not recognized. Instead, the initial \x will be inter-
1715 preted as a basic hexadecimal escape, with no following
1716 digits, giving a byte whose value is zero.
1718 Characters whose value is less than 256 can be defined by
1719 either of the two syntaxes for \x when PCRE is in UTF-8
1720 mode. There is no difference in the way they are handled.
1721 For example, \xdc is exactly the same as \x{dc}.
1723 After \0 up to two further octal digits are read. In both
1724 cases, if there are fewer than two digits, just those that
1725 are present are used. Thus the sequence \0\x\07 specifies
1726 two binary zeros followed by a BEL character (code value 7).
1727 Make sure you supply two digits after the initial zero if
1728 the character that follows is itself an octal digit.
1730 The handling of a backslash followed by a digit other than 0
1731 is complicated. Outside a character class, PCRE reads it
1732 and any following digits as a decimal number. If the number
1733 is less than 10, or if there have been at least that many
1734 previous capturing left parentheses in the expression, the
1735 entire sequence is taken as a back reference. A description
1736 of how this works is given later, following the discussion
1737 of parenthesized subpatterns.
1739 Inside a character class, or if the decimal number is
1740 greater than 9 and there have not been that many capturing
1741 subpatterns, PCRE re-reads up to three octal digits follow-
1742 ing the backslash, and generates a single byte from the
1743 least significant 8 bits of the value. Any subsequent digits
1744 stand for themselves. For example:
1746 \040 is another way of writing a space
1747 \40 is the same, provided there are fewer than 40
1748 previous capturing subpatterns
1749 \7 is always a back reference
1750 \11 might be a back reference, or another way of
1751 writing a tab
1752 \011 is always a tab
1753 \0113 is a tab followed by the character "3"
1754 \113 might be a back reference, otherwise the
1755 character with octal code 113
1756 \377 might be a back reference, otherwise
1757 the byte consisting entirely of 1 bits
1758 \81 is either a back reference, or a binary zero
1759 followed by the two characters "8" and "1"
1761 Note that octal values of 100 or greater must not be intro-
1762 duced by a leading zero, because no more than three octal
1763 digits are ever read.
1765 All the sequences that define a single byte value or a sin-
1766 gle UTF-8 character (in UTF-8 mode) can be used both inside
1767 and outside character classes. In addition, inside a charac-
1768 ter class, the sequence \b is interpreted as the backspace
1769 character (hex 08). Outside a character class it has a dif-
1770 ferent meaning (see below).
1772 The third use of backslash is for specifying generic charac-
1773 ter types:
1775 \d any decimal digit
1776 \D any character that is not a decimal digit
1777 \s any whitespace character
1778 \S any character that is not a whitespace character
1779 \w any "word" character
1780 W any "non-word" character
1782 Each pair of escape sequences partitions the complete set of
1783 characters into two disjoint sets. Any given character
1784 matches one, and only one, of each pair.
1786 In UTF-8 mode, characters with values greater than 255 never
1787 match \d, \s, or \w, and always match \D, \S, and \W.
1789 For compatibility with Perl, \s does not match the VT char-
1790 acter (code 11). This makes it different from the the POSIX
1791 "space" class. The \s characters are HT (9), LF (10), FF
1792 (12), CR (13), and space (32).
1794 A "word" character is any letter or digit or the underscore
1795 character, that is, any character which can be part of a
1796 Perl "word". The definition of letters and digits is con-
1797 trolled by PCRE's character tables, and may vary if locale-
1798 specific matching is taking place (see "Locale support" in
1799 the pcreapi page). For example, in the "fr" (French) locale,
1800 some character codes greater than 128 are used for accented
1801 letters, and these are matched by \w.
1803 These character type sequences can appear both inside and
1804 outside character classes. They each match one character of
1805 the appropriate type. If the current matching point is at
1806 the end of the subject string, all of them fail, since there
1807 is no character to match.
1809 The fourth use of backslash is for certain simple asser-
1810 tions. An assertion specifies a condition that has to be met
1811 at a particular point in a match, without consuming any
1812 characters from the subject string. The use of subpatterns
1813 for more complicated assertions is described below. The
1814 backslashed assertions are
1816 \b matches at a word boundary
1817 \B matches when not at a word boundary
1818 \A matches at start of subject
1819 \Z matches at end of subject or before newline at end
1820 \z matches at end of subject
1821 \G matches at first matching position in subject
1823 These assertions may not appear in character classes (but
1824 note that \b has a different meaning, namely the backspace
1825 character, inside a character class).
1827 A word boundary is a position in the subject string where
1828 the current character and the previous character do not both
1829 match \w or \W (i.e. one matches \w and the other matches
1830 \W), or the start or end of the string if the first or last
1831 character matches \w, respectively.
1832 The \A, \Z, and \z assertions differ from the traditional
1833 circumflex and dollar (described below) in that they only
1834 ever match at the very start and end of the subject string,
1835 whatever options are set. Thus, they are independent of mul-
1836 tiline mode.
1838 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1839 options. If the startoffset argument of pcre_exec() is non-
1840 zero, indicating that matching is to start at a point other
1841 than the beginning of the subject, \A can never match. The
1842 difference between \Z and \z is that \Z matches before a
1843 newline that is the last character of the string as well as
1844 at the end of the string, whereas \z matches only at the
1845 end.
1847 The \G assertion is true only when the current matching
1848 position is at the start point of the match, as specified by
1849 the startoffset argument of pcre_exec(). It differs from \A
1850 when the value of startoffset is non-zero. By calling
1851 pcre_exec() multiple times with appropriate arguments, you
1852 can mimic Perl's /g option, and it is in this kind of imple-
1853 mentation where \G can be useful.
1855 Note, however, that PCRE's interpretation of \G, as the
1856 start of the current match, is subtly different from Perl's,
1857 which defines it as the end of the previous match. In Perl,
1858 these can be different when the previously matched string
1859 was empty. Because PCRE does just one match at a time, it
1860 cannot reproduce this behaviour.
1862 If all the alternatives of a pattern begin with \G, the
1863 expression is anchored to the starting match position, and
1864 the "anchored" flag is set in the compiled regular expres-
1865 sion.
1870 Outside a character class, in the default matching mode, the
1871 circumflex character is an assertion which is true only if
1872 the current matching point is at the start of the subject
1873 string. If the startoffset argument of pcre_exec() is non-
1874 zero, circumflex can never match if the PCRE_MULTILINE
1875 option is unset. Inside a character class, circumflex has an
1876 entirely different meaning (see below).
1878 Circumflex need not be the first character of the pattern if
1879 a number of alternatives are involved, but it should be the
1880 first thing in each alternative in which it appears if the
1881 pattern is ever to match that branch. If all possible alter-
1882 natives start with a circumflex, that is, if the pattern is
1883 constrained to match only at the start of the subject, it is
1884 said to be an "anchored" pattern. (There are also other con-
1885 structs that can cause a pattern to be anchored.)
1887 A dollar character is an assertion which is true only if the
1888 current matching point is at the end of the subject string,
1889 or immediately before a newline character that is the last
1890 character in the string (by default). Dollar need not be the
1891 last character of the pattern if a number of alternatives
1892 are involved, but it should be the last item in any branch
1893 in which it appears. Dollar has no special meaning in a
1894 character class.
1896 The meaning of dollar can be changed so that it matches only
1897 at the very end of the string, by setting the
1898 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1899 affect the \Z assertion.
1901 The meanings of the circumflex and dollar characters are
1902 changed if the PCRE_MULTILINE option is set. When this is
1903 the case, they match immediately after and immediately
1904 before an internal newline character, respectively, in addi-
1905 tion to matching at the start and end of the subject string.
1906 For example, the pattern /^abc$/ matches the subject string
1907 "def\nabc" in multiline mode, but not otherwise. Conse-
1908 quently, patterns that are anchored in single line mode
1909 because all branches start with ^ are not anchored in multi-
1910 line mode, and a match for circumflex is possible when the
1911 startoffset argument of pcre_exec() is non-zero. The
1912 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1913 set.
1915 Note that the sequences \A, \Z, and \z can be used to match
1916 the start and end of the subject in both modes, and if all
1917 branches of a pattern start with \A it is always anchored,
1918 whether PCRE_MULTILINE is set or not.
1923 Outside a character class, a dot in the pattern matches any
1924 one character in the subject, including a non-printing char-
1925 acter, but not (by default) newline. In UTF-8 mode, a dot
1926 matches any UTF-8 character, which might be more than one
1927 byte long, except (by default) for newline. If the
1928 PCRE_DOTALL option is set, dots match newlines as well. The
1929 handling of dot is entirely independent of the handling of
1930 circumflex and dollar, the only relationship being that they
1931 both involve newline characters. Dot has no special meaning
1932 in a character class.
1938 Outside a character class, the escape sequence \C matches
1939 any one byte, both in and out of UTF-8 mode. Unlike a dot,
1940 it always matches a newline. The feature is provided in Perl
1941 in order to match individual bytes in UTF-8 mode. Because
1942 it breaks up UTF-8 characters into individual bytes, what
1943 remains in the string may be a malformed UTF-8 string. For
1944 this reason it is best avoided.
1946 PCRE does not allow \C to appear in lookbehind assertions
1947 (see below), because in UTF-8 mode it makes it impossible to
1948 calculate the length of the lookbehind.
1953 An opening square bracket introduces a character class, ter-
1954 minated by a closing square bracket. A closing square
1955 bracket on its own is not special. If a closing square
1956 bracket is required as a member of the class, it should be
1957 the first data character in the class (after an initial cir-
1958 cumflex, if present) or escaped with a backslash.
1960 A character class matches a single character in the subject.
1961 In UTF-8 mode, the character may occupy more than one byte.
1962 A matched character must be in the set of characters defined
1963 by the class, unless the first character in the class defin-
1964 ition is a circumflex, in which case the subject character
1965 must not be in the set defined by the class. If a circumflex
1966 is actually required as a member of the class, ensure it is
1967 not the first character, or escape it with a backslash.
1969 For example, the character class [aeiou] matches any lower
1970 case vowel, while [^aeiou] matches any character that is not
1971 a lower case vowel. Note that a circumflex is just a con-
1972 venient notation for specifying the characters which are in
1973 the class by enumerating those that are not. It is not an
1974 assertion: it still consumes a character from the subject
1975 string, and fails if the current pointer is at the end of
1976 the string.
1978 In UTF-8 mode, characters with values greater than 255 can
1979 be included in a class as a literal string of bytes, or by
1980 using the \x{ escaping mechanism.
1982 When caseless matching is set, any letters in a class
1983 represent both their upper case and lower case versions, so
1984 for example, a caseless [aeiou] matches "A" as well as "a",
1985 and a caseless [^aeiou] does not match "A", whereas a case-
1986 ful version would. PCRE does not support the concept of case
1987 for characters with values greater than 255.
1988 The newline character is never treated in any special way in
1989 character classes, whatever the setting of the PCRE_DOTALL
1990 or PCRE_MULTILINE options is. A class such as [^a] will
1991 always match a newline.
1993 The minus (hyphen) character can be used to specify a range
1994 of characters in a character class. For example, [d-m]
1995 matches any letter between d and m, inclusive. If a minus
1996 character is required in a class, it must be escaped with a
1997 backslash or appear in a position where it cannot be inter-
1998 preted as indicating a range, typically as the first or last
1999 character in the class.
2001 It is not possible to have the literal character "]" as the
2002 end character of a range. A pattern such as [W-]46] is
2003 interpreted as a class of two characters ("W" and "-") fol-
2004 lowed by a literal string "46]", so it would match "W46]" or
2005 "-46]". However, if the "]" is escaped with a backslash it
2006 is interpreted as the end of range, so [W-\]46] is inter-
2007 preted as a single class containing a range followed by two
2008 separate characters. The octal or hexadecimal representation
2009 of "]" can also be used to end a range.
2011 Ranges operate in the collating sequence of character
2012 values. They can also be used for characters specified
2013 numerically, for example [\000-\037]. In UTF-8 mode, ranges
2014 can include characters whose values are greater than 255,
2015 for example [\x{100}-\x{2ff}].
2017 If a range that includes letters is used when caseless
2018 matching is set, it matches the letters in either case. For
2019 example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2020 caselessly, and if character tables for the "fr" locale are
2021 in use, [\xc8-\xcb] matches accented E characters in both
2022 cases.
2024 The character types \d, \D, \s, \S, \w, and \W may also
2025 appear in a character class, and add the characters that
2026 they match to the class. For example, [\dABCDEF] matches any
2027 hexadecimal digit. A circumflex can conveniently be used
2028 with the upper case character types to specify a more res-
2029 tricted set of characters than the matching lower case type.
2030 For example, the class [^\W_] matches any letter or digit,
2031 but not underscore.
2033 All non-alphameric characters other than \, -, ^ (at the
2034 start) and the terminating ] are non-special in character
2035 classes, but it does no harm if they are escaped.
2040 Perl supports the POSIX notation for character classes,
2041 which uses names enclosed by [: and :] within the enclosing
2042 square brackets. PCRE also supports this notation. For exam-
2043 ple,
2045 [01[:alpha:]%]
2047 matches "0", "1", any alphabetic character, or "%". The sup-
2048 ported class names are
2050 alnum letters and digits
2051 alpha letters
2052 ascii character codes 0 - 127
2053 blank space or tab only
2054 cntrl control characters
2055 digit decimal digits (same as \d)
2056 graph printing characters, excluding space
2057 lower lower case letters
2058 print printing characters, including space
2059 punct printing characters, excluding letters and digits
2060 space white space (not quite the same as \s)
2061 upper upper case letters
2062 word "word" characters (same as \w)
2063 xdigit hexadecimal digits
2065 The "space" characters are HT (9), LF (10), VT (11), FF
2066 (12), CR (13), and space (32). Notice that this list
2067 includes the VT character (code 11). This makes "space" dif-
2068 ferent to \s, which does not include VT (for Perl compati-
2069 bility).
2071 The name "word" is a Perl extension, and "blank" is a GNU
2072 extension from Perl 5.8. Another Perl extension is negation,
2073 which is indicated by a ^ character after the colon. For
2074 example,
2076 [12[:^digit:]]
2078 matches "1", "2", or any non-digit. PCRE (and Perl) also
2079 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2080 "collating element", but these are not supported, and an
2081 error is given if they are encountered.
2083 In UTF-8 mode, characters with values greater than 255 do
2084 not match any of the POSIX character classes.
2089 Vertical bar characters are used to separate alternative
2090 patterns. For example, the pattern
2092 gilbert|sullivan
2094 matches either "gilbert" or "sullivan". Any number of alter-
2095 natives may appear, and an empty alternative is permitted
2096 (matching the empty string). The matching process tries
2097 each alternative in turn, from left to right, and the first
2098 one that succeeds is used. If the alternatives are within a
2099 subpattern (defined below), "succeeds" means matching the
2100 rest of the main pattern as well as the alternative in the
2101 subpattern.
2106 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2107 PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2108 within the pattern by a sequence of Perl option letters
2109 enclosed between "(?" and ")". The option letters are
2111 i for PCRE_CASELESS
2113 s for PCRE_DOTALL
2114 x for PCRE_EXTENDED
2116 For example, (?im) sets caseless, multiline matching. It is
2117 also possible to unset these options by preceding the letter
2118 with a hyphen, and a combined setting and unsetting such as
2119 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2120 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2121 If a letter appears both before and after the hyphen, the
2122 option is unset.
2124 When an option change occurs at top level (that is, not
2125 inside subpattern parentheses), the change applies to the
2126 remainder of the pattern that follows. If the change is
2127 placed right at the start of a pattern, PCRE extracts it
2128 into the global options (and it will therefore show up in
2129 data extracted by the pcre_fullinfo() function).
2131 An option change within a subpattern affects only that part
2132 of the current pattern that follows it, so
2134 (a(?i)b)c
2136 matches abc and aBc and no other strings (assuming
2137 PCRE_CASELESS is not used). By this means, options can be
2138 made to have different settings in different parts of the
2139 pattern. Any changes made in one alternative do carry on
2140 into subsequent branches within the same subpattern. For
2141 example,
2143 (a(?i)b|c)
2145 matches "ab", "aB", "c", and "C", even though when matching
2146 "C" the first branch is abandoned before the option setting.
2147 This is because the effects of option settings happen at
2148 compile time. There would be some very weird behaviour oth-
2149 erwise.
2151 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2152 be changed in the same way as the Perl-compatible options by
2153 using the characters U and X respectively. The (?X) flag
2154 setting is special in that it must always occur earlier in
2155 the pattern than any of the additional features it turns on,
2156 even when it is at top level. It is best put at the start.
2161 Subpatterns are delimited by parentheses (round brackets),
2162 which can be nested. Marking part of a pattern as a subpat-
2163 tern does two things:
2165 1. It localizes a set of alternatives. For example, the pat-
2166 tern
2168 cat(aract|erpillar|)
2170 matches one of the words "cat", "cataract", or "caterpil-
2171 lar". Without the parentheses, it would match "cataract",
2172 "erpillar" or the empty string.
2174 2. It sets up the subpattern as a capturing subpattern (as
2175 defined above). When the whole pattern matches, that por-
2176 tion of the subject string that matched the subpattern is
2177 passed back to the caller via the ovector argument of
2178 pcre_exec(). Opening parentheses are counted from left to
2179 right (starting from 1) to obtain the numbers of the captur-
2180 ing subpatterns.
2182 For example, if the string "the red king" is matched against
2183 the pattern
2185 the ((red|white) (king|queen))
2187 the captured substrings are "red king", "red", and "king",
2188 and are numbered 1, 2, and 3, respectively.
2190 The fact that plain parentheses fulfil two functions is not
2191 always helpful. There are often times when a grouping sub-
2192 pattern is required without a capturing requirement. If an
2193 opening parenthesis is followed by a question mark and a
2194 colon, the subpattern does not do any capturing, and is not
2195 counted when computing the number of any subsequent captur-
2196 ing subpatterns. For example, if the string "the white
2197 queen" is matched against the pattern
2199 the ((?:red|white) (king|queen))
2201 the captured substrings are "white queen" and "queen", and
2202 are numbered 1 and 2. The maximum number of capturing sub-
2203 patterns is 65535, and the maximum depth of nesting of all
2204 subpatterns, both capturing and non-capturing, is 200.
2206 As a convenient shorthand, if any option settings are
2207 required at the start of a non-capturing subpattern, the
2208 option letters may appear between the "?" and the ":". Thus
2209 the two patterns
2211 (?i:saturday|sunday)
2212 (?:(?i)saturday|sunday)
2214 match exactly the same set of strings. Because alternative
2215 branches are tried from left to right, and options are not
2216 reset until the end of the subpattern is reached, an option
2217 setting in one branch does affect subsequent branches, so
2218 the above patterns match "SUNDAY" as well as "Saturday".
2223 Identifying capturing parentheses by number is simple, but
2224 it can be very hard to keep track of the numbers in compli-
2225 cated regular expressions. Furthermore, if an expression is
2226 modified, the numbers may change. To help with the diffi-
2227 culty, PCRE supports the naming of subpatterns, something
2228 that Perl does not provide. The Python syntax (?P<name>...)
2229 is used. Names consist of alphanumeric characters and under-
2230 scores, and must be unique within a pattern.
2232 Named capturing parentheses are still allocated numbers as
2233 well as names. The PCRE API provides function calls for
2234 extracting the name-to-number translation table from a com-
2235 piled pattern. For further details see the pcreapi documen-
2236 tation.
2241 Repetition is specified by quantifiers, which can follow any
2242 of the following items:
2244 a literal data character
2245 the . metacharacter
2246 the \C escape sequence
2247 escapes such as \d that match single characters
2248 a character class
2249 a back reference (see next section)
2250 a parenthesized subpattern (unless it is an assertion)
2252 The general repetition quantifier specifies a minimum and
2253 maximum number of permitted matches, by giving the two
2254 numbers in curly brackets (braces), separated by a comma.
2255 The numbers must be less than 65536, and the first must be
2256 less than or equal to the second. For example:
2258 z{2,4}
2260 matches "zz", "zzz", or "zzzz". A closing brace on its own
2261 is not a special character. If the second number is omitted,
2262 but the comma is present, there is no upper limit; if the
2263 second number and the comma are both omitted, the quantifier
2264 specifies an exact number of required matches. Thus
2266 [aeiou]{3,}
2268 matches at least 3 successive vowels, but may match many
2269 more, while
2271 \d{8}
2273 matches exactly 8 digits. An opening curly bracket that
2274 appears in a position where a quantifier is not allowed, or
2275 one that does not match the syntax of a quantifier, is taken
2276 as a literal character. For example, {,6} is not a quantif-
2277 ier, but a literal string of four characters.
2279 In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2280 than to individual bytes. Thus, for example, \x{100}{2}
2281 matches two UTF-8 characters, each of which is represented
2282 by a two-byte sequence.
2284 The quantifier {0} is permitted, causing the expression to
2285 behave as if the previous item and the quantifier were not
2286 present.
2288 For convenience (and historical compatibility) the three
2289 most common quantifiers have single-character abbreviations:
2291 * is equivalent to {0,}
2292 + is equivalent to {1,}
2293 ? is equivalent to {0,1}
2295 It is possible to construct infinite loops by following a
2296 subpattern that can match no characters with a quantifier
2297 that has no upper limit, for example:
2299 (a?)*
2301 Earlier versions of Perl and PCRE used to give an error at
2302 compile time for such patterns. However, because there are
2303 cases where this can be useful, such patterns are now
2304 accepted, but if any repetition of the subpattern does in
2305 fact match no characters, the loop is forcibly broken.
2307 By default, the quantifiers are "greedy", that is, they
2308 match as much as possible (up to the maximum number of per-
2309 mitted times), without causing the rest of the pattern to
2310 fail. The classic example of where this gives problems is in
2311 trying to match comments in C programs. These appear between
2312 the sequences /* and */ and within the sequence, individual
2313 * and / characters may appear. An attempt to match C com-
2314 ments by applying the pattern
2316 /\*.*\*/
2318 to the string
2320 /* first command */ not comment /* second comment */
2322 fails, because it matches the entire string owing to the
2323 greediness of the .* item.
2325 However, if a quantifier is followed by a question mark, it
2326 ceases to be greedy, and instead matches the minimum number
2327 of times possible, so the pattern
2329 /\*.*?\*/
2331 does the right thing with the C comments. The meaning of the
2332 various quantifiers is not otherwise changed, just the pre-
2333 ferred number of matches. Do not confuse this use of ques-
2334 tion mark with its use as a quantifier in its own right.
2335 Because it has two uses, it can sometimes appear doubled, as
2336 in
2338 \d??\d
2340 which matches one digit by preference, but can match two if
2341 that is the only way the rest of the pattern matches.
2343 If the PCRE_UNGREEDY option is set (an option which is not
2344 available in Perl), the quantifiers are not greedy by
2345 default, but individual ones can be made greedy by following
2346 them with a question mark. In other words, it inverts the
2347 default behaviour.
2349 When a parenthesized subpattern is quantified with a minimum
2350 repeat count that is greater than 1 or with a limited max-
2351 imum, more store is required for the compiled pattern, in
2352 proportion to the size of the minimum or maximum.
2353 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2354 option (equivalent to Perl's /s) is set, thus allowing the .
2355 to match newlines, the pattern is implicitly anchored,
2356 because whatever follows will be tried against every charac-
2357 ter position in the subject string, so there is no point in
2358 retrying the overall match at any position after the first.
2359 PCRE normally treats such a pattern as though it were pre-
2360 ceded by \A.
2362 In cases where it is known that the subject string contains
2363 no newlines, it is worth setting PCRE_DOTALL in order to
2364 obtain this optimization, or alternatively using ^ to indi-
2365 cate anchoring explicitly.
2367 However, there is one situation where the optimization can-
2368 not be used. When .* is inside capturing parentheses that
2369 are the subject of a backreference elsewhere in the pattern,
2370 a match at the start may fail, and a later one succeed. Con-
2371 sider, for example:
2373 (.*)abc\1
2375 If the subject is "xyz123abc123" the match point is the
2376 fourth character. For this reason, such a pattern is not
2377 implicitly anchored.
2379 When a capturing subpattern is repeated, the value captured
2380 is the substring that matched the final iteration. For exam-
2381 ple, after
2383 (tweedle[dume]{3}\s*)+
2385 has matched "tweedledum tweedledee" the value of the cap-
2386 tured substring is "tweedledee". However, if there are
2387 nested capturing subpatterns, the corresponding captured
2388 values may have been set in previous iterations. For exam-
2389 ple, after
2391 /(a|(b))+/
2393 matches "aba" the value of the second captured substring is
2394 "b".
2399 With both maximizing and minimizing repetition, failure of
2400 what follows normally causes the repeated item to be re-
2401 evaluated to see if a different number of repeats allows the
2402 rest of the pattern to match. Sometimes it is useful to
2403 prevent this, either to change the nature of the match, or
2404 to cause it fail earlier than it otherwise might, when the
2405 author of the pattern knows there is no point in carrying
2406 on.
2408 Consider, for example, the pattern \d+foo when applied to
2409 the subject line
2411 123456bar
2413 After matching all 6 digits and then failing to match "foo",
2414 the normal action of the matcher is to try again with only 5
2415 digits matching the \d+ item, and then with 4, and so on,
2416 before ultimately failing. "Atomic grouping" (a term taken
2417 from Jeffrey Friedl's book) provides the means for specify-
2418 ing that once a subpattern has matched, it is not to be re-
2419 evaluated in this way.
2421 If we use atomic grouping for the previous example, the
2422 matcher would give up immediately on failing to match "foo"
2423 the first time. The notation is a kind of special
2424 parenthesis, starting with (?> as in this example:
2426 (?>\d+)bar
2428 This kind of parenthesis "locks up" the part of the pattern
2429 it contains once it has matched, and a failure further into
2430 the pattern is prevented from backtracking into it. Back-
2431 tracking past it to previous items, however, works as nor-
2432 mal.
2434 An alternative description is that a subpattern of this type
2435 matches the string of characters that an identical stan-
2436 dalone pattern would match, if anchored at the current point
2437 in the subject string.
2439 Atomic grouping subpatterns are not capturing subpatterns.
2440 Simple cases such as the above example can be thought of as
2441 a maximizing repeat that must swallow everything it can. So,
2442 while both \d+ and \d+? are prepared to adjust the number of
2443 digits they match in order to make the rest of the pattern
2444 match, (?>\d+) can only match an entire sequence of digits.
2446 Atomic groups in general can of course contain arbitrarily
2447 complicated subpatterns, and can be nested. However, when
2448 the subpattern for an atomic group is just a single repeated
2449 item, as in the example above, a simpler notation, called a
2450 "possessive quantifier" can be used. This consists of an
2451 additional + character following a quantifier. Using this
2452 notation, the previous example can be rewritten as
2454 \d++bar
2456 Possessive quantifiers are always greedy; the setting of the
2457 PCRE_UNGREEDY option is ignored. They are a convenient nota-
2458 tion for the simpler forms of atomic group. However, there
2459 is no difference in the meaning or processing of a posses-
2460 sive quantifier and the equivalent atomic group.
2462 The possessive quantifier syntax is an extension to the Perl
2463 syntax. It originates in Sun's Java package.
2465 When a pattern contains an unlimited repeat inside a subpat-
2466 tern that can itself be repeated an unlimited number of
2467 times, the use of an atomic group is the only way to avoid
2468 some failing matches taking a very long time indeed. The
2469 pattern
2471 (\D+|<\d+>)*[!?]
2473 matches an unlimited number of substrings that either con-
2474 sist of non-digits, or digits enclosed in <>, followed by
2475 either ! or ?. When it matches, it runs quickly. However, if
2476 it is applied to
2478 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2480 it takes a long time before reporting failure. This is
2481 because the string can be divided between the two repeats in
2482 a large number of ways, and all have to be tried. (The exam-
2483 ple used [!?] rather than a single character at the end,
2484 because both PCRE and Perl have an optimization that allows
2485 for fast failure when a single character is used. They
2486 remember the last single character that is required for a
2487 match, and fail early if it is not present in the string.)
2488 If the pattern is changed to
2490 ((?>\D+)|<\d+>)*[!?]
2492 sequences of non-digits cannot be broken, and failure hap-
2493 pens quickly.
2498 Outside a character class, a backslash followed by a digit
2499 greater than 0 (and possibly further digits) is a back
2500 reference to a capturing subpattern earlier (that is, to its
2501 left) in the pattern, provided there have been that many
2502 previous capturing left parentheses.
2504 However, if the decimal number following the backslash is
2505 less than 10, it is always taken as a back reference, and
2506 causes an error only if there are not that many capturing
2507 left parentheses in the entire pattern. In other words, the
2508 parentheses that are referenced need not be to the left of
2509 the reference for numbers less than 10. See the section
2510 entitled "Backslash" above for further details of the han-
2511 dling of digits following a backslash.
2513 A back reference matches whatever actually matched the cap-
2514 turing subpattern in the current subject string, rather than
2515 anything matching the subpattern itself (see "Subpatterns as
2516 subroutines" below for a way of doing that). So the pattern
2518 (sens|respons)e and \1ibility
2520 matches "sense and sensibility" and "response and responsi-
2521 bility", but not "sense and responsibility". If caseful
2522 matching is in force at the time of the back reference, the
2523 case of letters is relevant. For example,
2525 ((?i)rah)\s+\1
2527 matches "rah rah" and "RAH RAH", but not "RAH rah", even
2528 though the original capturing subpattern is matched case-
2529 lessly.
2531 Back references to named subpatterns use the Python syntax
2532 (?P=name). We could rewrite the above example as follows:
2534 (?<p1>(?i)rah)\s+(?P=p1)
2536 There may be more than one back reference to the same sub-
2537 pattern. If a subpattern has not actually been used in a
2538 particular match, any back references to it always fail. For
2539 example, the pattern
2541 (a|(bc))\2
2543 always fails if it starts to match "a" rather than "bc".
2544 Because there may be many capturing parentheses in a pat-
2545 tern, all digits following the backslash are taken as part
2546 of a potential back reference number. If the pattern contin-
2547 ues with a digit character, some delimiter must be used to
2548 terminate the back reference. If the PCRE_EXTENDED option is
2549 set, this can be whitespace. Otherwise an empty comment can
2550 be used.
2552 A back reference that occurs inside the parentheses to which
2553 it refers fails when the subpattern is first used, so, for
2554 example, (a\1) never matches. However, such references can
2555 be useful inside repeated subpatterns. For example, the pat-
2556 tern
2558 (a|b\1)+
2560 matches any number of "a"s and also "aba", "ababbaa" etc. At
2561 each iteration of the subpattern, the back reference matches
2562 the character string corresponding to the previous itera-
2563 tion. In order for this to work, the pattern must be such
2564 that the first iteration does not need to match the back
2565 reference. This can be done using alternation, as in the
2566 example above, or by a quantifier with a minimum of zero.
2571 An assertion is a test on the characters following or
2572 preceding the current matching point that does not actually
2573 consume any characters. The simple assertions coded as \b,
2574 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2575 plicated assertions are coded as subpatterns. There are two
2576 kinds: those that look ahead of the current position in the
2577 subject string, and those that look behind it.
2579 An assertion subpattern is matched in the normal way, except
2580 that it does not cause the current matching position to be
2581 changed. Lookahead assertions start with (?= for positive
2582 assertions and (?! for negative assertions. For example,
2584 \w+(?=;)
2586 matches a word followed by a semicolon, but does not include
2587 the semicolon in the match, and
2589 foo(?!bar)
2591 matches any occurrence of "foo" that is not followed by
2592 "bar". Note that the apparently similar pattern
2594 (?!foo)bar
2596 does not find an occurrence of "bar" that is preceded by
2597 something other than "foo"; it finds any occurrence of "bar"
2598 whatsoever, because the assertion (?!foo) is always true
2599 when the next three characters are "bar". A lookbehind
2600 assertion is needed to achieve this effect.
2602 If you want to force a matching failure at some point in a
2603 pattern, the most convenient way to do it is with (?!)
2604 because an empty string always matches, so an assertion that
2605 requires there not to be an empty string must always fail.
2607 Lookbehind assertions start with (?<= for positive asser-
2608 tions and (?<! for negative assertions. For example,
2610 (?<!foo)bar
2612 does find an occurrence of "bar" that is not preceded by
2613 "foo". The contents of a lookbehind assertion are restricted
2614 such that all the strings it matches must have a fixed
2615 length. However, if there are several alternatives, they do
2616 not all have to have the same fixed length. Thus
2618 (?<=bullock|donkey)
2620 is permitted, but
2622 (?<!dogs?|cats?)
2624 causes an error at compile time. Branches that match dif-
2625 ferent length strings are permitted only at the top level of
2626 a lookbehind assertion. This is an extension compared with
2627 Perl (at least for 5.8), which requires all branches to
2628 match the same length of string. An assertion such as
2630 (?<=ab(c|de))
2632 is not permitted, because its single top-level branch can
2633 match two different lengths, but it is acceptable if rewrit-
2634 ten to use two top-level branches:
2636 (?<=abc|abde)
2638 The implementation of lookbehind assertions is, for each
2639 alternative, to temporarily move the current position back
2640 by the fixed width and then try to match. If there are
2641 insufficient characters before the current position, the
2642 match is deemed to fail.
2644 PCRE does not allow the \C escape (which matches a single
2645 byte in UTF-8 mode) to appear in lookbehind assertions,
2646 because it makes it impossible to calculate the length of
2647 the lookbehind.
2649 Atomic groups can be used in conjunction with lookbehind
2650 assertions to specify efficient matching at the end of the
2651 subject string. Consider a simple pattern such as
2653 abcd$
2655 when applied to a long string that does not match. Because
2656 matching proceeds from left to right, PCRE will look for
2657 each "a" in the subject and then see if what follows matches
2658 the rest of the pattern. If the pattern is specified as
2660 ^.*abcd$
2662 the initial .* matches the entire string at first, but when
2663 this fails (because there is no following "a"), it back-
2664 tracks to match all but the last character, then all but the
2665 last two characters, and so on. Once again the search for
2666 "a" covers the entire string, from right to left, so we are
2667 no better off. However, if the pattern is written as
2669 ^(?>.*)(?<=abcd)
2671 or, equivalently,
2673 ^.*+(?<=abcd)
2675 there can be no backtracking for the .* item; it can match
2676 only the entire string. The subsequent lookbehind assertion
2677 does a single test on the last four characters. If it fails,
2678 the match fails immediately. For long strings, this approach
2679 makes a significant difference to the processing time.
2681 Several assertions (of any sort) may occur in succession.
2682 For example,
2684 (?<=\d{3})(?<!999)foo
2686 matches "foo" preceded by three digits that are not "999".
2687 Notice that each of the assertions is applied independently
2688 at the same point in the subject string. First there is a
2689 check that the previous three characters are all digits, and
2690 then there is a check that the same three characters are not
2691 "999". This pattern does not match "foo" preceded by six
2692 characters, the first of which are digits and the last three
2693 of which are not "999". For example, it doesn't match
2694 "123abcfoo". A pattern to do that is
2696 (?<=\d{3}...)(?<!999)foo
2698 This time the first assertion looks at the preceding six
2699 characters, checking that the first three are digits, and
2700 then the second assertion checks that the preceding three
2701 characters are not "999".
2703 Assertions can be nested in any combination. For example,
2705 (?<=(?<!foo)bar)baz
2707 matches an occurrence of "baz" that is preceded by "bar"
2708 which in turn is not preceded by "foo", while
2710 (?<=\d{3}(?!999)...)foo
2712 is another pattern which matches "foo" preceded by three
2713 digits and any three characters that are not "999".
2715 Assertion subpatterns are not capturing subpatterns, and may
2716 not be repeated, because it makes no sense to assert the
2717 same thing several times. If any kind of assertion contains
2718 capturing subpatterns within it, these are counted for the
2719 purposes of numbering the capturing subpatterns in the whole
2720 pattern. However, substring capturing is carried out only
2721 for positive assertions, because it does not make sense for
2722 negative assertions.
2727 It is possible to cause the matching process to obey a sub-
2728 pattern conditionally or to choose between two alternative
2729 subpatterns, depending on the result of an assertion, or
2730 whether a previous capturing subpattern matched or not. The
2731 two possible forms of conditional subpattern are
2733 (?(condition)yes-pattern)
2734 (?(condition)yes-pattern|no-pattern)
2736 If the condition is satisfied, the yes-pattern is used; oth-
2737 erwise the no-pattern (if present) is used. If there are
2738 more than two alternatives in the subpattern, a compile-time
2739 error occurs.
2741 There are three kinds of condition. If the text between the
2742 parentheses consists of a sequence of digits, the condition
2743 is satisfied if the capturing subpattern of that number has
2744 previously matched. The number must be greater than zero.
2745 Consider the following pattern, which contains non-
2746 significant white space to make it more readable (assume the
2747 PCRE_EXTENDED option) and to divide it into three parts for
2748 ease of discussion:
2750 ( \( )? [^()]+ (?(1) \) )
2752 The first part matches an optional opening parenthesis, and
2753 if that character is present, sets it as the first captured
2754 substring. The second part matches one or more characters
2755 that are not parentheses. The third part is a conditional
2756 subpattern that tests whether the first set of parentheses
2757 matched or not. If they did, that is, if subject started
2758 with an opening parenthesis, the condition is true, and so
2759 the yes-pattern is executed and a closing parenthesis is
2760 required. Otherwise, since no-pattern is not present, the
2761 subpattern matches nothing. In other words, this pattern
2762 matches a sequence of non-parentheses, optionally enclosed
2763 in parentheses.
2765 If the condition is the string (R), it is satisfied if a
2766 recursive call to the pattern or subpattern has been made.
2767 At "top level", the condition is false. This is a PCRE
2768 extension. Recursive patterns are described in the next
2769 section.
2771 If the condition is not a sequence of digits or (R), it must
2772 be an assertion. This may be a positive or negative looka-
2773 head or lookbehind assertion. Consider this pattern, again
2774 containing non-significant white space, and with the two
2775 alternatives on the second line:
2777 (?(?=[^a-z]*[a-z])
2778 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2780 The condition is a positive lookahead assertion that matches
2781 an optional sequence of non-letters followed by a letter. In
2782 other words, it tests for the presence of at least one
2783 letter in the subject. If a letter is found, the subject is
2784 matched against the first alternative; otherwise it is
2785 matched against the second. This pattern matches strings in
2786 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2787 letters and dd are digits.
2792 The sequence (?# marks the start of a comment which contin-
2793 ues up to the next closing parenthesis. Nested parentheses
2794 are not permitted. The characters that make up a comment
2795 play no part in the pattern matching at all.
2797 If the PCRE_EXTENDED option is set, an unescaped # character
2798 outside a character class introduces a comment that contin-
2799 ues up to the next newline character in the pattern.
2804 Consider the problem of matching a string in parentheses,
2805 allowing for unlimited nested parentheses. Without the use
2806 of recursion, the best that can be done is to use a pattern
2807 that matches up to some fixed depth of nesting. It is not
2808 possible to handle an arbitrary nesting depth. Perl has pro-
2809 vided an experimental facility that allows regular expres-
2810 sions to recurse (amongst other things). It does this by
2811 interpolating Perl code in the expression at run time, and
2812 the code can refer to the expression itself. A Perl pattern
2813 to solve the parentheses problem can be created like this:
2815 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2817 The (?p{...}) item interpolates Perl code at run time, and
2818 in this case refers recursively to the pattern in which it
2819 appears. Obviously, PCRE cannot support the interpolation of
2820 Perl code. Instead, it supports some special syntax for
2821 recursion of the entire pattern, and also for individual
2822 subpattern recursion.
2824 The special item that consists of (? followed by a number
2825 greater than zero and a closing parenthesis is a recursive
2826 call of the subpattern of the given number, provided that it
2827 occurs inside that subpattern. (If not, it is a "subroutine"
2828 call, which is described in the next section.) The special
2829 item (?R) is a recursive call of the entire regular expres-
2830 sion.
2832 For example, this PCRE pattern solves the nested parentheses
2833 problem (assume the PCRE_EXTENDED option is set so that
2834 white space is ignored):
2836 \( ( (?>[^()]+) | (?R) )* \)
2838 First it matches an opening parenthesis. Then it matches any
2839 number of substrings which can either be a sequence of non-
2840 parentheses, or a recursive match of the pattern itself
2841 (that is a correctly parenthesized substring). Finally
2842 there is a closing parenthesis.
2844 If this were part of a larger pattern, you would not want to
2845 recurse the entire pattern, so instead you could use this:
2847 ( \( ( (?>[^()]+) | (?1) )* \) )
2849 We have put the pattern into parentheses, and caused the
2850 recursion to refer to them instead of the whole pattern. In
2851 a larger pattern, keeping track of parenthesis numbers can
2852 be tricky. It may be more convenient to use named
2853 parentheses instead. For this, PCRE uses (?P>name), which is
2854 an extension to the Python syntax that PCRE uses for named
2855 parentheses (Perl does not provide named parentheses). We
2856 could rewrite the above example as follows:
2858 (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2860 This particular example pattern contains nested unlimited
2861 repeats, and so the use of atomic grouping for matching
2862 strings of non-parentheses is important when applying the
2863 pattern to strings that do not match. For example, when this
2864 pattern is applied to
2866 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2868 it yields "no match" quickly. However, if atomic grouping is
2869 not used, the match runs for a very long time indeed because
2870 there are so many different ways the + and * repeats can
2871 carve up the subject, and all have to be tested before
2872 failure can be reported.
2873 At the end of a match, the values set for any capturing sub-
2874 patterns are those from the outermost level of the recursion
2875 at which the subpattern value is set. If you want to obtain
2876 intermediate values, a callout function can be used (see
2877 below and the pcrecallout documentation). If the pattern
2878 above is matched against
2880 (ab(cd)ef)
2882 the value for the capturing parentheses is "ef", which is
2883 the last value taken on at the top level. If additional
2884 parentheses are added, giving
2886 \( ( ( (?>[^()]+) | (?R) )* ) \)
2887 ^ ^
2888 ^ ^
2890 the string they capture is "ab(cd)ef", the contents of the
2891 top level parentheses. If there are more than 15 capturing
2892 parentheses in a pattern, PCRE has to obtain extra memory to
2893 store data during a recursion, which it does by using
2894 pcre_malloc, freeing it via pcre_free afterwards. If no
2895 memory can be obtained, the match fails with the
2898 Do not confuse the (?R) item with the condition (R), which
2899 tests for recursion. Consider this pattern, which matches
2900 text in angle brackets, allowing for arbitrary nesting. Only
2901 digits are allowed in nested brackets (that is, when recurs-
2902 ing), whereas any characters are permitted at the outer
2903 level.
2905 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2907 In this pattern, (?(R) is the start of a conditional subpat-
2908 tern, with two different alternatives for the recursive and
2909 non-recursive cases. The (?R) item is the actual recursive
2910 call.
2915 If the syntax for a recursive subpattern reference (either
2916 by number or by name) is used outside the parentheses to
2917 which it refers, it operates like a subroutine in a program-
2918 ming language. An earlier example pointed out that the pat-
2919 tern
2921 (sens|respons)e and \1ibility
2923 matches "sense and sensibility" and "response and responsi-
2924 bility", but not "sense and responsibility". If instead the
2925 pattern
2927 (sens|respons)e and (?1)ibility
2929 is used, it does match "sense and responsibility" as well as
2930 the other two strings. Such references must, however, follow
2931 the subpattern to which they refer.
2936 Perl has a feature whereby using the sequence (?{...})
2937 causes arbitrary Perl code to be obeyed in the middle of
2938 matching a regular expression. This makes it possible,
2939 amongst other things, to extract different substrings that
2940 match the same pair of parentheses when there is a repeti-
2941 tion.
2943 PCRE provides a similar feature, but of course it cannot
2944 obey arbitrary Perl code. The feature is called "callout".
2945 The caller of PCRE provides an external function by putting
2946 its entry point in the global variable pcre_callout. By
2947 default, this variable contains NULL, which disables all
2948 calling out.
2950 Within a regular expression, (?C) indicates the points at
2951 which the external function is to be called. If you want to
2952 identify different callout points, you can put a number less
2953 than 256 after the letter C. The default value is zero. For
2954 example, this pattern has two callout points:
2956 (?C1)9abc(?C2)def
2958 During matching, when PCRE reaches a callout point (and
2959 pcre_callout is set), the external function is called. It is
2960 provided with the number of the callout, and, optionally,
2961 one item of data originally supplied by the caller of
2962 pcre_exec(). The callout function may cause matching to
2963 backtrack, or to fail altogether. A complete description of
2964 the interface to the callout function is given in the pcre-
2965 callout documentation.
2967 Last updated: 03 February 2003
2968 Copyright (c) 1997-2003 University of Cambridge.
2969 -----------------------------------------------------------------------------
2971 NAME
2972 PCRE - Perl-compatible regular expressions
2977 Certain items that may appear in regular expression patterns
2978 are more efficient than others. It is more efficient to use
2979 a character class like [aeiou] than a set of alternatives
2980 such as (a|e|i|o|u). In general, the simplest construction
2981 that provides the required behaviour is usually the most
2982 efficient. Jeffrey Friedl's book contains a lot of discus-
2983 sion about optimizing regular expressions for efficient per-
2984 formance.
2986 When a pattern begins with .* not in parentheses, or in
2987 parentheses that are not the subject of a backreference, and
2988 the PCRE_DOTALL option is set, the pattern is implicitly
2989 anchored by PCRE, since it can match only at the start of a
2990 subject string. However, if PCRE_DOTALL is not set, PCRE
2991 cannot make this optimization, because the . metacharacter
2992 does not then match a newline, and if the subject string
2993 contains newlines, the pattern may match from the character
2994 immediately following one of them instead of from the very
2995 start. For example, the pattern
2997 .*second
2999 matches the subject "first\nand second" (where \n stands for
3000 a newline character), with the match starting at the seventh
3001 character. In order to do this, PCRE has to retry the match
3002 starting after every newline in the subject.
3004 If you are using such a pattern with subject strings that do
3005 not contain newlines, the best performance is obtained by
3006 setting PCRE_DOTALL, or starting the pattern with ^.* to
3007 indicate explicit anchoring. That saves PCRE from having to
3008 scan along the subject looking for a newline to restart at.
3010 Beware of patterns that contain nested indefinite repeats.
3011 These can take a long time to run when applied to a string
3012 that does not match. Consider the pattern fragment
3014 (a+)*
3016 This can match "aaaa" in 33 different ways, and this number
3017 increases very rapidly as the string gets longer. (The *
3018 repeat can match 0, 1, 2, 3, or 4 times, and for each of
3019 those cases other than 0, the + repeats can match different
3020 numbers of times.) When the remainder of the pattern is such
3021 that the entire match is going to fail, PCRE has in princi-
3022 ple to try every possible variation, and this can take an
3023 extremely long time.
3024 An optimization catches some of the more simple cases such
3025 as
3027 (a+)*b
3029 where a literal character follows. Before embarking on the
3030 standard matching procedure, PCRE checks that there is a "b"
3031 later in the subject string, and if there is not, it fails
3032 the match immediately. However, when there is no following
3033 literal this optimization cannot be used. You can see the
3034 difference by comparing the behaviour of
3036 (a+)*\d
3038 with the pattern above. The former gives a failure almost
3039 instantly when applied to a whole line of "a" characters,
3040 whereas the latter takes an appreciable time with strings
3041 longer than about 20 characters.
3043 Last updated: 03 February 2003
3044 Copyright (c) 1997-2003 University of Cambridge.
3045 -----------------------------------------------------------------------------
3047 NAME
3048 PCRE - Perl-compatible regular expressions.
3052 #include <pcreposix.h>
3054 int regcomp(regex_t *preg, const char *pattern,
3055 int cflags);
3057 int regexec(regex_t *preg, const char *string,
3058 size_t nmatch, regmatch_t pmatch[], int eflags);
3060 size_t regerror(int errcode, const regex_t *preg,
3061 char *errbuf, size_t errbuf_size);
3063 void regfree(regex_t *preg);
3068 This set of functions provides a POSIX-style API to the PCRE
3069 regular expression package. See the pcreapi documentation
3070 for a description of the native API, which contains addi-
3071 tional functionality.
3073 The functions described here are just wrapper functions that
3074 ultimately call the PCRE native API. Their prototypes are
3075 defined in the pcreposix.h header file, and on Unix systems
3076 the library itself is called pcreposix.a, so can be accessed
3077 by adding -lpcreposix to the command for linking an applica-
3078 tion which uses them. Because the POSIX functions call the
3079 native ones, it is also necessary to add -lpcre.
3081 I have implemented only those option bits that can be rea-
3082 sonably mapped to PCRE native options. In addition, the
3083 options REG_EXTENDED and REG_NOSUB are defined with the
3084 value zero. They have no effect, but since programs that are
3085 written to the POSIX interface often use them, this makes it
3086 easier to slot in PCRE as a replacement library. Other POSIX
3087 options are not even defined.
3089 When PCRE is called via these functions, it is only the API
3090 that is POSIX-like in style. The syntax and semantics of the
3091 regular expressions themselves are still those of Perl, sub-
3092 ject to the setting of various PCRE options, as described
3093 below.
3095 The header for these functions is supplied as pcreposix.h to
3096 avoid any potential clash with other POSIX libraries. It
3097 can, of course, be renamed or aliased as regex.h, which is
3098 the "correct" name. It provides two structure types, regex_t
3099 for compiled internal forms, and regmatch_t for returning
3100 captured substrings. It also defines some constants whose
3101 names start with "REG_"; these are used for setting options
3102 and identifying error codes.
3107 The function regcomp() is called to compile a pattern into
3108 an internal form. The pattern is a C string terminated by a
3109 binary zero, and is passed in the argument pattern. The preg
3110 argument is a pointer to a regex_t structure which is used
3111 as a base for storing information about the compiled expres-
3112 sion.
3114 The argument cflags is either zero, or contains one or more
3115 of the bits defined by the following macros:
3119 The PCRE_CASELESS option is set when the expression is
3120 passed for compilation to the native function.
3124 The PCRE_MULTILINE option is set when the expression is
3125 passed for compilation to the native function. Note that
3126 this does not mimic the defined POSIX behaviour for
3127 REG_NEWLINE (see the following section).
3129 In the absence of these flags, no options are passed to the
3130 native function. This means the the regex is compiled with
3131 PCRE default semantics. In particular, the way it handles
3132 newline characters in the subject string is the Perl way,
3133 not the POSIX way. Note that setting PCRE_MULTILINE has only
3134 some of the effects specified for REG_NEWLINE. It does not
3135 affect the way newlines are matched by . (they aren't) or by
3136 a negative class such as [^a] (they are).
3138 The yield of regcomp() is zero on success, and non-zero oth-
3139 erwise. The preg structure is filled in on success, and one
3140 member of the structure is public: re_nsub contains the
3141 number of capturing subpatterns in the regular expression.
3142 Various error codes are defined in the header file.
3147 This area is not simple, because POSIX and Perl take dif-
3148 ferent views of things. It is not possible to get PCRE to
3149 obey POSIX semantics, but then PCRE was never intended to be
3150 a POSIX engine. The following table lists the different pos-
3151 sibilities for matching newline characters in PCRE:
3153 Default Change with
3155 . matches newline no PCRE_DOTALL
3156 newline matches [^a] yes not changeable
3157 $ matches \n at end yes PCRE_DOLLARENDONLY
3158 $ matches \n in middle no PCRE_MULTILINE
3159 ^ matches \n in middle no PCRE_MULTILINE
3161 This is the equivalent table for POSIX:
3163 Default Change with
3165 . matches newline yes REG_NEWLINE
3166 newline matches [^a] yes REG_NEWLINE
3167 $ matches \n at end no REG_NEWLINE
3168 $ matches \n in middle no REG_NEWLINE
3169 ^ matches \n in middle no REG_NEWLINE
3171 PCRE's behaviour is the same as Perl's, except that there is
3172 no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3173 and Perl, there is no way to stop newline from matching
3174 [^a].
3176 The default POSIX newline handling can be obtained by set-
3177 ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3178 to make PCRE behave exactly as for the REG_NEWLINE action.
3183 The function regexec() is called to match a pre-compiled
3184 pattern preg against a given string, which is terminated by
3185 a zero byte, subject to the options in eflags. These can be:
3189 The PCRE_NOTBOL option is set when calling the underlying
3190 PCRE matching function.
3194 The PCRE_NOTEOL option is set when calling the underlying
3195 PCRE matching function.
3197 The portion of the string that was matched, and also any
3198 captured substrings, are returned via the pmatch argument,
3199 which points to an array of nmatch structures of type
3200 regmatch_t, containing the members rm_so and rm_eo. These
3201 contain the offset to the first character of each substring
3202 and the offset to the first character after the end of each
3203 substring, respectively. The 0th element of the vector
3204 relates to the entire portion of string that was matched;
3205 subsequent elements relate to the capturing subpatterns of
3206 the regular expression. Unused entries in the array have
3207 both structure members set to -1.
3209 A successful match yields a zero return; various error codes
3210 are defined in the header file, of which REG_NOMATCH is the
3211 "expected" failure code.
3216 The regerror() function maps a non-zero errorcode from
3217 either regcomp() or regexec() to a printable message. If
3218 preg is not NULL, the error should have arisen from the use
3219 of that structure. A message terminated by a binary zero is
3220 placed in errbuf. The length of the message, including the
3221 zero, is limited to errbuf_size. The yield of the function
3222 is the size of buffer needed to hold the whole message.
3227 Compiling a regular expression causes memory to be allocated
3228 and associated with the preg structure. The function reg-
3229 free() frees all such memory, after which preg may no longer
3230 be used as a compiled expression.
3235 Philip Hazel <ph10@cam.ac.uk>
3236 University Computing Service,
3237 Cambridge CB2 3QG, England.
3239 Last updated: 03 February 2003
3240 Copyright (c) 1997-2003 University of Cambridge.
3241 -----------------------------------------------------------------------------
3243 NAME
3244 PCRE - Perl-compatible regular expressions
3249 A simple, complete demonstration program, to get you started
3250 with using PCRE, is supplied in the file pcredemo.c in the
3251 PCRE distribution.
3253 The program compiles the regular expression that is its
3254 first argument, and matches it against the subject string in
3255 its second argument. No PCRE options are set, and default
3256 character tables are used. If matching succeeds, the program
3257 outputs the portion of the subject that matched, together
3258 with the contents of any captured substrings.
3260 If the -g option is given on the command line, the program
3261 then goes on to check for further matches of the same regu-
3262 lar expression in the same subject string. The logic is a
3263 little bit tricky because of the possibility of matching an
3264 empty string. Comments in the code explain what is going on.
3266 On a Unix system that has PCRE installed in /usr/local, you
3267 can compile the demonstration program using a command like
3268 this:
3270 gcc -o pcredemo pcredemo.c -I/usr/local/include \
3271 -L/usr/local/lib -lpcre
3273 Then you can run simple tests like this:
3275 ./pcredemo 'cat|dog' 'the cat sat on the mat'
3276 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3278 Note that there is a much more comprehensive test program,
3279 called pcretest, which supports many more facilities for
3280 testing regular expressions and the PCRE library. The
3281 pcredemo program is provided as a simple coding example.
3283 On some operating systems (e.g. Solaris) you may get an
3284 error like this when you try to run pcredemo:
3286 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3287 file or directory
3289 This is caused by the way shared library support works on
3290 those systems. You need to add
3292 -R/usr/local/lib
3294 to the compile command to get round this problem.
3296 Last updated: 28 January 2003
3297 Copyright (c) 1997-2003 University of Cambridge.
3298 -----------------------------------------------------------------------------

  ViewVC Help
Powered by ViewVC 1.1.5