ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 63 - (show annotations)
Sat Feb 24 21:40:03 2007 UTC (14 years, 8 months ago) by nigel
File MIME type: text/plain
File size: 142247 byte(s)
Load pcre-4.0 into code/trunk.
1 This file contains a concatenation of the PCRE man pages, converted to plain
2 text format for ease of searching with a text editor, or for use on systems
3 that do not have a man page processor. The small individual files that give
4 synopses of each function in the library have not been included. There are
5 separate text files for the pcregrep and pcretest commands.
6 -----------------------------------------------------------------------------
9 PCRE - Perl-compatible regular expressions
14 The PCRE library is a set of functions that implement regu-
15 lar expression pattern matching using the same syntax and
16 semantics as Perl, with just a few differences. The current
17 implementation of PCRE (release 4.x) corresponds approxi-
18 mately with Perl 5.8, including support for UTF-8 encoded
19 strings. However, this support has to be explicitly
20 enabled; it is not the default.
22 PCRE is written in C and released as a C library. However, a
23 number of people have written wrappers and interfaces of
24 various kinds. A C++ class is included in these contribu-
25 tions, which can be found in the Contrib directory at the
26 primary FTP site, which is:
28 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
30 Details of exactly which Perl regular expression features
31 are and are not supported by PCRE are given in separate
32 documents. See the pcrepattern and pcrecompat pages.
34 Some features of PCRE can be included, excluded, or changed
35 when the library is built. The pcre_config() function makes
36 it possible for a client to discover which features are
37 available. Documentation about building PCRE for various
38 operating systems can be found in the README file in the
39 source distribution.
44 The user documentation for PCRE has been split up into a
45 number of different sections. In the "man" format, each of
46 these is a separate "man page". In the HTML format, each is
47 a separate page, linked from the index page. In the plain
48 text format, all the sections are concatenated, for ease of
49 searching. The sections are as follows:
51 pcre this document
52 pcreapi details of PCRE's native API
53 pcrebuild options for building PCRE
54 pcrecallout details of the callout feature
55 pcrecompat discussion of Perl compatibility
56 pcregrep description of the pcregrep command
57 pcrepattern syntax and semantics of supported
58 regular expressions
59 pcreperform discussion of performance issues
60 pcreposix the POSIX-compatible API
61 pcresample discussion of the sample program
62 pcretest the pcretest testing command
64 In addition, in the "man" and HTML formats, there is a short
65 page for each library function, listing its arguments and
66 results.
71 There are some size limitations in PCRE but it is hoped that
72 they will never in practice be relevant.
74 The maximum length of a compiled pattern is 65539 (sic)
75 bytes if PCRE is compiled with the default internal linkage
76 size of 2. If you want to process regular expressions that
77 are truly enormous, you can compile PCRE with an internal
78 linkage size of 3 or 4 (see the README file in the source
79 distribution and the pcrebuild documentation for details).
80 If these cases the limit is substantially larger. However,
81 the speed of execution will be slower.
83 All values in repeating quantifiers must be less than 65536.
84 The maximum number of capturing subpatterns is 65535.
86 There is no limit to the number of non-capturing subpat-
87 terns, but the maximum depth of nesting of all kinds of
88 parenthesized subpattern, including capturing subpatterns,
89 assertions, and other types of subpattern, is 200.
91 The maximum length of a subject string is the largest posi-
92 tive number that an integer variable can hold. However, PCRE
93 uses recursion to handle subpatterns and indefinite repeti-
94 tion. This means that the available stack space may limit
95 the size of a subject string that can be processed by cer-
96 tain patterns.
101 Starting at release 3.3, PCRE has had some support for char-
102 acter strings encoded in the UTF-8 format. For release 4.0
103 this has been greatly extended to cover most common require-
104 ments.
106 In order process UTF-8 strings, you must build PCRE to
107 include UTF-8 support in the code, and, in addition, you
108 must call pcre_compile() with the PCRE_UTF8 option flag.
109 When you do this, both the pattern and any subject strings
110 that are matched against it are treated as UTF-8 strings
111 instead of just strings of bytes.
113 If you compile PCRE with UTF-8 support, but do not use it at
114 run time, the library will be a bit bigger, but the addi-
115 tional run time overhead is limited to testing the PCRE_UTF8
116 flag in several places, so should not be very large.
118 The following comments apply when PCRE is running in UTF-8
119 mode:
121 1. PCRE assumes that the strings it is given contain valid
122 UTF-8 codes. It does not diagnose invalid UTF-8 strings. If
123 you pass invalid UTF-8 strings to PCRE, the results are
124 undefined.
126 2. In a pattern, the escape sequence \x{...}, where the con-
127 tents of the braces is a string of hexadecimal digits, is
128 interpreted as a UTF-8 character whose code number is the
129 given hexadecimal number, for example: \x{1234}. If a non-
130 hexadecimal digit appears between the braces, the item is
131 not recognized. This escape sequence can be used either as
132 a literal, or within a character class.
134 3. The original hexadecimal escape sequence, \xhh, matches a
135 two-byte UTF-8 character if the value is greater than 127.
137 4. Repeat quantifiers apply to complete UTF-8 characters,
138 not to individual bytes, for example: \x{100}{3}.
140 5. The dot metacharacter matches one UTF-8 character instead
141 of a single byte.
143 6. The escape sequence \C can be used to match a single byte
144 in UTF-8 mode, but its use can lead to some strange effects.
146 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W
147 correctly test characters of any code value, but the charac-
148 ters that PCRE recognizes as digits, spaces, or word charac-
149 ters remain the same set as before, all with values less
150 than 256.
152 8. Case-insensitive matching applies only to characters
153 whose values are less than 256. PCRE does not support the
154 notion of "case" for higher-valued characters.
156 9. PCRE does not support the use of Unicode tables and pro-
157 perties or the Perl escapes \p, \P, and \X.
162 Philip Hazel <ph10@cam.ac.uk>
163 University Computing Service,
164 Cambridge CB2 3QG, England.
165 Phone: +44 1223 334714
167 Last updated: 04 February 2003
168 Copyright (c) 1997-2003 University of Cambridge.
169 -----------------------------------------------------------------------------
171 NAME
172 PCRE - Perl-compatible regular expressions
177 This document describes the optional features of PCRE that
178 can be selected when the library is compiled. They are all
179 selected, or deselected, by providing options to the config-
180 ure script which is run before the make command. The com-
181 plete list of options for configure (which includes the
182 standard ones such as the selection of the installation
183 directory) can be obtained by running
185 ./configure --help
187 The following sections describe certain options whose names
188 begin with --enable or --disable. These settings specify
189 changes to the defaults for the configure command. Because
190 of the way that configure works, --enable and --disable
191 always come in pairs, so the complementary option always
192 exists as well, but as it specifies the default, it is not
193 described.
198 To build PCRE with support for UTF-8 character strings, add
200 --enable-utf8
202 to the configure command. Of itself, this does not make PCRE
203 treat strings as UTF-8. As well as compiling PCRE with this
204 option, you also have have to set the PCRE_UTF8 option when
205 you call the pcre_compile() function.
210 By default, PCRE treats character 10 (linefeed) as the new-
211 line character. This is the normal newline character on
212 Unix-like systems. You can compile PCRE to use character 13
213 (carriage return) instead by adding
215 --enable-newline-is-cr
217 to the configure command. For completeness there is also a
218 --enable-newline-is-lf option, which explicitly specifies
219 linefeed as the newline character.
224 The PCRE building process uses libtool to build both shared
225 and static Unix libraries by default. You can suppress one
226 of these by adding one of
228 --disable-shared
229 --disable-static
231 to the configure command, as required.
236 When PCRE is called through the POSIX interface (see the
237 pcreposix documentation), additional working storage is
238 required for holding the pointers to capturing substrings
239 because PCRE requires three integers per substring, whereas
240 the POSIX interface provides only two. If the number of
241 expected substrings is small, the wrapper function uses
242 space on the stack, because this is faster than using mal-
243 loc() for each call. The default threshold above which the
244 stack is no longer used is 10; it can be changed by adding a
245 setting such as
247 --with-posix-malloc-threshold=20
249 to the configure command.
254 Internally, PCRE has a function called match() which it
255 calls repeatedly (possibly recursively) when performing a
256 matching operation. By limiting the number of times this
257 function may be called, a limit can be placed on the
258 resources used by a single call to pcre_exec(). The limit
259 can be changed at run time, as described in the pcreapi
260 documentation. The default is 10 million, but this can be
261 changed by adding a setting such as
263 --with-match-limit=500000
265 to the configure command.
270 Within a compiled pattern, offset values are used to point
271 from one part to another (for example, from an opening
272 parenthesis to an alternation metacharacter). By default
273 two-byte values are used for these offsets, leading to a
274 maximum size for a compiled pattern of around 64K. This is
275 sufficient to handle all but the most gigantic patterns.
276 Nevertheless, some people do want to process enormous pat-
277 terns, so it is possible to compile PCRE to use three-byte
278 or four-byte offsets by adding a setting such as
280 --with-link-size=3
282 to the configure command. The value given must be 2, 3, or
283 4. Using longer offsets slows down the operation of PCRE
284 because it has to load additional bytes when handling them.
286 If you build PCRE with an increased link size, test 2 (and
287 test 5 if you are using UTF-8) will fail. Part of the output
288 of these tests is a representation of the compiled pattern,
289 and this changes with the link size.
291 Last updated: 21 January 2003
292 Copyright (c) 1997-2003 University of Cambridge.
293 -----------------------------------------------------------------------------
295 NAME
296 PCRE - Perl-compatible regular expressions
301 #include <pcre.h>
303 pcre *pcre_compile(const char *pattern, int options,
304 const char **errptr, int *erroffset,
305 const unsigned char *tableptr);
307 pcre_extra *pcre_study(const pcre *code, int options,
308 const char **errptr);
310 int pcre_exec(const pcre *code, const pcre_extra *extra,
311 const char *subject, int length, int startoffset,
312 int options, int *ovector, int ovecsize);
314 int pcre_copy_named_substring(const pcre *code,
315 const char *subject, int *ovector,
316 int stringcount, const char *stringname,
317 char *buffer, int buffersize);
319 int pcre_copy_substring(const char *subject, int *ovector,
320 int stringcount, int stringnumber, char *buffer,
321 int buffersize);
323 int pcre_get_named_substring(const pcre *code,
324 const char *subject, int *ovector,
325 int stringcount, const char *stringname,
326 const char **stringptr);
328 int pcre_get_stringnumber(const pcre *code,
329 const char *name);
331 int pcre_get_substring(const char *subject, int *ovector,
332 int stringcount, int stringnumber,
333 const char **stringptr);
335 int pcre_get_substring_list(const char *subject,
336 int *ovector, int stringcount, const char ***listptr);
338 void pcre_free_substring(const char *stringptr);
340 void pcre_free_substring_list(const char **stringptr);
342 const unsigned char *pcre_maketables(void);
344 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
345 int what, void *where);
348 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
350 int pcre_config(int what, void *where);
352 char *pcre_version(void);
354 void *(*pcre_malloc)(size_t);
356 void (*pcre_free)(void *);
358 int (*pcre_callout)(pcre_callout_block *);
363 PCRE has its own native API, which is described in this
364 document. There is also a set of wrapper functions that
365 correspond to the POSIX regular expression API. These are
366 described in the pcreposix documentation.
368 The native API function prototypes are defined in the header
369 file pcre.h, and on Unix systems the library itself is
370 called libpcre.a, so can be accessed by adding -lpcre to the
371 command for linking an application which calls it. The
372 header file defines the macros PCRE_MAJOR and PCRE_MINOR to
373 contain the major and minor release numbers for the library.
374 Applications can use these to include support for different
375 releases.
377 The functions pcre_compile(), pcre_study(), and pcre_exec()
378 are used for compiling and matching regular expressions. A
379 sample program that demonstrates the simplest way of using
380 them is given in the file pcredemo.c. The pcresample docu-
381 mentation describes how to run it.
383 There are convenience functions for extracting captured sub-
384 strings from a matched subject string. They are:
386 pcre_copy_substring()
387 pcre_copy_named_substring()
388 pcre_get_substring()
389 pcre_get_named_substring()
390 pcre_get_substring_list()
392 pcre_free_substring() and pcre_free_substring_list() are
393 also provided, to free the memory used for extracted
394 strings.
396 The function pcre_maketables() is used (optionally) to build
397 a set of character tables in the current locale for passing
398 to pcre_compile().
400 The function pcre_fullinfo() is used to find out information
401 about a compiled pattern; pcre_info() is an obsolete version
402 which returns only some of the available information, but is
403 retained for backwards compatibility. The function
404 pcre_version() returns a pointer to a string containing the
405 version of PCRE and its date of release.
407 The global variables pcre_malloc and pcre_free initially
408 contain the entry points of the standard malloc() and free()
409 functions respectively. PCRE calls the memory management
410 functions via these variables, so a calling program can
411 replace them if it wishes to intercept the calls. This
412 should be done before calling any PCRE functions.
414 The global variable pcre_callout initially contains NULL. It
415 can be set by the caller to a "callout" function, which PCRE
416 will then call at specified points during a matching opera-
417 tion. Details are given in the pcrecallout documentation.
422 The PCRE functions can be used in multi-threading applica-
423 tions, with the proviso that the memory management functions
424 pointed to by pcre_malloc and pcre_free, and the callout
425 function pointed to by pcre_callout, are shared by all
426 threads.
428 The compiled form of a regular expression is not altered
429 during matching, so the same compiled pattern can safely be
430 used by several threads at once.
435 int pcre_config(int what, void *where);
437 The function pcre_config() makes it possible for a PCRE
438 client to discover which optional features have been com-
439 piled into the PCRE library. The pcrebuild documentation has
440 more details about these optional features.
442 The first argument for pcre_config() is an integer, specify-
443 ing which information is required; the second argument is a
444 pointer to a variable into which the information is placed.
445 The following information is available:
449 The output is an integer that is set to one if UTF-8 support
450 is available; otherwise it is set to zero.
454 The output is an integer that is set to the value of the
455 code that is used for the newline character. It is either
456 linefeed (10) or carriage return (13), and should normally
457 be the standard character for your operating system.
461 The output is an integer that contains the number of bytes
462 used for internal linkage in compiled regular expressions.
463 The value is 2, 3, or 4. Larger values allow larger regular
464 expressions to be compiled, at the expense of slower match-
465 ing. The default value of 2 is sufficient for all but the
466 most massive patterns, since it allows the compiled pattern
467 to be up to 64K in size.
471 The output is an integer that contains the threshold above
472 which the POSIX interface uses malloc() for output vectors.
473 Further details are given in the pcreposix documentation.
477 The output is an integer that gives the default limit for
478 the number of internal matching function calls in a
479 pcre_exec() execution. Further details are given with
480 pcre_exec() below.
485 pcre *pcre_compile(const char *pattern, int options,
486 const char **errptr, int *erroffset,
487 const unsigned char *tableptr);
489 The function pcre_compile() is called to compile a pattern
490 into an internal form. The pattern is a C string terminated
491 by a binary zero, and is passed in the argument pattern. A
492 pointer to a single block of memory that is obtained via
493 pcre_malloc is returned. This contains the compiled code and
494 related data. The pcre type is defined for the returned
495 block; this is a typedef for a structure whose contents are
496 not externally defined. It is up to the caller to free the
497 memory when it is no longer required.
499 Although the compiled code of a PCRE regex is relocatable,
500 that is, it does not depend on memory location, the complete
501 pcre data block is not fully relocatable, because it con-
502 tains a copy of the tableptr argument, which is an address
503 (see below).
504 The options argument contains independent bits that affect
505 the compilation. It should be zero if no options are
506 required. Some of the options, in particular, those that are
507 compatible with Perl, can also be set and unset from within
508 the pattern (see the detailed description of regular expres-
509 sions in the pcrepattern documentation). For these options,
510 the contents of the options argument specifies their initial
511 settings at the start of compilation and execution. The
512 PCRE_ANCHORED option can be set at the time of matching as
513 well as at compile time.
515 If errptr is NULL, pcre_compile() returns NULL immediately.
516 Otherwise, if compilation of a pattern fails, pcre_compile()
517 returns NULL, and sets the variable pointed to by errptr to
518 point to a textual error message. The offset from the start
519 of the pattern to the character where the error was
520 discovered is placed in the variable pointed to by
521 erroffset, which must not be NULL. If it is, an immediate
522 error is given.
524 If the final argument, tableptr, is NULL, PCRE uses a
525 default set of character tables which are built when it is
526 compiled, using the default C locale. Otherwise, tableptr
527 must be the result of a call to pcre_maketables(). See the
528 section on locale support below.
530 This code fragment shows a typical straightforward call to
531 pcre_compile():
533 pcre *re;
534 const char *error;
535 int erroffset;
536 re = pcre_compile(
537 "^A.*Z", /* the pattern */
538 0, /* default options */
539 &error, /* for error message */
540 &erroffset, /* for error offset */
541 NULL); /* use default character tables */
543 The following option bits are defined:
547 If this bit is set, the pattern is forced to be "anchored",
548 that is, it is constrained to match only at the first match-
549 ing point in the string which is being searched (the "sub-
550 ject string"). This effect can also be achieved by appropri-
551 ate constructs in the pattern itself, which is the only way
552 to do it in Perl.
556 If this bit is set, letters in the pattern match both upper
557 and lower case letters. It is equivalent to Perl's /i
558 option, and it can be changed within a pattern by a (?i)
559 option setting.
563 If this bit is set, a dollar metacharacter in the pattern
564 matches only at the end of the subject string. Without this
565 option, a dollar also matches immediately before the final
566 character if it is a newline (but not before any other new-
567 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
568 PCRE_MULTILINE is set. There is no equivalent to this option
569 in Perl, and no way to set it within a pattern.
573 If this bit is set, a dot metacharater in the pattern
574 matches all characters, including newlines. Without it, new-
575 lines are excluded. This option is equivalent to Perl's /s
576 option, and it can be changed within a pattern by a (?s)
577 option setting. A negative class such as [^a] always matches
578 a newline character, independent of the setting of this
579 option.
583 If this bit is set, whitespace data characters in the pat-
584 tern are totally ignored except when escaped or inside a
585 character class. Whitespace does not include the VT charac-
586 ter (code 11). In addition, characters between an unescaped
587 # outside a character class and the next newline character,
588 inclusive, are also ignored. This is equivalent to Perl's /x
589 option, and it can be changed within a pattern by a (?x)
590 option setting.
592 This option makes it possible to include comments inside
593 complicated patterns. Note, however, that this applies only
594 to data characters. Whitespace characters may never appear
595 within special character sequences in a pattern, for example
596 within the sequence (?( which introduces a conditional sub-
597 pattern.
601 This option was invented in order to turn on additional
602 functionality of PCRE that is incompatible with Perl, but it
603 is currently of very little use. When set, any backslash in
604 a pattern that is followed by a letter that has no special
605 meaning causes an error, thus reserving these combinations
606 for future expansion. By default, as in Perl, a backslash
607 followed by a letter with no special meaning is treated as a
608 literal. There are at present no other features controlled
609 by this option. It can also be set by a (?X) option setting
610 within a pattern.
614 By default, PCRE treats the subject string as consisting of
615 a single "line" of characters (even if it actually contains
616 several newlines). The "start of line" metacharacter (^)
617 matches only at the start of the string, while the "end of
618 line" metacharacter ($) matches only at the end of the
619 string, or before a terminating newline (unless
620 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
622 When PCRE_MULTILINE it is set, the "start of line" and "end
623 of line" constructs match immediately following or immedi-
624 ately before any newline in the subject string, respec-
625 tively, as well as at the very start and end. This is
626 equivalent to Perl's /m option, and it can be changed within
627 a pattern by a (?m) option setting. If there are no "\n"
628 characters in a subject string, or no occurrences of ^ or $
629 in a pattern, setting PCRE_MULTILINE has no effect.
633 If this option is set, it disables the use of numbered cap-
634 turing parentheses in the pattern. Any opening parenthesis
635 that is not followed by ? behaves as if it were followed by
636 ?: but named parentheses can still be used for capturing
637 (and they acquire numbers in the usual way). There is no
638 equivalent of this option in Perl.
642 This option inverts the "greediness" of the quantifiers so
643 that they are not greedy by default, but become greedy if
644 followed by "?". It is not compatible with Perl. It can also
645 be set by a (?U) option setting within the pattern.
649 This option causes PCRE to regard both the pattern and the
650 subject as strings of UTF-8 characters instead of single-
651 byte character strings. However, it is available only if
652 PCRE has been built to include UTF-8 support. If not, the
653 use of this option provokes an error. Details of how this
654 option changes the behaviour of PCRE are given in the sec-
655 tion on UTF-8 support in the main pcre page.
660 pcre_extra *pcre_study(const pcre *code, int options,
661 const char **errptr);
663 When a pattern is going to be used several times, it is
664 worth spending more time analyzing it in order to speed up
665 the time taken for matching. The function pcre_study() takes
666 a pointer to a compiled pattern as its first argument. If
667 studing the pattern produces additional information that
668 will help speed up matching, pcre_study() returns a pointer
669 to a pcre_extra block, in which the study_data field points
670 to the results of the study.
672 The returned value from a pcre_study() can be passed
673 directly to pcre_exec(). However, the pcre_extra block also
674 contains other fields that can be set by the caller before
675 the block is passed; these are described below. If studying
676 the pattern does not produce any additional information,
677 pcre_study() returns NULL. In that circumstance, if the cal-
678 ling program wants to pass some of the other fields to
679 pcre_exec(), it must set up its own pcre_extra block.
681 The second argument contains option bits. At present, no
682 options are defined for pcre_study(), and this argument
683 should always be zero.
685 The third argument for pcre_study() is a pointer for an
686 error message. If studying succeeds (even if no data is
687 returned), the variable it points to is set to NULL. Other-
688 wise it points to a textual error message. You should there-
689 fore test the error pointer for NULL after calling
690 pcre_study(), to be sure that it has run successfully.
692 This is a typical call to pcre_study():
694 pcre_extra *pe;
695 pe = pcre_study(
696 re, /* result of pcre_compile() */
697 0, /* no options exist */
698 &error); /* set to NULL or points to a message */
700 At present, studying a pattern is useful only for non-
701 anchored patterns that do not have a single fixed starting
702 character. A bitmap of possible starting characters is
703 created.
708 PCRE handles caseless matching, and determines whether char-
709 acters are letters, digits, or whatever, by reference to a
710 set of tables. When running in UTF-8 mode, this applies only
711 to characters with codes less than 256. The library contains
712 a default set of tables that is created in the default C
713 locale when PCRE is compiled. This is used when the final
714 argument of pcre_compile() is NULL, and is sufficient for
715 many applications.
717 An alternative set of tables can, however, be supplied. Such
718 tables are built by calling the pcre_maketables() function,
719 which has no arguments, in the relevant locale. The result
720 can then be passed to pcre_compile() as often as necessary.
721 For example, to build and use tables that are appropriate
722 for the French locale (where accented characters with codes
723 greater than 128 are treated as letters), the following code
724 could be used:
726 setlocale(LC_CTYPE, "fr");
727 tables = pcre_maketables();
728 re = pcre_compile(..., tables);
730 The tables are built in memory that is obtained via
731 pcre_malloc. The pointer that is passed to pcre_compile is
732 saved with the compiled pattern, and the same tables are
733 used via this pointer by pcre_study() and pcre_exec(). Thus,
734 for any single pattern, compilation, studying and matching
735 all happen in the same locale, but different patterns can be
736 compiled in different locales. It is the caller's responsi-
737 bility to ensure that the memory containing the tables
738 remains available for as long as it is needed.
743 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
744 int what, void *where);
746 The pcre_fullinfo() function returns information about a
747 compiled pattern. It replaces the obsolete pcre_info() func-
748 tion, which is nevertheless retained for backwards compabil-
749 ity (and is documented below).
751 The first argument for pcre_fullinfo() is a pointer to the
752 compiled pattern. The second argument is the result of
753 pcre_study(), or NULL if the pattern was not studied. The
754 third argument specifies which piece of information is
755 required, and the fourth argument is a pointer to a variable
756 to receive the data. The yield of the function is zero for
757 success, or one of the following negative numbers:
759 PCRE_ERROR_NULL the argument code was NULL
760 the argument where was NULL
761 PCRE_ERROR_BADMAGIC the "magic number" was not found
762 PCRE_ERROR_BADOPTION the value of what was invalid
764 Here is a typical call of pcre_fullinfo(), to obtain the
765 length of the compiled pattern:
767 int rc;
768 unsigned long int length;
769 rc = pcre_fullinfo(
770 re, /* result of pcre_compile() */
771 pe, /* result of pcre_study(), or NULL */
772 PCRE_INFO_SIZE, /* what is required */
773 &length); /* where to put the data */
775 The possible values for the third argument are defined in
776 pcre.h, and are as follows:
780 Return the number of the highest back reference in the pat-
781 tern. The fourth argument should point to an int variable.
782 Zero is returned if there are no back references.
786 Return the number of capturing subpatterns in the pattern.
787 The fourth argument should point to an int variable.
791 Return information about the first byte of any matched
792 string, for a non-anchored pattern. (This option used to be
793 called PCRE_INFO_FIRSTCHAR; the old name is still recognized
794 for backwards compatibility.)
796 If there is a fixed first byte, e.g. from a pattern such as
797 (cat|cow|coyote), it is returned in the integer pointed to
798 by where. Otherwise, if either
800 (a) the pattern was compiled with the PCRE_MULTILINE option,
801 and every branch starts with "^", or
803 (b) every branch of the pattern starts with ".*" and
804 PCRE_DOTALL is not set (if it were set, the pattern would be
805 anchored),
807 -1 is returned, indicating that the pattern matches only at
808 the start of a subject string or after any newline within
809 the string. Otherwise -2 is returned. For anchored patterns,
810 -2 is returned.
814 If the pattern was studied, and this resulted in the con-
815 struction of a 256-bit table indicating a fixed set of bytes
816 for the first byte in any matching string, a pointer to the
817 table is returned. Otherwise NULL is returned. The fourth
818 argument should point to an unsigned char * variable.
822 For a non-anchored pattern, return the value of the right-
823 most literal byte which must exist in any matched string,
824 other than at its start. The fourth argument should point to
825 an int variable. If there is no such byte, or if the pattern
826 is anchored, -1 is returned. For example, for the pattern
827 /a\d+z\d+/ the returned value is 'z'.
833 PCRE supports the use of named as well as numbered capturing
834 parentheses. The names are just an additional way of identi-
835 fying the parentheses, which still acquire a number. A
836 caller that wants to extract data from a named subpattern
837 must convert the name to a number in order to access the
838 correct pointers in the output vector (described with
839 pcre_exec() below). In order to do this, it must first use
840 these three values to obtain the name-to-number mapping
841 table for the pattern.
843 The map consists of a number of fixed-size entries.
844 PCRE_INFO_NAMECOUNT gives the number of entries, and
845 PCRE_INFO_NAMEENTRYSIZE gives the size of each entry; both
846 of these return an int value. The entry size depends on the
847 length of the longest name. PCRE_INFO_NAMETABLE returns a
848 pointer to the first entry of the table (a pointer to char).
849 The first two bytes of each entry are the number of the cap-
850 turing parenthesis, most significant byte first. The rest of
851 the entry is the corresponding name, zero terminated. The
852 names are in alphabetical order. For example, consider the
853 following pattern (assume PCRE_EXTENDED is set, so white
854 space - including newlines - is ignored):
856 (?P<date> (?P<year>(\d\d)?\d\d) -
857 (?P<month>\d\d) - (?P<day>\d\d) )
859 There are four named subpatterns, so the table has four
860 entries, and each entry in the table is eight bytes long.
861 The table is as follows, with non-printing bytes shows in
862 hex, and undefined bytes shown as ??:
864 00 01 d a t e 00 ??
865 00 05 d a y 00 ?? ??
866 00 04 m o n t h 00
867 00 02 y e a r 00 ??
869 When writing code to extract data from named subpatterns,
870 remember that the length of each entry may be different for
871 each compiled pattern.
875 Return a copy of the options with which the pattern was com-
876 piled. The fourth argument should point to an unsigned long
877 int variable. These option bits are those specified in the
878 call to pcre_compile(), modified by any top-level option
879 settings within the pattern itself.
881 A pattern is automatically anchored by PCRE if all of its
882 top-level alternatives begin with one of the following:
884 ^ unless PCRE_MULTILINE is set
885 \A always
886 \G always
887 .* if PCRE_DOTALL is set and there are no back
888 references to the subpattern in which .* appears
890 For such patterns, the PCRE_ANCHORED bit is set in the
891 options returned by pcre_fullinfo().
895 Return the size of the compiled pattern, that is, the value
896 that was passed as the argument to pcre_malloc() when PCRE
897 was getting memory in which to place the compiled data. The
898 fourth argument should point to a size_t variable.
902 Returns the size of the data block pointed to by the
903 study_data field in a pcre_extra block. That is, it is the
904 value that was passed to pcre_malloc() when PCRE was getting
905 memory into which to place the data created by pcre_study().
906 The fourth argument should point to a size_t variable.
911 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
913 The pcre_info() function is now obsolete because its inter-
914 face is too restrictive to return all the available data
915 about a compiled pattern. New programs should use
916 pcre_fullinfo() instead. The yield of pcre_info() is the
917 number of capturing subpatterns, or one of the following
918 negative numbers:
920 PCRE_ERROR_NULL the argument code was NULL
921 PCRE_ERROR_BADMAGIC the "magic number" was not found
923 If the optptr argument is not NULL, a copy of the options
924 with which the pattern was compiled is placed in the integer
925 it points to (see PCRE_INFO_OPTIONS above).
927 If the pattern is not anchored and the firstcharptr argument
928 is not NULL, it is used to pass back information about the
929 first character of any matched string (see
935 int pcre_exec(const pcre *code, const pcre_extra *extra,
936 const char *subject, int length, int startoffset,
937 int options, int *ovector, int ovecsize);
939 The function pcre_exec() is called to match a subject string
940 against a pre-compiled pattern, which is passed in the code
941 argument. If the pattern has been studied, the result of the
942 study should be passed in the extra argument.
944 Here is an example of a simple call to pcre_exec():
946 int rc;
947 int ovector[30];
948 rc = pcre_exec(
949 re, /* result of pcre_compile() */
950 NULL, /* we didn't study the pattern */
951 "some string", /* the subject string */
952 11, /* the length of the subject string */
953 0, /* start at offset 0 in the subject */
954 0, /* default options */
955 ovector, /* vector for substring information */
956 30); /* number of elements in the vector */
958 If the extra argument is not NULL, it must point to a
959 pcre_extra data block. The pcre_study() function returns
960 such a block (when it doesn't return NULL), but you can also
961 create one for yourself, and pass additional information in
962 it. The fields in the block are as follows:
964 unsigned long int flags;
965 void *study_data;
966 unsigned long int match_limit;
967 void *callout_data;
969 The flags field is a bitmap that specifies which of the
970 other fields are set. The flag bits are:
976 Other flag bits should be set to zero. The study_data field
977 is set in the pcre_extra block that is returned by
978 pcre_study(), together with the appropriate flag bit. You
979 should not set this yourself, but you can add to the block
980 by setting the other fields.
982 The match_limit field provides a means of preventing PCRE
983 from using up a vast amount of resources when running pat-
984 terns that are not going to match, but which have a very
985 large number of possibilities in their search trees. The
986 classic example is the use of nested unlimited repeats.
987 Internally, PCRE uses a function called match() which it
988 calls repeatedly (sometimes recursively). The limit is
989 imposed on the number of times this function is called dur-
990 ing a match, which has the effect of limiting the amount of
991 recursion and backtracking that can take place. For patterns
992 that are not anchored, the count starts from zero for each
993 position in the subject string.
995 The default limit for the library can be set when PCRE is
996 built; the default default is 10 million, which handles all
997 but the most extreme cases. You can reduce the default by
998 suppling pcre_exec() with a pcre_extra block in which
999 match_limit is set to a smaller value, and
1000 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the
1001 limit is exceeded, pcre_exec() returns
1004 The pcre_callout field is used in conjunction with the "cal-
1005 lout" feature, which is described in the pcrecallout docu-
1006 mentation.
1008 The PCRE_ANCHORED option can be passed in the options argu-
1009 ment, whose unused bits must be zero. This limits
1010 pcre_exec() to matching at the first matching position. How-
1011 ever, if a pattern was compiled with PCRE_ANCHORED, or
1012 turned out to be anchored by virtue of its contents, it can-
1013 not be made unachored at matching time.
1015 There are also three further options that can be set only at
1016 matching time:
1020 The first character of the string is not the beginning of a
1021 line, so the circumflex metacharacter should not match
1022 before it. Setting this without PCRE_MULTILINE (at compile
1023 time) causes circumflex never to match.
1027 The end of the string is not the end of a line, so the dol-
1028 lar metacharacter should not match it nor (except in multi-
1029 line mode) a newline immediately before it. Setting this
1030 without PCRE_MULTILINE (at compile time) causes dollar never
1031 to match.
1035 An empty string is not considered to be a valid match if
1036 this option is set. If there are alternatives in the pat-
1037 tern, they are tried. If all the alternatives match the
1038 empty string, the entire match fails. For example, if the
1039 pattern
1041 a?b?
1043 is applied to a string not beginning with "a" or "b", it
1044 matches the empty string at the start of the subject. With
1045 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
1046 further into the string for occurrences of "a" or "b".
1048 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
1049 make a special case of a pattern match of the empty string
1050 within its split() function, and when using the /g modifier.
1051 It is possible to emulate Perl's behaviour after matching a
1052 null string by first trying the match again at the same
1053 offset with PCRE_NOTEMPTY set, and then if that fails by
1054 advancing the starting offset (see below) and trying an
1055 ordinary match again.
1057 The subject string is passed to pcre_exec() as a pointer in
1058 subject, a length in length, and a starting offset in star-
1059 toffset. Unlike the pattern string, the subject may contain
1060 binary zero bytes. When the starting offset is zero, the
1061 search for a match starts at the beginning of the subject,
1062 and this is by far the most common case.
1064 If the pattern was compiled with the PCRE_UTF8 option, the
1065 subject must be a sequence of bytes that is a valid UTF-8
1066 string. If an invalid UTF-8 string is passed, PCRE's
1067 behaviour is not defined.
1069 A non-zero starting offset is useful when searching for
1070 another match in the same subject by calling pcre_exec()
1071 again after a previous success. Setting startoffset differs
1072 from just passing over a shortened string and setting
1073 PCRE_NOTBOL in the case of a pattern that begins with any
1074 kind of lookbehind. For example, consider the pattern
1076 \Biss\B
1078 which finds occurrences of "iss" in the middle of words. (\B
1079 matches only if the current position in the subject is not a
1080 word boundary.) When applied to the string "Mississipi" the
1081 first call to pcre_exec() finds the first occurrence. If
1082 pcre_exec() is called again with just the remainder of the
1083 subject, namely "issipi", it does not match, because \B is
1084 always false at the start of the subject, which is deemed to
1085 be a word boundary. However, if pcre_exec() is passed the
1086 entire string again, but with startoffset set to 4, it finds
1087 the second occurrence of "iss" because it is able to look
1088 behind the starting point to discover that it is preceded by
1089 a letter.
1091 If a non-zero starting offset is passed when the pattern is
1092 anchored, one attempt to match at the given offset is tried.
1093 This can only succeed if the pattern does not require the
1094 match to be at the start of the subject.
1096 In general, a pattern matches a certain portion of the sub-
1097 ject, and in addition, further substrings from the subject
1098 may be picked out by parts of the pattern. Following the
1099 usage in Jeffrey Friedl's book, this is called "capturing"
1100 in what follows, and the phrase "capturing subpattern" is
1101 used for a fragment of a pattern that picks out a substring.
1102 PCRE supports several other kinds of parenthesized subpat-
1103 tern that do not cause substrings to be captured.
1105 Captured substrings are returned to the caller via a vector
1106 of integer offsets whose address is passed in ovector. The
1107 number of elements in the vector is passed in ovecsize. The
1108 first two-thirds of the vector is used to pass back captured
1109 substrings, each substring using a pair of integers. The
1110 remaining third of the vector is used as workspace by
1111 pcre_exec() while matching capturing subpatterns, and is not
1112 available for passing back information. The length passed in
1113 ovecsize should always be a multiple of three. If it is not,
1114 it is rounded down.
1116 When a match has been successful, information about captured
1117 substrings is returned in pairs of integers, starting at the
1118 beginning of ovector, and continuing up to two-thirds of its
1119 length at the most. The first element of a pair is set to
1120 the offset of the first character in a substring, and the
1121 second is set to the offset of the first character after the
1122 end of a substring. The first pair, ovector[0] and ovec-
1123 tor[1], identify the portion of the subject string matched
1124 by the entire pattern. The next pair is used for the first
1125 capturing subpattern, and so on. The value returned by
1126 pcre_exec() is the number of pairs that have been set. If
1127 there are no capturing subpatterns, the return value from a
1128 successful match is 1, indicating that just the first pair
1129 of offsets has been set.
1130 Some convenience functions are provided for extracting the
1131 captured substrings as separate strings. These are described
1132 in the following section.
1134 It is possible for an capturing subpattern number n+1 to
1135 match some part of the subject when subpattern n has not
1136 been used at all. For example, if the string "abc" is
1137 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
1138 are matched, but 2 is not. When this happens, both offset
1139 values corresponding to the unused subpattern are set to -1.
1141 If a capturing subpattern is matched repeatedly, it is the
1142 last portion of the string that it matched that gets
1143 returned.
1145 If the vector is too small to hold all the captured sub-
1146 strings, it is used as far as possible (up to two-thirds of
1147 its length), and the function returns a value of zero. In
1148 particular, if the substring offsets are not of interest,
1149 pcre_exec() may be called with ovector passed as NULL and
1150 ovecsize as zero. However, if the pattern contains back
1151 references and the ovector isn't big enough to remember the
1152 related substrings, PCRE has to get additional memory for
1153 use during matching. Thus it is usually advisable to supply
1154 an ovector.
1156 Note that pcre_info() can be used to find out how many cap-
1157 turing subpatterns there are in a compiled pattern. The
1158 smallest size for ovector that will allow for n captured
1159 substrings, in addition to the offsets of the substring
1160 matched by the whole pattern, is (n+1)*3.
1162 If pcre_exec() fails, it returns a negative number. The fol-
1163 lowing are defined in the header file:
1167 The subject string did not match the pattern.
1171 Either code or subject was passed as NULL, or ovector was
1172 NULL and ovecsize was not zero.
1176 An unrecognized bit was set in the options argument.
1180 PCRE stores a 4-byte "magic number" at the start of the com-
1181 piled code, to catch the case when it is passed a junk
1182 pointer. This is the error it gives when the magic number
1183 isn't present.
1187 While running the pattern match, an unknown item was encoun-
1188 tered in the compiled pattern. This error could be caused by
1189 a bug in PCRE or by overwriting of the compiled pattern.
1193 If a pattern contains back references, but the ovector that
1194 is passed to pcre_exec() is not big enough to remember the
1195 referenced substrings, PCRE gets a block of memory at the
1196 start of matching to use for this purpose. If the call via
1197 pcre_malloc() fails, this error is given. The memory is
1198 freed at the end of matching.
1202 This error is used by the pcre_copy_substring(),
1203 pcre_get_substring(), and pcre_get_substring_list() func-
1204 tions (see below). It is never returned by pcre_exec().
1208 The recursion and backtracking limit, as specified by the
1209 match_limit field in a pcre_extra structure (or defaulted)
1210 was reached. See the description above.
1214 This error is never generated by pcre_exec() itself. It is
1215 provided for use by callout functions that want to yield a
1216 distinctive error code. See the pcrecallout documentation
1217 for details.
1222 int pcre_copy_substring(const char *subject, int *ovector,
1223 int stringcount, int stringnumber, char *buffer,
1224 int buffersize);
1226 int pcre_get_substring(const char *subject, int *ovector,
1227 int stringcount, int stringnumber,
1228 const char **stringptr);
1230 int pcre_get_substring_list(const char *subject,
1231 int *ovector, int stringcount, const char ***listptr);
1234 Captured substrings can be accessed directly by using the
1235 offsets returned by pcre_exec() in ovector. For convenience,
1236 the functions pcre_copy_substring(), pcre_get_substring(),
1237 and pcre_get_substring_list() are provided for extracting
1238 captured substrings as new, separate, zero-terminated
1239 strings. These functions identify substrings by number. The
1240 next section describes functions for extracting named sub-
1241 strings. A substring that contains a binary zero is
1242 correctly extracted and has a further zero added on the end,
1243 but the result is not, of course, a C string.
1245 The first three arguments are the same for all three of
1246 these functions: subject is the subject string which has
1247 just been successfully matched, ovector is a pointer to the
1248 vector of integer offsets that was passed to pcre_exec(),
1249 and stringcount is the number of substrings that were cap-
1250 tured by the match, including the substring that matched the
1251 entire regular expression. This is the value returned by
1252 pcre_exec if it is greater than zero. If pcre_exec()
1253 returned zero, indicating that it ran out of space in ovec-
1254 tor, the value passed as stringcount should be the size of
1255 the vector divided by three.
1257 The functions pcre_copy_substring() and pcre_get_substring()
1258 extract a single substring, whose number is given as string-
1259 number. A value of zero extracts the substring that matched
1260 the entire pattern, while higher values extract the captured
1261 substrings. For pcre_copy_substring(), the string is placed
1262 in buffer, whose length is given by buffersize, while for
1263 pcre_get_substring() a new block of memory is obtained via
1264 pcre_malloc, and its address is returned via stringptr. The
1265 yield of the function is the length of the string, not
1266 including the terminating zero, or one of
1270 The buffer was too small for pcre_copy_substring(), or the
1271 attempt to get memory failed for pcre_get_substring().
1275 There is no substring whose number is stringnumber.
1277 The pcre_get_substring_list() function extracts all avail-
1278 able substrings and builds a list of pointers to them. All
1279 this is done in a single block of memory which is obtained
1280 via pcre_malloc. The address of the memory block is returned
1281 via listptr, which is also the start of the list of string
1282 pointers. The end of the list is marked by a NULL pointer.
1283 The yield of the function is zero if all went well, or
1287 if the attempt to get the memory block failed.
1289 When any of these functions encounter a substring that is
1290 unset, which can happen when capturing subpattern number n+1
1291 matches some part of the subject, but subpattern n has not
1292 been used at all, they return an empty string. This can be
1293 distinguished from a genuine zero-length substring by
1294 inspecting the appropriate offset in ovector, which is nega-
1295 tive for unset substrings.
1297 The two convenience functions pcre_free_substring() and
1298 pcre_free_substring_list() can be used to free the memory
1299 returned by a previous call of pcre_get_substring() or
1300 pcre_get_substring_list(), respectively. They do nothing
1301 more than call the function pointed to by pcre_free, which
1302 of course could be called directly from a C program. How-
1303 ever, PCRE is used in some situations where it is linked via
1304 a special interface to another programming language which
1305 cannot use pcre_free directly; it is for these cases that
1306 the functions are provided.
1311 int pcre_copy_named_substring(const pcre *code,
1312 const char *subject, int *ovector,
1313 int stringcount, const char *stringname,
1314 char *buffer, int buffersize);
1316 int pcre_get_stringnumber(const pcre *code,
1317 const char *name);
1319 int pcre_get_named_substring(const pcre *code,
1320 const char *subject, int *ovector,
1321 int stringcount, const char *stringname,
1322 const char **stringptr);
1324 To extract a substring by name, you first have to find asso-
1325 ciated number. This can be done by calling
1326 pcre_get_stringnumber(). The first argument is the compiled
1327 pattern, and the second is the name. For example, for this
1328 pattern
1330 ab(?<xxx>\d+)...
1332 the number of the subpattern called "xxx" is 1. Given the
1333 number, you can then extract the substring directly, or use
1334 one of the functions described in the previous section. For
1335 convenience, there are also two functions that do the whole
1336 job.
1338 Most of the arguments of pcre_copy_named_substring() and
1339 pcre_get_named_substring() are the same as those for the
1340 functions that extract by number, and so are not re-
1341 described here. There are just two differences.
1343 First, instead of a substring number, a substring name is
1344 given. Second, there is an extra argument, given at the
1345 start, which is a pointer to the compiled pattern. This is
1346 needed in order to gain access to the name-to-number trans-
1347 lation table.
1349 These functions call pcre_get_stringnumber(), and if it
1350 succeeds, they then call pcre_copy_substring() or
1351 pcre_get_substring(), as appropriate.
1353 Last updated: 03 February 2003
1354 Copyright (c) 1997-2003 University of Cambridge.
1355 -----------------------------------------------------------------------------
1357 NAME
1358 PCRE - Perl-compatible regular expressions
1363 int (*pcre_callout)(pcre_callout_block *);
1365 PCRE provides a feature called "callout", which is a means
1366 of temporarily passing control to the caller of PCRE in the
1367 middle of pattern matching. The caller of PCRE provides an
1368 external function by putting its entry point in the global
1369 variable pcre_callout. By default, this variable contains
1370 NULL, which disables all calling out.
1372 Within a regular expression, (?C) indicates the points at
1373 which the external function is to be called. Different cal-
1374 lout points can be identified by putting a number less than
1375 256 after the letter C. The default value is zero. For
1376 example, this pattern has two callout points:
1378 (?C1)9abc(?C2)def
1380 During matching, when PCRE reaches a callout point (and
1381 pcre_callout is set), the external function is called. Its
1382 only argument is a pointer to a pcre_callout block. This
1383 contains the following variables:
1385 int version;
1386 int callout_number;
1387 int *offset_vector;
1388 const char *subject;
1389 int subject_length;
1390 int start_match;
1391 int current_position;
1392 int capture_top;
1393 int capture_last;
1394 void *callout_data;
1396 The version field is an integer containing the version
1397 number of the block format. The current version is zero. The
1398 version number may change in future if additional fields are
1399 added, but the intention is never to remove any of the
1400 existing fields.
1402 The callout_number field contains the number of the callout,
1403 as compiled into the pattern (that is, the number after ?C).
1405 The offset_vector field is a pointer to the vector of
1406 offsets that was passed by the caller to pcre_exec(). The
1407 contents can be inspected in order to extract substrings
1408 that have been matched so far, in the same way as for
1409 extracting substrings after a match has completed.
1410 The subject and subject_length fields contain copies the
1411 values that were passed to pcre_exec().
1413 The start_match field contains the offset within the subject
1414 at which the current match attempt started. If the pattern
1415 is not anchored, the callout function may be called several
1416 times for different starting points.
1418 The current_position field contains the offset within the
1419 subject of the current match pointer.
1421 The capture_top field contains the number of the highest
1422 captured substring so far.
1424 The capture_last field contains the number of the most
1425 recently captured substring.
1427 The callout_data field contains a value that is passed to
1428 pcre_exec() by the caller specifically so that it can be
1429 passed back in callouts. It is passed in the pcre_callout
1430 field of the pcre_extra data structure. If no such data was
1431 passed, the value of callout_data in a pcre_callout block is
1432 NULL. There is a description of the pcre_extra structure in
1433 the pcreapi documentation.
1439 The callout function returns an integer. If the value is
1440 zero, matching proceeds as normal. If the value is greater
1441 than zero, matching fails at the current point, but back-
1442 tracking to test other possibilities goes ahead, just as if
1443 a lookahead assertion had failed. If the value is less than
1444 zero, the match is abandoned, and pcre_exec() returns the
1445 value.
1447 Negative values should normally be chosen from the set of
1448 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH
1449 forces a standard "no match" failure. The error number
1450 PCRE_ERROR_CALLOUT is reserved for use by callout functions;
1451 it will never be used by PCRE itself.
1453 Last updated: 21 January 2003
1454 Copyright (c) 1997-2003 University of Cambridge.
1455 -----------------------------------------------------------------------------
1457 NAME
1458 PCRE - Perl-compatible regular expressions
1463 This document describes the differences in the ways that
1464 PCRE and Perl handle regular expressions. The differences
1465 described here are with respect to Perl 5.8.
1467 1. PCRE does not allow repeat quantifiers on lookahead
1468 assertions. Perl permits them, but they do not mean what you
1469 might think. For example, (?!a){3} does not assert that the
1470 next three characters are not "a". It just asserts that the
1471 next character is not "a" three times.
1473 2. Capturing subpatterns that occur inside negative looka-
1474 head assertions are counted, but their entries in the
1475 offsets vector are never set. Perl sets its numerical vari-
1476 ables from any such patterns that are matched before the
1477 assertion fails to match something (thereby succeeding), but
1478 only if the negative lookahead assertion contains just one
1479 branch.
1481 3. Though binary zero characters are supported in the sub-
1482 ject string, they are not allowed in a pattern string
1483 because it is passed as a normal C string, terminated by
1484 zero. The escape sequence "\0" can be used in the pattern to
1485 represent a binary zero.
1487 4. The following Perl escape sequences are not supported:
1488 \l, \u, \L, \U, \P, \p, and \X. In fact these are imple-
1489 mented by Perl's general string-handling and are not part of
1490 its pattern matching engine. If any of these are encountered
1491 by PCRE, an error is generated.
1493 5. PCRE does support the \Q...\E escape for quoting sub-
1494 strings. Characters in between are treated as literals. This
1495 is slightly different from Perl in that $ and @ are also
1496 handled as literals inside the quotes. In Perl, they cause
1497 variable interpolation (but of course PCRE does not have
1498 variables). Note the following examples:
1500 Pattern PCRE matches Perl matches
1502 \Qabc$xyz\E abc$xyz abc followed by the
1503 contents of $xyz
1504 \Qabc\$xyz\E abc\$xyz abc\$xyz
1505 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1507 In PCRE, the \Q...\E mechanism is not recognized inside a
1508 character class.
1510 8. Fairly obviously, PCRE does not support the (?{code}) and
1511 (?p{code}) constructions. However, there is some experimen-
1512 tal support for recursive patterns using the non-Perl items
1513 (?R), (?number) and (?P>name). Also, the PCRE "callout"
1514 feature allows an external function to be called during pat-
1515 tern matching.
1517 9. There are some differences that are concerned with the
1518 settings of captured strings when part of a pattern is
1519 repeated. For example, matching "aba" against the pattern
1520 /^(a(b)?)+$/ in Perl leaves $2 unset, but in PCRE it is set
1521 to "b".
1523 10. PCRE provides some extensions to the Perl regular
1524 expression facilities:
1526 (a) Although lookbehind assertions must match fixed length
1527 strings, each alternative branch of a lookbehind assertion
1528 can match a different length of string. Perl requires them
1529 all to have the same length.
1531 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
1532 set, the $ meta-character matches only at the very end of
1533 the string.
1535 (c) If PCRE_EXTRA is set, a backslash followed by a letter
1536 with no special meaning is faulted.
1538 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
1539 tion quantifiers is inverted, that is, by default they are
1540 not greedy, but if followed by a question mark they are.
1542 (e) PCRE_ANCHORED can be used to force a pattern to be tried
1543 only at the first matching position in the subject string.
1546 PCRE_NO_AUTO_CAPTURE options for pcre_exec() have no Perl
1547 equivalents.
1549 (g) The (?R), (?number), and (?P>name) constructs allows for
1550 recursive pattern matching (Perl can do this using the
1551 (?p{code}) construct, which PCRE cannot support.)
1553 (h) PCRE supports named capturing substrings, using the
1554 Python syntax.
1556 (i) PCRE supports the possessive quantifier "++" syntax,
1557 taken from Sun's Java package.
1559 (j) The (R) condition, for testing recursion, is a PCRE
1560 extension.
1562 (k) The callout facility is PCRE-specific.
1564 Last updated: 03 February 2003
1565 Copyright (c) 1997-2003 University of Cambridge.
1566 -----------------------------------------------------------------------------
1568 NAME
1569 PCRE - Perl-compatible regular expressions
1574 The syntax and semantics of the regular expressions sup-
1575 ported by PCRE are described below. Regular expressions are
1576 also described in the Perl documentation and in a number of
1577 other books, some of which have copious examples. Jeffrey
1578 Friedl's "Mastering Regular Expressions", published by
1579 O'Reilly, covers them in great detail. The description here
1580 is intended as reference documentation.
1582 The basic operation of PCRE is on strings of bytes. However,
1583 there is also support for UTF-8 character strings. To use
1584 this support you must build PCRE to include UTF-8 support,
1585 and then call pcre_compile() with the PCRE_UTF8 option. How
1586 this affects the pattern matching is mentioned in several
1587 places below. There is also a summary of UTF-8 features in
1588 the section on UTF-8 support in the main pcre page.
1590 A regular expression is a pattern that is matched against a
1591 subject string from left to right. Most characters stand for
1592 themselves in a pattern, and match the corresponding charac-
1593 ters in the subject. As a trivial example, the pattern
1595 The quick brown fox
1597 matches a portion of a subject string that is identical to
1598 itself. The power of regular expressions comes from the
1599 ability to include alternatives and repetitions in the pat-
1600 tern. These are encoded in the pattern by the use of meta-
1601 characters, which do not stand for themselves but instead
1602 are interpreted in some special way.
1604 There are two different sets of meta-characters: those that
1605 are recognized anywhere in the pattern except within square
1606 brackets, and those that are recognized in square brackets.
1607 Outside square brackets, the meta-characters are as follows:
1609 \ general escape character with several uses
1610 ^ assert start of string (or line, in multiline mode)
1611 $ assert end of string (or line, in multiline mode)
1612 . match any character except newline (by default)
1613 [ start character class definition
1614 | start of alternative branch
1615 ( start subpattern
1616 ) end subpattern
1617 ? extends the meaning of (
1618 also 0 or 1 quantifier
1619 also quantifier minimizer
1620 * 0 or more quantifier
1621 + 1 or more quantifier
1622 also "possessive quantifier"
1623 { start min/max quantifier
1625 Part of a pattern that is in square brackets is called a
1626 "character class". In a character class the only meta-
1627 characters are:
1629 \ general escape character
1630 ^ negate the class, but only if the first character
1631 - indicates character range
1632 [ POSIX character class (only if followed by POSIX
1633 syntax)
1634 ] terminates the character class
1636 The following sections describe the use of each of the
1637 meta-characters.
1642 The backslash character has several uses. Firstly, if it is
1643 followed by a non-alphameric character, it takes away any
1644 special meaning that character may have. This use of
1645 backslash as an escape character applies both inside and
1646 outside character classes.
1648 For example, if you want to match a * character, you write
1649 \* in the pattern. This escaping action applies whether or
1650 not the following character would otherwise be interpreted
1651 as a meta-character, so it is always safe to precede a non-
1652 alphameric with backslash to specify that it stands for
1653 itself. In particular, if you want to match a backslash, you
1654 write \\.
1656 If a pattern is compiled with the PCRE_EXTENDED option, whi-
1657 tespace in the pattern (other than in a character class) and
1658 characters between a # outside a character class and the
1659 next newline character are ignored. An escaping backslash
1660 can be used to include a whitespace or # character as part
1661 of the pattern.
1663 If you want to remove the special meaning from a sequence of
1664 characters, you can do so by putting them between \Q and \E.
1665 This is different from Perl in that $ and @ are handled as
1666 literals in \Q...\E sequences in PCRE, whereas in Perl, $
1667 and @ cause variable interpolation. Note the following exam-
1668 ples:
1670 Pattern PCRE matches Perl matches
1672 \Qabc$xyz\E abc$xyz abc followed by the
1674 contents of $xyz
1675 \Qabc\$xyz\E abc\$xyz abc\$xyz
1676 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1678 The \Q...\E sequence is recognized both inside and outside
1679 character classes.
1681 A second use of backslash provides a way of encoding non-
1682 printing characters in patterns in a visible manner. There
1683 is no restriction on the appearance of non-printing charac-
1684 ters, apart from the binary zero that terminates a pattern,
1685 but when a pattern is being prepared by text editing, it is
1686 usually easier to use one of the following escape sequences
1687 than the binary character it represents:
1689 \a alarm, that is, the BEL character (hex 07)
1690 \cx "control-x", where x is any character
1691 \e escape (hex 1B)
1692 \f formfeed (hex 0C)
1693 \n newline (hex 0A)
1694 \r carriage return (hex 0D)
1695 \t tab (hex 09)
1696 \ddd character with octal code ddd, or backreference
1697 \xhh character with hex code hh
1698 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1700 The precise effect of \cx is as follows: if x is a lower
1701 case letter, it is converted to upper case. Then bit 6 of
1702 the character (hex 40) is inverted. Thus \cz becomes hex
1703 1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.
1705 After \x, from zero to two hexadecimal digits are read
1706 (letters can be in upper or lower case). In UTF-8 mode, any
1707 number of hexadecimal digits may appear between \x{ and },
1708 but the value of the character code must be less than 2**31
1709 (that is, the maximum hexadecimal value is 7FFFFFFF). If
1710 characters other than hexadecimal digits appear between \x{
1711 and }, or if there is no terminating }, this form of escape
1712 is not recognized. Instead, the initial \x will be inter-
1713 preted as a basic hexadecimal escape, with no following
1714 digits, giving a byte whose value is zero.
1716 Characters whose value is less than 256 can be defined by
1717 either of the two syntaxes for \x when PCRE is in UTF-8
1718 mode. There is no difference in the way they are handled.
1719 For example, \xdc is exactly the same as \x{dc}.
1721 After \0 up to two further octal digits are read. In both
1722 cases, if there are fewer than two digits, just those that
1723 are present are used. Thus the sequence \0\x\07 specifies
1724 two binary zeros followed by a BEL character (code value 7).
1725 Make sure you supply two digits after the initial zero if
1726 the character that follows is itself an octal digit.
1728 The handling of a backslash followed by a digit other than 0
1729 is complicated. Outside a character class, PCRE reads it
1730 and any following digits as a decimal number. If the number
1731 is less than 10, or if there have been at least that many
1732 previous capturing left parentheses in the expression, the
1733 entire sequence is taken as a back reference. A description
1734 of how this works is given later, following the discussion
1735 of parenthesized subpatterns.
1737 Inside a character class, or if the decimal number is
1738 greater than 9 and there have not been that many capturing
1739 subpatterns, PCRE re-reads up to three octal digits follow-
1740 ing the backslash, and generates a single byte from the
1741 least significant 8 bits of the value. Any subsequent digits
1742 stand for themselves. For example:
1744 \040 is another way of writing a space
1745 \40 is the same, provided there are fewer than 40
1746 previous capturing subpatterns
1747 \7 is always a back reference
1748 \11 might be a back reference, or another way of
1749 writing a tab
1750 \011 is always a tab
1751 \0113 is a tab followed by the character "3"
1752 \113 might be a back reference, otherwise the
1753 character with octal code 113
1754 \377 might be a back reference, otherwise
1755 the byte consisting entirely of 1 bits
1756 \81 is either a back reference, or a binary zero
1757 followed by the two characters "8" and "1"
1759 Note that octal values of 100 or greater must not be intro-
1760 duced by a leading zero, because no more than three octal
1761 digits are ever read.
1763 All the sequences that define a single byte value or a sin-
1764 gle UTF-8 character (in UTF-8 mode) can be used both inside
1765 and outside character classes. In addition, inside a charac-
1766 ter class, the sequence \b is interpreted as the backspace
1767 character (hex 08). Outside a character class it has a dif-
1768 ferent meaning (see below).
1770 The third use of backslash is for specifying generic charac-
1771 ter types:
1773 \d any decimal digit
1774 \D any character that is not a decimal digit
1775 \s any whitespace character
1776 \S any character that is not a whitespace character
1777 \w any "word" character
1778 W any "non-word" character
1780 Each pair of escape sequences partitions the complete set of
1781 characters into two disjoint sets. Any given character
1782 matches one, and only one, of each pair.
1784 In UTF-8 mode, characters with values greater than 255 never
1785 match \d, \s, or \w, and always match \D, \S, and \W.
1787 For compatibility with Perl, \s does not match the VT char-
1788 acter (code 11). This makes it different from the the POSIX
1789 "space" class. The \s characters are HT (9), LF (10), FF
1790 (12), CR (13), and space (32).
1792 A "word" character is any letter or digit or the underscore
1793 character, that is, any character which can be part of a
1794 Perl "word". The definition of letters and digits is con-
1795 trolled by PCRE's character tables, and may vary if locale-
1796 specific matching is taking place (see "Locale support" in
1797 the pcreapi page). For example, in the "fr" (French) locale,
1798 some character codes greater than 128 are used for accented
1799 letters, and these are matched by \w.
1801 These character type sequences can appear both inside and
1802 outside character classes. They each match one character of
1803 the appropriate type. If the current matching point is at
1804 the end of the subject string, all of them fail, since there
1805 is no character to match.
1807 The fourth use of backslash is for certain simple asser-
1808 tions. An assertion specifies a condition that has to be met
1809 at a particular point in a match, without consuming any
1810 characters from the subject string. The use of subpatterns
1811 for more complicated assertions is described below. The
1812 backslashed assertions are
1814 \b matches at a word boundary
1815 \B matches when not at a word boundary
1816 \A matches at start of subject
1817 \Z matches at end of subject or before newline at end
1818 \z matches at end of subject
1819 \G matches at first matching position in subject
1821 These assertions may not appear in character classes (but
1822 note that \b has a different meaning, namely the backspace
1823 character, inside a character class).
1825 A word boundary is a position in the subject string where
1826 the current character and the previous character do not both
1827 match \w or \W (i.e. one matches \w and the other matches
1828 \W), or the start or end of the string if the first or last
1829 character matches \w, respectively.
1830 The \A, \Z, and \z assertions differ from the traditional
1831 circumflex and dollar (described below) in that they only
1832 ever match at the very start and end of the subject string,
1833 whatever options are set. Thus, they are independent of mul-
1834 tiline mode.
1836 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL
1837 options. If the startoffset argument of pcre_exec() is non-
1838 zero, indicating that matching is to start at a point other
1839 than the beginning of the subject, \A can never match. The
1840 difference between \Z and \z is that \Z matches before a
1841 newline that is the last character of the string as well as
1842 at the end of the string, whereas \z matches only at the
1843 end.
1845 The \G assertion is true only when the current matching
1846 position is at the start point of the match, as specified by
1847 the startoffset argument of pcre_exec(). It differs from \A
1848 when the value of startoffset is non-zero. By calling
1849 pcre_exec() multiple times with appropriate arguments, you
1850 can mimic Perl's /g option, and it is in this kind of imple-
1851 mentation where \G can be useful.
1853 Note, however, that PCRE's interpretation of \G, as the
1854 start of the current match, is subtly different from Perl's,
1855 which defines it as the end of the previous match. In Perl,
1856 these can be different when the previously matched string
1857 was empty. Because PCRE does just one match at a time, it
1858 cannot reproduce this behaviour.
1860 If all the alternatives of a pattern begin with \G, the
1861 expression is anchored to the starting match position, and
1862 the "anchored" flag is set in the compiled regular expres-
1863 sion.
1868 Outside a character class, in the default matching mode, the
1869 circumflex character is an assertion which is true only if
1870 the current matching point is at the start of the subject
1871 string. If the startoffset argument of pcre_exec() is non-
1872 zero, circumflex can never match if the PCRE_MULTILINE
1873 option is unset. Inside a character class, circumflex has an
1874 entirely different meaning (see below).
1876 Circumflex need not be the first character of the pattern if
1877 a number of alternatives are involved, but it should be the
1878 first thing in each alternative in which it appears if the
1879 pattern is ever to match that branch. If all possible alter-
1880 natives start with a circumflex, that is, if the pattern is
1881 constrained to match only at the start of the subject, it is
1882 said to be an "anchored" pattern. (There are also other con-
1883 structs that can cause a pattern to be anchored.)
1885 A dollar character is an assertion which is true only if the
1886 current matching point is at the end of the subject string,
1887 or immediately before a newline character that is the last
1888 character in the string (by default). Dollar need not be the
1889 last character of the pattern if a number of alternatives
1890 are involved, but it should be the last item in any branch
1891 in which it appears. Dollar has no special meaning in a
1892 character class.
1894 The meaning of dollar can be changed so that it matches only
1895 at the very end of the string, by setting the
1896 PCRE_DOLLAR_ENDONLY option at compile time. This does not
1897 affect the \Z assertion.
1899 The meanings of the circumflex and dollar characters are
1900 changed if the PCRE_MULTILINE option is set. When this is
1901 the case, they match immediately after and immediately
1902 before an internal newline character, respectively, in addi-
1903 tion to matching at the start and end of the subject string.
1904 For example, the pattern /^abc$/ matches the subject string
1905 "def\nabc" in multiline mode, but not otherwise. Conse-
1906 quently, patterns that are anchored in single line mode
1907 because all branches start with ^ are not anchored in multi-
1908 line mode, and a match for circumflex is possible when the
1909 startoffset argument of pcre_exec() is non-zero. The
1910 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1911 set.
1913 Note that the sequences \A, \Z, and \z can be used to match
1914 the start and end of the subject in both modes, and if all
1915 branches of a pattern start with \A it is always anchored,
1916 whether PCRE_MULTILINE is set or not.
1921 Outside a character class, a dot in the pattern matches any
1922 one character in the subject, including a non-printing char-
1923 acter, but not (by default) newline. In UTF-8 mode, a dot
1924 matches any UTF-8 character, which might be more than one
1925 byte long, except (by default) for newline. If the
1926 PCRE_DOTALL option is set, dots match newlines as well. The
1927 handling of dot is entirely independent of the handling of
1928 circumflex and dollar, the only relationship being that they
1929 both involve newline characters. Dot has no special meaning
1930 in a character class.
1936 Outside a character class, the escape sequence \C matches
1937 any one byte, both in and out of UTF-8 mode. Unlike a dot,
1938 it always matches a newline. The feature is provided in Perl
1939 in order to match individual bytes in UTF-8 mode. Because
1940 it breaks up UTF-8 characters into individual bytes, what
1941 remains in the string may be a malformed UTF-8 string. For
1942 this reason it is best avoided.
1944 PCRE does not allow \C to appear in lookbehind assertions
1945 (see below), because in UTF-8 mode it makes it impossible to
1946 calculate the length of the lookbehind.
1951 An opening square bracket introduces a character class, ter-
1952 minated by a closing square bracket. A closing square
1953 bracket on its own is not special. If a closing square
1954 bracket is required as a member of the class, it should be
1955 the first data character in the class (after an initial cir-
1956 cumflex, if present) or escaped with a backslash.
1958 A character class matches a single character in the subject.
1959 In UTF-8 mode, the character may occupy more than one byte.
1960 A matched character must be in the set of characters defined
1961 by the class, unless the first character in the class defin-
1962 ition is a circumflex, in which case the subject character
1963 must not be in the set defined by the class. If a circumflex
1964 is actually required as a member of the class, ensure it is
1965 not the first character, or escape it with a backslash.
1967 For example, the character class [aeiou] matches any lower
1968 case vowel, while [^aeiou] matches any character that is not
1969 a lower case vowel. Note that a circumflex is just a con-
1970 venient notation for specifying the characters which are in
1971 the class by enumerating those that are not. It is not an
1972 assertion: it still consumes a character from the subject
1973 string, and fails if the current pointer is at the end of
1974 the string.
1976 In UTF-8 mode, characters with values greater than 255 can
1977 be included in a class as a literal string of bytes, or by
1978 using the \x{ escaping mechanism.
1980 When caseless matching is set, any letters in a class
1981 represent both their upper case and lower case versions, so
1982 for example, a caseless [aeiou] matches "A" as well as "a",
1983 and a caseless [^aeiou] does not match "A", whereas a case-
1984 ful version would. PCRE does not support the concept of case
1985 for characters with values greater than 255.
1986 The newline character is never treated in any special way in
1987 character classes, whatever the setting of the PCRE_DOTALL
1988 or PCRE_MULTILINE options is. A class such as [^a] will
1989 always match a newline.
1991 The minus (hyphen) character can be used to specify a range
1992 of characters in a character class. For example, [d-m]
1993 matches any letter between d and m, inclusive. If a minus
1994 character is required in a class, it must be escaped with a
1995 backslash or appear in a position where it cannot be inter-
1996 preted as indicating a range, typically as the first or last
1997 character in the class.
1999 It is not possible to have the literal character "]" as the
2000 end character of a range. A pattern such as [W-]46] is
2001 interpreted as a class of two characters ("W" and "-") fol-
2002 lowed by a literal string "46]", so it would match "W46]" or
2003 "-46]". However, if the "]" is escaped with a backslash it
2004 is interpreted as the end of range, so [W-\]46] is inter-
2005 preted as a single class containing a range followed by two
2006 separate characters. The octal or hexadecimal representation
2007 of "]" can also be used to end a range.
2009 Ranges operate in the collating sequence of character
2010 values. They can also be used for characters specified
2011 numerically, for example [\000-\037]. In UTF-8 mode, ranges
2012 can include characters whose values are greater than 255,
2013 for example [\x{100}-\x{2ff}].
2015 If a range that includes letters is used when caseless
2016 matching is set, it matches the letters in either case. For
2017 example, [W-c] is equivalent to [][\^_`wxyzabc], matched
2018 caselessly, and if character tables for the "fr" locale are
2019 in use, [\xc8-\xcb] matches accented E characters in both
2020 cases.
2022 The character types \d, \D, \s, \S, \w, and \W may also
2023 appear in a character class, and add the characters that
2024 they match to the class. For example, [\dABCDEF] matches any
2025 hexadecimal digit. A circumflex can conveniently be used
2026 with the upper case character types to specify a more res-
2027 tricted set of characters than the matching lower case type.
2028 For example, the class [^\W_] matches any letter or digit,
2029 but not underscore.
2031 All non-alphameric characters other than \, -, ^ (at the
2032 start) and the terminating ] are non-special in character
2033 classes, but it does no harm if they are escaped.
2038 Perl supports the POSIX notation for character classes,
2039 which uses names enclosed by [: and :] within the enclosing
2040 square brackets. PCRE also supports this notation. For exam-
2041 ple,
2043 [01[:alpha:]%]
2045 matches "0", "1", any alphabetic character, or "%". The sup-
2046 ported class names are
2048 alnum letters and digits
2049 alpha letters
2050 ascii character codes 0 - 127
2051 blank space or tab only
2052 cntrl control characters
2053 digit decimal digits (same as \d)
2054 graph printing characters, excluding space
2055 lower lower case letters
2056 print printing characters, including space
2057 punct printing characters, excluding letters and digits
2058 space white space (not quite the same as \s)
2059 upper upper case letters
2060 word "word" characters (same as \w)
2061 xdigit hexadecimal digits
2063 The "space" characters are HT (9), LF (10), VT (11), FF
2064 (12), CR (13), and space (32). Notice that this list
2065 includes the VT character (code 11). This makes "space" dif-
2066 ferent to \s, which does not include VT (for Perl compati-
2067 bility).
2069 The name "word" is a Perl extension, and "blank" is a GNU
2070 extension from Perl 5.8. Another Perl extension is negation,
2071 which is indicated by a ^ character after the colon. For
2072 example,
2074 [12[:^digit:]]
2076 matches "1", "2", or any non-digit. PCRE (and Perl) also
2077 recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
2078 "collating element", but these are not supported, and an
2079 error is given if they are encountered.
2081 In UTF-8 mode, characters with values greater than 255 do
2082 not match any of the POSIX character classes.
2087 Vertical bar characters are used to separate alternative
2088 patterns. For example, the pattern
2090 gilbert|sullivan
2092 matches either "gilbert" or "sullivan". Any number of alter-
2093 natives may appear, and an empty alternative is permitted
2094 (matching the empty string). The matching process tries
2095 each alternative in turn, from left to right, and the first
2096 one that succeeds is used. If the alternatives are within a
2097 subpattern (defined below), "succeeds" means matching the
2098 rest of the main pattern as well as the alternative in the
2099 subpattern.
2104 The settings of the PCRE_CASELESS, PCRE_MULTILINE,
2105 PCRE_DOTALL, and PCRE_EXTENDED options can be changed from
2106 within the pattern by a sequence of Perl option letters
2107 enclosed between "(?" and ")". The option letters are
2109 i for PCRE_CASELESS
2111 s for PCRE_DOTALL
2112 x for PCRE_EXTENDED
2114 For example, (?im) sets caseless, multiline matching. It is
2115 also possible to unset these options by preceding the letter
2116 with a hyphen, and a combined setting and unsetting such as
2117 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
2118 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
2119 If a letter appears both before and after the hyphen, the
2120 option is unset.
2122 When an option change occurs at top level (that is, not
2123 inside subpattern parentheses), the change applies to the
2124 remainder of the pattern that follows. If the change is
2125 placed right at the start of a pattern, PCRE extracts it
2126 into the global options (and it will therefore show up in
2127 data extracted by the pcre_fullinfo() function).
2129 An option change within a subpattern affects only that part
2130 of the current pattern that follows it, so
2132 (a(?i)b)c
2134 matches abc and aBc and no other strings (assuming
2135 PCRE_CASELESS is not used). By this means, options can be
2136 made to have different settings in different parts of the
2137 pattern. Any changes made in one alternative do carry on
2138 into subsequent branches within the same subpattern. For
2139 example,
2141 (a(?i)b|c)
2143 matches "ab", "aB", "c", and "C", even though when matching
2144 "C" the first branch is abandoned before the option setting.
2145 This is because the effects of option settings happen at
2146 compile time. There would be some very weird behaviour oth-
2147 erwise.
2149 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
2150 be changed in the same way as the Perl-compatible options by
2151 using the characters U and X respectively. The (?X) flag
2152 setting is special in that it must always occur earlier in
2153 the pattern than any of the additional features it turns on,
2154 even when it is at top level. It is best put at the start.
2159 Subpatterns are delimited by parentheses (round brackets),
2160 which can be nested. Marking part of a pattern as a subpat-
2161 tern does two things:
2163 1. It localizes a set of alternatives. For example, the pat-
2164 tern
2166 cat(aract|erpillar|)
2168 matches one of the words "cat", "cataract", or "caterpil-
2169 lar". Without the parentheses, it would match "cataract",
2170 "erpillar" or the empty string.
2172 2. It sets up the subpattern as a capturing subpattern (as
2173 defined above). When the whole pattern matches, that por-
2174 tion of the subject string that matched the subpattern is
2175 passed back to the caller via the ovector argument of
2176 pcre_exec(). Opening parentheses are counted from left to
2177 right (starting from 1) to obtain the numbers of the captur-
2178 ing subpatterns.
2180 For example, if the string "the red king" is matched against
2181 the pattern
2183 the ((red|white) (king|queen))
2185 the captured substrings are "red king", "red", and "king",
2186 and are numbered 1, 2, and 3, respectively.
2188 The fact that plain parentheses fulfil two functions is not
2189 always helpful. There are often times when a grouping sub-
2190 pattern is required without a capturing requirement. If an
2191 opening parenthesis is followed by a question mark and a
2192 colon, the subpattern does not do any capturing, and is not
2193 counted when computing the number of any subsequent captur-
2194 ing subpatterns. For example, if the string "the white
2195 queen" is matched against the pattern
2197 the ((?:red|white) (king|queen))
2199 the captured substrings are "white queen" and "queen", and
2200 are numbered 1 and 2. The maximum number of capturing sub-
2201 patterns is 65535, and the maximum depth of nesting of all
2202 subpatterns, both capturing and non-capturing, is 200.
2204 As a convenient shorthand, if any option settings are
2205 required at the start of a non-capturing subpattern, the
2206 option letters may appear between the "?" and the ":". Thus
2207 the two patterns
2209 (?i:saturday|sunday)
2210 (?:(?i)saturday|sunday)
2212 match exactly the same set of strings. Because alternative
2213 branches are tried from left to right, and options are not
2214 reset until the end of the subpattern is reached, an option
2215 setting in one branch does affect subsequent branches, so
2216 the above patterns match "SUNDAY" as well as "Saturday".
2221 Identifying capturing parentheses by number is simple, but
2222 it can be very hard to keep track of the numbers in compli-
2223 cated regular expressions. Furthermore, if an expression is
2224 modified, the numbers may change. To help with the diffi-
2225 culty, PCRE supports the naming of subpatterns, something
2226 that Perl does not provide. The Python syntax (?P<name>...)
2227 is used. Names consist of alphanumeric characters and under-
2228 scores, and must be unique within a pattern.
2230 Named capturing parentheses are still allocated numbers as
2231 well as names. The PCRE API provides function calls for
2232 extracting the name-to-number translation table from a com-
2233 piled pattern. For further details see the pcreapi documen-
2234 tation.
2239 Repetition is specified by quantifiers, which can follow any
2240 of the following items:
2242 a literal data character
2243 the . metacharacter
2244 the \C escape sequence
2245 escapes such as \d that match single characters
2246 a character class
2247 a back reference (see next section)
2248 a parenthesized subpattern (unless it is an assertion)
2250 The general repetition quantifier specifies a minimum and
2251 maximum number of permitted matches, by giving the two
2252 numbers in curly brackets (braces), separated by a comma.
2253 The numbers must be less than 65536, and the first must be
2254 less than or equal to the second. For example:
2256 z{2,4}
2258 matches "zz", "zzz", or "zzzz". A closing brace on its own
2259 is not a special character. If the second number is omitted,
2260 but the comma is present, there is no upper limit; if the
2261 second number and the comma are both omitted, the quantifier
2262 specifies an exact number of required matches. Thus
2264 [aeiou]{3,}
2266 matches at least 3 successive vowels, but may match many
2267 more, while
2269 \d{8}
2271 matches exactly 8 digits. An opening curly bracket that
2272 appears in a position where a quantifier is not allowed, or
2273 one that does not match the syntax of a quantifier, is taken
2274 as a literal character. For example, {,6} is not a quantif-
2275 ier, but a literal string of four characters.
2277 In UTF-8 mode, quantifiers apply to UTF-8 characters rather
2278 than to individual bytes. Thus, for example, \x{100}{2}
2279 matches two UTF-8 characters, each of which is represented
2280 by a two-byte sequence.
2282 The quantifier {0} is permitted, causing the expression to
2283 behave as if the previous item and the quantifier were not
2284 present.
2286 For convenience (and historical compatibility) the three
2287 most common quantifiers have single-character abbreviations:
2289 * is equivalent to {0,}
2290 + is equivalent to {1,}
2291 ? is equivalent to {0,1}
2293 It is possible to construct infinite loops by following a
2294 subpattern that can match no characters with a quantifier
2295 that has no upper limit, for example:
2297 (a?)*
2299 Earlier versions of Perl and PCRE used to give an error at
2300 compile time for such patterns. However, because there are
2301 cases where this can be useful, such patterns are now
2302 accepted, but if any repetition of the subpattern does in
2303 fact match no characters, the loop is forcibly broken.
2305 By default, the quantifiers are "greedy", that is, they
2306 match as much as possible (up to the maximum number of per-
2307 mitted times), without causing the rest of the pattern to
2308 fail. The classic example of where this gives problems is in
2309 trying to match comments in C programs. These appear between
2310 the sequences /* and */ and within the sequence, individual
2311 * and / characters may appear. An attempt to match C com-
2312 ments by applying the pattern
2314 /\*.*\*/
2316 to the string
2318 /* first command */ not comment /* second comment */
2320 fails, because it matches the entire string owing to the
2321 greediness of the .* item.
2323 However, if a quantifier is followed by a question mark, it
2324 ceases to be greedy, and instead matches the minimum number
2325 of times possible, so the pattern
2327 /\*.*?\*/
2329 does the right thing with the C comments. The meaning of the
2330 various quantifiers is not otherwise changed, just the pre-
2331 ferred number of matches. Do not confuse this use of ques-
2332 tion mark with its use as a quantifier in its own right.
2333 Because it has two uses, it can sometimes appear doubled, as
2334 in
2336 \d??\d
2338 which matches one digit by preference, but can match two if
2339 that is the only way the rest of the pattern matches.
2341 If the PCRE_UNGREEDY option is set (an option which is not
2342 available in Perl), the quantifiers are not greedy by
2343 default, but individual ones can be made greedy by following
2344 them with a question mark. In other words, it inverts the
2345 default behaviour.
2347 When a parenthesized subpattern is quantified with a minimum
2348 repeat count that is greater than 1 or with a limited max-
2349 imum, more store is required for the compiled pattern, in
2350 proportion to the size of the minimum or maximum.
2351 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
2352 option (equivalent to Perl's /s) is set, thus allowing the .
2353 to match newlines, the pattern is implicitly anchored,
2354 because whatever follows will be tried against every charac-
2355 ter position in the subject string, so there is no point in
2356 retrying the overall match at any position after the first.
2357 PCRE normally treats such a pattern as though it were pre-
2358 ceded by \A.
2360 In cases where it is known that the subject string contains
2361 no newlines, it is worth setting PCRE_DOTALL in order to
2362 obtain this optimization, or alternatively using ^ to indi-
2363 cate anchoring explicitly.
2365 However, there is one situation where the optimization can-
2366 not be used. When .* is inside capturing parentheses that
2367 are the subject of a backreference elsewhere in the pattern,
2368 a match at the start may fail, and a later one succeed. Con-
2369 sider, for example:
2371 (.*)abc\1
2373 If the subject is "xyz123abc123" the match point is the
2374 fourth character. For this reason, such a pattern is not
2375 implicitly anchored.
2377 When a capturing subpattern is repeated, the value captured
2378 is the substring that matched the final iteration. For exam-
2379 ple, after
2381 (tweedle[dume]{3}\s*)+
2383 has matched "tweedledum tweedledee" the value of the cap-
2384 tured substring is "tweedledee". However, if there are
2385 nested capturing subpatterns, the corresponding captured
2386 values may have been set in previous iterations. For exam-
2387 ple, after
2389 /(a|(b))+/
2391 matches "aba" the value of the second captured substring is
2392 "b".
2397 With both maximizing and minimizing repetition, failure of
2398 what follows normally causes the repeated item to be re-
2399 evaluated to see if a different number of repeats allows the
2400 rest of the pattern to match. Sometimes it is useful to
2401 prevent this, either to change the nature of the match, or
2402 to cause it fail earlier than it otherwise might, when the
2403 author of the pattern knows there is no point in carrying
2404 on.
2406 Consider, for example, the pattern \d+foo when applied to
2407 the subject line
2409 123456bar
2411 After matching all 6 digits and then failing to match "foo",
2412 the normal action of the matcher is to try again with only 5
2413 digits matching the \d+ item, and then with 4, and so on,
2414 before ultimately failing. "Atomic grouping" (a term taken
2415 from Jeffrey Friedl's book) provides the means for specify-
2416 ing that once a subpattern has matched, it is not to be re-
2417 evaluated in this way.
2419 If we use atomic grouping for the previous example, the
2420 matcher would give up immediately on failing to match "foo"
2421 the first time. The notation is a kind of special
2422 parenthesis, starting with (?> as in this example:
2424 (?>\d+)bar
2426 This kind of parenthesis "locks up" the part of the pattern
2427 it contains once it has matched, and a failure further into
2428 the pattern is prevented from backtracking into it. Back-
2429 tracking past it to previous items, however, works as nor-
2430 mal.
2432 An alternative description is that a subpattern of this type
2433 matches the string of characters that an identical stan-
2434 dalone pattern would match, if anchored at the current point
2435 in the subject string.
2437 Atomic grouping subpatterns are not capturing subpatterns.
2438 Simple cases such as the above example can be thought of as
2439 a maximizing repeat that must swallow everything it can. So,
2440 while both \d+ and \d+? are prepared to adjust the number of
2441 digits they match in order to make the rest of the pattern
2442 match, (?>\d+) can only match an entire sequence of digits.
2444 Atomic groups in general can of course contain arbitrarily
2445 complicated subpatterns, and can be nested. However, when
2446 the subpattern for an atomic group is just a single repeated
2447 item, as in the example above, a simpler notation, called a
2448 "possessive quantifier" can be used. This consists of an
2449 additional + character following a quantifier. Using this
2450 notation, the previous example can be rewritten as
2452 \d++bar
2454 Possessive quantifiers are always greedy; the setting of the
2455 PCRE_UNGREEDY option is ignored. They are a convenient nota-
2456 tion for the simpler forms of atomic group. However, there
2457 is no difference in the meaning or processing of a posses-
2458 sive quantifier and the equivalent atomic group.
2460 The possessive quantifier syntax is an extension to the Perl
2461 syntax. It originates in Sun's Java package.
2463 When a pattern contains an unlimited repeat inside a subpat-
2464 tern that can itself be repeated an unlimited number of
2465 times, the use of an atomic group is the only way to avoid
2466 some failing matches taking a very long time indeed. The
2467 pattern
2469 (\D+|<\d+>)*[!?]
2471 matches an unlimited number of substrings that either con-
2472 sist of non-digits, or digits enclosed in <>, followed by
2473 either ! or ?. When it matches, it runs quickly. However, if
2474 it is applied to
2476 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2478 it takes a long time before reporting failure. This is
2479 because the string can be divided between the two repeats in
2480 a large number of ways, and all have to be tried. (The exam-
2481 ple used [!?] rather than a single character at the end,
2482 because both PCRE and Perl have an optimization that allows
2483 for fast failure when a single character is used. They
2484 remember the last single character that is required for a
2485 match, and fail early if it is not present in the string.)
2486 If the pattern is changed to
2488 ((?>\D+)|<\d+>)*[!?]
2490 sequences of non-digits cannot be broken, and failure hap-
2491 pens quickly.
2496 Outside a character class, a backslash followed by a digit
2497 greater than 0 (and possibly further digits) is a back
2498 reference to a capturing subpattern earlier (that is, to its
2499 left) in the pattern, provided there have been that many
2500 previous capturing left parentheses.
2502 However, if the decimal number following the backslash is
2503 less than 10, it is always taken as a back reference, and
2504 causes an error only if there are not that many capturing
2505 left parentheses in the entire pattern. In other words, the
2506 parentheses that are referenced need not be to the left of
2507 the reference for numbers less than 10. See the section
2508 entitled "Backslash" above for further details of the han-
2509 dling of digits following a backslash.
2511 A back reference matches whatever actually matched the cap-
2512 turing subpattern in the current subject string, rather than
2513 anything matching the subpattern itself (see "Subpatterns as
2514 subroutines" below for a way of doing that). So the pattern
2516 (sens|respons)e and \1ibility
2518 matches "sense and sensibility" and "response and responsi-
2519 bility", but not "sense and responsibility". If caseful
2520 matching is in force at the time of the back reference, the
2521 case of letters is relevant. For example,
2523 ((?i)rah)\s+\1
2525 matches "rah rah" and "RAH RAH", but not "RAH rah", even
2526 though the original capturing subpattern is matched case-
2527 lessly.
2529 Back references to named subpatterns use the Python syntax
2530 (?P=name). We could rewrite the above example as follows:
2532 (?<p1>(?i)rah)\s+(?P=p1)
2534 There may be more than one back reference to the same sub-
2535 pattern. If a subpattern has not actually been used in a
2536 particular match, any back references to it always fail. For
2537 example, the pattern
2539 (a|(bc))\2
2541 always fails if it starts to match "a" rather than "bc".
2542 Because there may be many capturing parentheses in a pat-
2543 tern, all digits following the backslash are taken as part
2544 of a potential back reference number. If the pattern contin-
2545 ues with a digit character, some delimiter must be used to
2546 terminate the back reference. If the PCRE_EXTENDED option is
2547 set, this can be whitespace. Otherwise an empty comment can
2548 be used.
2550 A back reference that occurs inside the parentheses to which
2551 it refers fails when the subpattern is first used, so, for
2552 example, (a\1) never matches. However, such references can
2553 be useful inside repeated subpatterns. For example, the pat-
2554 tern
2556 (a|b\1)+
2558 matches any number of "a"s and also "aba", "ababbaa" etc. At
2559 each iteration of the subpattern, the back reference matches
2560 the character string corresponding to the previous itera-
2561 tion. In order for this to work, the pattern must be such
2562 that the first iteration does not need to match the back
2563 reference. This can be done using alternation, as in the
2564 example above, or by a quantifier with a minimum of zero.
2569 An assertion is a test on the characters following or
2570 preceding the current matching point that does not actually
2571 consume any characters. The simple assertions coded as \b,
2572 \B, \A, \G, \Z, \z, ^ and $ are described above. More com-
2573 plicated assertions are coded as subpatterns. There are two
2574 kinds: those that look ahead of the current position in the
2575 subject string, and those that look behind it.
2577 An assertion subpattern is matched in the normal way, except
2578 that it does not cause the current matching position to be
2579 changed. Lookahead assertions start with (?= for positive
2580 assertions and (?! for negative assertions. For example,
2582 \w+(?=;)
2584 matches a word followed by a semicolon, but does not include
2585 the semicolon in the match, and
2587 foo(?!bar)
2589 matches any occurrence of "foo" that is not followed by
2590 "bar". Note that the apparently similar pattern
2592 (?!foo)bar
2594 does not find an occurrence of "bar" that is preceded by
2595 something other than "foo"; it finds any occurrence of "bar"
2596 whatsoever, because the assertion (?!foo) is always true
2597 when the next three characters are "bar". A lookbehind
2598 assertion is needed to achieve this effect.
2600 If you want to force a matching failure at some point in a
2601 pattern, the most convenient way to do it is with (?!)
2602 because an empty string always matches, so an assertion that
2603 requires there not to be an empty string must always fail.
2605 Lookbehind assertions start with (?<= for positive asser-
2606 tions and (?<! for negative assertions. For example,
2608 (?<!foo)bar
2610 does find an occurrence of "bar" that is not preceded by
2611 "foo". The contents of a lookbehind assertion are restricted
2612 such that all the strings it matches must have a fixed
2613 length. However, if there are several alternatives, they do
2614 not all have to have the same fixed length. Thus
2616 (?<=bullock|donkey)
2618 is permitted, but
2620 (?<!dogs?|cats?)
2622 causes an error at compile time. Branches that match dif-
2623 ferent length strings are permitted only at the top level of
2624 a lookbehind assertion. This is an extension compared with
2625 Perl (at least for 5.8), which requires all branches to
2626 match the same length of string. An assertion such as
2628 (?<=ab(c|de))
2630 is not permitted, because its single top-level branch can
2631 match two different lengths, but it is acceptable if rewrit-
2632 ten to use two top-level branches:
2634 (?<=abc|abde)
2636 The implementation of lookbehind assertions is, for each
2637 alternative, to temporarily move the current position back
2638 by the fixed width and then try to match. If there are
2639 insufficient characters before the current position, the
2640 match is deemed to fail.
2642 PCRE does not allow the \C escape (which matches a single
2643 byte in UTF-8 mode) to appear in lookbehind assertions,
2644 because it makes it impossible to calculate the length of
2645 the lookbehind.
2647 Atomic groups can be used in conjunction with lookbehind
2648 assertions to specify efficient matching at the end of the
2649 subject string. Consider a simple pattern such as
2651 abcd$
2653 when applied to a long string that does not match. Because
2654 matching proceeds from left to right, PCRE will look for
2655 each "a" in the subject and then see if what follows matches
2656 the rest of the pattern. If the pattern is specified as
2658 ^.*abcd$
2660 the initial .* matches the entire string at first, but when
2661 this fails (because there is no following "a"), it back-
2662 tracks to match all but the last character, then all but the
2663 last two characters, and so on. Once again the search for
2664 "a" covers the entire string, from right to left, so we are
2665 no better off. However, if the pattern is written as
2667 ^(?>.*)(?<=abcd)
2669 or, equivalently,
2671 ^.*+(?<=abcd)
2673 there can be no backtracking for the .* item; it can match
2674 only the entire string. The subsequent lookbehind assertion
2675 does a single test on the last four characters. If it fails,
2676 the match fails immediately. For long strings, this approach
2677 makes a significant difference to the processing time.
2679 Several assertions (of any sort) may occur in succession.
2680 For example,
2682 (?<=\d{3})(?<!999)foo
2684 matches "foo" preceded by three digits that are not "999".
2685 Notice that each of the assertions is applied independently
2686 at the same point in the subject string. First there is a
2687 check that the previous three characters are all digits, and
2688 then there is a check that the same three characters are not
2689 "999". This pattern does not match "foo" preceded by six
2690 characters, the first of which are digits and the last three
2691 of which are not "999". For example, it doesn't match
2692 "123abcfoo". A pattern to do that is
2694 (?<=\d{3}...)(?<!999)foo
2696 This time the first assertion looks at the preceding six
2697 characters, checking that the first three are digits, and
2698 then the second assertion checks that the preceding three
2699 characters are not "999".
2701 Assertions can be nested in any combination. For example,
2703 (?<=(?<!foo)bar)baz
2705 matches an occurrence of "baz" that is preceded by "bar"
2706 which in turn is not preceded by "foo", while
2708 (?<=\d{3}(?!999)...)foo
2710 is another pattern which matches "foo" preceded by three
2711 digits and any three characters that are not "999".
2713 Assertion subpatterns are not capturing subpatterns, and may
2714 not be repeated, because it makes no sense to assert the
2715 same thing several times. If any kind of assertion contains
2716 capturing subpatterns within it, these are counted for the
2717 purposes of numbering the capturing subpatterns in the whole
2718 pattern. However, substring capturing is carried out only
2719 for positive assertions, because it does not make sense for
2720 negative assertions.
2725 It is possible to cause the matching process to obey a sub-
2726 pattern conditionally or to choose between two alternative
2727 subpatterns, depending on the result of an assertion, or
2728 whether a previous capturing subpattern matched or not. The
2729 two possible forms of conditional subpattern are
2731 (?(condition)yes-pattern)
2732 (?(condition)yes-pattern|no-pattern)
2734 If the condition is satisfied, the yes-pattern is used; oth-
2735 erwise the no-pattern (if present) is used. If there are
2736 more than two alternatives in the subpattern, a compile-time
2737 error occurs.
2739 There are three kinds of condition. If the text between the
2740 parentheses consists of a sequence of digits, the condition
2741 is satisfied if the capturing subpattern of that number has
2742 previously matched. The number must be greater than zero.
2743 Consider the following pattern, which contains non-
2744 significant white space to make it more readable (assume the
2745 PCRE_EXTENDED option) and to divide it into three parts for
2746 ease of discussion:
2748 ( \( )? [^()]+ (?(1) \) )
2750 The first part matches an optional opening parenthesis, and
2751 if that character is present, sets it as the first captured
2752 substring. The second part matches one or more characters
2753 that are not parentheses. The third part is a conditional
2754 subpattern that tests whether the first set of parentheses
2755 matched or not. If they did, that is, if subject started
2756 with an opening parenthesis, the condition is true, and so
2757 the yes-pattern is executed and a closing parenthesis is
2758 required. Otherwise, since no-pattern is not present, the
2759 subpattern matches nothing. In other words, this pattern
2760 matches a sequence of non-parentheses, optionally enclosed
2761 in parentheses.
2763 If the condition is the string (R), it is satisfied if a
2764 recursive call to the pattern or subpattern has been made.
2765 At "top level", the condition is false. This is a PCRE
2766 extension. Recursive patterns are described in the next
2767 section.
2769 If the condition is not a sequence of digits or (R), it must
2770 be an assertion. This may be a positive or negative looka-
2771 head or lookbehind assertion. Consider this pattern, again
2772 containing non-significant white space, and with the two
2773 alternatives on the second line:
2775 (?(?=[^a-z]*[a-z])
2776 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2778 The condition is a positive lookahead assertion that matches
2779 an optional sequence of non-letters followed by a letter. In
2780 other words, it tests for the presence of at least one
2781 letter in the subject. If a letter is found, the subject is
2782 matched against the first alternative; otherwise it is
2783 matched against the second. This pattern matches strings in
2784 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2785 letters and dd are digits.
2790 The sequence (?# marks the start of a comment which contin-
2791 ues up to the next closing parenthesis. Nested parentheses
2792 are not permitted. The characters that make up a comment
2793 play no part in the pattern matching at all.
2795 If the PCRE_EXTENDED option is set, an unescaped # character
2796 outside a character class introduces a comment that contin-
2797 ues up to the next newline character in the pattern.
2802 Consider the problem of matching a string in parentheses,
2803 allowing for unlimited nested parentheses. Without the use
2804 of recursion, the best that can be done is to use a pattern
2805 that matches up to some fixed depth of nesting. It is not
2806 possible to handle an arbitrary nesting depth. Perl has pro-
2807 vided an experimental facility that allows regular expres-
2808 sions to recurse (amongst other things). It does this by
2809 interpolating Perl code in the expression at run time, and
2810 the code can refer to the expression itself. A Perl pattern
2811 to solve the parentheses problem can be created like this:
2813 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2815 The (?p{...}) item interpolates Perl code at run time, and
2816 in this case refers recursively to the pattern in which it
2817 appears. Obviously, PCRE cannot support the interpolation of
2818 Perl code. Instead, it supports some special syntax for
2819 recursion of the entire pattern, and also for individual
2820 subpattern recursion.
2822 The special item that consists of (? followed by a number
2823 greater than zero and a closing parenthesis is a recursive
2824 call of the subpattern of the given number, provided that it
2825 occurs inside that subpattern. (If not, it is a "subroutine"
2826 call, which is described in the next section.) The special
2827 item (?R) is a recursive call of the entire regular expres-
2828 sion.
2830 For example, this PCRE pattern solves the nested parentheses
2831 problem (assume the PCRE_EXTENDED option is set so that
2832 white space is ignored):
2834 \( ( (?>[^()]+) | (?R) )* \)
2836 First it matches an opening parenthesis. Then it matches any
2837 number of substrings which can either be a sequence of non-
2838 parentheses, or a recursive match of the pattern itself
2839 (that is a correctly parenthesized substring). Finally
2840 there is a closing parenthesis.
2842 If this were part of a larger pattern, you would not want to
2843 recurse the entire pattern, so instead you could use this:
2845 ( \( ( (?>[^()]+) | (?1) )* \) )
2847 We have put the pattern into parentheses, and caused the
2848 recursion to refer to them instead of the whole pattern. In
2849 a larger pattern, keeping track of parenthesis numbers can
2850 be tricky. It may be more convenient to use named
2851 parentheses instead. For this, PCRE uses (?P>name), which is
2852 an extension to the Python syntax that PCRE uses for named
2853 parentheses (Perl does not provide named parentheses). We
2854 could rewrite the above example as follows:
2856 (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2858 This particular example pattern contains nested unlimited
2859 repeats, and so the use of atomic grouping for matching
2860 strings of non-parentheses is important when applying the
2861 pattern to strings that do not match. For example, when this
2862 pattern is applied to
2864 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2866 it yields "no match" quickly. However, if atomic grouping is
2867 not used, the match runs for a very long time indeed because
2868 there are so many different ways the + and * repeats can
2869 carve up the subject, and all have to be tested before
2870 failure can be reported.
2871 At the end of a match, the values set for any capturing sub-
2872 patterns are those from the outermost level of the recursion
2873 at which the subpattern value is set. If you want to obtain
2874 intermediate values, a callout function can be used (see
2875 below and the pcrecallout documentation). If the pattern
2876 above is matched against
2878 (ab(cd)ef)
2880 the value for the capturing parentheses is "ef", which is
2881 the last value taken on at the top level. If additional
2882 parentheses are added, giving
2884 \( ( ( (?>[^()]+) | (?R) )* ) \)
2885 ^ ^
2886 ^ ^
2888 the string they capture is "ab(cd)ef", the contents of the
2889 top level parentheses. If there are more than 15 capturing
2890 parentheses in a pattern, PCRE has to obtain extra memory to
2891 store data during a recursion, which it does by using
2892 pcre_malloc, freeing it via pcre_free afterwards. If no
2893 memory can be obtained, the match fails with the
2896 Do not confuse the (?R) item with the condition (R), which
2897 tests for recursion. Consider this pattern, which matches
2898 text in angle brackets, allowing for arbitrary nesting. Only
2899 digits are allowed in nested brackets (that is, when recurs-
2900 ing), whereas any characters are permitted at the outer
2901 level.
2903 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2905 In this pattern, (?(R) is the start of a conditional subpat-
2906 tern, with two different alternatives for the recursive and
2907 non-recursive cases. The (?R) item is the actual recursive
2908 call.
2913 If the syntax for a recursive subpattern reference (either
2914 by number or by name) is used outside the parentheses to
2915 which it refers, it operates like a subroutine in a program-
2916 ming language. An earlier example pointed out that the pat-
2917 tern
2919 (sens|respons)e and \1ibility
2921 matches "sense and sensibility" and "response and responsi-
2922 bility", but not "sense and responsibility". If instead the
2923 pattern
2925 (sens|respons)e and (?1)ibility
2927 is used, it does match "sense and responsibility" as well as
2928 the other two strings. Such references must, however, follow
2929 the subpattern to which they refer.
2934 Perl has a feature whereby using the sequence (?{...})
2935 causes arbitrary Perl code to be obeyed in the middle of
2936 matching a regular expression. This makes it possible,
2937 amongst other things, to extract different substrings that
2938 match the same pair of parentheses when there is a repeti-
2939 tion.
2941 PCRE provides a similar feature, but of course it cannot
2942 obey arbitrary Perl code. The feature is called "callout".
2943 The caller of PCRE provides an external function by putting
2944 its entry point in the global variable pcre_callout. By
2945 default, this variable contains NULL, which disables all
2946 calling out.
2948 Within a regular expression, (?C) indicates the points at
2949 which the external function is to be called. If you want to
2950 identify different callout points, you can put a number less
2951 than 256 after the letter C. The default value is zero. For
2952 example, this pattern has two callout points:
2954 (?C1)9abc(?C2)def
2956 During matching, when PCRE reaches a callout point (and
2957 pcre_callout is set), the external function is called. It is
2958 provided with the number of the callout, and, optionally,
2959 one item of data originally supplied by the caller of
2960 pcre_exec(). The callout function may cause matching to
2961 backtrack, or to fail altogether. A complete description of
2962 the interface to the callout function is given in the pcre-
2963 callout documentation.
2965 Last updated: 03 February 2003
2966 Copyright (c) 1997-2003 University of Cambridge.
2967 -----------------------------------------------------------------------------
2969 NAME
2970 PCRE - Perl-compatible regular expressions
2975 Certain items that may appear in regular expression patterns
2976 are more efficient than others. It is more efficient to use
2977 a character class like [aeiou] than a set of alternatives
2978 such as (a|e|i|o|u). In general, the simplest construction
2979 that provides the required behaviour is usually the most
2980 efficient. Jeffrey Friedl's book contains a lot of discus-
2981 sion about optimizing regular expressions for efficient per-
2982 formance.
2984 When a pattern begins with .* not in parentheses, or in
2985 parentheses that are not the subject of a backreference, and
2986 the PCRE_DOTALL option is set, the pattern is implicitly
2987 anchored by PCRE, since it can match only at the start of a
2988 subject string. However, if PCRE_DOTALL is not set, PCRE
2989 cannot make this optimization, because the . metacharacter
2990 does not then match a newline, and if the subject string
2991 contains newlines, the pattern may match from the character
2992 immediately following one of them instead of from the very
2993 start. For example, the pattern
2995 .*second
2997 matches the subject "first\nand second" (where \n stands for
2998 a newline character), with the match starting at the seventh
2999 character. In order to do this, PCRE has to retry the match
3000 starting after every newline in the subject.
3002 If you are using such a pattern with subject strings that do
3003 not contain newlines, the best performance is obtained by
3004 setting PCRE_DOTALL, or starting the pattern with ^.* to
3005 indicate explicit anchoring. That saves PCRE from having to
3006 scan along the subject looking for a newline to restart at.
3008 Beware of patterns that contain nested indefinite repeats.
3009 These can take a long time to run when applied to a string
3010 that does not match. Consider the pattern fragment
3012 (a+)*
3014 This can match "aaaa" in 33 different ways, and this number
3015 increases very rapidly as the string gets longer. (The *
3016 repeat can match 0, 1, 2, 3, or 4 times, and for each of
3017 those cases other than 0, the + repeats can match different
3018 numbers of times.) When the remainder of the pattern is such
3019 that the entire match is going to fail, PCRE has in princi-
3020 ple to try every possible variation, and this can take an
3021 extremely long time.
3022 An optimization catches some of the more simple cases such
3023 as
3025 (a+)*b
3027 where a literal character follows. Before embarking on the
3028 standard matching procedure, PCRE checks that there is a "b"
3029 later in the subject string, and if there is not, it fails
3030 the match immediately. However, when there is no following
3031 literal this optimization cannot be used. You can see the
3032 difference by comparing the behaviour of
3034 (a+)*\d
3036 with the pattern above. The former gives a failure almost
3037 instantly when applied to a whole line of "a" characters,
3038 whereas the latter takes an appreciable time with strings
3039 longer than about 20 characters.
3041 Last updated: 03 February 2003
3042 Copyright (c) 1997-2003 University of Cambridge.
3043 -----------------------------------------------------------------------------
3045 NAME
3046 PCRE - Perl-compatible regular expressions.
3050 #include <pcreposix.h>
3052 int regcomp(regex_t *preg, const char *pattern,
3053 int cflags);
3055 int regexec(regex_t *preg, const char *string,
3056 size_t nmatch, regmatch_t pmatch[], int eflags);
3058 size_t regerror(int errcode, const regex_t *preg,
3059 char *errbuf, size_t errbuf_size);
3061 void regfree(regex_t *preg);
3066 This set of functions provides a POSIX-style API to the PCRE
3067 regular expression package. See the pcreapi documentation
3068 for a description of the native API, which contains addi-
3069 tional functionality.
3071 The functions described here are just wrapper functions that
3072 ultimately call the PCRE native API. Their prototypes are
3073 defined in the pcreposix.h header file, and on Unix systems
3074 the library itself is called pcreposix.a, so can be accessed
3075 by adding -lpcreposix to the command for linking an applica-
3076 tion which uses them. Because the POSIX functions call the
3077 native ones, it is also necessary to add -lpcre.
3079 I have implemented only those option bits that can be rea-
3080 sonably mapped to PCRE native options. In addition, the
3081 options REG_EXTENDED and REG_NOSUB are defined with the
3082 value zero. They have no effect, but since programs that are
3083 written to the POSIX interface often use them, this makes it
3084 easier to slot in PCRE as a replacement library. Other POSIX
3085 options are not even defined.
3087 When PCRE is called via these functions, it is only the API
3088 that is POSIX-like in style. The syntax and semantics of the
3089 regular expressions themselves are still those of Perl, sub-
3090 ject to the setting of various PCRE options, as described
3091 below.
3093 The header for these functions is supplied as pcreposix.h to
3094 avoid any potential clash with other POSIX libraries. It
3095 can, of course, be renamed or aliased as regex.h, which is
3096 the "correct" name. It provides two structure types, regex_t
3097 for compiled internal forms, and regmatch_t for returning
3098 captured substrings. It also defines some constants whose
3099 names start with "REG_"; these are used for setting options
3100 and identifying error codes.
3105 The function regcomp() is called to compile a pattern into
3106 an internal form. The pattern is a C string terminated by a
3107 binary zero, and is passed in the argument pattern. The preg
3108 argument is a pointer to a regex_t structure which is used
3109 as a base for storing information about the compiled expres-
3110 sion.
3112 The argument cflags is either zero, or contains one or more
3113 of the bits defined by the following macros:
3117 The PCRE_CASELESS option is set when the expression is
3118 passed for compilation to the native function.
3122 The PCRE_MULTILINE option is set when the expression is
3123 passed for compilation to the native function. Note that
3124 this does not mimic the defined POSIX behaviour for
3125 REG_NEWLINE (see the following section).
3127 In the absence of these flags, no options are passed to the
3128 native function. This means the the regex is compiled with
3129 PCRE default semantics. In particular, the way it handles
3130 newline characters in the subject string is the Perl way,
3131 not the POSIX way. Note that setting PCRE_MULTILINE has only
3132 some of the effects specified for REG_NEWLINE. It does not
3133 affect the way newlines are matched by . (they aren't) or by
3134 a negative class such as [^a] (they are).
3136 The yield of regcomp() is zero on success, and non-zero oth-
3137 erwise. The preg structure is filled in on success, and one
3138 member of the structure is public: re_nsub contains the
3139 number of capturing subpatterns in the regular expression.
3140 Various error codes are defined in the header file.
3145 This area is not simple, because POSIX and Perl take dif-
3146 ferent views of things. It is not possible to get PCRE to
3147 obey POSIX semantics, but then PCRE was never intended to be
3148 a POSIX engine. The following table lists the different pos-
3149 sibilities for matching newline characters in PCRE:
3151 Default Change with
3153 . matches newline no PCRE_DOTALL
3154 newline matches [^a] yes not changeable
3155 $ matches \n at end yes PCRE_DOLLARENDONLY
3156 $ matches \n in middle no PCRE_MULTILINE
3157 ^ matches \n in middle no PCRE_MULTILINE
3159 This is the equivalent table for POSIX:
3161 Default Change with
3163 . matches newline yes REG_NEWLINE
3164 newline matches [^a] yes REG_NEWLINE
3165 $ matches \n at end no REG_NEWLINE
3166 $ matches \n in middle no REG_NEWLINE
3167 ^ matches \n in middle no REG_NEWLINE
3169 PCRE's behaviour is the same as Perl's, except that there is
3170 no equivalent for PCRE_DOLLARENDONLY in Perl. In both PCRE
3171 and Perl, there is no way to stop newline from matching
3172 [^a].
3174 The default POSIX newline handling can be obtained by set-
3175 ting PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way
3176 to make PCRE behave exactly as for the REG_NEWLINE action.
3181 The function regexec() is called to match a pre-compiled
3182 pattern preg against a given string, which is terminated by
3183 a zero byte, subject to the options in eflags. These can be:
3187 The PCRE_NOTBOL option is set when calling the underlying
3188 PCRE matching function.
3192 The PCRE_NOTEOL option is set when calling the underlying
3193 PCRE matching function.
3195 The portion of the string that was matched, and also any
3196 captured substrings, are returned via the pmatch argument,
3197 which points to an array of nmatch structures of type
3198 regmatch_t, containing the members rm_so and rm_eo. These
3199 contain the offset to the first character of each substring
3200 and the offset to the first character after the end of each
3201 substring, respectively. The 0th element of the vector
3202 relates to the entire portion of string that was matched;
3203 subsequent elements relate to the capturing subpatterns of
3204 the regular expression. Unused entries in the array have
3205 both structure members set to -1.
3207 A successful match yields a zero return; various error codes
3208 are defined in the header file, of which REG_NOMATCH is the
3209 "expected" failure code.
3214 The regerror() function maps a non-zero errorcode from
3215 either regcomp() or regexec() to a printable message. If
3216 preg is not NULL, the error should have arisen from the use
3217 of that structure. A message terminated by a binary zero is
3218 placed in errbuf. The length of the message, including the
3219 zero, is limited to errbuf_size. The yield of the function
3220 is the size of buffer needed to hold the whole message.
3225 Compiling a regular expression causes memory to be allocated
3226 and associated with the preg structure. The function reg-
3227 free() frees all such memory, after which preg may no longer
3228 be used as a compiled expression.
3233 Philip Hazel <ph10@cam.ac.uk>
3234 University Computing Service,
3235 Cambridge CB2 3QG, England.
3237 Last updated: 03 February 2003
3238 Copyright (c) 1997-2003 University of Cambridge.
3239 -----------------------------------------------------------------------------
3241 NAME
3242 PCRE - Perl-compatible regular expressions
3247 A simple, complete demonstration program, to get you started
3248 with using PCRE, is supplied in the file pcredemo.c in the
3249 PCRE distribution.
3251 The program compiles the regular expression that is its
3252 first argument, and matches it against the subject string in
3253 its second argument. No PCRE options are set, and default
3254 character tables are used. If matching succeeds, the program
3255 outputs the portion of the subject that matched, together
3256 with the contents of any captured substrings.
3258 If the -g option is given on the command line, the program
3259 then goes on to check for further matches of the same regu-
3260 lar expression in the same subject string. The logic is a
3261 little bit tricky because of the possibility of matching an
3262 empty string. Comments in the code explain what is going on.
3264 On a Unix system that has PCRE installed in /usr/local, you
3265 can compile the demonstration program using a command like
3266 this:
3268 gcc -o pcredemo pcredemo.c -I/usr/local/include \
3269 -L/usr/local/lib -lpcre
3271 Then you can run simple tests like this:
3273 ./pcredemo 'cat|dog' 'the cat sat on the mat'
3274 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3276 Note that there is a much more comprehensive test program,
3277 called pcretest, which supports many more facilities for
3278 testing regular expressions and the PCRE library. The
3279 pcredemo program is provided as a simple coding example.
3281 On some operating systems (e.g. Solaris) you may get an
3282 error like this when you try to run pcredemo:
3284 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such
3285 file or directory
3287 This is caused by the way shared library support works on
3288 those systems. You need to add
3290 -R/usr/local/lib
3292 to the compile command to get round this problem.
3294 Last updated: 28 January 2003
3295 Copyright (c) 1997-2003 University of Cambridge.
3296 -----------------------------------------------------------------------------

  ViewVC Help
Powered by ViewVC 1.1.5