ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 73 - (show annotations)
Sat Feb 24 21:40:30 2007 UTC (14 years, 1 month ago) by nigel
File MIME type: text/plain
File size: 151738 byte(s)
Error occurred while calculating annotation data.
Load pcre-4.5 into code/trunk.
1 This file contains a concatenation of the PCRE man pages, converted to plain
2 text format for ease of searching with a text editor, or for use on systems
3 that do not have a man page processor. The small individual files that give
4 synopses of each function in the library have not been included. There are
5 separate text files for the pcregrep and pcretest commands.
6 -----------------------------------------------------------------------------
8 PCRE(3) PCRE(3)
13 PCRE - Perl-compatible regular expressions
17 The PCRE library is a set of functions that implement regular expres-
18 sion pattern matching using the same syntax and semantics as Perl, with
19 just a few differences. The current implementation of PCRE (release
20 4.x) corresponds approximately with Perl 5.8, including support for
21 UTF-8 encoded strings. However, this support has to be explicitly
22 enabled; it is not the default.
24 PCRE is written in C and released as a C library. However, a number of
25 people have written wrappers and interfaces of various kinds. A C++
26 class is included in these contributions, which can be found in the
27 Contrib directory at the primary FTP site, which is:
29 ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
31 Details of exactly which Perl regular expression features are and are
32 not supported by PCRE are given in separate documents. See the pcrepat-
33 tern and pcrecompat pages.
35 Some features of PCRE can be included, excluded, or changed when the
36 library is built. The pcre_config() function makes it possible for a
37 client to discover which features are available. Documentation about
38 building PCRE for various operating systems can be found in the README
39 file in the source distribution.
44 The user documentation for PCRE has been split up into a number of dif-
45 ferent sections. In the "man" format, each of these is a separate "man
46 page". In the HTML format, each is a separate page, linked from the
47 index page. In the plain text format, all the sections are concate-
48 nated, for ease of searching. The sections are as follows:
50 pcre this document
51 pcreapi details of PCRE's native API
52 pcrebuild options for building PCRE
53 pcrecallout details of the callout feature
54 pcrecompat discussion of Perl compatibility
55 pcregrep description of the pcregrep command
56 pcrepattern syntax and semantics of supported
57 regular expressions
58 pcreperform discussion of performance issues
59 pcreposix the POSIX-compatible API
60 pcresample discussion of the sample program
61 pcretest the pcretest testing command
63 In addition, in the "man" and HTML formats, there is a short page for
64 each library function, listing its arguments and results.
69 There are some size limitations in PCRE but it is hoped that they will
70 never in practice be relevant.
72 The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE
73 is compiled with the default internal linkage size of 2. If you want to
74 process regular expressions that are truly enormous, you can compile
75 PCRE with an internal linkage size of 3 or 4 (see the README file in
76 the source distribution and the pcrebuild documentation for details).
77 If these cases the limit is substantially larger. However, the speed
78 of execution will be slower.
80 All values in repeating quantifiers must be less than 65536. The maxi-
81 mum number of capturing subpatterns is 65535.
83 There is no limit to the number of non-capturing subpatterns, but the
84 maximum depth of nesting of all kinds of parenthesized subpattern,
85 including capturing subpatterns, assertions, and other types of subpat-
86 tern, is 200.
88 The maximum length of a subject string is the largest positive number
89 that an integer variable can hold. However, PCRE uses recursion to han-
90 dle subpatterns and indefinite repetition. This means that the avail-
91 able stack space may limit the size of a subject string that can be
92 processed by certain patterns.
97 Starting at release 3.3, PCRE has had some support for character
98 strings encoded in the UTF-8 format. For release 4.0 this has been
99 greatly extended to cover most common requirements.
101 In order process UTF-8 strings, you must build PCRE to include UTF-8
102 support in the code, and, in addition, you must call pcre_compile()
103 with the PCRE_UTF8 option flag. When you do this, both the pattern and
104 any subject strings that are matched against it are treated as UTF-8
105 strings instead of just strings of bytes.
107 If you compile PCRE with UTF-8 support, but do not use it at run time,
108 the library will be a bit bigger, but the additional run time overhead
109 is limited to testing the PCRE_UTF8 flag in several places, so should
110 not be very large.
112 The following comments apply when PCRE is running in UTF-8 mode:
114 1. When you set the PCRE_UTF8 flag, the strings passed as patterns and
115 subjects are checked for validity on entry to the relevant functions.
116 If an invalid UTF-8 string is passed, an error return is given. In some
117 situations, you may already know that your strings are valid, and
118 therefore want to skip these checks in order to improve performance. If
119 you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time,
120 PCRE assumes that the pattern or subject it is given (respectively)
121 contains only valid UTF-8 codes. In this case, it does not diagnose an
122 invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when
123 PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may
124 crash.
126 2. In a pattern, the escape sequence \x{...}, where the contents of the
127 braces is a string of hexadecimal digits, is interpreted as a UTF-8
128 character whose code number is the given hexadecimal number, for exam-
129 ple: \x{1234}. If a non-hexadecimal digit appears between the braces,
130 the item is not recognized. This escape sequence can be used either as
131 a literal, or within a character class.
133 3. The original hexadecimal escape sequence, \xhh, matches a two-byte
134 UTF-8 character if the value is greater than 127.
136 4. Repeat quantifiers apply to complete UTF-8 characters, not to indi-
137 vidual bytes, for example: \x{100}{3}.
139 5. The dot metacharacter matches one UTF-8 character instead of a
140 single byte.
142 6. The escape sequence \C can be used to match a single byte in UTF-8
143 mode, but its use can lead to some strange effects.
145 7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
146 test characters of any code value, but the characters that PCRE recog-
147 nizes as digits, spaces, or word characters remain the same set as
148 before, all with values less than 256.
150 8. Case-insensitive matching applies only to characters whose values
151 are less than 256. PCRE does not support the notion of "case" for
152 higher-valued characters.
154 9. PCRE does not support the use of Unicode tables and properties or
155 the Perl escapes \p, \P, and \X.
160 Philip Hazel <ph10@cam.ac.uk>
161 University Computing Service,
162 Cambridge CB2 3QG, England.
163 Phone: +44 1223 334714
165 Last updated: 20 August 2003
166 Copyright (c) 1997-2003 University of Cambridge.
167 -----------------------------------------------------------------------------
169 PCRE(3) PCRE(3)
173 NAME
174 PCRE - Perl-compatible regular expressions
178 This document describes the optional features of PCRE that can be
179 selected when the library is compiled. They are all selected, or dese-
180 lected, by providing options to the configure script which is run
181 before the make command. The complete list of options for configure
182 (which includes the standard ones such as the selection of the instal-
183 lation directory) can be obtained by running
185 ./configure --help
187 The following sections describe certain options whose names begin with
188 --enable or --disable. These settings specify changes to the defaults
189 for the configure command. Because of the way that configure works,
190 --enable and --disable always come in pairs, so the complementary
191 option always exists as well, but as it specifies the default, it is
192 not described.
197 To build PCRE with support for UTF-8 character strings, add
199 --enable-utf8
201 to the configure command. Of itself, this does not make PCRE treat
202 strings as UTF-8. As well as compiling PCRE with this option, you also
203 have have to set the PCRE_UTF8 option when you call the pcre_compile()
204 function.
209 By default, PCRE treats character 10 (linefeed) as the newline charac-
210 ter. This is the normal newline character on Unix-like systems. You can
211 compile PCRE to use character 13 (carriage return) instead by adding
213 --enable-newline-is-cr
215 to the configure command. For completeness there is also a --enable-
216 newline-is-lf option, which explicitly specifies linefeed as the new-
217 line character.
222 The PCRE building process uses libtool to build both shared and static
223 Unix libraries by default. You can suppress one of these by adding one
224 of
226 --disable-shared
227 --disable-static
229 to the configure command, as required.
234 When PCRE is called through the POSIX interface (see the pcreposix
235 documentation), additional working storage is required for holding the
236 pointers to capturing substrings because PCRE requires three integers
237 per substring, whereas the POSIX interface provides only two. If the
238 number of expected substrings is small, the wrapper function uses space
239 on the stack, because this is faster than using malloc() for each call.
240 The default threshold above which the stack is no longer used is 10; it
241 can be changed by adding a setting such as
243 --with-posix-malloc-threshold=20
245 to the configure command.
250 Internally, PCRE has a function called match() which it calls repeat-
251 edly (possibly recursively) when performing a matching operation. By
252 limiting the number of times this function may be called, a limit can
253 be placed on the resources used by a single call to pcre_exec(). The
254 limit can be changed at run time, as described in the pcreapi documen-
255 tation. The default is 10 million, but this can be changed by adding a
256 setting such as
258 --with-match-limit=500000
260 to the configure command.
265 Within a compiled pattern, offset values are used to point from one
266 part to another (for example, from an opening parenthesis to an alter-
267 nation metacharacter). By default two-byte values are used for these
268 offsets, leading to a maximum size for a compiled pattern of around
269 64K. This is sufficient to handle all but the most gigantic patterns.
270 Nevertheless, some people do want to process enormous patterns, so it
271 is possible to compile PCRE to use three-byte or four-byte offsets by
272 adding a setting such as
274 --with-link-size=3
276 to the configure command. The value given must be 2, 3, or 4. Using
277 longer offsets slows down the operation of PCRE because it has to load
278 additional bytes when handling them.
280 If you build PCRE with an increased link size, test 2 (and test 5 if
281 you are using UTF-8) will fail. Part of the output of these tests is a
282 representation of the compiled pattern, and this changes with the link
283 size.
288 PCRE implements backtracking while matching by making recursive calls
289 to an internal function called match(). In environments where the size
290 of the stack is limited, this can severely limit PCRE's operation. (The
291 Unix environment does not usually suffer from this problem.) An alter-
292 native approach that uses memory from the heap to remember data,
293 instead of using recursive function calls, has been implemented to work
294 round this problem. If you want to build a version of PCRE that works
295 this way, add
297 --disable-stack-for-recursion
299 to the configure command. With this configuration, PCRE will use the
300 pcre_stack_malloc and pcre_stack_free variables to call memory
301 management functions. Separate functions are provided because the usage
302 is very predictable: the block sizes requested are always the same, and
303 the blocks are always freed in reverse order. A calling program might
304 be able to implement optimized functions that perform better than the
305 standard malloc() and free() functions. PCRE runs noticeably more
306 slowly when built in this way.
311 PCRE assumes by default that it will run in an environment where the
312 character code is ASCII (or UTF-8, which is a superset of ASCII). PCRE
313 can, however, be compiled to run in an EBCDIC environment by adding
315 --enable-ebcdic
317 to the configure command.
319 Last updated: 09 December 2003
320 Copyright (c) 1997-2003 University of Cambridge.
321 -----------------------------------------------------------------------------
323 PCRE(3) PCRE(3)
327 NAME
328 PCRE - Perl-compatible regular expressions
332 #include <pcre.h>
334 pcre *pcre_compile(const char *pattern, int options,
335 const char **errptr, int *erroffset,
336 const unsigned char *tableptr);
338 pcre_extra *pcre_study(const pcre *code, int options,
339 const char **errptr);
341 int pcre_exec(const pcre *code, const pcre_extra *extra,
342 const char *subject, int length, int startoffset,
343 int options, int *ovector, int ovecsize);
345 int pcre_copy_named_substring(const pcre *code,
346 const char *subject, int *ovector,
347 int stringcount, const char *stringname,
348 char *buffer, int buffersize);
350 int pcre_copy_substring(const char *subject, int *ovector,
351 int stringcount, int stringnumber, char *buffer,
352 int buffersize);
354 int pcre_get_named_substring(const pcre *code,
355 const char *subject, int *ovector,
356 int stringcount, const char *stringname,
357 const char **stringptr);
359 int pcre_get_stringnumber(const pcre *code,
360 const char *name);
362 int pcre_get_substring(const char *subject, int *ovector,
363 int stringcount, int stringnumber,
364 const char **stringptr);
366 int pcre_get_substring_list(const char *subject,
367 int *ovector, int stringcount, const char ***listptr);
369 void pcre_free_substring(const char *stringptr);
371 void pcre_free_substring_list(const char **stringptr);
373 const unsigned char *pcre_maketables(void);
375 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
376 int what, void *where);
378 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
380 int pcre_config(int what, void *where);
382 char *pcre_version(void);
384 void *(*pcre_malloc)(size_t);
386 void (*pcre_free)(void *);
388 void *(*pcre_stack_malloc)(size_t);
390 void (*pcre_stack_free)(void *);
392 int (*pcre_callout)(pcre_callout_block *);
397 PCRE has its own native API, which is described in this document. There
398 is also a set of wrapper functions that correspond to the POSIX regular
399 expression API. These are described in the pcreposix documentation.
401 The native API function prototypes are defined in the header file
402 pcre.h, and on Unix systems the library itself is called libpcre.a, so
403 can be accessed by adding -lpcre to the command for linking an applica-
404 tion which calls it. The header file defines the macros PCRE_MAJOR and
405 PCRE_MINOR to contain the major and minor release numbers for the
406 library. Applications can use these to include support for different
407 releases.
409 The functions pcre_compile(), pcre_study(), and pcre_exec() are used
410 for compiling and matching regular expressions. A sample program that
411 demonstrates the simplest way of using them is given in the file pcre-
412 demo.c. The pcresample documentation describes how to run it.
414 There are convenience functions for extracting captured substrings from
415 a matched subject string. They are:
417 pcre_copy_substring()
418 pcre_copy_named_substring()
419 pcre_get_substring()
420 pcre_get_named_substring()
421 pcre_get_substring_list()
423 pcre_free_substring() and pcre_free_substring_list() are also provided,
424 to free the memory used for extracted strings.
426 The function pcre_maketables() is used (optionally) to build a set of
427 character tables in the current locale for passing to pcre_compile().
429 The function pcre_fullinfo() is used to find out information about a
430 compiled pattern; pcre_info() is an obsolete version which returns only
431 some of the available information, but is retained for backwards com-
432 patibility. The function pcre_version() returns a pointer to a string
433 containing the version of PCRE and its date of release.
435 The global variables pcre_malloc and pcre_free initially contain the
436 entry points of the standard malloc() and free() functions respec-
437 tively. PCRE calls the memory management functions via these variables,
438 so a calling program can replace them if it wishes to intercept the
439 calls. This should be done before calling any PCRE functions.
441 The global variables pcre_stack_malloc and pcre_stack_free are also
442 indirections to memory management functions. These special functions
443 are used only when PCRE is compiled to use the heap for remembering
444 data, instead of recursive function calls. This is a non-standard way
445 of building PCRE, for use in environments that have limited stacks.
446 Because of the greater use of memory management, it runs more slowly.
447 Separate functions are provided so that special-purpose external code
448 can be used for this case. When used, these functions are always called
449 in a stack-like manner (last obtained, first freed), and always for
450 memory blocks of the same size.
452 The global variable pcre_callout initially contains NULL. It can be set
453 by the caller to a "callout" function, which PCRE will then call at
454 specified points during a matching operation. Details are given in the
455 pcrecallout documentation.
460 The PCRE functions can be used in multi-threading applications, with
461 the proviso that the memory management functions pointed to by
462 pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
463 callout function pointed to by pcre_callout, are shared by all threads.
465 The compiled form of a regular expression is not altered during match-
466 ing, so the same compiled pattern can safely be used by several threads
467 at once.
472 int pcre_config(int what, void *where);
474 The function pcre_config() makes it possible for a PCRE client to dis-
475 cover which optional features have been compiled into the PCRE library.
476 The pcrebuild documentation has more details about these optional fea-
477 tures.
479 The first argument for pcre_config() is an integer, specifying which
480 information is required; the second argument is a pointer to a variable
481 into which the information is placed. The following information is
482 available:
486 The output is an integer that is set to one if UTF-8 support is avail-
487 able; otherwise it is set to zero.
491 The output is an integer that is set to the value of the code that is
492 used for the newline character. It is either linefeed (10) or carriage
493 return (13), and should normally be the standard character for your
494 operating system.
498 The output is an integer that contains the number of bytes used for
499 internal linkage in compiled regular expressions. The value is 2, 3, or
500 4. Larger values allow larger regular expressions to be compiled, at
501 the expense of slower matching. The default value of 2 is sufficient
502 for all but the most massive patterns, since it allows the compiled
503 pattern to be up to 64K in size.
507 The output is an integer that contains the threshold above which the
508 POSIX interface uses malloc() for output vectors. Further details are
509 given in the pcreposix documentation.
513 The output is an integer that gives the default limit for the number of
514 internal matching function calls in a pcre_exec() execution. Further
515 details are given with pcre_exec() below.
519 The output is an integer that is set to one if internal recursion is
520 implemented by recursive function calls that use the stack to remember
521 their state. This is the usual way that PCRE is compiled. The output is
522 zero if PCRE was compiled to use blocks of data on the heap instead of
523 recursive function calls. In this case, pcre_stack_malloc and
524 pcre_stack_free are called to manage memory blocks on the heap, thus
525 avoiding the use of the stack.
530 pcre *pcre_compile(const char *pattern, int options,
531 const char **errptr, int *erroffset,
532 const unsigned char *tableptr);
535 The function pcre_compile() is called to compile a pattern into an
536 internal form. The pattern is a C string terminated by a binary zero,
537 and is passed in the argument pattern. A pointer to a single block of
538 memory that is obtained via pcre_malloc is returned. This contains the
539 compiled code and related data. The pcre type is defined for the
540 returned block; this is a typedef for a structure whose contents are
541 not externally defined. It is up to the caller to free the memory when
542 it is no longer required.
544 Although the compiled code of a PCRE regex is relocatable, that is, it
545 does not depend on memory location, the complete pcre data block is not
546 fully relocatable, because it contains a copy of the tableptr argument,
547 which is an address (see below).
549 The options argument contains independent bits that affect the compila-
550 tion. It should be zero if no options are required. Some of the
551 options, in particular, those that are compatible with Perl, can also
552 be set and unset from within the pattern (see the detailed description
553 of regular expressions in the pcrepattern documentation). For these
554 options, the contents of the options argument specifies their initial
555 settings at the start of compilation and execution. The PCRE_ANCHORED
556 option can be set at the time of matching as well as at compile time.
558 If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise,
559 if compilation of a pattern fails, pcre_compile() returns NULL, and
560 sets the variable pointed to by errptr to point to a textual error mes-
561 sage. The offset from the start of the pattern to the character where
562 the error was discovered is placed in the variable pointed to by
563 erroffset, which must not be NULL. If it is, an immediate error is
564 given.
566 If the final argument, tableptr, is NULL, PCRE uses a default set of
567 character tables which are built when it is compiled, using the default
568 C locale. Otherwise, tableptr must be the result of a call to
569 pcre_maketables(). See the section on locale support below.
571 This code fragment shows a typical straightforward call to pcre_com-
572 pile():
574 pcre *re;
575 const char *error;
576 int erroffset;
577 re = pcre_compile(
578 "^A.*Z", /* the pattern */
579 0, /* default options */
580 &error, /* for error message */
581 &erroffset, /* for error offset */
582 NULL); /* use default character tables */
584 The following option bits are defined:
588 If this bit is set, the pattern is forced to be "anchored", that is, it
589 is constrained to match only at the first matching point in the string
590 which is being searched (the "subject string"). This effect can also be
591 achieved by appropriate constructs in the pattern itself, which is the
592 only way to do it in Perl.
596 If this bit is set, letters in the pattern match both upper and lower
597 case letters. It is equivalent to Perl's /i option, and it can be
598 changed within a pattern by a (?i) option setting.
602 If this bit is set, a dollar metacharacter in the pattern matches only
603 at the end of the subject string. Without this option, a dollar also
604 matches immediately before the final character if it is a newline (but
605 not before any other newlines). The PCRE_DOLLAR_ENDONLY option is
606 ignored if PCRE_MULTILINE is set. There is no equivalent to this option
607 in Perl, and no way to set it within a pattern.
611 If this bit is set, a dot metacharater in the pattern matches all char-
612 acters, including newlines. Without it, newlines are excluded. This
613 option is equivalent to Perl's /s option, and it can be changed within
614 a pattern by a (?s) option setting. A negative class such as [^a]
615 always matches a newline character, independent of the setting of this
616 option.
620 If this bit is set, whitespace data characters in the pattern are
621 totally ignored except when escaped or inside a character class.
622 Whitespace does not include the VT character (code 11). In addition,
623 characters between an unescaped # outside a character class and the
624 next newline character, inclusive, are also ignored. This is equivalent
625 to Perl's /x option, and it can be changed within a pattern by a (?x)
626 option setting.
628 This option makes it possible to include comments inside complicated
629 patterns. Note, however, that this applies only to data characters.
630 Whitespace characters may never appear within special character
631 sequences in a pattern, for example within the sequence (?( which
632 introduces a conditional subpattern.
636 This option was invented in order to turn on additional functionality
637 of PCRE that is incompatible with Perl, but it is currently of very
638 little use. When set, any backslash in a pattern that is followed by a
639 letter that has no special meaning causes an error, thus reserving
640 these combinations for future expansion. By default, as in Perl, a
641 backslash followed by a letter with no special meaning is treated as a
642 literal. There are at present no other features controlled by this
643 option. It can also be set by a (?X) option setting within a pattern.
647 By default, PCRE treats the subject string as consisting of a single
648 "line" of characters (even if it actually contains several newlines).
649 The "start of line" metacharacter (^) matches only at the start of the
650 string, while the "end of line" metacharacter ($) matches only at the
651 end of the string, or before a terminating newline (unless PCRE_DOL-
652 LAR_ENDONLY is set). This is the same as Perl.
654 When PCRE_MULTILINE it is set, the "start of line" and "end of line"
655 constructs match immediately following or immediately before any new-
656 line in the subject string, respectively, as well as at the very start
657 and end. This is equivalent to Perl's /m option, and it can be changed
658 within a pattern by a (?m) option setting. If there are no "\n" charac-
659 ters in a subject string, or no occurrences of ^ or $ in a pattern,
660 setting PCRE_MULTILINE has no effect.
664 If this option is set, it disables the use of numbered capturing paren-
665 theses in the pattern. Any opening parenthesis that is not followed by
666 ? behaves as if it were followed by ?: but named parentheses can still
667 be used for capturing (and they acquire numbers in the usual way).
668 There is no equivalent of this option in Perl.
672 This option inverts the "greediness" of the quantifiers so that they
673 are not greedy by default, but become greedy if followed by "?". It is
674 not compatible with Perl. It can also be set by a (?U) option setting
675 within the pattern.
679 This option causes PCRE to regard both the pattern and the subject as
680 strings of UTF-8 characters instead of single-byte character strings.
681 However, it is available only if PCRE has been built to include UTF-8
682 support. If not, the use of this option provokes an error. Details of
683 how this option changes the behaviour of PCRE are given in the section
684 on UTF-8 support in the main pcre page.
688 When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
689 automatically checked. If an invalid UTF-8 sequence of bytes is found,
690 pcre_compile() returns an error. If you already know that your pattern
691 is valid, and you want to skip this check for performance reasons, you
692 can set the PCRE_NO_UTF8_CHECK option. When it is set, the effect of
693 passing an invalid UTF-8 string as a pattern is undefined. It may cause
694 your program to crash. Note that there is a similar option for sup-
695 pressing the checking of subject strings passed to pcre_exec().
701 pcre_extra *pcre_study(const pcre *code, int options,
702 const char **errptr);
704 When a pattern is going to be used several times, it is worth spending
705 more time analyzing it in order to speed up the time taken for match-
706 ing. The function pcre_study() takes a pointer to a compiled pattern as
707 its first argument. If studing the pattern produces additional informa-
708 tion that will help speed up matching, pcre_study() returns a pointer
709 to a pcre_extra block, in which the study_data field points to the
710 results of the study.
712 The returned value from a pcre_study() can be passed directly to
713 pcre_exec(). However, the pcre_extra block also contains other fields
714 that can be set by the caller before the block is passed; these are
715 described below. If studying the pattern does not produce any addi-
716 tional information, pcre_study() returns NULL. In that circumstance, if
717 the calling program wants to pass some of the other fields to
718 pcre_exec(), it must set up its own pcre_extra block.
720 The second argument contains option bits. At present, no options are
721 defined for pcre_study(), and this argument should always be zero.
723 The third argument for pcre_study() is a pointer for an error message.
724 If studying succeeds (even if no data is returned), the variable it
725 points to is set to NULL. Otherwise it points to a textual error mes-
726 sage. You should therefore test the error pointer for NULL after call-
727 ing pcre_study(), to be sure that it has run successfully.
729 This is a typical call to pcre_study():
731 pcre_extra *pe;
732 pe = pcre_study(
733 re, /* result of pcre_compile() */
734 0, /* no options exist */
735 &error); /* set to NULL or points to a message */
737 At present, studying a pattern is useful only for non-anchored patterns
738 that do not have a single fixed starting character. A bitmap of possi-
739 ble starting characters is created.
744 PCRE handles caseless matching, and determines whether characters are
745 letters, digits, or whatever, by reference to a set of tables. When
746 running in UTF-8 mode, this applies only to characters with codes less
747 than 256. The library contains a default set of tables that is created
748 in the default C locale when PCRE is compiled. This is used when the
749 final argument of pcre_compile() is NULL, and is sufficient for many
750 applications.
752 An alternative set of tables can, however, be supplied. Such tables are
753 built by calling the pcre_maketables() function, which has no argu-
754 ments, in the relevant locale. The result can then be passed to
755 pcre_compile() as often as necessary. For example, to build and use
756 tables that are appropriate for the French locale (where accented char-
757 acters with codes greater than 128 are treated as letters), the follow-
758 ing code could be used:
760 setlocale(LC_CTYPE, "fr");
761 tables = pcre_maketables();
762 re = pcre_compile(..., tables);
764 The tables are built in memory that is obtained via pcre_malloc. The
765 pointer that is passed to pcre_compile is saved with the compiled pat-
766 tern, and the same tables are used via this pointer by pcre_study() and
767 pcre_exec(). Thus, for any single pattern, compilation, studying and
768 matching all happen in the same locale, but different patterns can be
769 compiled in different locales. It is the caller's responsibility to
770 ensure that the memory containing the tables remains available for as
771 long as it is needed.
776 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
777 int what, void *where);
779 The pcre_fullinfo() function returns information about a compiled pat-
780 tern. It replaces the obsolete pcre_info() function, which is neverthe-
781 less retained for backwards compability (and is documented below).
783 The first argument for pcre_fullinfo() is a pointer to the compiled
784 pattern. The second argument is the result of pcre_study(), or NULL if
785 the pattern was not studied. The third argument specifies which piece
786 of information is required, and the fourth argument is a pointer to a
787 variable to receive the data. The yield of the function is zero for
788 success, or one of the following negative numbers:
790 PCRE_ERROR_NULL the argument code was NULL
791 the argument where was NULL
792 PCRE_ERROR_BADMAGIC the "magic number" was not found
793 PCRE_ERROR_BADOPTION the value of what was invalid
795 Here is a typical call of pcre_fullinfo(), to obtain the length of the
796 compiled pattern:
798 int rc;
799 unsigned long int length;
800 rc = pcre_fullinfo(
801 re, /* result of pcre_compile() */
802 pe, /* result of pcre_study(), or NULL */
803 PCRE_INFO_SIZE, /* what is required */
804 &length); /* where to put the data */
806 The possible values for the third argument are defined in pcre.h, and
807 are as follows:
811 Return the number of the highest back reference in the pattern. The
812 fourth argument should point to an int variable. Zero is returned if
813 there are no back references.
817 Return the number of capturing subpatterns in the pattern. The fourth
818 argument should point to an int variable.
822 Return information about the first byte of any matched string, for a
823 non-anchored pattern. (This option used to be called
824 PCRE_INFO_FIRSTCHAR; the old name is still recognized for backwards
825 compatibility.)
827 If there is a fixed first byte, e.g. from a pattern such as
828 (cat|cow|coyote), it is returned in the integer pointed to by where.
829 Otherwise, if either
831 (a) the pattern was compiled with the PCRE_MULTILINE option, and every
832 branch starts with "^", or
834 (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
835 set (if it were set, the pattern would be anchored),
837 -1 is returned, indicating that the pattern matches only at the start
838 of a subject string or after any newline within the string. Otherwise
839 -2 is returned. For anchored patterns, -2 is returned.
843 If the pattern was studied, and this resulted in the construction of a
844 256-bit table indicating a fixed set of bytes for the first byte in any
845 matching string, a pointer to the table is returned. Otherwise NULL is
846 returned. The fourth argument should point to an unsigned char * vari-
847 able.
851 Return the value of the rightmost literal byte that must exist in any
852 matched string, other than at its start, if such a byte has been
853 recorded. The fourth argument should point to an int variable. If there
854 is no such byte, -1 is returned. For anchored patterns, a last literal
855 byte is recorded only if it follows something of variable length. For
856 example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
857 /^a\dz\d/ the returned value is -1.
863 PCRE supports the use of named as well as numbered capturing parenthe-
864 ses. The names are just an additional way of identifying the parenthe-
865 ses, which still acquire a number. A caller that wants to extract data
866 from a named subpattern must convert the name to a number in order to
867 access the correct pointers in the output vector (described with
868 pcre_exec() below). In order to do this, it must first use these three
869 values to obtain the name-to-number mapping table for the pattern.
871 The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
872 gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
873 of each entry; both of these return an int value. The entry size
874 depends on the length of the longest name. PCRE_INFO_NAMETABLE returns
875 a pointer to the first entry of the table (a pointer to char). The
876 first two bytes of each entry are the number of the capturing parenthe-
877 sis, most significant byte first. The rest of the entry is the corre-
878 sponding name, zero terminated. The names are in alphabetical order.
879 For example, consider the following pattern (assume PCRE_EXTENDED is
880 set, so white space - including newlines - is ignored):
882 (?P<date> (?P<year>(\d\d)?\d\d) -
883 (?P<month>\d\d) - (?P<day>\d\d) )
885 There are four named subpatterns, so the table has four entries, and
886 each entry in the table is eight bytes long. The table is as follows,
887 with non-printing bytes shows in hex, and undefined bytes shown as ??:
889 00 01 d a t e 00 ??
890 00 05 d a y 00 ?? ??
891 00 04 m o n t h 00
892 00 02 y e a r 00 ??
894 When writing code to extract data from named subpatterns, remember that
895 the length of each entry may be different for each compiled pattern.
899 Return a copy of the options with which the pattern was compiled. The
900 fourth argument should point to an unsigned long int variable. These
901 option bits are those specified in the call to pcre_compile(), modified
902 by any top-level option settings within the pattern itself.
904 A pattern is automatically anchored by PCRE if all of its top-level
905 alternatives begin with one of the following:
907 ^ unless PCRE_MULTILINE is set
908 \A always
909 \G always
910 .* if PCRE_DOTALL is set and there are no back
911 references to the subpattern in which .* appears
913 For such patterns, the PCRE_ANCHORED bit is set in the options returned
914 by pcre_fullinfo().
918 Return the size of the compiled pattern, that is, the value that was
919 passed as the argument to pcre_malloc() when PCRE was getting memory in
920 which to place the compiled data. The fourth argument should point to a
921 size_t variable.
925 Returns the size of the data block pointed to by the study_data field
926 in a pcre_extra block. That is, it is the value that was passed to
927 pcre_malloc() when PCRE was getting memory into which to place the data
928 created by pcre_study(). The fourth argument should point to a size_t
929 variable.
934 int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
936 The pcre_info() function is now obsolete because its interface is too
937 restrictive to return all the available data about a compiled pattern.
938 New programs should use pcre_fullinfo() instead. The yield of
939 pcre_info() is the number of capturing subpatterns, or one of the fol-
940 lowing negative numbers:
942 PCRE_ERROR_NULL the argument code was NULL
943 PCRE_ERROR_BADMAGIC the "magic number" was not found
945 If the optptr argument is not NULL, a copy of the options with which
946 the pattern was compiled is placed in the integer it points to (see
949 If the pattern is not anchored and the firstcharptr argument is not
950 NULL, it is used to pass back information about the first character of
951 any matched string (see PCRE_INFO_FIRSTBYTE above).
956 int pcre_exec(const pcre *code, const pcre_extra *extra,
957 const char *subject, int length, int startoffset,
958 int options, int *ovector, int ovecsize);
960 The function pcre_exec() is called to match a subject string against a
961 pre-compiled pattern, which is passed in the code argument. If the pat-
962 tern has been studied, the result of the study should be passed in the
963 extra argument.
965 Here is an example of a simple call to pcre_exec():
967 int rc;
968 int ovector[30];
969 rc = pcre_exec(
970 re, /* result of pcre_compile() */
971 NULL, /* we didn't study the pattern */
972 "some string", /* the subject string */
973 11, /* the length of the subject string */
974 0, /* start at offset 0 in the subject */
975 0, /* default options */
976 ovector, /* vector for substring information */
977 30); /* number of elements in the vector */
979 If the extra argument is not NULL, it must point to a pcre_extra data
980 block. The pcre_study() function returns such a block (when it doesn't
981 return NULL), but you can also create one for yourself, and pass addi-
982 tional information in it. The fields in the block are as follows:
984 unsigned long int flags;
985 void *study_data;
986 unsigned long int match_limit;
987 void *callout_data;
989 The flags field is a bitmap that specifies which of the other fields
990 are set. The flag bits are:
996 Other flag bits should be set to zero. The study_data field is set in
997 the pcre_extra block that is returned by pcre_study(), together with
998 the appropriate flag bit. You should not set this yourself, but you can
999 add to the block by setting the other fields.
1001 The match_limit field provides a means of preventing PCRE from using up
1002 a vast amount of resources when running patterns that are not going to
1003 match, but which have a very large number of possibilities in their
1004 search trees. The classic example is the use of nested unlimited
1005 repeats. Internally, PCRE uses a function called match() which it calls
1006 repeatedly (sometimes recursively). The limit is imposed on the number
1007 of times this function is called during a match, which has the effect
1008 of limiting the amount of recursion and backtracking that can take
1009 place. For patterns that are not anchored, the count starts from zero
1010 for each position in the subject string.
1012 The default limit for the library can be set when PCRE is built; the
1013 default default is 10 million, which handles all but the most extreme
1014 cases. You can reduce the default by suppling pcre_exec() with a
1015 pcre_extra block in which match_limit is set to a smaller value, and
1016 PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is
1017 exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1019 The pcre_callout field is used in conjunction with the "callout" fea-
1020 ture, which is described in the pcrecallout documentation.
1022 The PCRE_ANCHORED option can be passed in the options argument, whose
1023 unused bits must be zero. This limits pcre_exec() to matching at the
1024 first matching position. However, if a pattern was compiled with
1025 PCRE_ANCHORED, or turned out to be anchored by virtue of its contents,
1026 it cannot be made unachored at matching time.
1028 When PCRE_UTF8 was set at compile time, the validity of the subject as
1029 a UTF-8 string is automatically checked, and the value of startoffset
1030 is also checked to ensure that it points to the start of a UTF-8 char-
1031 acter. If an invalid UTF-8 sequence of bytes is found, pcre_exec()
1032 returns the error PCRE_ERROR_BADUTF8. If startoffset contains an
1033 invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1035 If you already know that your subject is valid, and you want to skip
1036 these checks for performance reasons, you can set the
1037 PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to
1038 do this for the second and subsequent calls to pcre_exec() if you are
1039 making repeated calls to find all the matches in a single subject
1040 string. However, you should be sure that the value of startoffset
1041 points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
1042 set, the effect of passing an invalid UTF-8 string as a subject, or a
1043 value of startoffset that does not point to the start of a UTF-8 char-
1044 acter, is undefined. Your program may crash.
1046 There are also three further options that can be set only at matching
1047 time:
1051 The first character of the string is not the beginning of a line, so
1052 the circumflex metacharacter should not match before it. Setting this
1053 without PCRE_MULTILINE (at compile time) causes circumflex never to
1054 match.
1058 The end of the string is not the end of a line, so the dollar metachar-
1059 acter should not match it nor (except in multiline mode) a newline
1060 immediately before it. Setting this without PCRE_MULTILINE (at compile
1061 time) causes dollar never to match.
1065 An empty string is not considered to be a valid match if this option is
1066 set. If there are alternatives in the pattern, they are tried. If all
1067 the alternatives match the empty string, the entire match fails. For
1068 example, if the pattern
1070 a?b?
1072 is applied to a string not beginning with "a" or "b", it matches the
1073 empty string at the start of the subject. With PCRE_NOTEMPTY set, this
1074 match is not valid, so PCRE searches further into the string for occur-
1075 rences of "a" or "b".
1077 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1078 cial case of a pattern match of the empty string within its split()
1079 function, and when using the /g modifier. It is possible to emulate
1080 Perl's behaviour after matching a null string by first trying the match
1081 again at the same offset with PCRE_NOTEMPTY set, and then if that fails
1082 by advancing the starting offset (see below) and trying an ordinary
1083 match again.
1085 The subject string is passed to pcre_exec() as a pointer in subject, a
1086 length in length, and a starting byte offset in startoffset. Unlike the
1087 pattern string, the subject may contain binary zero bytes. When the
1088 starting offset is zero, the search for a match starts at the beginning
1089 of the subject, and this is by far the most common case.
1091 If the pattern was compiled with the PCRE_UTF8 option, the subject must
1092 be a sequence of bytes that is a valid UTF-8 string, and the starting
1093 offset must point to the beginning of a UTF-8 character. If an invalid
1094 UTF-8 string or offset is passed, an error (either PCRE_ERROR_BADUTF8
1095 or PCRE_ERROR_BADUTF8_OFFSET) is returned, unless the option
1096 PCRE_NO_UTF8_CHECK is set, in which case PCRE's behaviour is not
1097 defined.
1099 A non-zero starting offset is useful when searching for another match
1100 in the same subject by calling pcre_exec() again after a previous suc-
1101 cess. Setting startoffset differs from just passing over a shortened
1102 string and setting PCRE_NOTBOL in the case of a pattern that begins
1103 with any kind of lookbehind. For example, consider the pattern
1105 \Biss\B
1107 which finds occurrences of "iss" in the middle of words. (\B matches
1108 only if the current position in the subject is not a word boundary.)
1109 When applied to the string "Mississipi" the first call to pcre_exec()
1110 finds the first occurrence. If pcre_exec() is called again with just
1111 the remainder of the subject, namely "issipi", it does not match,
1112 because \B is always false at the start of the subject, which is deemed
1113 to be a word boundary. However, if pcre_exec() is passed the entire
1114 string again, but with startoffset set to 4, it finds the second
1115 occurrence of "iss" because it is able to look behind the starting
1116 point to discover that it is preceded by a letter.
1118 If a non-zero starting offset is passed when the pattern is anchored,
1119 one attempt to match at the given offset is tried. This can only suc-
1120 ceed if the pattern does not require the match to be at the start of
1121 the subject.
1123 In general, a pattern matches a certain portion of the subject, and in
1124 addition, further substrings from the subject may be picked out by
1125 parts of the pattern. Following the usage in Jeffrey Friedl's book,
1126 this is called "capturing" in what follows, and the phrase "capturing
1127 subpattern" is used for a fragment of a pattern that picks out a sub-
1128 string. PCRE supports several other kinds of parenthesized subpattern
1129 that do not cause substrings to be captured.
1131 Captured substrings are returned to the caller via a vector of integer
1132 offsets whose address is passed in ovector. The number of elements in
1133 the vector is passed in ovecsize. The first two-thirds of the vector is
1134 used to pass back captured substrings, each substring using a pair of
1135 integers. The remaining third of the vector is used as workspace by
1136 pcre_exec() while matching capturing subpatterns, and is not available
1137 for passing back information. The length passed in ovecsize should
1138 always be a multiple of three. If it is not, it is rounded down.
1140 When a match has been successful, information about captured substrings
1141 is returned in pairs of integers, starting at the beginning of ovector,
1142 and continuing up to two-thirds of its length at the most. The first
1143 element of a pair is set to the offset of the first character in a sub-
1144 string, and the second is set to the offset of the first character
1145 after the end of a substring. The first pair, ovector[0] and ovec-
1146 tor[1], identify the portion of the subject string matched by the
1147 entire pattern. The next pair is used for the first capturing subpat-
1148 tern, and so on. The value returned by pcre_exec() is the number of
1149 pairs that have been set. If there are no capturing subpatterns, the
1150 return value from a successful match is 1, indicating that just the
1151 first pair of offsets has been set.
1153 Some convenience functions are provided for extracting the captured
1154 substrings as separate strings. These are described in the following
1155 section.
1157 It is possible for an capturing subpattern number n+1 to match some
1158 part of the subject when subpattern n has not been used at all. For
1159 example, if the string "abc" is matched against the pattern (a|(z))(bc)
1160 subpatterns 1 and 3 are matched, but 2 is not. When this happens, both
1161 offset values corresponding to the unused subpattern are set to -1.
1163 If a capturing subpattern is matched repeatedly, it is the last portion
1164 of the string that it matched that gets returned.
1166 If the vector is too small to hold all the captured substrings, it is
1167 used as far as possible (up to two-thirds of its length), and the func-
1168 tion returns a value of zero. In particular, if the substring offsets
1169 are not of interest, pcre_exec() may be called with ovector passed as
1170 NULL and ovecsize as zero. However, if the pattern contains back refer-
1171 ences and the ovector isn't big enough to remember the related sub-
1172 strings, PCRE has to get additional memory for use during matching.
1173 Thus it is usually advisable to supply an ovector.
1175 Note that pcre_info() can be used to find out how many capturing sub-
1176 patterns there are in a compiled pattern. The smallest size for ovector
1177 that will allow for n captured substrings, in addition to the offsets
1178 of the substring matched by the whole pattern, is (n+1)*3.
1180 If pcre_exec() fails, it returns a negative number. The following are
1181 defined in the header file:
1185 The subject string did not match the pattern.
1189 Either code or subject was passed as NULL, or ovector was NULL and
1190 ovecsize was not zero.
1194 An unrecognized bit was set in the options argument.
1198 PCRE stores a 4-byte "magic number" at the start of the compiled code,
1199 to catch the case when it is passed a junk pointer. This is the error
1200 it gives when the magic number isn't present.
1204 While running the pattern match, an unknown item was encountered in the
1205 compiled pattern. This error could be caused by a bug in PCRE or by
1206 overwriting of the compiled pattern.
1210 If a pattern contains back references, but the ovector that is passed
1211 to pcre_exec() is not big enough to remember the referenced substrings,
1212 PCRE gets a block of memory at the start of matching to use for this
1213 purpose. If the call via pcre_malloc() fails, this error is given. The
1214 memory is freed at the end of matching.
1218 This error is used by the pcre_copy_substring(), pcre_get_substring(),
1219 and pcre_get_substring_list() functions (see below). It is never
1220 returned by pcre_exec().
1224 The recursion and backtracking limit, as specified by the match_limit
1225 field in a pcre_extra structure (or defaulted) was reached. See the
1226 description above.
1230 This error is never generated by pcre_exec() itself. It is provided for
1231 use by callout functions that want to yield a distinctive error code.
1232 See the pcrecallout documentation for details.
1236 A string that contains an invalid UTF-8 byte sequence was passed as a
1237 subject.
1241 The UTF-8 byte sequence that was passed as a subject was valid, but the
1242 value of startoffset did not point to the beginning of a UTF-8 charac-
1243 ter.
1248 int pcre_copy_substring(const char *subject, int *ovector,
1249 int stringcount, int stringnumber, char *buffer,
1250 int buffersize);
1252 int pcre_get_substring(const char *subject, int *ovector,
1253 int stringcount, int stringnumber,
1254 const char **stringptr);
1256 int pcre_get_substring_list(const char *subject,
1257 int *ovector, int stringcount, const char ***listptr);
1259 Captured substrings can be accessed directly by using the offsets
1260 returned by pcre_exec() in ovector. For convenience, the functions
1261 pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub-
1262 string_list() are provided for extracting captured substrings as new,
1263 separate, zero-terminated strings. These functions identify substrings
1264 by number. The next section describes functions for extracting named
1265 substrings. A substring that contains a binary zero is correctly
1266 extracted and has a further zero added on the end, but the result is
1267 not, of course, a C string.
1269 The first three arguments are the same for all three of these func-
1270 tions: subject is the subject string which has just been successfully
1271 matched, ovector is a pointer to the vector of integer offsets that was
1272 passed to pcre_exec(), and stringcount is the number of substrings that
1273 were captured by the match, including the substring that matched the
1274 entire regular expression. This is the value returned by pcre_exec if
1275 it is greater than zero. If pcre_exec() returned zero, indicating that
1276 it ran out of space in ovector, the value passed as stringcount should
1277 be the size of the vector divided by three.
1279 The functions pcre_copy_substring() and pcre_get_substring() extract a
1280 single substring, whose number is given as stringnumber. A value of
1281 zero extracts the substring that matched the entire pattern, while
1282 higher values extract the captured substrings. For pcre_copy_sub-
1283 string(), the string is placed in buffer, whose length is given by
1284 buffersize, while for pcre_get_substring() a new block of memory is
1285 obtained via pcre_malloc, and its address is returned via stringptr.
1286 The yield of the function is the length of the string, not including
1287 the terminating zero, or one of
1291 The buffer was too small for pcre_copy_substring(), or the attempt to
1292 get memory failed for pcre_get_substring().
1296 There is no substring whose number is stringnumber.
1298 The pcre_get_substring_list() function extracts all available sub-
1299 strings and builds a list of pointers to them. All this is done in a
1300 single block of memory which is obtained via pcre_malloc. The address
1301 of the memory block is returned via listptr, which is also the start of
1302 the list of string pointers. The end of the list is marked by a NULL
1303 pointer. The yield of the function is zero if all went well, or
1307 if the attempt to get the memory block failed.
1309 When any of these functions encounter a substring that is unset, which
1310 can happen when capturing subpattern number n+1 matches some part of
1311 the subject, but subpattern n has not been used at all, they return an
1312 empty string. This can be distinguished from a genuine zero-length sub-
1313 string by inspecting the appropriate offset in ovector, which is nega-
1314 tive for unset substrings.
1316 The two convenience functions pcre_free_substring() and
1317 pcre_free_substring_list() can be used to free the memory returned by a
1318 previous call of pcre_get_substring() or pcre_get_substring_list(),
1319 respectively. They do nothing more than call the function pointed to by
1320 pcre_free, which of course could be called directly from a C program.
1321 However, PCRE is used in some situations where it is linked via a spe-
1322 cial interface to another programming language which cannot use
1323 pcre_free directly; it is for these cases that the functions are pro-
1324 vided.
1329 int pcre_copy_named_substring(const pcre *code,
1330 const char *subject, int *ovector,
1331 int stringcount, const char *stringname,
1332 char *buffer, int buffersize);
1334 int pcre_get_stringnumber(const pcre *code,
1335 const char *name);
1337 int pcre_get_named_substring(const pcre *code,
1338 const char *subject, int *ovector,
1339 int stringcount, const char *stringname,
1340 const char **stringptr);
1342 To extract a substring by name, you first have to find associated num-
1343 ber. This can be done by calling pcre_get_stringnumber(). The first
1344 argument is the compiled pattern, and the second is the name. For exam-
1345 ple, for this pattern
1347 ab(?<xxx>\d+)...
1349 the number of the subpattern called "xxx" is 1. Given the number, you
1350 can then extract the substring directly, or use one of the functions
1351 described in the previous section. For convenience, there are also two
1352 functions that do the whole job.
1354 Most of the arguments of pcre_copy_named_substring() and
1355 pcre_get_named_substring() are the same as those for the functions that
1356 extract by number, and so are not re-described here. There are just two
1357 differences.
1359 First, instead of a substring number, a substring name is given. Sec-
1360 ond, there is an extra argument, given at the start, which is a pointer
1361 to the compiled pattern. This is needed in order to gain access to the
1362 name-to-number translation table.
1364 These functions call pcre_get_stringnumber(), and if it succeeds, they
1365 then call pcre_copy_substring() or pcre_get_substring(), as appropri-
1366 ate.
1368 Last updated: 09 December 2003
1369 Copyright (c) 1997-2003 University of Cambridge.
1370 -----------------------------------------------------------------------------
1372 PCRE(3) PCRE(3)
1376 NAME
1377 PCRE - Perl-compatible regular expressions
1381 int (*pcre_callout)(pcre_callout_block *);
1383 PCRE provides a feature called "callout", which is a means of temporar-
1384 ily passing control to the caller of PCRE in the middle of pattern
1385 matching. The caller of PCRE provides an external function by putting
1386 its entry point in the global variable pcre_callout. By default, this
1387 variable contains NULL, which disables all calling out.
1389 Within a regular expression, (?C) indicates the points at which the
1390 external function is to be called. Different callout points can be
1391 identified by putting a number less than 256 after the letter C. The
1392 default value is zero. For example, this pattern has two callout
1393 points:
1395 (?C1)abc(?C2)def
1397 During matching, when PCRE reaches a callout point (and pcre_callout is
1398 set), the external function is called. Its only argument is a pointer
1399 to a pcre_callout block. This contains the following variables:
1401 int version;
1402 int callout_number;
1403 int *offset_vector;
1404 const char *subject;
1405 int subject_length;
1406 int start_match;
1407 int current_position;
1408 int capture_top;
1409 int capture_last;
1410 void *callout_data;
1412 The version field is an integer containing the version number of the
1413 block format. The current version is zero. The version number may
1414 change in future if additional fields are added, but the intention is
1415 never to remove any of the existing fields.
1417 The callout_number field contains the number of the callout, as com-
1418 piled into the pattern (that is, the number after ?C).
1420 The offset_vector field is a pointer to the vector of offsets that was
1421 passed by the caller to pcre_exec(). The contents can be inspected in
1422 order to extract substrings that have been matched so far, in the same
1423 way as for extracting substrings after a match has completed.
1425 The subject and subject_length fields contain copies the values that
1426 were passed to pcre_exec().
1428 The start_match field contains the offset within the subject at which
1429 the current match attempt started. If the pattern is not anchored, the
1430 callout function may be called several times for different starting
1431 points.
1433 The current_position field contains the offset within the subject of
1434 the current match pointer.
1436 The capture_top field contains one more than the number of the highest
1437 numbered captured substring so far. If no substrings have been
1438 captured, the value of capture_top is one.
1440 The capture_last field contains the number of the most recently cap-
1441 tured substring.
1443 The callout_data field contains a value that is passed to pcre_exec()
1444 by the caller specifically so that it can be passed back in callouts.
1445 It is passed in the pcre_callout field of the pcre_extra data struc-
1446 ture. If no such data was passed, the value of callout_data in a
1447 pcre_callout block is NULL. There is a description of the pcre_extra
1448 structure in the pcreapi documentation.
1454 The callout function returns an integer. If the value is zero, matching
1455 proceeds as normal. If the value is greater than zero, matching fails
1456 at the current point, but backtracking to test other possibilities goes
1457 ahead, just as if a lookahead assertion had failed. If the value is
1458 less than zero, the match is abandoned, and pcre_exec() returns the
1459 value.
1461 Negative values should normally be chosen from the set of
1462 PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
1463 dard "no match" failure. The error number PCRE_ERROR_CALLOUT is
1464 reserved for use by callout functions; it will never be used by PCRE
1465 itself.
1467 Last updated: 21 January 2003
1468 Copyright (c) 1997-2003 University of Cambridge.
1469 -----------------------------------------------------------------------------
1471 PCRE(3) PCRE(3)
1475 NAME
1476 PCRE - Perl-compatible regular expressions
1480 This document describes the differences in the ways that PCRE and Perl
1481 handle regular expressions. The differences described here are with
1482 respect to Perl 5.8.
1484 1. PCRE does not have full UTF-8 support. Details of what it does have
1485 are given in the section on UTF-8 support in the main pcre page.
1487 2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
1488 permits them, but they do not mean what you might think. For example,
1489 (?!a){3} does not assert that the next three characters are not "a". It
1490 just asserts that the next character is not "a" three times.
1492 3. Capturing subpatterns that occur inside negative lookahead asser-
1493 tions are counted, but their entries in the offsets vector are never
1494 set. Perl sets its numerical variables from any such patterns that are
1495 matched before the assertion fails to match something (thereby succeed-
1496 ing), but only if the negative lookahead assertion contains just one
1497 branch.
1499 4. Though binary zero characters are supported in the subject string,
1500 they are not allowed in a pattern string because it is passed as a nor-
1501 mal C string, terminated by zero. The escape sequence "\0" can be used
1502 in the pattern to represent a binary zero.
1504 5. The following Perl escape sequences are not supported: \l, \u, \L,
1505 \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general
1506 string-handling and are not part of its pattern matching engine. If any
1507 of these are encountered by PCRE, an error is generated.
1509 6. PCRE does support the \Q...\E escape for quoting substrings. Charac-
1510 ters in between are treated as literals. This is slightly different
1511 from Perl in that $ and @ are also handled as literals inside the
1512 quotes. In Perl, they cause variable interpolation (but of course PCRE
1513 does not have variables). Note the following examples:
1515 Pattern PCRE matches Perl matches
1517 \Qabc$xyz\E abc$xyz abc followed by the
1518 contents of $xyz
1519 \Qabc\$xyz\E abc\$xyz abc\$xyz
1520 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1522 The \Q...\E sequence is recognized both inside and outside character
1523 classes.
1525 7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
1526 constructions. However, there is some experimental support for recur-
1527 sive patterns using the non-Perl items (?R), (?number) and (?P>name).
1528 Also, the PCRE "callout" feature allows an external function to be
1529 called during pattern matching.
1531 8. There are some differences that are concerned with the settings of
1532 captured strings when part of a pattern is repeated. For example,
1533 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
1534 unset, but in PCRE it is set to "b".
1536 9. PCRE provides some extensions to the Perl regular expression
1537 facilities:
1539 (a) Although lookbehind assertions must match fixed length strings,
1540 each alternative branch of a lookbehind assertion can match a different
1541 length of string. Perl requires them all to have the same length.
1543 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $
1544 meta-character matches only at the very end of the string.
1546 (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
1547 cial meaning is faulted.
1549 (d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti-
1550 fiers is inverted, that is, by default they are not greedy, but if fol-
1551 lowed by a question mark they are.
1553 (e) PCRE_ANCHORED can be used to force a pattern to be tried only at
1554 the first matching position in the subject string.
1557 TURE options for pcre_exec() have no Perl equivalents.
1559 (g) The (?R), (?number), and (?P>name) constructs allows for recursive
1560 pattern matching (Perl can do this using the (?p{code}) construct,
1561 which PCRE cannot support.)
1563 (h) PCRE supports named capturing substrings, using the Python syntax.
1565 (i) PCRE supports the possessive quantifier "++" syntax, taken from
1566 Sun's Java package.
1568 (j) The (R) condition, for testing recursion, is a PCRE extension.
1570 (k) The callout facility is PCRE-specific.
1572 Last updated: 09 December 2003
1573 Copyright (c) 1997-2003 University of Cambridge.
1574 -----------------------------------------------------------------------------
1576 PCRE(3) PCRE(3)
1580 NAME
1581 PCRE - Perl-compatible regular expressions
1585 The syntax and semantics of the regular expressions supported by PCRE
1586 are described below. Regular expressions are also described in the Perl
1587 documentation and in a number of other books, some of which have copi-
1588 ous examples. Jeffrey Friedl's "Mastering Regular Expressions", pub-
1589 lished by O'Reilly, covers them in great detail. The description here
1590 is intended as reference documentation.
1592 The basic operation of PCRE is on strings of bytes. However, there is
1593 also support for UTF-8 character strings. To use this support you must
1594 build PCRE to include UTF-8 support, and then call pcre_compile() with
1595 the PCRE_UTF8 option. How this affects the pattern matching is men-
1596 tioned in several places below. There is also a summary of UTF-8 fea-
1597 tures in the section on UTF-8 support in the main pcre page.
1599 A regular expression is a pattern that is matched against a subject
1600 string from left to right. Most characters stand for themselves in a
1601 pattern, and match the corresponding characters in the subject. As a
1602 trivial example, the pattern
1604 The quick brown fox
1606 matches a portion of a subject string that is identical to itself. The
1607 power of regular expressions comes from the ability to include alterna-
1608 tives and repetitions in the pattern. These are encoded in the pattern
1609 by the use of meta-characters, which do not stand for themselves but
1610 instead are interpreted in some special way.
1612 There are two different sets of meta-characters: those that are recog-
1613 nized anywhere in the pattern except within square brackets, and those
1614 that are recognized in square brackets. Outside square brackets, the
1615 meta-characters are as follows:
1617 \ general escape character with several uses
1618 ^ assert start of string (or line, in multiline mode)
1619 $ assert end of string (or line, in multiline mode)
1620 . match any character except newline (by default)
1621 [ start character class definition
1622 | start of alternative branch
1623 ( start subpattern
1624 ) end subpattern
1625 ? extends the meaning of (
1626 also 0 or 1 quantifier
1627 also quantifier minimizer
1628 * 0 or more quantifier
1629 + 1 or more quantifier
1630 also "possessive quantifier"
1631 { start min/max quantifier
1633 Part of a pattern that is in square brackets is called a "character
1634 class". In a character class the only meta-characters are:
1636 \ general escape character
1637 ^ negate the class, but only if the first character
1638 - indicates character range
1639 [ POSIX character class (only if followed by POSIX
1640 syntax)
1641 ] terminates the character class
1643 The following sections describe the use of each of the meta-characters.
1648 The backslash character has several uses. Firstly, if it is followed by
1649 a non-alphameric character, it takes away any special meaning that
1650 character may have. This use of backslash as an escape character
1651 applies both inside and outside character classes.
1653 For example, if you want to match a * character, you write \* in the
1654 pattern. This escaping action applies whether or not the following
1655 character would otherwise be interpreted as a meta-character, so it is
1656 always safe to precede a non-alphameric with backslash to specify that
1657 it stands for itself. In particular, if you want to match a backslash,
1658 you write \\.
1660 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in
1661 the pattern (other than in a character class) and characters between a
1662 # outside a character class and the next newline character are ignored.
1663 An escaping backslash can be used to include a whitespace or # charac-
1664 ter as part of the pattern.
1666 If you want to remove the special meaning from a sequence of charac-
1667 ters, you can do so by putting them between \Q and \E. This is differ-
1668 ent from Perl in that $ and @ are handled as literals in \Q...\E
1669 sequences in PCRE, whereas in Perl, $ and @ cause variable interpola-
1670 tion. Note the following examples:
1672 Pattern PCRE matches Perl matches
1674 \Qabc$xyz\E abc$xyz abc followed by the
1675 contents of $xyz
1676 \Qabc\$xyz\E abc\$xyz abc\$xyz
1677 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
1679 The \Q...\E sequence is recognized both inside and outside character
1680 classes.
1682 A second use of backslash provides a way of encoding non-printing char-
1683 acters in patterns in a visible manner. There is no restriction on the
1684 appearance of non-printing characters, apart from the binary zero that
1685 terminates a pattern, but when a pattern is being prepared by text
1686 editing, it is usually easier to use one of the following escape
1687 sequences than the binary character it represents:
1689 \a alarm, that is, the BEL character (hex 07)
1690 \cx "control-x", where x is any character
1691 \e escape (hex 1B)
1692 \f formfeed (hex 0C)
1693 \n newline (hex 0A)
1694 \r carriage return (hex 0D)
1695 \t tab (hex 09)
1696 \ddd character with octal code ddd, or backreference
1697 \xhh character with hex code hh
1698 \x{hhh..} character with hex code hhh... (UTF-8 mode only)
1700 The precise effect of \cx is as follows: if x is a lower case letter,
1701 it is converted to upper case. Then bit 6 of the character (hex 40) is
1702 inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;
1703 becomes hex 7B.
1705 After \x, from zero to two hexadecimal digits are read (letters can be
1706 in upper or lower case). In UTF-8 mode, any number of hexadecimal dig-
1707 its may appear between \x{ and }, but the value of the character code
1708 must be less than 2**31 (that is, the maximum hexadecimal value is
1709 7FFFFFFF). If characters other than hexadecimal digits appear between
1710 \x{ and }, or if there is no terminating }, this form of escape is not
1711 recognized. Instead, the initial \x will be interpreted as a basic hex-
1712 adecimal escape, with no following digits, giving a byte whose value is
1713 zero.
1715 Characters whose value is less than 256 can be defined by either of the
1716 two syntaxes for \x when PCRE is in UTF-8 mode. There is no difference
1717 in the way they are handled. For example, \xdc is exactly the same as
1718 \x{dc}.
1720 After \0 up to two further octal digits are read. In both cases, if
1721 there are fewer than two digits, just those that are present are used.
1722 Thus the sequence \0\x\07 specifies two binary zeros followed by a BEL
1723 character (code value 7). Make sure you supply two digits after the
1724 initial zero if the character that follows is itself an octal digit.
1726 The handling of a backslash followed by a digit other than 0 is compli-
1727 cated. Outside a character class, PCRE reads it and any following dig-
1728 its as a decimal number. If the number is less than 10, or if there
1729 have been at least that many previous capturing left parentheses in the
1730 expression, the entire sequence is taken as a back reference. A
1731 description of how this works is given later, following the discussion
1732 of parenthesized subpatterns.
1734 Inside a character class, or if the decimal number is greater than 9
1735 and there have not been that many capturing subpatterns, PCRE re-reads
1736 up to three octal digits following the backslash, and generates a sin-
1737 gle byte from the least significant 8 bits of the value. Any subsequent
1738 digits stand for themselves. For example:
1740 \040 is another way of writing a space
1741 \40 is the same, provided there are fewer than 40
1742 previous capturing subpatterns
1743 \7 is always a back reference
1744 \11 might be a back reference, or another way of
1745 writing a tab
1746 \011 is always a tab
1747 \0113 is a tab followed by the character "3"
1748 \113 might be a back reference, otherwise the
1749 character with octal code 113
1750 \377 might be a back reference, otherwise
1751 the byte consisting entirely of 1 bits
1752 \81 is either a back reference, or a binary zero
1753 followed by the two characters "8" and "1"
1755 Note that octal values of 100 or greater must not be introduced by a
1756 leading zero, because no more than three octal digits are ever read.
1758 All the sequences that define a single byte value or a single UTF-8
1759 character (in UTF-8 mode) can be used both inside and outside character
1760 classes. In addition, inside a character class, the sequence \b is
1761 interpreted as the backspace character (hex 08). Outside a character
1762 class it has a different meaning (see below).
1764 The third use of backslash is for specifying generic character types:
1766 \d any decimal digit
1767 \D any character that is not a decimal digit
1768 \s any whitespace character
1769 \S any character that is not a whitespace character
1770 \w any "word" character
1771 \W any "non-word" character
1773 Each pair of escape sequences partitions the complete set of characters
1774 into two disjoint sets. Any given character matches one, and only one,
1775 of each pair.
1777 In UTF-8 mode, characters with values greater than 255 never match \d,
1778 \s, or \w, and always match \D, \S, and \W.
1780 For compatibility with Perl, \s does not match the VT character (code
1781 11). This makes it different from the the POSIX "space" class. The \s
1782 characters are HT (9), LF (10), FF (12), CR (13), and space (32).
1784 A "word" character is any letter or digit or the underscore character,
1785 that is, any character which can be part of a Perl "word". The defini-
1786 tion of letters and digits is controlled by PCRE's character tables,
1787 and may vary if locale- specific matching is taking place (see "Locale
1788 support" in the pcreapi page). For example, in the "fr" (French)
1789 locale, some character codes greater than 128 are used for accented
1790 letters, and these are matched by \w.
1792 These character type sequences can appear both inside and outside char-
1793 acter classes. They each match one character of the appropriate type.
1794 If the current matching point is at the end of the subject string, all
1795 of them fail, since there is no character to match.
1797 The fourth use of backslash is for certain simple assertions. An asser-
1798 tion specifies a condition that has to be met at a particular point in
1799 a match, without consuming any characters from the subject string. The
1800 use of subpatterns for more complicated assertions is described below.
1801 The backslashed assertions are
1803 \b matches at a word boundary
1804 \B matches when not at a word boundary
1805 \A matches at start of subject
1806 \Z matches at end of subject or before newline at end
1807 \z matches at end of subject
1808 \G matches at first matching position in subject
1810 These assertions may not appear in character classes (but note that \b
1811 has a different meaning, namely the backspace character, inside a char-
1812 acter class).
1814 A word boundary is a position in the subject string where the current
1815 character and the previous character do not both match \w or \W (i.e.
1816 one matches \w and the other matches \W), or the start or end of the
1817 string if the first or last character matches \w, respectively.
1819 The \A, \Z, and \z assertions differ from the traditional circumflex
1820 and dollar (described below) in that they only ever match at the very
1821 start and end of the subject string, whatever options are set. Thus,
1822 they are independent of multiline mode.
1824 They are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options. If the
1825 startoffset argument of pcre_exec() is non-zero, indicating that match-
1826 ing is to start at a point other than the beginning of the subject, \A
1827 can never match. The difference between \Z and \z is that \Z matches
1828 before a newline that is the last character of the string as well as at
1829 the end of the string, whereas \z matches only at the end.
1831 The \G assertion is true only when the current matching position is at
1832 the start point of the match, as specified by the startoffset argument
1833 of pcre_exec(). It differs from \A when the value of startoffset is
1834 non-zero. By calling pcre_exec() multiple times with appropriate argu-
1835 ments, you can mimic Perl's /g option, and it is in this kind of imple-
1836 mentation where \G can be useful.
1838 Note, however, that PCRE's interpretation of \G, as the start of the
1839 current match, is subtly different from Perl's, which defines it as the
1840 end of the previous match. In Perl, these can be different when the
1841 previously matched string was empty. Because PCRE does just one match
1842 at a time, it cannot reproduce this behaviour.
1844 If all the alternatives of a pattern begin with \G, the expression is
1845 anchored to the starting match position, and the "anchored" flag is set
1846 in the compiled regular expression.
1851 Outside a character class, in the default matching mode, the circumflex
1852 character is an assertion which is true only if the current matching
1853 point is at the start of the subject string. If the startoffset argu-
1854 ment of pcre_exec() is non-zero, circumflex can never match if the
1855 PCRE_MULTILINE option is unset. Inside a character class, circumflex
1856 has an entirely different meaning (see below).
1858 Circumflex need not be the first character of the pattern if a number
1859 of alternatives are involved, but it should be the first thing in each
1860 alternative in which it appears if the pattern is ever to match that
1861 branch. If all possible alternatives start with a circumflex, that is,
1862 if the pattern is constrained to match only at the start of the sub-
1863 ject, it is said to be an "anchored" pattern. (There are also other
1864 constructs that can cause a pattern to be anchored.)
1866 A dollar character is an assertion which is true only if the current
1867 matching point is at the end of the subject string, or immediately
1868 before a newline character that is the last character in the string (by
1869 default). Dollar need not be the last character of the pattern if a
1870 number of alternatives are involved, but it should be the last item in
1871 any branch in which it appears. Dollar has no special meaning in a
1872 character class.
1874 The meaning of dollar can be changed so that it matches only at the
1875 very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at
1876 compile time. This does not affect the \Z assertion.
1878 The meanings of the circumflex and dollar characters are changed if the
1879 PCRE_MULTILINE option is set. When this is the case, they match immedi-
1880 ately after and immediately before an internal newline character,
1881 respectively, in addition to matching at the start and end of the sub-
1882 ject string. For example, the pattern /^abc$/ matches the subject
1883 string "def\nabc" in multiline mode, but not otherwise. Consequently,
1884 patterns that are anchored in single line mode because all branches
1885 start with ^ are not anchored in multiline mode, and a match for cir-
1886 cumflex is possible when the startoffset argument of pcre_exec() is
1887 non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE
1888 is set.
1890 Note that the sequences \A, \Z, and \z can be used to match the start
1891 and end of the subject in both modes, and if all branches of a pattern
1892 start with \A it is always anchored, whether PCRE_MULTILINE is set or
1893 not.
1898 Outside a character class, a dot in the pattern matches any one charac-
1899 ter in the subject, including a non-printing character, but not (by
1900 default) newline. In UTF-8 mode, a dot matches any UTF-8 character,
1901 which might be more than one byte long, except (by default) for new-
1902 line. If the PCRE_DOTALL option is set, dots match newlines as well.
1903 The handling of dot is entirely independent of the handling of circum-
1904 flex and dollar, the only relationship being that they both involve
1905 newline characters. Dot has no special meaning in a character class.
1910 Outside a character class, the escape sequence \C matches any one byte,
1911 both in and out of UTF-8 mode. Unlike a dot, it always matches a new-
1912 line. The feature is provided in Perl in order to match individual
1913 bytes in UTF-8 mode. Because it breaks up UTF-8 characters into indi-
1914 vidual bytes, what remains in the string may be a malformed UTF-8
1915 string. For this reason it is best avoided.
1917 PCRE does not allow \C to appear in lookbehind assertions (see below),
1918 because in UTF-8 mode it makes it impossible to calculate the length of
1919 the lookbehind.
1924 An opening square bracket introduces a character class, terminated by a
1925 closing square bracket. A closing square bracket on its own is not spe-
1926 cial. If a closing square bracket is required as a member of the class,
1927 it should be the first data character in the class (after an initial
1928 circumflex, if present) or escaped with a backslash.
1930 A character class matches a single character in the subject. In UTF-8
1931 mode, the character may occupy more than one byte. A matched character
1932 must be in the set of characters defined by the class, unless the first
1933 character in the class definition is a circumflex, in which case the
1934 subject character must not be in the set defined by the class. If a
1935 circumflex is actually required as a member of the class, ensure it is
1936 not the first character, or escape it with a backslash.
1938 For example, the character class [aeiou] matches any lower case vowel,
1939 while [^aeiou] matches any character that is not a lower case vowel.
1940 Note that a circumflex is just a convenient notation for specifying the
1941 characters which are in the class by enumerating those that are not. It
1942 is not an assertion: it still consumes a character from the subject
1943 string, and fails if the current pointer is at the end of the string.
1945 In UTF-8 mode, characters with values greater than 255 can be included
1946 in a class as a literal string of bytes, or by using the \x{ escaping
1947 mechanism.
1949 When caseless matching is set, any letters in a class represent both
1950 their upper case and lower case versions, so for example, a caseless
1951 [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1952 match "A", whereas a caseful version would. PCRE does not support the
1953 concept of case for characters with values greater than 255.
1955 The newline character is never treated in any special way in character
1956 classes, whatever the setting of the PCRE_DOTALL or PCRE_MULTILINE
1957 options is. A class such as [^a] will always match a newline.
1959 The minus (hyphen) character can be used to specify a range of charac-
1960 ters in a character class. For example, [d-m] matches any letter
1961 between d and m, inclusive. If a minus character is required in a
1962 class, it must be escaped with a backslash or appear in a position
1963 where it cannot be interpreted as indicating a range, typically as the
1964 first or last character in the class.
1966 It is not possible to have the literal character "]" as the end charac-
1967 ter of a range. A pattern such as [W-]46] is interpreted as a class of
1968 two characters ("W" and "-") followed by a literal string "46]", so it
1969 would match "W46]" or "-46]". However, if the "]" is escaped with a
1970 backslash it is interpreted as the end of range, so [W-\]46] is inter-
1971 preted as a single class containing a range followed by two separate
1972 characters. The octal or hexadecimal representation of "]" can also be
1973 used to end a range.
1975 Ranges operate in the collating sequence of character values. They can
1976 also be used for characters specified numerically, for example
1977 [\000-\037]. In UTF-8 mode, ranges can include characters whose values
1978 are greater than 255, for example [\x{100}-\x{2ff}].
1980 If a range that includes letters is used when caseless matching is set,
1981 it matches the letters in either case. For example, [W-c] is equivalent
1982 to [][\^_`wxyzabc], matched caselessly, and if character tables for the
1983 "fr" locale are in use, [\xc8-\xcb] matches accented E characters in
1984 both cases.
1986 The character types \d, \D, \s, \S, \w, and \W may also appear in a
1987 character class, and add the characters that they match to the class.
1988 For example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
1989 conveniently be used with the upper case character types to specify a
1990 more restricted set of characters than the matching lower case type.
1991 For example, the class [^\W_] matches any letter or digit, but not
1992 underscore.
1994 All non-alphameric characters other than \, -, ^ (at the start) and the
1995 terminating ] are non-special in character classes, but it does no harm
1996 if they are escaped.
2001 Perl supports the POSIX notation for character classes, which uses
2002 names enclosed by [: and :] within the enclosing square brackets. PCRE
2003 also supports this notation. For example,
2005 [01[:alpha:]%]
2007 matches "0", "1", any alphabetic character, or "%". The supported class
2008 names are
2010 alnum letters and digits
2011 alpha letters
2012 ascii character codes 0 - 127
2013 blank space or tab only
2014 cntrl control characters
2015 digit decimal digits (same as \d)
2016 graph printing characters, excluding space
2017 lower lower case letters
2018 print printing characters, including space
2019 punct printing characters, excluding letters and digits
2020 space white space (not quite the same as \s)
2021 upper upper case letters
2022 word "word" characters (same as \w)
2023 xdigit hexadecimal digits
2025 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
2026 and space (32). Notice that this list includes the VT character (code
2027 11). This makes "space" different to \s, which does not include VT (for
2028 Perl compatibility).
2030 The name "word" is a Perl extension, and "blank" is a GNU extension
2031 from Perl 5.8. Another Perl extension is negation, which is indicated
2032 by a ^ character after the colon. For example,
2034 [12[:^digit:]]
2036 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the
2037 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
2038 these are not supported, and an error is given if they are encountered.
2040 In UTF-8 mode, characters with values greater than 255 do not match any
2041 of the POSIX character classes.
2046 Vertical bar characters are used to separate alternative patterns. For
2047 example, the pattern
2049 gilbert|sullivan
2051 matches either "gilbert" or "sullivan". Any number of alternatives may
2052 appear, and an empty alternative is permitted (matching the empty
2053 string). The matching process tries each alternative in turn, from
2054 left to right, and the first one that succeeds is used. If the alterna-
2055 tives are within a subpattern (defined below), "succeeds" means match-
2056 ing the rest of the main pattern as well as the alternative in the sub-
2057 pattern.
2062 The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and
2063 PCRE_EXTENDED options can be changed from within the pattern by a
2064 sequence of Perl option letters enclosed between "(?" and ")". The
2065 option letters are
2067 i for PCRE_CASELESS
2069 s for PCRE_DOTALL
2070 x for PCRE_EXTENDED
2072 For example, (?im) sets caseless, multiline matching. It is also possi-
2073 ble to unset these options by preceding the letter with a hyphen, and a
2074 combined setting and unsetting such as (?im-sx), which sets PCRE_CASE-
2076 is also permitted. If a letter appears both before and after the
2077 hyphen, the option is unset.
2079 When an option change occurs at top level (that is, not inside subpat-
2080 tern parentheses), the change applies to the remainder of the pattern
2081 that follows. If the change is placed right at the start of a pattern,
2082 PCRE extracts it into the global options (and it will therefore show up
2083 in data extracted by the pcre_fullinfo() function).
2085 An option change within a subpattern affects only that part of the cur-
2086 rent pattern that follows it, so
2088 (a(?i)b)c
2090 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not
2091 used). By this means, options can be made to have different settings
2092 in different parts of the pattern. Any changes made in one alternative
2093 do carry on into subsequent branches within the same subpattern. For
2094 example,
2096 (a(?i)b|c)
2098 matches "ab", "aB", "c", and "C", even though when matching "C" the
2099 first branch is abandoned before the option setting. This is because
2100 the effects of option settings happen at compile time. There would be
2101 some very weird behaviour otherwise.
2103 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can be changed
2104 in the same way as the Perl-compatible options by using the characters
2105 U and X respectively. The (?X) flag setting is special in that it must
2106 always occur earlier in the pattern than any of the additional features
2107 it turns on, even when it is at top level. It is best put at the start.
2112 Subpatterns are delimited by parentheses (round brackets), which can be
2113 nested. Marking part of a pattern as a subpattern does two things:
2115 1. It localizes a set of alternatives. For example, the pattern
2117 cat(aract|erpillar|)
2119 matches one of the words "cat", "cataract", or "caterpillar". Without
2120 the parentheses, it would match "cataract", "erpillar" or the empty
2121 string.
2123 2. It sets up the subpattern as a capturing subpattern (as defined
2124 above). When the whole pattern matches, that portion of the subject
2125 string that matched the subpattern is passed back to the caller via the
2126 ovector argument of pcre_exec(). Opening parentheses are counted from
2127 left to right (starting from 1) to obtain the numbers of the capturing
2128 subpatterns.
2130 For example, if the string "the red king" is matched against the pat-
2131 tern
2133 the ((red|white) (king|queen))
2135 the captured substrings are "red king", "red", and "king", and are num-
2136 bered 1, 2, and 3, respectively.
2138 The fact that plain parentheses fulfil two functions is not always
2139 helpful. There are often times when a grouping subpattern is required
2140 without a capturing requirement. If an opening parenthesis is followed
2141 by a question mark and a colon, the subpattern does not do any captur-
2142 ing, and is not counted when computing the number of any subsequent
2143 capturing subpatterns. For example, if the string "the white queen" is
2144 matched against the pattern
2146 the ((?:red|white) (king|queen))
2148 the captured substrings are "white queen" and "queen", and are numbered
2149 1 and 2. The maximum number of capturing subpatterns is 65535, and the
2150 maximum depth of nesting of all subpatterns, both capturing and non-
2151 capturing, is 200.
2153 As a convenient shorthand, if any option settings are required at the
2154 start of a non-capturing subpattern, the option letters may appear
2155 between the "?" and the ":". Thus the two patterns
2157 (?i:saturday|sunday)
2158 (?:(?i)saturday|sunday)
2160 match exactly the same set of strings. Because alternative branches are
2161 tried from left to right, and options are not reset until the end of
2162 the subpattern is reached, an option setting in one branch does affect
2163 subsequent branches, so the above patterns match "SUNDAY" as well as
2164 "Saturday".
2169 Identifying capturing parentheses by number is simple, but it can be
2170 very hard to keep track of the numbers in complicated regular expres-
2171 sions. Furthermore, if an expression is modified, the numbers may
2172 change. To help with the difficulty, PCRE supports the naming of sub-
2173 patterns, something that Perl does not provide. The Python syntax
2174 (?P<name>...) is used. Names consist of alphanumeric characters and
2175 underscores, and must be unique within a pattern.
2177 Named capturing parentheses are still allocated numbers as well as
2178 names. The PCRE API provides function calls for extracting the name-to-
2179 number translation table from a compiled pattern. For further details
2180 see the pcreapi documentation.
2185 Repetition is specified by quantifiers, which can follow any of the
2186 following items:
2188 a literal data character
2189 the . metacharacter
2190 the \C escape sequence
2191 escapes such as \d that match single characters
2192 a character class
2193 a back reference (see next section)
2194 a parenthesized subpattern (unless it is an assertion)
2196 The general repetition quantifier specifies a minimum and maximum num-
2197 ber of permitted matches, by giving the two numbers in curly brackets
2198 (braces), separated by a comma. The numbers must be less than 65536,
2199 and the first must be less than or equal to the second. For example:
2201 z{2,4}
2203 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
2204 special character. If the second number is omitted, but the comma is
2205 present, there is no upper limit; if the second number and the comma
2206 are both omitted, the quantifier specifies an exact number of required
2207 matches. Thus
2209 [aeiou]{3,}
2211 matches at least 3 successive vowels, but may match many more, while
2213 \d{8}
2215 matches exactly 8 digits. An opening curly bracket that appears in a
2216 position where a quantifier is not allowed, or one that does not match
2217 the syntax of a quantifier, is taken as a literal character. For exam-
2218 ple, {,6} is not a quantifier, but a literal string of four characters.
2220 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to
2221 individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char-
2222 acters, each of which is represented by a two-byte sequence.
2224 The quantifier {0} is permitted, causing the expression to behave as if
2225 the previous item and the quantifier were not present.
2227 For convenience (and historical compatibility) the three most common
2228 quantifiers have single-character abbreviations:
2230 * is equivalent to {0,}
2231 + is equivalent to {1,}
2232 ? is equivalent to {0,1}
2234 It is possible to construct infinite loops by following a subpattern
2235 that can match no characters with a quantifier that has no upper limit,
2236 for example:
2238 (a?)*
2240 Earlier versions of Perl and PCRE used to give an error at compile time
2241 for such patterns. However, because there are cases where this can be
2242 useful, such patterns are now accepted, but if any repetition of the
2243 subpattern does in fact match no characters, the loop is forcibly bro-
2244 ken.
2246 By default, the quantifiers are "greedy", that is, they match as much
2247 as possible (up to the maximum number of permitted times), without
2248 causing the rest of the pattern to fail. The classic example of where
2249 this gives problems is in trying to match comments in C programs. These
2250 appear between the sequences /* and */ and within the sequence, indi-
2251 vidual * and / characters may appear. An attempt to match C comments by
2252 applying the pattern
2254 /\*.*\*/
2256 to the string
2258 /* first command */ not comment /* second comment */
2260 fails, because it matches the entire string owing to the greediness of
2261 the .* item.
2263 However, if a quantifier is followed by a question mark, it ceases to
2264 be greedy, and instead matches the minimum number of times possible, so
2265 the pattern
2267 /\*.*?\*/
2269 does the right thing with the C comments. The meaning of the various
2270 quantifiers is not otherwise changed, just the preferred number of
2271 matches. Do not confuse this use of question mark with its use as a
2272 quantifier in its own right. Because it has two uses, it can sometimes
2273 appear doubled, as in
2275 \d??\d
2277 which matches one digit by preference, but can match two if that is the
2278 only way the rest of the pattern matches.
2280 If the PCRE_UNGREEDY option is set (an option which is not available in
2281 Perl), the quantifiers are not greedy by default, but individual ones
2282 can be made greedy by following them with a question mark. In other
2283 words, it inverts the default behaviour.
2285 When a parenthesized subpattern is quantified with a minimum repeat
2286 count that is greater than 1 or with a limited maximum, more store is
2287 required for the compiled pattern, in proportion to the size of the
2288 minimum or maximum.
2290 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv-
2291 alent to Perl's /s) is set, thus allowing the . to match newlines, the
2292 pattern is implicitly anchored, because whatever follows will be tried
2293 against every character position in the subject string, so there is no
2294 point in retrying the overall match at any position after the first.
2295 PCRE normally treats such a pattern as though it were preceded by \A.
2297 In cases where it is known that the subject string contains no new-
2298 lines, it is worth setting PCRE_DOTALL in order to obtain this opti-
2299 mization, or alternatively using ^ to indicate anchoring explicitly.
2301 However, there is one situation where the optimization cannot be used.
2302 When .* is inside capturing parentheses that are the subject of a
2303 backreference elsewhere in the pattern, a match at the start may fail,
2304 and a later one succeed. Consider, for example:
2306 (.*)abc\1
2308 If the subject is "xyz123abc123" the match point is the fourth charac-
2309 ter. For this reason, such a pattern is not implicitly anchored.
2311 When a capturing subpattern is repeated, the value captured is the sub-
2312 string that matched the final iteration. For example, after
2314 (tweedle[dume]{3}\s*)+
2316 has matched "tweedledum tweedledee" the value of the captured substring
2317 is "tweedledee". However, if there are nested capturing subpatterns,
2318 the corresponding captured values may have been set in previous itera-
2319 tions. For example, after
2321 /(a|(b))+/
2323 matches "aba" the value of the second captured substring is "b".
2328 With both maximizing and minimizing repetition, failure of what follows
2329 normally causes the repeated item to be re-evaluated to see if a dif-
2330 ferent number of repeats allows the rest of the pattern to match. Some-
2331 times it is useful to prevent this, either to change the nature of the
2332 match, or to cause it fail earlier than it otherwise might, when the
2333 author of the pattern knows there is no point in carrying on.
2335 Consider, for example, the pattern \d+foo when applied to the subject
2336 line
2338 123456bar
2340 After matching all 6 digits and then failing to match "foo", the normal
2341 action of the matcher is to try again with only 5 digits matching the
2342 \d+ item, and then with 4, and so on, before ultimately failing.
2343 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
2344 the means for specifying that once a subpattern has matched, it is not
2345 to be re-evaluated in this way.
2347 If we use atomic grouping for the previous example, the matcher would
2348 give up immediately on failing to match "foo" the first time. The nota-
2349 tion is a kind of special parenthesis, starting with (?> as in this
2350 example:
2352 (?>\d+)foo
2354 This kind of parenthesis "locks up" the part of the pattern it con-
2355 tains once it has matched, and a failure further into the pattern is
2356 prevented from backtracking into it. Backtracking past it to previous
2357 items, however, works as normal.
2359 An alternative description is that a subpattern of this type matches
2360 the string of characters that an identical standalone pattern would
2361 match, if anchored at the current point in the subject string.
2363 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
2364 such as the above example can be thought of as a maximizing repeat that
2365 must swallow everything it can. So, while both \d+ and \d+? are pre-
2366 pared to adjust the number of digits they match in order to make the
2367 rest of the pattern match, (?>\d+) can only match an entire sequence of
2368 digits.
2370 Atomic groups in general can of course contain arbitrarily complicated
2371 subpatterns, and can be nested. However, when the subpattern for an
2372 atomic group is just a single repeated item, as in the example above, a
2373 simpler notation, called a "possessive quantifier" can be used. This
2374 consists of an additional + character following a quantifier. Using
2375 this notation, the previous example can be rewritten as
2377 \d++bar
2379 Possessive quantifiers are always greedy; the setting of the
2380 PCRE_UNGREEDY option is ignored. They are a convenient notation for the
2381 simpler forms of atomic group. However, there is no difference in the
2382 meaning or processing of a possessive quantifier and the equivalent
2383 atomic group.
2385 The possessive quantifier syntax is an extension to the Perl syntax. It
2386 originates in Sun's Java package.
2388 When a pattern contains an unlimited repeat inside a subpattern that
2389 can itself be repeated an unlimited number of times, the use of an
2390 atomic group is the only way to avoid some failing matches taking a
2391 very long time indeed. The pattern
2393 (\D+|<\d+>)*[!?]
2395 matches an unlimited number of substrings that either consist of non-
2396 digits, or digits enclosed in <>, followed by either ! or ?. When it
2397 matches, it runs quickly. However, if it is applied to
2399 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2401 it takes a long time before reporting failure. This is because the
2402 string can be divided between the two repeats in a large number of
2403 ways, and all have to be tried. (The example used [!?] rather than a
2404 single character at the end, because both PCRE and Perl have an opti-
2405 mization that allows for fast failure when a single character is used.
2406 They remember the last single character that is required for a match,
2407 and fail early if it is not present in the string.) If the pattern is
2408 changed to
2410 ((?>\D+)|<\d+>)*[!?]
2412 sequences of non-digits cannot be broken, and failure happens quickly.
2417 Outside a character class, a backslash followed by a digit greater than
2418 0 (and possibly further digits) is a back reference to a capturing sub-
2419 pattern earlier (that is, to its left) in the pattern, provided there
2420 have been that many previous capturing left parentheses.
2422 However, if the decimal number following the backslash is less than 10,
2423 it is always taken as a back reference, and causes an error only if
2424 there are not that many capturing left parentheses in the entire pat-
2425 tern. In other words, the parentheses that are referenced need not be
2426 to the left of the reference for numbers less than 10. See the section
2427 entitled "Backslash" above for further details of the handling of dig-
2428 its following a backslash.
2430 A back reference matches whatever actually matched the capturing sub-
2431 pattern in the current subject string, rather than anything matching
2432 the subpattern itself (see "Subpatterns as subroutines" below for a way
2433 of doing that). So the pattern
2435 (sens|respons)e and \1ibility
2437 matches "sense and sensibility" and "response and responsibility", but
2438 not "sense and responsibility". If caseful matching is in force at the
2439 time of the back reference, the case of letters is relevant. For exam-
2440 ple,
2442 ((?i)rah)\s+\1
2444 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
2445 original capturing subpattern is matched caselessly.
2447 Back references to named subpatterns use the Python syntax (?P=name).
2448 We could rewrite the above example as follows:
2450 (?<p1>(?i)rah)\s+(?P=p1)
2452 There may be more than one back reference to the same subpattern. If a
2453 subpattern has not actually been used in a particular match, any back
2454 references to it always fail. For example, the pattern
2456 (a|(bc))\2
2458 always fails if it starts to match "a" rather than "bc". Because there
2459 may be many capturing parentheses in a pattern, all digits following
2460 the backslash are taken as part of a potential back reference number.
2461 If the pattern continues with a digit character, some delimiter must be
2462 used to terminate the back reference. If the PCRE_EXTENDED option is
2463 set, this can be whitespace. Otherwise an empty comment can be used.
2465 A back reference that occurs inside the parentheses to which it refers
2466 fails when the subpattern is first used, so, for example, (a\1) never
2467 matches. However, such references can be useful inside repeated sub-
2468 patterns. For example, the pattern
2470 (a|b\1)+
2472 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
2473 ation of the subpattern, the back reference matches the character
2474 string corresponding to the previous iteration. In order for this to
2475 work, the pattern must be such that the first iteration does not need
2476 to match the back reference. This can be done using alternation, as in
2477 the example above, or by a quantifier with a minimum of zero.
2482 An assertion is a test on the characters following or preceding the
2483 current matching point that does not actually consume any characters.
2484 The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are
2485 described above. More complicated assertions are coded as subpatterns.
2486 There are two kinds: those that look ahead of the current position in
2487 the subject string, and those that look behind it.
2489 An assertion subpattern is matched in the normal way, except that it
2490 does not cause the current matching position to be changed. Lookahead
2491 assertions start with (?= for positive assertions and (?! for negative
2492 assertions. For example,
2494 \w+(?=;)
2496 matches a word followed by a semicolon, but does not include the semi-
2497 colon in the match, and
2499 foo(?!bar)
2501 matches any occurrence of "foo" that is not followed by "bar". Note
2502 that the apparently similar pattern
2504 (?!foo)bar
2506 does not find an occurrence of "bar" that is preceded by something
2507 other than "foo"; it finds any occurrence of "bar" whatsoever, because
2508 the assertion (?!foo) is always true when the next three characters are
2509 "bar". A lookbehind assertion is needed to achieve this effect.
2511 If you want to force a matching failure at some point in a pattern, the
2512 most convenient way to do it is with (?!) because an empty string
2513 always matches, so an assertion that requires there not to be an empty
2514 string must always fail.
2516 Lookbehind assertions start with (?<= for positive assertions and (?<!
2517 for negative assertions. For example,
2519 (?<!foo)bar
2521 does find an occurrence of "bar" that is not preceded by "foo". The
2522 contents of a lookbehind assertion are restricted such that all the
2523 strings it matches must have a fixed length. However, if there are sev-
2524 eral alternatives, they do not all have to have the same fixed length.
2525 Thus
2527 (?<=bullock|donkey)
2529 is permitted, but
2531 (?<!dogs?|cats?)
2533 causes an error at compile time. Branches that match different length
2534 strings are permitted only at the top level of a lookbehind assertion.
2535 This is an extension compared with Perl (at least for 5.8), which
2536 requires all branches to match the same length of string. An assertion
2537 such as
2539 (?<=ab(c|de))
2541 is not permitted, because its single top-level branch can match two
2542 different lengths, but it is acceptable if rewritten to use two top-
2543 level branches:
2545 (?<=abc|abde)
2547 The implementation of lookbehind assertions is, for each alternative,
2548 to temporarily move the current position back by the fixed width and
2549 then try to match. If there are insufficient characters before the cur-
2550 rent position, the match is deemed to fail.
2552 PCRE does not allow the \C escape (which matches a single byte in UTF-8
2553 mode) to appear in lookbehind assertions, because it makes it impossi-
2554 ble to calculate the length of the lookbehind.
2556 Atomic groups can be used in conjunction with lookbehind assertions to
2557 specify efficient matching at the end of the subject string. Consider a
2558 simple pattern such as
2560 abcd$
2562 when applied to a long string that does not match. Because matching
2563 proceeds from left to right, PCRE will look for each "a" in the subject
2564 and then see if what follows matches the rest of the pattern. If the
2565 pattern is specified as
2567 ^.*abcd$
2569 the initial .* matches the entire string at first, but when this fails
2570 (because there is no following "a"), it backtracks to match all but the
2571 last character, then all but the last two characters, and so on. Once
2572 again the search for "a" covers the entire string, from right to left,
2573 so we are no better off. However, if the pattern is written as
2575 ^(?>.*)(?<=abcd)
2577 or, equivalently,
2579 ^.*+(?<=abcd)
2581 there can be no backtracking for the .* item; it can match only the
2582 entire string. The subsequent lookbehind assertion does a single test
2583 on the last four characters. If it fails, the match fails immediately.
2584 For long strings, this approach makes a significant difference to the
2585 processing time.
2587 Several assertions (of any sort) may occur in succession. For example,
2589 (?<=\d{3})(?<!999)foo
2591 matches "foo" preceded by three digits that are not "999". Notice that
2592 each of the assertions is applied independently at the same point in
2593 the subject string. First there is a check that the previous three
2594 characters are all digits, and then there is a check that the same
2595 three characters are not "999". This pattern does not match "foo" pre-
2596 ceded by six characters, the first of which are digits and the last
2597 three of which are not "999". For example, it doesn't match "123abc-
2598 foo". A pattern to do that is
2600 (?<=\d{3}...)(?<!999)foo
2602 This time the first assertion looks at the preceding six characters,
2603 checking that the first three are digits, and then the second assertion
2604 checks that the preceding three characters are not "999".
2606 Assertions can be nested in any combination. For example,
2608 (?<=(?<!foo)bar)baz
2610 matches an occurrence of "baz" that is preceded by "bar" which in turn
2611 is not preceded by "foo", while
2613 (?<=\d{3}(?!999)...)foo
2615 is another pattern which matches "foo" preceded by three digits and any
2616 three characters that are not "999".
2618 Assertion subpatterns are not capturing subpatterns, and may not be
2619 repeated, because it makes no sense to assert the same thing several
2620 times. If any kind of assertion contains capturing subpatterns within
2621 it, these are counted for the purposes of numbering the capturing sub-
2622 patterns in the whole pattern. However, substring capturing is carried
2623 out only for positive assertions, because it does not make sense for
2624 negative assertions.
2629 It is possible to cause the matching process to obey a subpattern con-
2630 ditionally or to choose between two alternative subpatterns, depending
2631 on the result of an assertion, or whether a previous capturing
2632 subpattern matched or not. The two possible forms of conditional sub-
2633 pattern are
2635 (?(condition)yes-pattern)
2636 (?(condition)yes-pattern|no-pattern)
2638 If the condition is satisfied, the yes-pattern is used; otherwise the
2639 no-pattern (if present) is used. If there are more than two alterna-
2640 tives in the subpattern, a compile-time error occurs.
2642 There are three kinds of condition. If the text between the parentheses
2643 consists of a sequence of digits, the condition is satisfied if the
2644 capturing subpattern of that number has previously matched. The number
2645 must be greater than zero. Consider the following pattern, which con-
2646 tains non-significant white space to make it more readable (assume the
2647 PCRE_EXTENDED option) and to divide it into three parts for ease of
2648 discussion:
2650 ( \( )? [^()]+ (?(1) \) )
2652 The first part matches an optional opening parenthesis, and if that
2653 character is present, sets it as the first captured substring. The sec-
2654 ond part matches one or more characters that are not parentheses. The
2655 third part is a conditional subpattern that tests whether the first set
2656 of parentheses matched or not. If they did, that is, if subject started
2657 with an opening parenthesis, the condition is true, and so the yes-pat-
2658 tern is executed and a closing parenthesis is required. Otherwise,
2659 since no-pattern is not present, the subpattern matches nothing. In
2660 other words, this pattern matches a sequence of non-parentheses,
2661 optionally enclosed in parentheses.
2663 If the condition is the string (R), it is satisfied if a recursive call
2664 to the pattern or subpattern has been made. At "top level", the condi-
2665 tion is false. This is a PCRE extension. Recursive patterns are
2666 described in the next section.
2668 If the condition is not a sequence of digits or (R), it must be an
2669 assertion. This may be a positive or negative lookahead or lookbehind
2670 assertion. Consider this pattern, again containing non-significant
2671 white space, and with the two alternatives on the second line:
2673 (?(?=[^a-z]*[a-z])
2674 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
2676 The condition is a positive lookahead assertion that matches an
2677 optional sequence of non-letters followed by a letter. In other words,
2678 it tests for the presence of at least one letter in the subject. If a
2679 letter is found, the subject is matched against the first alternative;
2680 otherwise it is matched against the second. This pattern matches
2681 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
2682 letters and dd are digits.
2687 The sequence (?# marks the start of a comment which continues up to the
2688 next closing parenthesis. Nested parentheses are not permitted. The
2689 characters that make up a comment play no part in the pattern matching
2690 at all.
2692 If the PCRE_EXTENDED option is set, an unescaped # character outside a
2693 character class introduces a comment that continues up to the next new-
2694 line character in the pattern.
2699 Consider the problem of matching a string in parentheses, allowing for
2700 unlimited nested parentheses. Without the use of recursion, the best
2701 that can be done is to use a pattern that matches up to some fixed
2702 depth of nesting. It is not possible to handle an arbitrary nesting
2703 depth. Perl has provided an experimental facility that allows regular
2704 expressions to recurse (amongst other things). It does this by interpo-
2705 lating Perl code in the expression at run time, and the code can refer
2706 to the expression itself. A Perl pattern to solve the parentheses prob-
2707 lem can be created like this:
2709 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
2711 The (?p{...}) item interpolates Perl code at run time, and in this case
2712 refers recursively to the pattern in which it appears. Obviously, PCRE
2713 cannot support the interpolation of Perl code. Instead, it supports
2714 some special syntax for recursion of the entire pattern, and also for
2715 individual subpattern recursion.
2717 The special item that consists of (? followed by a number greater than
2718 zero and a closing parenthesis is a recursive call of the subpattern of
2719 the given number, provided that it occurs inside that subpattern. (If
2720 not, it is a "subroutine" call, which is described in the next sec-
2721 tion.) The special item (?R) is a recursive call of the entire regular
2722 expression.
2724 For example, this PCRE pattern solves the nested parentheses problem
2725 (assume the PCRE_EXTENDED option is set so that white space is
2726 ignored):
2728 \( ( (?>[^()]+) | (?R) )* \)
2730 First it matches an opening parenthesis. Then it matches any number of
2731 substrings which can either be a sequence of non-parentheses, or a
2732 recursive match of the pattern itself (that is a correctly parenthe-
2733 sized substring). Finally there is a closing parenthesis.
2735 If this were part of a larger pattern, you would not want to recurse
2736 the entire pattern, so instead you could use this:
2738 ( \( ( (?>[^()]+) | (?1) )* \) )
2740 We have put the pattern into parentheses, and caused the recursion to
2741 refer to them instead of the whole pattern. In a larger pattern, keep-
2742 ing track of parenthesis numbers can be tricky. It may be more conve-
2743 nient to use named parentheses instead. For this, PCRE uses (?P>name),
2744 which is an extension to the Python syntax that PCRE uses for named
2745 parentheses (Perl does not provide named parentheses). We could rewrite
2746 the above example as follows:
2748 (?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )
2750 This particular example pattern contains nested unlimited repeats, and
2751 so the use of atomic grouping for matching strings of non-parentheses
2752 is important when applying the pattern to strings that do not match.
2753 For example, when this pattern is applied to
2755 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2757 it yields "no match" quickly. However, if atomic grouping is not used,
2758 the match runs for a very long time indeed because there are so many
2759 different ways the + and * repeats can carve up the subject, and all
2760 have to be tested before failure can be reported.
2762 At the end of a match, the values set for any capturing subpatterns are
2763 those from the outermost level of the recursion at which the subpattern
2764 value is set. If you want to obtain intermediate values, a callout
2765 function can be used (see below and the pcrecallout documentation). If
2766 the pattern above is matched against
2768 (ab(cd)ef)
2770 the value for the capturing parentheses is "ef", which is the last
2771 value taken on at the top level. If additional parentheses are added,
2772 giving
2774 \( ( ( (?>[^()]+) | (?R) )* ) \)
2775 ^ ^
2776 ^ ^
2778 the string they capture is "ab(cd)ef", the contents of the top level
2779 parentheses. If there are more than 15 capturing parentheses in a pat-
2780 tern, PCRE has to obtain extra memory to store data during a recursion,
2781 which it does by using pcre_malloc, freeing it via pcre_free after-
2782 wards. If no memory can be obtained, the match fails with the
2785 Do not confuse the (?R) item with the condition (R), which tests for
2786 recursion. Consider this pattern, which matches text in angle brack-
2787 ets, allowing for arbitrary nesting. Only digits are allowed in nested
2788 brackets (that is, when recursing), whereas any characters are permit-
2789 ted at the outer level.
2791 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
2793 In this pattern, (?(R) is the start of a conditional subpattern, with
2794 two different alternatives for the recursive and non-recursive cases.
2795 The (?R) item is the actual recursive call.
2800 If the syntax for a recursive subpattern reference (either by number or
2801 by name) is used outside the parentheses to which it refers, it oper-
2802 ates like a subroutine in a programming language. An earlier example
2803 pointed out that the pattern
2805 (sens|respons)e and \1ibility
2807 matches "sense and sensibility" and "response and responsibility", but
2808 not "sense and responsibility". If instead the pattern
2810 (sens|respons)e and (?1)ibility
2812 is used, it does match "sense and responsibility" as well as the other
2813 two strings. Such references must, however, follow the subpattern to
2814 which they refer.
2819 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
2820 Perl code to be obeyed in the middle of matching a regular expression.
2821 This makes it possible, amongst other things, to extract different sub-
2822 strings that match the same pair of parentheses when there is a repeti-
2823 tion.
2825 PCRE provides a similar feature, but of course it cannot obey arbitrary
2826 Perl code. The feature is called "callout". The caller of PCRE provides
2827 an external function by putting its entry point in the global variable
2828 pcre_callout. By default, this variable contains NULL, which disables
2829 all calling out.
2831 Within a regular expression, (?C) indicates the points at which the
2832 external function is to be called. If you want to identify different
2833 callout points, you can put a number less than 256 after the letter C.
2834 The default value is zero. For example, this pattern has two callout
2835 points:
2837 (?C1)abc(?C2)def
2839 During matching, when PCRE reaches a callout point (and pcre_callout is
2840 set), the external function is called. It is provided with the number
2841 of the callout, and, optionally, one item of data originally supplied
2842 by the caller of pcre_exec(). The callout function may cause matching
2843 to backtrack, or to fail altogether. A complete description of the
2844 interface to the callout function is given in the pcrecallout documen-
2845 tation.
2847 Last updated: 03 February 2003
2848 Copyright (c) 1997-2003 University of Cambridge.
2849 -----------------------------------------------------------------------------
2851 PCRE(3) PCRE(3)
2855 NAME
2856 PCRE - Perl-compatible regular expressions
2860 Certain items that may appear in regular expression patterns are more
2861 efficient than others. It is more efficient to use a character class
2862 like [aeiou] than a set of alternatives such as (a|e|i|o|u). In gen-
2863 eral, the simplest construction that provides the required behaviour is
2864 usually the most efficient. Jeffrey Friedl's book contains a lot of
2865 discussion about optimizing regular expressions for efficient perfor-
2866 mance.
2868 When a pattern begins with .* not in parentheses, or in parentheses
2869 that are not the subject of a backreference, and the PCRE_DOTALL option
2870 is set, the pattern is implicitly anchored by PCRE, since it can match
2871 only at the start of a subject string. However, if PCRE_DOTALL is not
2872 set, PCRE cannot make this optimization, because the . metacharacter
2873 does not then match a newline, and if the subject string contains new-
2874 lines, the pattern may match from the character immediately following
2875 one of them instead of from the very start. For example, the pattern
2877 .*second
2879 matches the subject "first\nand second" (where \n stands for a newline
2880 character), with the match starting at the seventh character. In order
2881 to do this, PCRE has to retry the match starting after every newline in
2882 the subject.
2884 If you are using such a pattern with subject strings that do not con-
2885 tain newlines, the best performance is obtained by setting PCRE_DOTALL,
2886 or starting the pattern with ^.* to indicate explicit anchoring. That
2887 saves PCRE from having to scan along the subject looking for a newline
2888 to restart at.
2890 Beware of patterns that contain nested indefinite repeats. These can
2891 take a long time to run when applied to a string that does not match.
2892 Consider the pattern fragment
2894 (a+)*
2896 This can match "aaaa" in 33 different ways, and this number increases
2897 very rapidly as the string gets longer. (The * repeat can match 0, 1,
2898 2, 3, or 4 times, and for each of those cases other than 0, the +
2899 repeats can match different numbers of times.) When the remainder of
2900 the pattern is such that the entire match is going to fail, PCRE has in
2901 principle to try every possible variation, and this can take an
2902 extremely long time.
2904 An optimization catches some of the more simple cases such as
2906 (a+)*b
2908 where a literal character follows. Before embarking on the standard
2909 matching procedure, PCRE checks that there is a "b" later in the sub-
2910 ject string, and if there is not, it fails the match immediately. How-
2911 ever, when there is no following literal this optimization cannot be
2912 used. You can see the difference by comparing the behaviour of
2914 (a+)*\d
2916 with the pattern above. The former gives a failure almost instantly
2917 when applied to a whole line of "a" characters, whereas the latter
2918 takes an appreciable time with strings longer than about 20 characters.
2920 Last updated: 03 February 2003
2921 Copyright (c) 1997-2003 University of Cambridge.
2922 -----------------------------------------------------------------------------
2924 PCRE(3) PCRE(3)
2928 NAME
2929 PCRE - Perl-compatible regular expressions.
2932 #include <pcreposix.h>
2934 int regcomp(regex_t *preg, const char *pattern,
2935 int cflags);
2937 int regexec(regex_t *preg, const char *string,
2938 size_t nmatch, regmatch_t pmatch[], int eflags);
2940 size_t regerror(int errcode, const regex_t *preg,
2941 char *errbuf, size_t errbuf_size);
2943 void regfree(regex_t *preg);
2948 This set of functions provides a POSIX-style API to the PCRE regular
2949 expression package. See the pcreapi documentation for a description of
2950 the native API, which contains additional functionality.
2952 The functions described here are just wrapper functions that ultimately
2953 call the PCRE native API. Their prototypes are defined in the
2954 pcreposix.h header file, and on Unix systems the library itself is
2955 called pcreposix.a, so can be accessed by adding -lpcreposix to the
2956 command for linking an application which uses them. Because the POSIX
2957 functions call the native ones, it is also necessary to add -lpcre.
2959 I have implemented only those option bits that can be reasonably mapped
2960 to PCRE native options. In addition, the options REG_EXTENDED and
2961 REG_NOSUB are defined with the value zero. They have no effect, but
2962 since programs that are written to the POSIX interface often use them,
2963 this makes it easier to slot in PCRE as a replacement library. Other
2964 POSIX options are not even defined.
2966 When PCRE is called via these functions, it is only the API that is
2967 POSIX-like in style. The syntax and semantics of the regular expres-
2968 sions themselves are still those of Perl, subject to the setting of
2969 various PCRE options, as described below. "POSIX-like in style" means
2970 that the API approximates to the POSIX definition; it is not fully
2971 POSIX-compatible, and in multi-byte encoding domains it is probably
2972 even less compatible.
2974 The header for these functions is supplied as pcreposix.h to avoid any
2975 potential clash with other POSIX libraries. It can, of course, be
2976 renamed or aliased as regex.h, which is the "correct" name. It provides
2977 two structure types, regex_t for compiled internal forms, and reg-
2978 match_t for returning captured substrings. It also defines some con-
2979 stants whose names start with "REG_"; these are used for setting
2980 options and identifying error codes.
2985 The function regcomp() is called to compile a pattern into an internal
2986 form. The pattern is a C string terminated by a binary zero, and is
2987 passed in the argument pattern. The preg argument is a pointer to a
2988 regex_t structure which is used as a base for storing information about
2989 the compiled expression.
2991 The argument cflags is either zero, or contains one or more of the bits
2992 defined by the following macros:
2996 The PCRE_CASELESS option is set when the expression is passed for com-
2997 pilation to the native function.
3001 The PCRE_MULTILINE option is set when the expression is passed for com-
3002 pilation to the native function. Note that this does not mimic the
3003 defined POSIX behaviour for REG_NEWLINE (see the following section).
3005 In the absence of these flags, no options are passed to the native
3006 function. This means the the regex is compiled with PCRE default
3007 semantics. In particular, the way it handles newline characters in the
3008 subject string is the Perl way, not the POSIX way. Note that setting
3009 PCRE_MULTILINE has only some of the effects specified for REG_NEWLINE.
3010 It does not affect the way newlines are matched by . (they aren't) or
3011 by a negative class such as [^a] (they are).
3013 The yield of regcomp() is zero on success, and non-zero otherwise. The
3014 preg structure is filled in on success, and one member of the structure
3015 is public: re_nsub contains the number of capturing subpatterns in the
3016 regular expression. Various error codes are defined in the header file.
3021 This area is not simple, because POSIX and Perl take different views of
3022 things. It is not possible to get PCRE to obey POSIX semantics, but
3023 then PCRE was never intended to be a POSIX engine. The following table
3024 lists the different possibilities for matching newline characters in
3025 PCRE:
3027 Default Change with
3029 . matches newline no PCRE_DOTALL
3030 newline matches [^a] yes not changeable
3031 $ matches \n at end yes PCRE_DOLLARENDONLY
3032 $ matches \n in middle no PCRE_MULTILINE
3033 ^ matches \n in middle no PCRE_MULTILINE
3035 This is the equivalent table for POSIX:
3037 Default Change with
3039 . matches newline yes REG_NEWLINE
3040 newline matches [^a] yes REG_NEWLINE
3041 $ matches \n at end no REG_NEWLINE
3042 $ matches \n in middle no REG_NEWLINE
3043 ^ matches \n in middle no REG_NEWLINE
3045 PCRE's behaviour is the same as Perl's, except that there is no equiva-
3046 lent for PCRE_DOLLARENDONLY in Perl. In both PCRE and Perl, there is no
3047 way to stop newline from matching [^a].
3049 The default POSIX newline handling can be obtained by setting
3050 PCRE_DOTALL and PCRE_DOLLARENDONLY, but there is no way to make PCRE
3051 behave exactly as for the REG_NEWLINE action.
3056 The function regexec() is called to match a pre-compiled pattern preg
3057 against a given string, which is terminated by a zero byte, subject to
3058 the options in eflags. These can be:
3062 The PCRE_NOTBOL option is set when calling the underlying PCRE matching
3063 function.
3067 The PCRE_NOTEOL option is set when calling the underlying PCRE matching
3068 function.
3070 The portion of the string that was matched, and also any captured sub-
3071 strings, are returned via the pmatch argument, which points to an array
3072 of nmatch structures of type regmatch_t, containing the members rm_so
3073 and rm_eo. These contain the offset to the first character of each sub-
3074 string and the offset to the first character after the end of each sub-
3075 string, respectively. The 0th element of the vector relates to the
3076 entire portion of string that was matched; subsequent elements relate
3077 to the capturing subpatterns of the regular expression. Unused entries
3078 in the array have both structure members set to -1.
3080 A successful match yields a zero return; various error codes are
3081 defined in the header file, of which REG_NOMATCH is the "expected"
3082 failure code.
3087 The regerror() function maps a non-zero errorcode from either regcomp()
3088 or regexec() to a printable message. If preg is not NULL, the error
3089 should have arisen from the use of that structure. A message terminated
3090 by a binary zero is placed in errbuf. The length of the message,
3091 including the zero, is limited to errbuf_size. The yield of the func-
3092 tion is the size of buffer needed to hold the whole message.
3097 Compiling a regular expression causes memory to be allocated and asso-
3098 ciated with the preg structure. The function regfree() frees all such
3099 memory, after which preg may no longer be used as a compiled expres-
3100 sion.
3105 Philip Hazel <ph10@cam.ac.uk>
3106 University Computing Service,
3107 Cambridge CB2 3QG, England.
3109 Last updated: 03 February 2003
3110 Copyright (c) 1997-2003 University of Cambridge.
3111 -----------------------------------------------------------------------------
3113 PCRE(3) PCRE(3)
3117 NAME
3118 PCRE - Perl-compatible regular expressions
3122 A simple, complete demonstration program, to get you started with using
3123 PCRE, is supplied in the file pcredemo.c in the PCRE distribution.
3125 The program compiles the regular expression that is its first argument,
3126 and matches it against the subject string in its second argument. No
3127 PCRE options are set, and default character tables are used. If match-
3128 ing succeeds, the program outputs the portion of the subject that
3129 matched, together with the contents of any captured substrings.
3131 If the -g option is given on the command line, the program then goes on
3132 to check for further matches of the same regular expression in the same
3133 subject string. The logic is a little bit tricky because of the possi-
3134 bility of matching an empty string. Comments in the code explain what
3135 is going on.
3137 On a Unix system that has PCRE installed in /usr/local, you can compile
3138 the demonstration program using a command like this:
3140 gcc -o pcredemo pcredemo.c -I/usr/local/include \
3141 -L/usr/local/lib -lpcre
3143 Then you can run simple tests like this:
3145 ./pcredemo 'cat|dog' 'the cat sat on the mat'
3146 ./pcredemo -g 'cat|dog' 'the dog sat on the cat'
3148 Note that there is a much more comprehensive test program, called
3149 pcretest, which supports many more facilities for testing regular
3150 expressions and the PCRE library. The pcredemo program is provided as a
3151 simple coding example.
3153 On some operating systems (e.g. Solaris) you may get an error like this
3154 when you try to run pcredemo:
3156 ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or
3157 directory
3159 This is caused by the way shared library support works on those sys-
3160 tems. You need to add
3162 -R/usr/local/lib
3164 to the compile command to get round this problem.
3166 Last updated: 28 January 2003
3167 Copyright (c) 1997-2003 University of Cambridge.
3168 -----------------------------------------------------------------------------

  ViewVC Help
Powered by ViewVC 1.1.5