ViewVC logotype

Contents of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 47 - (show annotations)
Sat Feb 24 21:39:29 2007 UTC (14 years, 8 months ago) by nigel
File MIME type: text/plain
File size: 87371 byte(s)
Load pcre-3.2 into code/trunk.
2 pcre - Perl-compatible regular expressions.
7 #include <pcre.h>
9 pcre *pcre_compile(const char *pattern, int options,
10 const char **errptr, int *erroffset,
11 const unsigned char *tableptr);
13 pcre_extra *pcre_study(const pcre *code, int options,
14 const char **errptr);
16 int pcre_exec(const pcre *code, const pcre_extra *extra,
17 const char *subject, int length, int startoffset,
18 int options, int *ovector, int ovecsize);
20 int pcre_copy_substring(const char *subject, int *ovector,
21 int stringcount, int stringnumber, char *buffer,
22 int buffersize);
24 int pcre_get_substring(const char *subject, int *ovector,
25 int stringcount, int stringnumber,
26 const char **stringptr);
28 int pcre_get_substring_list(const char *subject,
29 int *ovector, int stringcount, const char ***listptr);
31 const unsigned char *pcre_maketables(void);
33 int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
34 int what, void *where);
36 int pcre_info(const pcre *code, int *optptr, *firstcharptr);
38 char *pcre_version(void);
40 void *(*pcre_malloc)(size_t);
42 void (*pcre_free)(void *);
48 The PCRE library is a set of functions that implement regu-
49 lar expression pattern matching using the same syntax and
50 semantics as Perl 5, with just a few differences (see
51 below). The current implementation corresponds to Perl
52 5.005, with some additional features from the Perl develop-
53 ment release.
55 PCRE has its own native API, which is described in this
56 document. There is also a set of wrapper functions that
57 correspond to the POSIX regular expression API. These are
58 described in the pcreposix documentation.
60 The native API function prototypes are defined in the header
61 file pcre.h, and on Unix systems the library itself is
62 called libpcre.a, so can be accessed by adding -lpcre to the
63 command for linking an application which calls it. The
64 header file defines the macros PCRE_MAJOR and PCRE_MINOR to
65 contain the major and minor release numbers for the library.
66 Applications can use these to include support for different
67 releases.
69 The functions pcre_compile(), pcre_study(), and pcre_exec()
70 are used for compiling and matching regular expressions,
71 while pcre_copy_substring(), pcre_get_substring(), and
72 pcre_get_substring_list() are convenience functions for
73 extracting captured substrings from a matched subject
74 string. The function pcre_maketables() is used (optionally)
75 to build a set of character tables in the current locale for
76 passing to pcre_compile().
78 The function pcre_fullinfo() is used to find out information
79 about a compiled pattern; pcre_info() is an obsolete version
80 which returns only some of the available information, but is
81 retained for backwards compatibility. The function
82 pcre_version() returns a pointer to a string containing the
83 version of PCRE and its date of release.
85 The global variables pcre_malloc and pcre_free initially
86 contain the entry points of the standard malloc() and free()
87 functions respectively. PCRE calls the memory management
88 functions via these variables, so a calling program can
89 replace them if it wishes to intercept the calls. This
90 should be done before calling any PCRE functions.
95 The PCRE functions can be used in multi-threading applica-
96 tions, with the proviso that the memory management functions
97 pointed to by pcre_malloc and pcre_free are shared by all
98 threads.
100 The compiled form of a regular expression is not altered
101 during matching, so the same compiled pattern can safely be
102 used by several threads at once.
108 The function pcre_compile() is called to compile a pattern
109 into an internal form. The pattern is a C string terminated
110 by a binary zero, and is passed in the argument pattern. A
111 pointer to a single block of memory that is obtained via
112 pcre_malloc is returned. This contains the compiled code and
113 related data. The pcre type is defined for this for conveni-
114 ence, but in fact pcre is just a typedef for void, since the
115 contents of the block are not externally defined. It is up
116 to the caller to free the memory when it is no longer
117 required.
119 The size of a compiled pattern is roughly proportional to
120 the length of the pattern string, except that each character
121 class (other than those containing just a single character,
122 negated or not) requires 33 bytes, and repeat quantifiers
123 with a minimum greater than one or a bounded maximum cause
124 the relevant portions of the compiled pattern to be repli-
125 cated.
127 The options argument contains independent bits that affect
128 the compilation. It should be zero if no options are
129 required. Some of the options, in particular, those that are
130 compatible with Perl, can also be set and unset from within
131 the pattern (see the detailed description of regular expres-
132 sions below). For these options, the contents of the options
133 argument specifies their initial settings at the start of
134 compilation and execution. The PCRE_ANCHORED option can be
135 set at the time of matching as well as at compile time.
137 If errptr is NULL, pcre_compile() returns NULL immediately.
138 Otherwise, if compilation of a pattern fails, pcre_compile()
139 returns NULL, and sets the variable pointed to by errptr to
140 point to a textual error message. The offset from the start
141 of the pattern to the character where the error was
142 discovered is placed in the variable pointed to by
143 erroffset, which must not be NULL. If it is, an immediate
144 error is given.
146 If the final argument, tableptr, is NULL, PCRE uses a
147 default set of character tables which are built when it is
148 compiled, using the default C locale. Otherwise, tableptr
149 must be the result of a call to pcre_maketables(). See the
150 section on locale support below.
152 The following option bits are defined in the header file:
156 If this bit is set, the pattern is forced to be "anchored",
157 that is, it is constrained to match only at the start of the
158 string which is being searched (the "subject string"). This
159 effect can also be achieved by appropriate constructs in the
160 pattern itself, which is the only way to do it in Perl.
164 If this bit is set, letters in the pattern match both upper
165 and lower case letters. It is equivalent to Perl's /i
166 option.
170 If this bit is set, a dollar metacharacter in the pattern
171 matches only at the end of the subject string. Without this
172 option, a dollar also matches immediately before the final
173 character if it is a newline (but not before any other new-
174 lines). The PCRE_DOLLAR_ENDONLY option is ignored if
175 PCRE_MULTILINE is set. There is no equivalent to this option
176 in Perl.
180 If this bit is set, a dot metacharater in the pattern
181 matches all characters, including newlines. Without it, new-
182 lines are excluded. This option is equivalent to Perl's /s
183 option. A negative class such as [^a] always matches a new-
184 line character, independent of the setting of this option.
188 If this bit is set, whitespace data characters in the pat-
189 tern are totally ignored except when escaped or inside a
190 character class, and characters between an unescaped # out-
191 side a character class and the next newline character,
192 inclusive, are also ignored. This is equivalent to Perl's /x
193 option, and makes it possible to include comments inside
194 complicated patterns. Note, however, that this applies only
195 to data characters. Whitespace characters may never appear
196 within special character sequences in a pattern, for example
197 within the sequence (?( which introduces a conditional sub-
198 pattern.
202 This option was invented in order to turn on additional
203 functionality of PCRE that is incompatible with Perl, but it
204 is currently of very little use. When set, any backslash in
205 a pattern that is followed by a letter that has no special
206 meaning causes an error, thus reserving these combinations
207 for future expansion. By default, as in Perl, a backslash
208 followed by a letter with no special meaning is treated as a
209 literal. There are at present no other features controlled
210 by this option. It can also be set by a (?X) option setting
211 within a pattern.
215 By default, PCRE treats the subject string as consisting of
216 a single "line" of characters (even if it actually contains
217 several newlines). The "start of line" metacharacter (^)
218 matches only at the start of the string, while the "end of
219 line" metacharacter ($) matches only at the end of the
220 string, or before a terminating newline (unless
221 PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
223 When PCRE_MULTILINE it is set, the "start of line" and "end
224 of line" constructs match immediately following or immedi-
225 ately before any newline in the subject string, respec-
226 tively, as well as at the very start and end. This is
227 equivalent to Perl's /m option. If there are no "\n" charac-
228 ters in a subject string, or no occurrences of ^ or $ in a
229 pattern, setting PCRE_MULTILINE has no effect.
233 This option inverts the "greediness" of the quantifiers so
234 that they are not greedy by default, but become greedy if
235 followed by "?". It is not compatible with Perl. It can also
236 be set by a (?U) option setting within the pattern.
241 When a pattern is going to be used several times, it is
242 worth spending more time analyzing it in order to speed up
243 the time taken for matching. The function pcre_study() takes
244 a pointer to a compiled pattern as its first argument, and
245 returns a pointer to a pcre_extra block (another void
246 typedef) containing additional information about the pat-
247 tern; this can be passed to pcre_exec(). If no additional
248 information is available, NULL is returned.
250 The second argument contains option bits. At present, no
251 options are defined for pcre_study(), and this argument
252 should always be zero.
254 The third argument for pcre_study() is a pointer to an error
255 message. If studying succeeds (even if no data is returned),
256 the variable it points to is set to NULL. Otherwise it
257 points to a textual error message.
259 At present, studying a pattern is useful only for non-
260 anchored patterns that do not have a single fixed starting
261 character. A bitmap of possible starting characters is
262 created.
267 PCRE handles caseless matching, and determines whether char-
268 acters are letters, digits, or whatever, by reference to a
269 set of tables. The library contains a default set of tables
270 which is created in the default C locale when PCRE is com-
271 piled. This is used when the final argument of
272 pcre_compile() is NULL, and is sufficient for many applica-
273 tions.
275 An alternative set of tables can, however, be supplied. Such
276 tables are built by calling the pcre_maketables() function,
277 which has no arguments, in the relevant locale. The result
278 can then be passed to pcre_compile() as often as necessary.
279 For example, to build and use tables that are appropriate
280 for the French locale (where accented characters with codes
281 greater than 128 are treated as letters), the following code
282 could be used:
284 setlocale(LC_CTYPE, "fr");
285 tables = pcre_maketables();
286 re = pcre_compile(..., tables);
288 The tables are built in memory that is obtained via
289 pcre_malloc. The pointer that is passed to pcre_compile is
290 saved with the compiled pattern, and the same tables are
291 used via this pointer by pcre_study() and pcre_exec(). Thus
292 for any single pattern, compilation, studying and matching
293 all happen in the same locale, but different patterns can be
294 compiled in different locales. It is the caller's responsi-
295 bility to ensure that the memory containing the tables
296 remains available for as long as it is needed.
301 The pcre_fullinfo() function returns information about a
302 compiled pattern. It replaces the obsolete pcre_info() func-
303 tion, which is nevertheless retained for backwards compabil-
304 ity (and is documented below).
306 The first argument for pcre_fullinfo() is a pointer to the
307 compiled pattern. The second argument is the result of
308 pcre_study(), or NULL if the pattern was not studied. The
309 third argument specifies which piece of information is
310 required, while the fourth argument is a pointer to a vari-
311 able to receive the data. The yield of the function is zero
312 for success, or one of the following negative numbers:
314 PCRE_ERROR_NULL the argument code was NULL
315 the argument where was NULL
316 PCRE_ERROR_BADMAGIC the "magic number" was not found
317 PCRE_ERROR_BADOPTION the value of what was invalid
319 The possible values for the third argument are defined in
320 pcre.h, and are as follows:
324 Return a copy of the options with which the pattern was com-
325 piled. The fourth argument should point to au unsigned long
326 int variable. These option bits are those specified in the
327 call to pcre_compile(), modified by any top-level option
328 settings within the pattern itself, and with the
329 PCRE_ANCHORED bit forcibly set if the form of the pattern
330 implies that it can match only at the start of a subject
331 string.
335 Return the size of the compiled pattern, that is, the value
336 that was passed as the argument to pcre_malloc() when PCRE
337 was getting memory in which to place the compiled data. The
338 fourth argument should point to a size_t variable.
342 Return the number of capturing subpatterns in the pattern.
343 The fourth argument should point to an int variable.
347 Return the number of the highest back reference in the pat-
348 tern. The fourth argument should point to an int variable.
349 Zero is returned if there are no back references.
353 Return information about the first character of any matched
354 string, for a non-anchored pattern. If there is a fixed
355 first character, e.g. from a pattern such as
356 (cat|cow|coyote), it is returned in the integer pointed to
357 by where. Otherwise, if either
359 (a) the pattern was compiled with the PCRE_MULTILINE option,
360 and every branch starts with "^", or
362 (b) every branch of the pattern starts with ".*" and
363 PCRE_DOTALL is not set (if it were set, the pattern would be
364 anchored),
366 -1 is returned, indicating that the pattern matches only at
367 the start of a subject string or after any "\n" within the
368 string. Otherwise -2 is returned. For anchored patterns, -2
369 is returned.
373 If the pattern was studied, and this resulted in the con-
374 struction of a 256-bit table indicating a fixed set of char-
375 acters for the first character in any matching string, a
376 pointer to the table is returned. Otherwise NULL is
377 returned. The fourth argument should point to an unsigned
378 char * variable.
382 For a non-anchored pattern, return the value of the right-
383 most literal character which must exist in any matched
384 string, other than at its start. The fourth argument should
385 point to an int variable. If there is no such character, or
386 if the pattern is anchored, -1 is returned. For example, for
387 the pattern /a\d+z\d+/ the returned value is 'z'.
389 The pcre_info() function is now obsolete because its inter-
390 face is too restrictive to return all the available data
391 about a compiled pattern. New programs should use
392 pcre_fullinfo() instead. The yield of pcre_info() is the
393 number of capturing subpatterns, or one of the following
394 negative numbers:
396 PCRE_ERROR_NULL the argument code was NULL
397 PCRE_ERROR_BADMAGIC the "magic number" was not found
399 If the optptr argument is not NULL, a copy of the options
400 with which the pattern was compiled is placed in the integer
401 it points to (see PCRE_INFO_OPTIONS above).
403 If the pattern is not anchored and the firstcharptr argument
404 is not NULL, it is used to pass back information about the
405 first character of any matched string (see
411 The function pcre_exec() is called to match a subject string
412 against a pre-compiled pattern, which is passed in the code
413 argument. If the pattern has been studied, the result of the
414 study should be passed in the extra argument. Otherwise this
415 must be NULL.
417 The PCRE_ANCHORED option can be passed in the options argu-
418 ment, whose unused bits must be zero. However, if a pattern
419 was compiled with PCRE_ANCHORED, or turned out to be
420 anchored by virtue of its contents, it cannot be made
421 unachored at matching time.
423 There are also three further options that can be set only at
424 matching time:
428 The first character of the string is not the beginning of a
429 line, so the circumflex metacharacter should not match
430 before it. Setting this without PCRE_MULTILINE (at compile
431 time) causes circumflex never to match.
435 The end of the string is not the end of a line, so the dol-
436 lar metacharacter should not match it nor (except in multi-
437 line mode) a newline immediately before it. Setting this
438 without PCRE_MULTILINE (at compile time) causes dollar never
439 to match.
443 An empty string is not considered to be a valid match if
444 this option is set. If there are alternatives in the pat-
445 tern, they are tried. If all the alternatives match the
446 empty string, the entire match fails. For example, if the
447 pattern
449 a?b?
451 is applied to a string not beginning with "a" or "b", it
452 matches the empty string at the start of the subject. With
453 PCRE_NOTEMPTY set, this match is not valid, so PCRE searches
454 further into the string for occurrences of "a" or "b".
456 Perl has no direct equivalent of PCRE_NOTEMPTY, but it does
457 make a special case of a pattern match of the empty string
458 within its split() function, and when using the /g modifier.
459 It is possible to emulate Perl's behaviour after matching a
460 null string by first trying the match again at the same
461 offset with PCRE_NOTEMPTY set, and then if that fails by
462 advancing the starting offset (see below) and trying an
463 ordinary match again.
465 The subject string is passed as a pointer in subject, a
466 length in length, and a starting offset in startoffset.
467 Unlike the pattern string, it may contain binary zero char-
468 acters. When the starting offset is zero, the search for a
469 match starts at the beginning of the subject, and this is by
470 far the most common case.
472 A non-zero starting offset is useful when searching for
473 another match in the same subject by calling pcre_exec()
474 again after a previous success. Setting startoffset differs
475 from just passing over a shortened string and setting
476 PCRE_NOTBOL in the case of a pattern that begins with any
477 kind of lookbehind. For example, consider the pattern
479 \Biss\B
481 which finds occurrences of "iss" in the middle of words. (\B
482 matches only if the current position in the subject is not a
483 word boundary.) When applied to the string "Mississipi" the
484 first call to pcre_exec() finds the first occurrence. If
485 pcre_exec() is called again with just the remainder of the
486 subject, namely "issipi", it does not match, because \B is
487 always false at the start of the subject, which is deemed to
488 be a word boundary. However, if pcre_exec() is passed the
489 entire string again, but with startoffset set to 4, it finds
490 the second occurrence of "iss" because it is able to look
491 behind the starting point to discover that it is preceded by
492 a letter.
494 If a non-zero starting offset is passed when the pattern is
495 anchored, one attempt to match at the given offset is tried.
496 This can only succeed if the pattern does not require the
497 match to be at the start of the subject.
499 In general, a pattern matches a certain portion of the sub-
500 ject, and in addition, further substrings from the subject
501 may be picked out by parts of the pattern. Following the
502 usage in Jeffrey Friedl's book, this is called "capturing"
503 in what follows, and the phrase "capturing subpattern" is
504 used for a fragment of a pattern that picks out a substring.
505 PCRE supports several other kinds of parenthesized subpat-
506 tern that do not cause substrings to be captured.
508 Captured substrings are returned to the caller via a vector
509 of integer offsets whose address is passed in ovector. The
510 number of elements in the vector is passed in ovecsize. The
511 first two-thirds of the vector is used to pass back captured
512 substrings, each substring using a pair of integers. The
513 remaining third of the vector is used as workspace by
514 pcre_exec() while matching capturing subpatterns, and is not
515 available for passing back information. The length passed in
516 ovecsize should always be a multiple of three. If it is not,
517 it is rounded down.
519 When a match has been successful, information about captured
520 substrings is returned in pairs of integers, starting at the
521 beginning of ovector, and continuing up to two-thirds of its
522 length at the most. The first element of a pair is set to
523 the offset of the first character in a substring, and the
524 second is set to the offset of the first character after the
525 end of a substring. The first pair, ovector[0] and ovec-
526 tor[1], identify the portion of the subject string matched
527 by the entire pattern. The next pair is used for the first
528 capturing subpattern, and so on. The value returned by
529 pcre_exec() is the number of pairs that have been set. If
530 there are no capturing subpatterns, the return value from a
531 successful match is 1, indicating that just the first pair
532 of offsets has been set.
534 Some convenience functions are provided for extracting the
535 captured substrings as separate strings. These are described
536 in the following section.
538 It is possible for an capturing subpattern number n+1 to
539 match some part of the subject when subpattern n has not
540 been used at all. For example, if the string "abc" is
541 matched against the pattern (a|(z))(bc) subpatterns 1 and 3
542 are matched, but 2 is not. When this happens, both offset
543 values corresponding to the unused subpattern are set to -1.
545 If a capturing subpattern is matched repeatedly, it is the
546 last portion of the string that it matched that gets
547 returned.
549 If the vector is too small to hold all the captured sub-
550 strings, it is used as far as possible (up to two-thirds of
551 its length), and the function returns a value of zero. In
552 particular, if the substring offsets are not of interest,
553 pcre_exec() may be called with ovector passed as NULL and
554 ovecsize as zero. However, if the pattern contains back
555 references and the ovector isn't big enough to remember the
556 related substrings, PCRE has to get additional memory for
557 use during matching. Thus it is usually advisable to supply
558 an ovector.
560 Note that pcre_info() can be used to find out how many cap-
561 turing subpatterns there are in a compiled pattern. The
562 smallest size for ovector that will allow for n captured
563 substrings in addition to the offsets of the substring
564 matched by the whole pattern is (n+1)*3.
566 If pcre_exec() fails, it returns a negative number. The fol-
567 lowing are defined in the header file:
571 The subject string did not match the pattern.
575 Either code or subject was passed as NULL, or ovector was
576 NULL and ovecsize was not zero.
580 An unrecognized bit was set in the options argument.
584 PCRE stores a 4-byte "magic number" at the start of the com-
585 piled code, to catch the case when it is passed a junk
586 pointer. This is the error it gives when the magic number
587 isn't present.
591 While running the pattern match, an unknown item was encoun-
592 tered in the compiled pattern. This error could be caused by
593 a bug in PCRE or by overwriting of the compiled pattern.
597 If a pattern contains back references, but the ovector that
598 is passed to pcre_exec() is not big enough to remember the
599 referenced substrings, PCRE gets a block of memory at the
600 start of matching to use for this purpose. If the call via
601 pcre_malloc() fails, this error is given. The memory is
602 freed at the end of matching.
607 Captured substrings can be accessed directly by using the
608 offsets returned by pcre_exec() in ovector. For convenience,
609 the functions pcre_copy_substring(), pcre_get_substring(),
610 and pcre_get_substring_list() are provided for extracting
611 captured substrings as new, separate, zero-terminated
612 strings. A substring that contains a binary zero is
613 correctly extracted and has a further zero added on the end,
614 but the result does not, of course, function as a C string.
616 The first three arguments are the same for all three func-
617 tions: subject is the subject string which has just been
618 successfully matched, ovector is a pointer to the vector of
619 integer offsets that was passed to pcre_exec(), and
620 stringcount is the number of substrings that were captured
621 by the match, including the substring that matched the
622 entire regular expression. This is the value returned by
623 pcre_exec if it is greater than zero. If pcre_exec()
624 returned zero, indicating that it ran out of space in ovec-
625 tor, the value passed as stringcount should be the size of
626 the vector divided by three.
628 The functions pcre_copy_substring() and pcre_get_substring()
629 extract a single substring, whose number is given as string-
630 number. A value of zero extracts the substring that matched
631 the entire pattern, while higher values extract the captured
632 substrings. For pcre_copy_substring(), the string is placed
633 in buffer, whose length is given by buffersize, while for
634 pcre_get_substring() a new block of store is obtained via
635 pcre_malloc, and its address is returned via stringptr. The
636 yield of the function is the length of the string, not
637 including the terminating zero, or one of
641 The buffer was too small for pcre_copy_substring(), or the
642 attempt to get memory failed for pcre_get_substring().
646 There is no substring whose number is stringnumber.
648 The pcre_get_substring_list() function extracts all avail-
649 able substrings and builds a list of pointers to them. All
650 this is done in a single block of memory which is obtained
651 via pcre_malloc. The address of the memory block is returned
652 via listptr, which is also the start of the list of string
653 pointers. The end of the list is marked by a NULL pointer.
654 The yield of the function is zero if all went well, or
658 if the attempt to get the memory block failed.
660 When any of these functions encounter a substring that is
661 unset, which can happen when capturing subpattern number n+1
662 matches some part of the subject, but subpattern n has not
663 been used at all, they return an empty string. This can be
664 distinguished from a genuine zero-length substring by
665 inspecting the appropriate offset in ovector, which is nega-
666 tive for unset substrings.
672 There are some size limitations in PCRE but it is hoped that
673 they will never in practice be relevant. The maximum length
674 of a compiled pattern is 65539 (sic) bytes. All values in
675 repeating quantifiers must be less than 65536. The maximum
676 number of capturing subpatterns is 99. The maximum number
677 of all parenthesized subpatterns, including capturing sub-
678 patterns, assertions, and other types of subpattern, is 200.
680 The maximum length of a subject string is the largest posi-
681 tive number that an integer variable can hold. However, PCRE
682 uses recursion to handle subpatterns and indefinite repeti-
683 tion. This means that the available stack space may limit
684 the size of a subject string that can be processed by cer-
685 tain patterns.
690 The differences described here are with respect to Perl
691 5.005.
693 1. By default, a whitespace character is any character that
694 the C library function isspace() recognizes, though it is
695 possible to compile PCRE with alternative character type
696 tables. Normally isspace() matches space, formfeed, newline,
697 carriage return, horizontal tab, and vertical tab. Perl 5 no
698 longer includes vertical tab in its set of whitespace char-
699 acters. The \v escape that was in the Perl documentation for
700 a long time was never in fact recognized. However, the char-
701 acter itself was treated as whitespace at least up to 5.002.
702 In 5.004 and 5.005 it does not match \s.
704 2. PCRE does not allow repeat quantifiers on lookahead
705 assertions. Perl permits them, but they do not mean what you
706 might think. For example, (?!a){3} does not assert that the
707 next three characters are not "a". It just asserts that the
708 next character is not "a" three times.
710 3. Capturing subpatterns that occur inside negative looka-
711 head assertions are counted, but their entries in the
712 offsets vector are never set. Perl sets its numerical vari-
713 ables from any such patterns that are matched before the
714 assertion fails to match something (thereby succeeding), but
715 only if the negative lookahead assertion contains just one
716 branch.
718 4. Though binary zero characters are supported in the sub-
719 ject string, they are not allowed in a pattern string
720 because it is passed as a normal C string, terminated by
721 zero. The escape sequence "\0" can be used in the pattern to
722 represent a binary zero.
724 5. The following Perl escape sequences are not supported:
725 \l, \u, \L, \U, \E, \Q. In fact these are implemented by
726 Perl's general string-handling and are not part of its pat-
727 tern matching engine.
729 6. The Perl \G assertion is not supported as it is not
730 relevant to single pattern matches.
732 7. Fairly obviously, PCRE does not support the (?{code}) and
733 (?p{code}) constructions. However, there is some experimen-
734 tal support for recursive patterns using the non-Perl item
735 (?R).
736 8. There are at the time of writing some oddities in Perl
737 5.005_02 concerned with the settings of captured strings
738 when part of a pattern is repeated. For example, matching
739 "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
740 "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2
741 unset. However, if the pattern is changed to
742 /^(aa(b(b))?)+$/ then $2 (and $3) are set.
744 In Perl 5.004 $2 is set in both cases, and that is also true
745 of PCRE. If in the future Perl changes to a consistent state
746 that is different, PCRE may change to follow.
748 9. Another as yet unresolved discrepancy is that in Perl
749 5.005_02 the pattern /^(a)?(?(1)a|b)+$/ matches the string
750 "a", whereas in PCRE it does not. However, in both Perl and
751 PCRE /^(a)?a/ matched against "a" leaves $1 unset.
753 10. PCRE provides some extensions to the Perl regular
754 expression facilities:
756 (a) Although lookbehind assertions must match fixed length
757 strings, each alternative branch of a lookbehind assertion
758 can match a different length of string. Perl 5.005 requires
759 them all to have the same length.
761 (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not
762 set, the $ meta- character matches only at the very end of
763 the string.
765 (c) If PCRE_EXTRA is set, a backslash followed by a letter
766 with no special meaning is faulted.
768 (d) If PCRE_UNGREEDY is set, the greediness of the repeti-
769 tion quantifiers is inverted, that is, by default they are
770 not greedy, but if followed by a question mark they are.
772 (e) PCRE_ANCHORED can be used to force a pattern to be tried
773 only at the start of the subject.
776 for pcre_exec() have no Perl equivalents.
778 (g) The (?R) construct allows for recursive pattern matching
779 (Perl 5.6 can do this using the (?p{code}) construct, which
780 PCRE cannot of course support.)
785 The syntax and semantics of the regular expressions sup-
786 ported by PCRE are described below. Regular expressions are
787 also described in the Perl documentation and in a number of
789 other books, some of which have copious examples. Jeffrey
790 Friedl's "Mastering Regular Expressions", published by
791 O'Reilly (ISBN 1-56592-257), covers them in great detail.
792 The description here is intended as reference documentation.
794 A regular expression is a pattern that is matched against a
795 subject string from left to right. Most characters stand for
796 themselves in a pattern, and match the corresponding charac-
797 ters in the subject. As a trivial example, the pattern
799 The quick brown fox
801 matches a portion of a subject string that is identical to
802 itself. The power of regular expressions comes from the
803 ability to include alternatives and repetitions in the pat-
804 tern. These are encoded in the pattern by the use of meta-
805 characters, which do not stand for themselves but instead
806 are interpreted in some special way.
808 There are two different sets of meta-characters: those that
809 are recognized anywhere in the pattern except within square
810 brackets, and those that are recognized in square brackets.
811 Outside square brackets, the meta-characters are as follows:
813 \ general escape character with several uses
814 ^ assert start of subject (or line, in multiline
815 mode)
816 $ assert end of subject (or line, in multiline mode)
817 . match any character except newline (by default)
818 [ start character class definition
819 | start of alternative branch
820 ( start subpattern
821 ) end subpattern
822 ? extends the meaning of (
823 also 0 or 1 quantifier
824 also quantifier minimizer
825 * 0 or more quantifier
826 + 1 or more quantifier
827 { start min/max quantifier
829 Part of a pattern that is in square brackets is called a
830 "character class". In a character class the only meta-
831 characters are:
833 \ general escape character
834 ^ negate the class, but only if the first character
835 - indicates character range
836 ] terminates the character class
838 The following sections describe the use of each of the
839 meta-characters.
844 The backslash character has several uses. Firstly, if it is
845 followed by a non-alphameric character, it takes away any
846 special meaning that character may have. This use of
847 backslash as an escape character applies both inside and
848 outside character classes.
850 For example, if you want to match a "*" character, you write
851 "\*" in the pattern. This applies whether or not the follow-
852 ing character would otherwise be interpreted as a meta-
853 character, so it is always safe to precede a non-alphameric
854 with "\" to specify that it stands for itself. In particu-
855 lar, if you want to match a backslash, you write "\\".
857 If a pattern is compiled with the PCRE_EXTENDED option, whi-
858 tespace in the pattern (other than in a character class) and
859 characters between a "#" outside a character class and the
860 next newline character are ignored. An escaping backslash
861 can be used to include a whitespace or "#" character as part
862 of the pattern.
864 A second use of backslash provides a way of encoding non-
865 printing characters in patterns in a visible manner. There
866 is no restriction on the appearance of non-printing charac-
867 ters, apart from the binary zero that terminates a pattern,
868 but when a pattern is being prepared by text editing, it is
869 usually easier to use one of the following escape sequences
870 than the binary character it represents:
872 \a alarm, that is, the BEL character (hex 07)
873 \cx "control-x", where x is any character
874 \e escape (hex 1B)
875 \f formfeed (hex 0C)
876 \n newline (hex 0A)
877 \r carriage return (hex 0D)
878 \t tab (hex 09)
879 \xhh character with hex code hh
880 \ddd character with octal code ddd, or backreference
882 The precise effect of "\cx" is as follows: if "x" is a lower
883 case letter, it is converted to upper case. Then bit 6 of
884 the character (hex 40) is inverted. Thus "\cz" becomes hex
885 1A, but "\c{" becomes hex 3B, while "\c;" becomes hex 7B.
887 After "\x", up to two hexadecimal digits are read (letters
888 can be in upper or lower case).
890 After "\0" up to two further octal digits are read. In both
891 cases, if there are fewer than two digits, just those that
892 are present are used. Thus the sequence "\0\x\07" specifies
893 two binary zeros followed by a BEL character. Make sure you
894 supply two digits after the initial zero if the character
895 that follows is itself an octal digit.
897 The handling of a backslash followed by a digit other than 0
898 is complicated. Outside a character class, PCRE reads it
899 and any following digits as a decimal number. If the number
900 is less than 10, or if there have been at least that many
901 previous capturing left parentheses in the expression, the
902 entire sequence is taken as a back reference. A description
903 of how this works is given later, following the discussion
904 of parenthesized subpatterns.
906 Inside a character class, or if the decimal number is
907 greater than 9 and there have not been that many capturing
908 subpatterns, PCRE re-reads up to three octal digits follow-
909 ing the backslash, and generates a single byte from the
910 least significant 8 bits of the value. Any subsequent digits
911 stand for themselves. For example:
913 \040 is another way of writing a space
914 \40 is the same, provided there are fewer than 40
915 previous capturing subpatterns
916 \7 is always a back reference
917 \11 might be a back reference, or another way of
918 writing a tab
919 \011 is always a tab
920 \0113 is a tab followed by the character "3"
921 \113 is the character with octal code 113 (since there
922 can be no more than 99 back references)
923 \377 is a byte consisting entirely of 1 bits
924 \81 is either a back reference, or a binary zero
925 followed by the two characters "8" and "1"
927 Note that octal values of 100 or greater must not be intro-
928 duced by a leading zero, because no more than three octal
929 digits are ever read.
931 All the sequences that define a single byte value can be
932 used both inside and outside character classes. In addition,
933 inside a character class, the sequence "\b" is interpreted
934 as the backspace character (hex 08). Outside a character
935 class it has a different meaning (see below).
937 The third use of backslash is for specifying generic charac-
938 ter types:
940 \d any decimal digit
941 \D any character that is not a decimal digit
942 \s any whitespace character
943 \S any character that is not a whitespace character
944 \w any "word" character
945 \W any "non-word" character
947 Each pair of escape sequences partitions the complete set of
948 characters into two disjoint sets. Any given character
949 matches one, and only one, of each pair.
951 A "word" character is any letter or digit or the underscore
952 character, that is, any character which can be part of a
953 Perl "word". The definition of letters and digits is con-
954 trolled by PCRE's character tables, and may vary if locale-
955 specific matching is taking place (see "Locale support"
956 above). For example, in the "fr" (French) locale, some char-
957 acter codes greater than 128 are used for accented letters,
958 and these are matched by \w.
960 These character type sequences can appear both inside and
961 outside character classes. They each match one character of
962 the appropriate type. If the current matching point is at
963 the end of the subject string, all of them fail, since there
964 is no character to match.
966 The fourth use of backslash is for certain simple asser-
967 tions. An assertion specifies a condition that has to be met
968 at a particular point in a match, without consuming any
969 characters from the subject string. The use of subpatterns
970 for more complicated assertions is described below. The
971 backslashed assertions are
973 \b word boundary
974 \B not a word boundary
975 \A start of subject (independent of multiline mode)
976 \Z end of subject or newline at end (independent of
977 multiline mode)
978 \z end of subject (independent of multiline mode)
980 These assertions may not appear in character classes (but
981 note that "\b" has a different meaning, namely the backspace
982 character, inside a character class).
984 A word boundary is a position in the subject string where
985 the current character and the previous character do not both
986 match \w or \W (i.e. one matches \w and the other matches
987 \W), or the start or end of the string if the first or last
988 character matches \w, respectively.
990 The \A, \Z, and \z assertions differ from the traditional
991 circumflex and dollar (described below) in that they only
992 ever match at the very start and end of the subject string,
993 whatever options are set. They are not affected by the
994 PCRE_NOTBOL or PCRE_NOTEOL options. If the startoffset argu-
995 ment of pcre_exec() is non-zero, \A can never match. The
996 difference between \Z and \z is that \Z matches before a
997 newline that is the last character of the string as well as
998 at the end of the string, whereas \z matches only at the
999 end.
1004 Outside a character class, in the default matching mode, the
1005 circumflex character is an assertion which is true only if
1006 the current matching point is at the start of the subject
1007 string. If the startoffset argument of pcre_exec() is non-
1008 zero, circumflex can never match. Inside a character class,
1009 circumflex has an entirely different meaning (see below).
1011 Circumflex need not be the first character of the pattern if
1012 a number of alternatives are involved, but it should be the
1013 first thing in each alternative in which it appears if the
1014 pattern is ever to match that branch. If all possible alter-
1015 natives start with a circumflex, that is, if the pattern is
1016 constrained to match only at the start of the subject, it is
1017 said to be an "anchored" pattern. (There are also other con-
1018 structs that can cause a pattern to be anchored.)
1020 A dollar character is an assertion which is true only if the
1021 current matching point is at the end of the subject string,
1022 or immediately before a newline character that is the last
1023 character in the string (by default). Dollar need not be the
1024 last character of the pattern if a number of alternatives
1025 are involved, but it should be the last item in any branch
1026 in which it appears. Dollar has no special meaning in a
1027 character class.
1029 The meaning of dollar can be changed so that it matches only
1030 at the very end of the string, by setting the
1031 PCRE_DOLLAR_ENDONLY option at compile or matching time. This
1032 does not affect the \Z assertion.
1034 The meanings of the circumflex and dollar characters are
1035 changed if the PCRE_MULTILINE option is set. When this is
1036 the case, they match immediately after and immediately
1037 before an internal "\n" character, respectively, in addition
1038 to matching at the start and end of the subject string. For
1039 example, the pattern /^abc$/ matches the subject string
1040 "def\nabc" in multiline mode, but not otherwise. Conse-
1041 quently, patterns that are anchored in single line mode
1042 because all branches start with "^" are not anchored in mul-
1043 tiline mode, and a match for circumflex is possible when the
1044 startoffset argument of pcre_exec() is non-zero. The
1045 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is
1046 set.
1048 Note that the sequences \A, \Z, and \z can be used to match
1049 the start and end of the subject in both modes, and if all
1050 branches of a pattern start with \A is it always anchored,
1051 whether PCRE_MULTILINE is set or not.
1056 Outside a character class, a dot in the pattern matches any
1057 one character in the subject, including a non-printing char-
1058 acter, but not (by default) newline. If the PCRE_DOTALL
1059 option is set, dots match newlines as well. The handling of
1060 dot is entirely independent of the handling of circumflex
1061 and dollar, the only relationship being that they both
1062 involve newline characters. Dot has no special meaning in a
1063 character class.
1068 An opening square bracket introduces a character class, ter-
1069 minated by a closing square bracket. A closing square
1070 bracket on its own is not special. If a closing square
1071 bracket is required as a member of the class, it should be
1072 the first data character in the class (after an initial cir-
1073 cumflex, if present) or escaped with a backslash.
1075 A character class matches a single character in the subject;
1076 the character must be in the set of characters defined by
1077 the class, unless the first character in the class is a cir-
1078 cumflex, in which case the subject character must not be in
1079 the set defined by the class. If a circumflex is actually
1080 required as a member of the class, ensure it is not the
1081 first character, or escape it with a backslash.
1083 For example, the character class [aeiou] matches any lower
1084 case vowel, while [^aeiou] matches any character that is not
1085 a lower case vowel. Note that a circumflex is just a con-
1086 venient notation for specifying the characters which are in
1087 the class by enumerating those that are not. It is not an
1088 assertion: it still consumes a character from the subject
1089 string, and fails if the current pointer is at the end of
1090 the string.
1092 When caseless matching is set, any letters in a class
1093 represent both their upper case and lower case versions, so
1094 for example, a caseless [aeiou] matches "A" as well as "a",
1095 and a caseless [^aeiou] does not match "A", whereas a case-
1096 ful version would.
1098 The newline character is never treated in any special way in
1099 character classes, whatever the setting of the PCRE_DOTALL
1100 or PCRE_MULTILINE options is. A class such as [^a] will
1101 always match a newline.
1103 The minus (hyphen) character can be used to specify a range
1104 of characters in a character class. For example, [d-m]
1105 matches any letter between d and m, inclusive. If a minus
1106 character is required in a class, it must be escaped with a
1107 backslash or appear in a position where it cannot be inter-
1108 preted as indicating a range, typically as the first or last
1109 character in the class.
1111 It is not possible to have the literal character "]" as the
1112 end character of a range. A pattern such as [W-]46] is
1113 interpreted as a class of two characters ("W" and "-") fol-
1114 lowed by a literal string "46]", so it would match "W46]" or
1115 "-46]". However, if the "]" is escaped with a backslash it
1116 is interpreted as the end of range, so [W-\]46] is inter-
1117 preted as a single class containing a range followed by two
1118 separate characters. The octal or hexadecimal representation
1119 of "]" can also be used to end a range.
1121 Ranges operate in ASCII collating sequence. They can also be
1122 used for characters specified numerically, for example
1123 [\000-\037]. If a range that includes letters is used when
1124 caseless matching is set, it matches the letters in either
1125 case. For example, [W-c] is equivalent to [][\^_`wxyzabc],
1126 matched caselessly, and if character tables for the "fr"
1127 locale are in use, [\xc8-\xcb] matches accented E characters
1128 in both cases.
1130 The character types \d, \D, \s, \S, \w, and \W may also
1131 appear in a character class, and add the characters that
1132 they match to the class. For example, [\dABCDEF] matches any
1133 hexadecimal digit. A circumflex can conveniently be used
1134 with the upper case character types to specify a more res-
1135 tricted set of characters than the matching lower case type.
1136 For example, the class [^\W_] matches any letter or digit,
1137 but not underscore.
1139 All non-alphameric characters other than \, -, ^ (at the
1140 start) and the terminating ] are non-special in character
1141 classes, but it does no harm if they are escaped.
1146 Perl 5.6 (not yet released at the time of writing) is going
1147 to support the POSIX notation for character classes, which
1148 uses names enclosed by [: and :] within the enclosing
1149 square brackets. PCRE supports this notation. For example,
1151 [01[:alpha:]%]
1153 matches "0", "1", any alphabetic character, or "%". The sup-
1154 ported class names are
1156 alnum letters and digits
1157 alpha letters
1158 ascii character codes 0 - 127
1159 cntrl control characters
1160 digit decimal digits (same as \d)
1161 graph printing characters, excluding space
1162 lower lower case letters
1163 print printing characters, including space
1164 punct printing characters, excluding letters and digits
1165 space white space (same as \s)
1166 upper upper case letters
1167 word "word" characters (same as \w)
1168 xdigit hexadecimal digits
1170 The names "ascii" and "word" are Perl extensions. Another
1171 Perl extension is negation, which is indicated by a ^ char-
1172 acter after the colon. For example,
1174 [12[:^digit:]]
1176 matches "1", "2", or any non-digit. PCRE (and Perl) also
1177 recogize the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1178 "collating element", but these are not supported, and an
1179 error is given if they are encountered.
1184 Vertical bar characters are used to separate alternative
1185 patterns. For example, the pattern
1187 gilbert|sullivan
1189 matches either "gilbert" or "sullivan". Any number of alter-
1190 natives may appear, and an empty alternative is permitted
1191 (matching the empty string). The matching process tries
1192 each alternative in turn, from left to right, and the first
1193 one that succeeds is used. If the alternatives are within a
1194 subpattern (defined below), "succeeds" means matching the
1195 rest of the main pattern as well as the alternative in the
1196 subpattern.
1202 and PCRE_EXTENDED can be changed from within the pattern by
1203 a sequence of Perl option letters enclosed between "(?" and
1204 ")". The option letters are
1206 i for PCRE_CASELESS
1208 s for PCRE_DOTALL
1209 x for PCRE_EXTENDED
1211 For example, (?im) sets caseless, multiline matching. It is
1212 also possible to unset these options by preceding the letter
1213 with a hyphen, and a combined setting and unsetting such as
1214 (?im-sx), which sets PCRE_CASELESS and PCRE_MULTILINE while
1215 unsetting PCRE_DOTALL and PCRE_EXTENDED, is also permitted.
1216 If a letter appears both before and after the hyphen, the
1217 option is unset.
1219 The scope of these option changes depends on where in the
1220 pattern the setting occurs. For settings that are outside
1221 any subpattern (defined below), the effect is the same as if
1222 the options were set or unset at the start of matching. The
1223 following patterns all behave in exactly the same way:
1225 (?i)abc
1226 a(?i)bc
1227 ab(?i)c
1228 abc(?i)
1230 which in turn is the same as compiling the pattern abc with
1231 PCRE_CASELESS set. In other words, such "top level" set-
1232 tings apply to the whole pattern (unless there are other
1233 changes inside subpatterns). If there is more than one set-
1234 ting of the same option at top level, the rightmost setting
1235 is used.
1237 If an option change occurs inside a subpattern, the effect
1238 is different. This is a change of behaviour in Perl 5.005.
1239 An option change inside a subpattern affects only that part
1240 of the subpattern that follows it, so
1242 (a(?i)b)c
1244 matches abc and aBc and no other strings (assuming
1245 PCRE_CASELESS is not used). By this means, options can be
1246 made to have different settings in different parts of the
1247 pattern. Any changes made in one alternative do carry on
1248 into subsequent branches within the same subpattern. For
1249 example,
1251 (a(?i)b|c)
1253 matches "ab", "aB", "c", and "C", even though when matching
1254 "C" the first branch is abandoned before the option setting.
1255 This is because the effects of option settings happen at
1256 compile time. There would be some very weird behaviour oth-
1257 erwise.
1259 The PCRE-specific options PCRE_UNGREEDY and PCRE_EXTRA can
1260 be changed in the same way as the Perl-compatible options by
1261 using the characters U and X respectively. The (?X) flag
1262 setting is special in that it must always occur earlier in
1263 the pattern than any of the additional features it turns on,
1264 even when it is at top level. It is best put at the start.
1269 Subpatterns are delimited by parentheses (round brackets),
1270 which can be nested. Marking part of a pattern as a subpat-
1271 tern does two things:
1273 1. It localizes a set of alternatives. For example, the pat-
1274 tern
1276 cat(aract|erpillar|)
1278 matches one of the words "cat", "cataract", or "caterpil-
1279 lar". Without the parentheses, it would match "cataract",
1280 "erpillar" or the empty string.
1282 2. It sets up the subpattern as a capturing subpattern (as
1283 defined above). When the whole pattern matches, that por-
1284 tion of the subject string that matched the subpattern is
1285 passed back to the caller via the ovector argument of
1286 pcre_exec(). Opening parentheses are counted from left to
1287 right (starting from 1) to obtain the numbers of the captur-
1288 ing subpatterns.
1290 For example, if the string "the red king" is matched against
1291 the pattern
1293 the ((red|white) (king|queen))
1295 the captured substrings are "red king", "red", and "king",
1296 and are numbered 1, 2, and 3.
1298 The fact that plain parentheses fulfil two functions is not
1299 always helpful. There are often times when a grouping sub-
1300 pattern is required without a capturing requirement. If an
1301 opening parenthesis is followed by "?:", the subpattern does
1302 not do any capturing, and is not counted when computing the
1303 number of any subsequent capturing subpatterns. For example,
1304 if the string "the white queen" is matched against the pat-
1305 tern
1307 the ((?:red|white) (king|queen))
1309 the captured substrings are "white queen" and "queen", and
1310 are numbered 1 and 2. The maximum number of captured sub-
1311 strings is 99, and the maximum number of all subpatterns,
1312 both capturing and non-capturing, is 200.
1314 As a convenient shorthand, if any option settings are
1315 required at the start of a non-capturing subpattern, the
1316 option letters may appear between the "?" and the ":". Thus
1317 the two patterns
1319 (?i:saturday|sunday)
1320 (?:(?i)saturday|sunday)
1322 match exactly the same set of strings. Because alternative
1323 branches are tried from left to right, and options are not
1324 reset until the end of the subpattern is reached, an option
1325 setting in one branch does affect subsequent branches, so
1326 the above patterns match "SUNDAY" as well as "Saturday".
1331 Repetition is specified by quantifiers, which can follow any
1332 of the following items:
1334 a single character, possibly escaped
1335 the . metacharacter
1336 a character class
1337 a back reference (see next section)
1338 a parenthesized subpattern (unless it is an assertion -
1339 see below)
1341 The general repetition quantifier specifies a minimum and
1342 maximum number of permitted matches, by giving the two
1343 numbers in curly brackets (braces), separated by a comma.
1344 The numbers must be less than 65536, and the first must be
1345 less than or equal to the second. For example:
1347 z{2,4}
1349 matches "zz", "zzz", or "zzzz". A closing brace on its own
1350 is not a special character. If the second number is omitted,
1351 but the comma is present, there is no upper limit; if the
1352 second number and the comma are both omitted, the quantifier
1353 specifies an exact number of required matches. Thus
1355 [aeiou]{3,}
1357 matches at least 3 successive vowels, but may match many
1358 more, while
1360 \d{8}
1362 matches exactly 8 digits. An opening curly bracket that
1363 appears in a position where a quantifier is not allowed, or
1364 one that does not match the syntax of a quantifier, is taken
1365 as a literal character. For example, {,6} is not a quantif-
1366 ier, but a literal string of four characters.
1368 The quantifier {0} is permitted, causing the expression to
1369 behave as if the previous item and the quantifier were not
1370 present.
1372 For convenience (and historical compatibility) the three
1373 most common quantifiers have single-character abbreviations:
1375 * is equivalent to {0,}
1376 + is equivalent to {1,}
1377 ? is equivalent to {0,1}
1379 It is possible to construct infinite loops by following a
1380 subpattern that can match no characters with a quantifier
1381 that has no upper limit, for example:
1383 (a?)*
1385 Earlier versions of Perl and PCRE used to give an error at
1386 compile time for such patterns. However, because there are
1387 cases where this can be useful, such patterns are now
1388 accepted, but if any repetition of the subpattern does in
1389 fact match no characters, the loop is forcibly broken.
1391 By default, the quantifiers are "greedy", that is, they
1392 match as much as possible (up to the maximum number of per-
1393 mitted times), without causing the rest of the pattern to
1394 fail. The classic example of where this gives problems is in
1395 trying to match comments in C programs. These appear between
1396 the sequences /* and */ and within the sequence, individual
1397 * and / characters may appear. An attempt to match C com-
1398 ments by applying the pattern
1400 /\*.*\*/
1402 to the string
1404 /* first command */ not comment /* second comment */
1406 fails, because it matches the entire string due to the
1407 greediness of the .* item.
1409 However, if a quantifier is followed by a question mark, it
1410 ceases to be greedy, and instead matches the minimum number
1411 of times possible, so the pattern
1413 /\*.*?\*/
1415 does the right thing with the C comments. The meaning of the
1416 various quantifiers is not otherwise changed, just the pre-
1417 ferred number of matches. Do not confuse this use of ques-
1418 tion mark with its use as a quantifier in its own right.
1419 Because it has two uses, it can sometimes appear doubled, as
1420 in
1422 \d??\d
1424 which matches one digit by preference, but can match two if
1425 that is the only way the rest of the pattern matches.
1427 If the PCRE_UNGREEDY option is set (an option which is not
1428 available in Perl), the quantifiers are not greedy by
1429 default, but individual ones can be made greedy by following
1430 them with a question mark. In other words, it inverts the
1431 default behaviour.
1433 When a parenthesized subpattern is quantified with a minimum
1434 repeat count that is greater than 1 or with a limited max-
1435 imum, more store is required for the compiled pattern, in
1436 proportion to the size of the minimum or maximum.
1438 If a pattern starts with .* or .{0,} and the PCRE_DOTALL
1439 option (equivalent to Perl's /s) is set, thus allowing the .
1440 to match newlines, the pattern is implicitly anchored,
1441 because whatever follows will be tried against every charac-
1442 ter position in the subject string, so there is no point in
1443 retrying the overall match at any position after the first.
1444 PCRE treats such a pattern as though it were preceded by \A.
1445 In cases where it is known that the subject string contains
1446 no newlines, it is worth setting PCRE_DOTALL when the pat-
1447 tern begins with .* in order to obtain this optimization, or
1448 alternatively using ^ to indicate anchoring explicitly.
1450 When a capturing subpattern is repeated, the value captured
1451 is the substring that matched the final iteration. For exam-
1452 ple, after
1454 (tweedle[dume]{3}\s*)+
1456 has matched "tweedledum tweedledee" the value of the cap-
1457 tured substring is "tweedledee". However, if there are
1458 nested capturing subpatterns, the corresponding captured
1459 values may have been set in previous iterations. For exam-
1460 ple, after
1462 /(a|(b))+/
1464 matches "aba" the value of the second captured substring is
1465 "b".
1470 Outside a character class, a backslash followed by a digit
1471 greater than 0 (and possibly further digits) is a back
1472 reference to a capturing subpattern earlier (i.e. to its
1473 left) in the pattern, provided there have been that many
1474 previous capturing left parentheses.
1476 However, if the decimal number following the backslash is
1477 less than 10, it is always taken as a back reference, and
1478 causes an error only if there are not that many capturing
1479 left parentheses in the entire pattern. In other words, the
1480 parentheses that are referenced need not be to the left of
1481 the reference for numbers less than 10. See the section
1482 entitled "Backslash" above for further details of the han-
1483 dling of digits following a backslash.
1485 A back reference matches whatever actually matched the cap-
1486 turing subpattern in the current subject string, rather than
1487 anything matching the subpattern itself. So the pattern
1489 (sens|respons)e and \1ibility
1491 matches "sense and sensibility" and "response and responsi-
1492 bility", but not "sense and responsibility". If caseful
1493 matching is in force at the time of the back reference, the
1494 case of letters is relevant. For example,
1496 ((?i)rah)\s+\1
1498 matches "rah rah" and "RAH RAH", but not "RAH rah", even
1499 though the original capturing subpattern is matched case-
1500 lessly.
1502 There may be more than one back reference to the same sub-
1503 pattern. If a subpattern has not actually been used in a
1504 particular match, any back references to it always fail. For
1505 example, the pattern
1507 (a|(bc))\2
1509 always fails if it starts to match "a" rather than "bc".
1510 Because there may be up to 99 back references, all digits
1511 following the backslash are taken as part of a potential
1512 back reference number. If the pattern continues with a digit
1513 character, some delimiter must be used to terminate the back
1514 reference. If the PCRE_EXTENDED option is set, this can be
1515 whitespace. Otherwise an empty comment can be used.
1517 A back reference that occurs inside the parentheses to which
1518 it refers fails when the subpattern is first used, so, for
1519 example, (a\1) never matches. However, such references can
1520 be useful inside repeated subpatterns. For example, the
1521 pattern
1523 (a|b\1)+
1525 matches any number of "a"s and also "aba", "ababaa" etc. At
1526 each iteration of the subpattern, the back reference matches
1527 the character string corresponding to the previous itera-
1528 tion. In order for this to work, the pattern must be such
1529 that the first iteration does not need to match the back
1530 reference. This can be done using alternation, as in the
1531 example above, or by a quantifier with a minimum of zero.
1536 An assertion is a test on the characters following or
1537 preceding the current matching point that does not actually
1538 consume any characters. The simple assertions coded as \b,
1539 \B, \A, \Z, \z, ^ and $ are described above. More compli-
1540 cated assertions are coded as subpatterns. There are two
1541 kinds: those that look ahead of the current position in the
1542 subject string, and those that look behind it.
1544 An assertion subpattern is matched in the normal way, except
1545 that it does not cause the current matching position to be
1546 changed. Lookahead assertions start with (?= for positive
1547 assertions and (?! for negative assertions. For example,
1549 \w+(?=;)
1551 matches a word followed by a semicolon, but does not include
1552 the semicolon in the match, and
1554 foo(?!bar)
1556 matches any occurrence of "foo" that is not followed by
1557 "bar". Note that the apparently similar pattern
1559 (?!foo)bar
1561 does not find an occurrence of "bar" that is preceded by
1562 something other than "foo"; it finds any occurrence of "bar"
1563 whatsoever, because the assertion (?!foo) is always true
1564 when the next three characters are "bar". A lookbehind
1565 assertion is needed to achieve this effect.
1567 Lookbehind assertions start with (?<= for positive asser-
1568 tions and (?<! for negative assertions. For example,
1570 (?<!foo)bar
1572 does find an occurrence of "bar" that is not preceded by
1573 "foo". The contents of a lookbehind assertion are restricted
1574 such that all the strings it matches must have a fixed
1575 length. However, if there are several alternatives, they do
1576 not all have to have the same fixed length. Thus
1578 (?<=bullock|donkey)
1580 is permitted, but
1582 (?<!dogs?|cats?)
1584 causes an error at compile time. Branches that match dif-
1585 ferent length strings are permitted only at the top level of
1586 a lookbehind assertion. This is an extension compared with
1587 Perl 5.005, which requires all branches to match the same
1588 length of string. An assertion such as
1590 (?<=ab(c|de))
1592 is not permitted, because its single top-level branch can
1593 match two different lengths, but it is acceptable if rewrit-
1594 ten to use two top-level branches:
1596 (?<=abc|abde)
1598 The implementation of lookbehind assertions is, for each
1599 alternative, to temporarily move the current position back
1600 by the fixed width and then try to match. If there are
1601 insufficient characters before the current position, the
1602 match is deemed to fail. Lookbehinds in conjunction with
1603 once-only subpatterns can be particularly useful for match-
1604 ing at the ends of strings; an example is given at the end
1605 of the section on once-only subpatterns.
1607 Several assertions (of any sort) may occur in succession.
1608 For example,
1610 (?<=\d{3})(?<!999)foo
1612 matches "foo" preceded by three digits that are not "999".
1613 Notice that each of the assertions is applied independently
1614 at the same point in the subject string. First there is a
1615 check that the previous three characters are all digits, and
1616 then there is a check that the same three characters are not
1617 "999". This pattern does not match "foo" preceded by six
1618 characters, the first of which are digits and the last three
1619 of which are not "999". For example, it doesn't match
1620 "123abcfoo". A pattern to do that is
1622 (?<=\d{3}...)(?<!999)foo
1624 This time the first assertion looks at the preceding six
1625 characters, checking that the first three are digits, and
1626 then the second assertion checks that the preceding three
1627 characters are not "999".
1629 Assertions can be nested in any combination. For example,
1631 (?<=(?<!foo)bar)baz
1633 matches an occurrence of "baz" that is preceded by "bar"
1634 which in turn is not preceded by "foo", while
1636 (?<=\d{3}(?!999)...)foo
1638 is another pattern which matches "foo" preceded by three
1639 digits and any three characters that are not "999".
1641 Assertion subpatterns are not capturing subpatterns, and may
1642 not be repeated, because it makes no sense to assert the
1643 same thing several times. If any kind of assertion contains
1644 capturing subpatterns within it, these are counted for the
1645 purposes of numbering the capturing subpatterns in the whole
1646 pattern. However, substring capturing is carried out only
1647 for positive assertions, because it does not make sense for
1648 negative assertions.
1650 Assertions count towards the maximum of 200 parenthesized
1651 subpatterns.
1656 With both maximizing and minimizing repetition, failure of
1657 what follows normally causes the repeated item to be re-
1658 evaluated to see if a different number of repeats allows the
1659 rest of the pattern to match. Sometimes it is useful to
1660 prevent this, either to change the nature of the match, or
1661 to cause it fail earlier than it otherwise might, when the
1662 author of the pattern knows there is no point in carrying
1663 on.
1665 Consider, for example, the pattern \d+foo when applied to
1666 the subject line
1668 123456bar
1670 After matching all 6 digits and then failing to match "foo",
1671 the normal action of the matcher is to try again with only 5
1672 digits matching the \d+ item, and then with 4, and so on,
1673 before ultimately failing. Once-only subpatterns provide the
1674 means for specifying that once a portion of the pattern has
1675 matched, it is not to be re-evaluated in this way, so the
1676 matcher would give up immediately on failing to match "foo"
1677 the first time. The notation is another kind of special
1678 parenthesis, starting with (?> as in this example:
1680 (?>\d+)bar
1682 This kind of parenthesis "locks up" the part of the pattern
1683 it contains once it has matched, and a failure further into
1684 the pattern is prevented from backtracking into it. Back-
1685 tracking past it to previous items, however, works as nor-
1686 mal.
1688 An alternative description is that a subpattern of this type
1689 matches the string of characters that an identical stan-
1690 dalone pattern would match, if anchored at the current point
1691 in the subject string.
1693 Once-only subpatterns are not capturing subpatterns. Simple
1694 cases such as the above example can be thought of as a max-
1695 imizing repeat that must swallow everything it can. So,
1696 while both \d+ and \d+? are prepared to adjust the number of
1697 digits they match in order to make the rest of the pattern
1698 match, (?>\d+) can only match an entire sequence of digits.
1700 This construction can of course contain arbitrarily compli-
1701 cated subpatterns, and it can be nested.
1703 Once-only subpatterns can be used in conjunction with look-
1704 behind assertions to specify efficient matching at the end
1705 of the subject string. Consider a simple pattern such as
1707 abcd$
1709 when applied to a long string which does not match. Because
1710 matching proceeds from left to right, PCRE will look for
1711 each "a" in the subject and then see if what follows matches
1712 the rest of the pattern. If the pattern is specified as
1714 ^.*abcd$
1716 the initial .* matches the entire string at first, but when
1717 this fails (because there is no following "a"), it back-
1718 tracks to match all but the last character, then all but the
1719 last two characters, and so on. Once again the search for
1720 "a" covers the entire string, from right to left, so we are
1721 no better off. However, if the pattern is written as
1723 ^(?>.*)(?<=abcd)
1725 there can be no backtracking for the .* item; it can match
1726 only the entire string. The subsequent lookbehind assertion
1727 does a single test on the last four characters. If it fails,
1728 the match fails immediately. For long strings, this approach
1729 makes a significant difference to the processing time.
1731 When a pattern contains an unlimited repeat inside a subpat-
1732 tern that can itself be repeated an unlimited number of
1733 times, the use of a once-only subpattern is the only way to
1734 avoid some failing matches taking a very long time indeed.
1735 The pattern
1737 (\D+|<\d+>)*[!?]
1739 matches an unlimited number of substrings that either con-
1740 sist of non-digits, or digits enclosed in <>, followed by
1741 either ! or ?. When it matches, it runs quickly. However, if
1742 it is applied to
1744 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1746 it takes a long time before reporting failure. This is
1747 because the string can be divided between the two repeats in
1748 a large number of ways, and all have to be tried. (The exam-
1749 ple used [!?] rather than a single character at the end,
1750 because both PCRE and Perl have an optimization that allows
1751 for fast failure when a single character is used. They
1752 remember the last single character that is required for a
1753 match, and fail early if it is not present in the string.)
1754 If the pattern is changed to
1756 ((?>\D+)|<\d+>)*[!?]
1758 sequences of non-digits cannot be broken, and failure hap-
1759 pens quickly.
1764 It is possible to cause the matching process to obey a sub-
1765 pattern conditionally or to choose between two alternative
1766 subpatterns, depending on the result of an assertion, or
1767 whether a previous capturing subpattern matched or not. The
1768 two possible forms of conditional subpattern are
1770 (?(condition)yes-pattern)
1771 (?(condition)yes-pattern|no-pattern)
1773 If the condition is satisfied, the yes-pattern is used; oth-
1774 erwise the no-pattern (if present) is used. If there are
1775 more than two alternatives in the subpattern, a compile-time
1776 error occurs.
1778 There are two kinds of condition. If the text between the
1779 parentheses consists of a sequence of digits, the condition
1780 is satisfied if the capturing subpattern of that number has
1781 previously matched. Consider the following pattern, which
1782 contains non-significant white space to make it more read-
1783 able (assume the PCRE_EXTENDED option) and to divide it into
1784 three parts for ease of discussion:
1786 ( \( )? [^()]+ (?(1) \) )
1788 The first part matches an optional opening parenthesis, and
1789 if that character is present, sets it as the first captured
1790 substring. The second part matches one or more characters
1791 that are not parentheses. The third part is a conditional
1792 subpattern that tests whether the first set of parentheses
1793 matched or not. If they did, that is, if subject started
1794 with an opening parenthesis, the condition is true, and so
1795 the yes-pattern is executed and a closing parenthesis is
1796 required. Otherwise, since no-pattern is not present, the
1797 subpattern matches nothing. In other words, this pattern
1798 matches a sequence of non-parentheses, optionally enclosed
1799 in parentheses.
1801 If the condition is not a sequence of digits, it must be an
1802 assertion. This may be a positive or negative lookahead or
1803 lookbehind assertion. Consider this pattern, again contain-
1804 ing non-significant white space, and with the two alterna-
1805 tives on the second line:
1807 (?(?=[^a-z]*[a-z])
1808 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1810 The condition is a positive lookahead assertion that matches
1811 an optional sequence of non-letters followed by a letter. In
1812 other words, it tests for the presence of at least one
1813 letter in the subject. If a letter is found, the subject is
1814 matched against the first alternative; otherwise it is
1815 matched against the second. This pattern matches strings in
1816 one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
1817 letters and dd are digits.
1822 The sequence (?# marks the start of a comment which contin-
1823 ues up to the next closing parenthesis. Nested parentheses
1824 are not permitted. The characters that make up a comment
1825 play no part in the pattern matching at all.
1827 If the PCRE_EXTENDED option is set, an unescaped # character
1828 outside a character class introduces a comment that contin-
1829 ues up to the next newline character in the pattern.
1834 Consider the problem of matching a string in parentheses,
1835 allowing for unlimited nested parentheses. Without the use
1836 of recursion, the best that can be done is to use a pattern
1837 that matches up to some fixed depth of nesting. It is not
1838 possible to handle an arbitrary nesting depth. Perl 5.6 has
1839 provided an experimental facility that allows regular
1840 expressions to recurse (amongst other things). It does this
1841 by interpolating Perl code in the expression at run time,
1842 and the code can refer to the expression itself. A Perl pat-
1843 tern to solve the parentheses problem can be created like
1844 this:
1846 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1848 The (?p{...}) item interpolates Perl code at run time, and
1849 in this case refers recursively to the pattern in which it
1850 appears. Obviously, PCRE cannot support the interpolation of
1851 Perl code. Instead, the special item (?R) is provided for
1852 the specific case of recursion. This PCRE pattern solves the
1853 parentheses problem (assume the PCRE_EXTENDED option is set
1854 so that white space is ignored):
1856 \( ( (?>[^()]+) | (?R) )* \)
1858 First it matches an opening parenthesis. Then it matches any
1859 number of substrings which can either be a sequence of non-
1860 parentheses, or a recursive match of the pattern itself
1861 (i.e. a correctly parenthesized substring). Finally there is
1862 a closing parenthesis.
1864 This particular example pattern contains nested unlimited
1865 repeats, and so the use of a once-only subpattern for match-
1866 ing strings of non-parentheses is important when applying
1867 the pattern to strings that do not match. For example, when
1868 it is applied to
1870 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1872 it yields "no match" quickly. However, if a once-only sub-
1873 pattern is not used, the match runs for a very long time
1874 indeed because there are so many different ways the + and *
1875 repeats can carve up the subject, and all have to be tested
1876 before failure can be reported.
1878 The values set for any capturing subpatterns are those from
1879 the outermost level of the recursion at which the subpattern
1880 value is set. If the pattern above is matched against
1882 (ab(cd)ef)
1884 the value for the capturing parentheses is "ef", which is
1885 the last value taken on at the top level. If additional
1886 parentheses are added, giving
1888 \( ( ( (?>[^()]+) | (?R) )* ) \)
1889 ^ ^
1890 ^ ^ the string they capture is
1891 "ab(cd)ef", the contents of the top level parentheses. If
1892 there are more than 15 capturing parentheses in a pattern,
1893 PCRE has to obtain extra memory to store data during a
1894 recursion, which it does by using pcre_malloc, freeing it
1895 via pcre_free afterwards. If no memory can be obtained, it
1896 saves data for the first 15 capturing parentheses only, as
1897 there is no way to give an out-of-memory error from within a
1898 recursion.
1903 Certain items that may appear in patterns are more efficient
1904 than others. It is more efficient to use a character class
1905 like [aeiou] than a set of alternatives such as (a|e|i|o|u).
1906 In general, the simplest construction that provides the
1907 required behaviour is usually the most efficient. Jeffrey
1908 Friedl's book contains a lot of discussion about optimizing
1909 regular expressions for efficient performance.
1911 When a pattern begins with .* and the PCRE_DOTALL option is
1912 set, the pattern is implicitly anchored by PCRE, since it
1913 can match only at the start of a subject string. However, if
1914 PCRE_DOTALL is not set, PCRE cannot make this optimization,
1915 because the . metacharacter does not then match a newline,
1916 and if the subject string contains newlines, the pattern may
1917 match from the character immediately following one of them
1918 instead of from the very start. For example, the pattern
1920 (.*) second
1922 matches the subject "first\nand second" (where \n stands for
1923 a newline character) with the first captured substring being
1924 "and". In order to do this, PCRE has to retry the match
1925 starting after every newline in the subject.
1927 If you are using such a pattern with subject strings that do
1928 not contain newlines, the best performance is obtained by
1929 setting PCRE_DOTALL, or starting the pattern with ^.* to
1930 indicate explicit anchoring. That saves PCRE from having to
1931 scan along the subject looking for a newline to restart at.
1933 Beware of patterns that contain nested indefinite repeats.
1934 These can take a long time to run when applied to a string
1935 that does not match. Consider the pattern fragment
1937 (a+)*
1939 This can match "aaaa" in 33 different ways, and this number
1940 increases very rapidly as the string gets longer. (The *
1941 repeat can match 0, 1, 2, 3, or 4 times, and for each of
1942 those cases other than 0, the + repeats can match different
1943 numbers of times.) When the remainder of the pattern is such
1944 that the entire match is going to fail, PCRE has in princi-
1945 ple to try every possible variation, and this can take an
1946 extremely long time.
1948 An optimization catches some of the more simple cases such
1949 as
1951 (a+)*b
1953 where a literal character follows. Before embarking on the
1954 standard matching procedure, PCRE checks that there is a "b"
1955 later in the subject string, and if there is not, it fails
1956 the match immediately. However, when there is no following
1957 literal this optimization cannot be used. You can see the
1958 difference by comparing the behaviour of
1960 (a+)*\d
1962 with the pattern above. The former gives a failure almost
1963 instantly when applied to a whole line of "a" characters,
1964 whereas the latter takes an appreciable time with strings
1965 longer than about 20 characters.
1970 Philip Hazel <ph10@cam.ac.uk>
1971 University Computing Service,
1972 New Museums Site,
1973 Cambridge CB2 3QG, England.
1974 Phone: +44 1223 334714
1976 Last updated: 27 January 2000
1977 Copyright (c) 1997-2000 University of Cambridge.

  ViewVC Help
Powered by ViewVC 1.1.5