ViewVC logotype

Contents of /code/trunk/doc/html/pcrepattern.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 91 - (show annotations)
Sat Feb 24 21:41:34 2007 UTC (14 years ago) by nigel
File MIME type: text/html
File size: 71296 byte(s)
Load pcre-6.7 into code/trunk.
1 <html>
2 <head>
3 <title>pcrepattern specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcrepattern man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION DETAILS</a>
17 <li><a name="TOC2" href="#SEC2">BACKSLASH</a>
18 <li><a name="TOC3" href="#SEC3">CIRCUMFLEX AND DOLLAR</a>
19 <li><a name="TOC4" href="#SEC4">FULL STOP (PERIOD, DOT)</a>
20 <li><a name="TOC5" href="#SEC5">MATCHING A SINGLE BYTE</a>
22 <li><a name="TOC7" href="#SEC7">POSIX CHARACTER CLASSES</a>
23 <li><a name="TOC8" href="#SEC8">VERTICAL BAR</a>
24 <li><a name="TOC9" href="#SEC9">INTERNAL OPTION SETTING</a>
25 <li><a name="TOC10" href="#SEC10">SUBPATTERNS</a>
26 <li><a name="TOC11" href="#SEC11">NAMED SUBPATTERNS</a>
27 <li><a name="TOC12" href="#SEC12">REPETITION</a>
29 <li><a name="TOC14" href="#SEC14">BACK REFERENCES</a>
30 <li><a name="TOC15" href="#SEC15">ASSERTIONS</a>
31 <li><a name="TOC16" href="#SEC16">CONDITIONAL SUBPATTERNS</a>
32 <li><a name="TOC17" href="#SEC17">COMMENTS</a>
33 <li><a name="TOC18" href="#SEC18">RECURSIVE PATTERNS</a>
34 <li><a name="TOC19" href="#SEC19">SUBPATTERNS AS SUBROUTINES</a>
35 <li><a name="TOC20" href="#SEC20">CALLOUTS</a>
36 </ul>
37 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION DETAILS</a><br>
38 <P>
39 The syntax and semantics of the regular expressions supported by PCRE are
40 described below. Regular expressions are also described in the Perl
41 documentation and in a number of books, some of which have copious examples.
42 Jeffrey Friedl's "Mastering Regular Expressions", published by O'Reilly, covers
43 regular expressions in great detail. This description of PCRE's regular
44 expressions is intended as reference material.
45 </P>
46 <P>
47 The original operation of PCRE was on strings of one-byte characters. However,
48 there is now also support for UTF-8 character strings. To use this, you must
49 build PCRE to include UTF-8 support, and then call <b>pcre_compile()</b> with
50 the PCRE_UTF8 option. How this affects pattern matching is mentioned in several
51 places below. There is also a summary of UTF-8 features in the
52 <a href="pcre.html#utf8support">section on UTF-8 support</a>
53 in the main
54 <a href="pcre.html"><b>pcre</b></a>
55 page.
56 </P>
57 <P>
58 The remainder of this document discusses the patterns that are supported by
59 PCRE when its main matching function, <b>pcre_exec()</b>, is used.
60 From release 6.0, PCRE offers a second matching function,
61 <b>pcre_dfa_exec()</b>, which matches using a different algorithm that is not
62 Perl-compatible. The advantages and disadvantages of the alternative function,
63 and how it differs from the normal function, are discussed in the
64 <a href="pcrematching.html"><b>pcrematching</b></a>
65 page.
66 </P>
67 <P>
68 A regular expression is a pattern that is matched against a subject string from
69 left to right. Most characters stand for themselves in a pattern, and match the
70 corresponding characters in the subject. As a trivial example, the pattern
71 <pre>
72 The quick brown fox
73 </pre>
74 matches a portion of a subject string that is identical to itself. When
75 caseless matching is specified (the PCRE_CASELESS option), letters are matched
76 independently of case. In UTF-8 mode, PCRE always understands the concept of
77 case for characters whose values are less than 128, so caseless matching is
78 always possible. For characters with higher values, the concept of case is
79 supported if PCRE is compiled with Unicode property support, but not otherwise.
80 If you want to use caseless matching for characters 128 and above, you must
81 ensure that PCRE is compiled with Unicode property support as well as with
82 UTF-8 support.
83 </P>
84 <P>
85 The power of regular expressions comes from the ability to include alternatives
86 and repetitions in the pattern. These are encoded in the pattern by the use of
87 <i>metacharacters</i>, which do not stand for themselves but instead are
88 interpreted in some special way.
89 </P>
90 <P>
91 There are two different sets of metacharacters: those that are recognized
92 anywhere in the pattern except within square brackets, and those that are
93 recognized in square brackets. Outside square brackets, the metacharacters are
94 as follows:
95 <pre>
96 \ general escape character with several uses
97 ^ assert start of string (or line, in multiline mode)
98 $ assert end of string (or line, in multiline mode)
99 . match any character except newline (by default)
100 [ start character class definition
101 | start of alternative branch
102 ( start subpattern
103 ) end subpattern
104 ? extends the meaning of (
105 also 0 or 1 quantifier
106 also quantifier minimizer
107 * 0 or more quantifier
108 + 1 or more quantifier
109 also "possessive quantifier"
110 { start min/max quantifier
111 </pre>
112 Part of a pattern that is in square brackets is called a "character class". In
113 a character class the only metacharacters are:
114 <pre>
115 \ general escape character
116 ^ negate the class, but only if the first character
117 - indicates character range
118 [ POSIX character class (only if followed by POSIX syntax)
119 ] terminates the character class
120 </pre>
121 The following sections describe the use of each of the metacharacters.
122 </P>
123 <br><a name="SEC2" href="#TOC1">BACKSLASH</a><br>
124 <P>
125 The backslash character has several uses. Firstly, if it is followed by a
126 non-alphanumeric character, it takes away any special meaning that character
127 may have. This use of backslash as an escape character applies both inside and
128 outside character classes.
129 </P>
130 <P>
131 For example, if you want to match a * character, you write \* in the pattern.
132 This escaping action applies whether or not the following character would
133 otherwise be interpreted as a metacharacter, so it is always safe to precede a
134 non-alphanumeric with backslash to specify that it stands for itself. In
135 particular, if you want to match a backslash, you write \\.
136 </P>
137 <P>
138 If a pattern is compiled with the PCRE_EXTENDED option, whitespace in the
139 pattern (other than in a character class) and characters between a # outside
140 a character class and the next newline are ignored. An escaping backslash can
141 be used to include a whitespace or # character as part of the pattern.
142 </P>
143 <P>
144 If you want to remove the special meaning from a sequence of characters, you
145 can do so by putting them between \Q and \E. This is different from Perl in
146 that $ and @ are handled as literals in \Q...\E sequences in PCRE, whereas in
147 Perl, $ and @ cause variable interpolation. Note the following examples:
148 <pre>
149 Pattern PCRE matches Perl matches
151 \Qabc$xyz\E abc$xyz abc followed by the contents of $xyz
152 \Qabc\$xyz\E abc\$xyz abc\$xyz
153 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
154 </pre>
155 The \Q...\E sequence is recognized both inside and outside character classes.
156 <a name="digitsafterbackslash"></a></P>
157 <br><b>
158 Non-printing characters
159 </b><br>
160 <P>
161 A second use of backslash provides a way of encoding non-printing characters
162 in patterns in a visible manner. There is no restriction on the appearance of
163 non-printing characters, apart from the binary zero that terminates a pattern,
164 but when a pattern is being prepared by text editing, it is usually easier to
165 use one of the following escape sequences than the binary character it
166 represents:
167 <pre>
168 \a alarm, that is, the BEL character (hex 07)
169 \cx "control-x", where x is any character
170 \e escape (hex 1B)
171 \f formfeed (hex 0C)
172 \n newline (hex 0A)
173 \r carriage return (hex 0D)
174 \t tab (hex 09)
175 \ddd character with octal code ddd, or backreference
176 \xhh character with hex code hh
177 \x{hhh..} character with hex code hhh..
178 </pre>
179 The precise effect of \cx is as follows: if x is a lower case letter, it
180 is converted to upper case. Then bit 6 of the character (hex 40) is inverted.
181 Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; becomes hex
182 7B.
183 </P>
184 <P>
185 After \x, from zero to two hexadecimal digits are read (letters can be in
186 upper or lower case). Any number of hexadecimal digits may appear between \x{
187 and }, but the value of the character code must be less than 256 in non-UTF-8
188 mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value
189 is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{
190 and }, or if there is no terminating }, this form of escape is not recognized.
191 Instead, the initial \x will be interpreted as a basic hexadecimal escape,
192 with no following digits, giving a character whose value is zero.
193 </P>
194 <P>
195 Characters whose value is less than 256 can be defined by either of the two
196 syntaxes for \x. There is no difference in the way they are handled. For
197 example, \xdc is exactly the same as \x{dc}.
198 </P>
199 <P>
200 After \0 up to two further octal digits are read. If there are fewer than two
201 digits, just those that are present are used. Thus the sequence \0\x\07
202 specifies two binary zeros followed by a BEL character (code value 7). Make
203 sure you supply two digits after the initial zero if the pattern character that
204 follows is itself an octal digit.
205 </P>
206 <P>
207 The handling of a backslash followed by a digit other than 0 is complicated.
208 Outside a character class, PCRE reads it and any following digits as a decimal
209 number. If the number is less than 10, or if there have been at least that many
210 previous capturing left parentheses in the expression, the entire sequence is
211 taken as a <i>back reference</i>. A description of how this works is given
212 <a href="#backreferences">later,</a>
213 following the discussion of
214 <a href="#subpattern">parenthesized subpatterns.</a>
215 </P>
216 <P>
217 Inside a character class, or if the decimal number is greater than 9 and there
218 have not been that many capturing subpatterns, PCRE re-reads up to three octal
219 digits following the backslash, ane uses them to generate a data character. Any
220 subsequent digits stand for themselves. In non-UTF-8 mode, the value of a
221 character specified in octal must be less than \400. In UTF-8 mode, values up
222 to \777 are permitted. For example:
223 <pre>
224 \040 is another way of writing a space
225 \40 is the same, provided there are fewer than 40 previous capturing subpatterns
226 \7 is always a back reference
227 \11 might be a back reference, or another way of writing a tab
228 \011 is always a tab
229 \0113 is a tab followed by the character "3"
230 \113 might be a back reference, otherwise the character with octal code 113
231 \377 might be a back reference, otherwise the byte consisting entirely of 1 bits
232 \81 is either a back reference, or a binary zero followed by the two characters "8" and "1"
233 </pre>
234 Note that octal values of 100 or greater must not be introduced by a leading
235 zero, because no more than three octal digits are ever read.
236 </P>
237 <P>
238 All the sequences that define a single character value can be used both inside
239 and outside character classes. In addition, inside a character class, the
240 sequence \b is interpreted as the backspace character (hex 08), and the
241 sequence \X is interpreted as the character "X". Outside a character class,
242 these sequences have different meanings
243 <a href="#uniextseq">(see below).</a>
244 </P>
245 <br><b>
246 Generic character types
247 </b><br>
248 <P>
249 The third use of backslash is for specifying generic character types. The
250 following are always recognized:
251 <pre>
252 \d any decimal digit
253 \D any character that is not a decimal digit
254 \s any whitespace character
255 \S any character that is not a whitespace character
256 \w any "word" character
257 \W any "non-word" character
258 </pre>
259 Each pair of escape sequences partitions the complete set of characters into
260 two disjoint sets. Any given character matches one, and only one, of each pair.
261 </P>
262 <P>
263 These character type sequences can appear both inside and outside character
264 classes. They each match one character of the appropriate type. If the current
265 matching point is at the end of the subject string, all of them fail, since
266 there is no character to match.
267 </P>
268 <P>
269 For compatibility with Perl, \s does not match the VT character (code 11).
270 This makes it different from the the POSIX "space" class. The \s characters
271 are HT (9), LF (10), FF (12), CR (13), and space (32). (If "use locale;" is
272 included in a Perl script, \s may match the VT character. In PCRE, it never
273 does.)
274 </P>
275 <P>
276 A "word" character is an underscore or any character less than 256 that is a
277 letter or digit. The definition of letters and digits is controlled by PCRE's
278 low-valued character tables, and may vary if locale-specific matching is taking
279 place (see
280 <a href="pcreapi.html#localesupport">"Locale support"</a>
281 in the
282 <a href="pcreapi.html"><b>pcreapi</b></a>
283 page). For example, in the "fr_FR" (French) locale, some character codes
284 greater than 128 are used for accented letters, and these are matched by \w.
285 </P>
286 <P>
287 In UTF-8 mode, characters with values greater than 128 never match \d, \s, or
288 \w, and always match \D, \S, and \W. This is true even when Unicode
289 character property support is available. The use of locales with Unicode is
290 discouraged.
291 <a name="uniextseq"></a></P>
292 <br><b>
293 Unicode character properties
294 </b><br>
295 <P>
296 When PCRE is built with Unicode character property support, three additional
297 escape sequences to match character properties are available when UTF-8 mode
298 is selected. They are:
299 <pre>
300 \p{<i>xx</i>} a character with the <i>xx</i> property
301 \P{<i>xx</i>} a character without the <i>xx</i> property
302 \X an extended Unicode sequence
303 </pre>
304 The property names represented by <i>xx</i> above are limited to the Unicode
305 script names, the general category properties, and "Any", which matches any
306 character (including newline). Other properties such as "InMusicalSymbols" are
307 not currently supported by PCRE. Note that \P{Any} does not match any
308 characters, so always causes a match failure.
309 </P>
310 <P>
311 Sets of Unicode characters are defined as belonging to certain scripts. A
312 character from one of these sets can be matched using a script name. For
313 example:
314 <pre>
315 \p{Greek}
316 \P{Han}
317 </pre>
318 Those that are not part of an identified script are lumped together as
319 "Common". The current list of scripts is:
320 </P>
321 <P>
322 Arabic,
323 Armenian,
324 Bengali,
325 Bopomofo,
326 Braille,
327 Buginese,
328 Buhid,
329 Canadian_Aboriginal,
330 Cherokee,
331 Common,
332 Coptic,
333 Cypriot,
334 Cyrillic,
335 Deseret,
336 Devanagari,
337 Ethiopic,
338 Georgian,
339 Glagolitic,
340 Gothic,
341 Greek,
342 Gujarati,
343 Gurmukhi,
344 Han,
345 Hangul,
346 Hanunoo,
347 Hebrew,
348 Hiragana,
349 Inherited,
350 Kannada,
351 Katakana,
352 Kharoshthi,
353 Khmer,
354 Lao,
355 Latin,
356 Limbu,
357 Linear_B,
358 Malayalam,
359 Mongolian,
360 Myanmar,
361 New_Tai_Lue,
362 Ogham,
363 Old_Italic,
364 Old_Persian,
365 Oriya,
366 Osmanya,
367 Runic,
368 Shavian,
369 Sinhala,
370 Syloti_Nagri,
371 Syriac,
372 Tagalog,
373 Tagbanwa,
374 Tai_Le,
375 Tamil,
376 Telugu,
377 Thaana,
378 Thai,
379 Tibetan,
380 Tifinagh,
381 Ugaritic,
382 Yi.
383 </P>
384 <P>
385 Each character has exactly one general category property, specified by a
386 two-letter abbreviation. For compatibility with Perl, negation can be specified
387 by including a circumflex between the opening brace and the property name. For
388 example, \p{^Lu} is the same as \P{Lu}.
389 </P>
390 <P>
391 If only one letter is specified with \p or \P, it includes all the general
392 category properties that start with that letter. In this case, in the absence
393 of negation, the curly brackets in the escape sequence are optional; these two
394 examples have the same effect:
395 <pre>
396 \p{L}
397 \pL
398 </pre>
399 The following general category property codes are supported:
400 <pre>
401 C Other
402 Cc Control
403 Cf Format
404 Cn Unassigned
405 Co Private use
406 Cs Surrogate
408 L Letter
409 Ll Lower case letter
410 Lm Modifier letter
411 Lo Other letter
412 Lt Title case letter
413 Lu Upper case letter
415 M Mark
416 Mc Spacing mark
417 Me Enclosing mark
418 Mn Non-spacing mark
420 N Number
421 Nd Decimal number
422 Nl Letter number
423 No Other number
425 P Punctuation
426 Pc Connector punctuation
427 Pd Dash punctuation
428 Pe Close punctuation
429 Pf Final punctuation
430 Pi Initial punctuation
431 Po Other punctuation
432 Ps Open punctuation
434 S Symbol
435 Sc Currency symbol
436 Sk Modifier symbol
437 Sm Mathematical symbol
438 So Other symbol
440 Z Separator
441 Zl Line separator
442 Zp Paragraph separator
443 Zs Space separator
444 </pre>
445 The special property L& is also supported: it matches a character that has
446 the Lu, Ll, or Lt property, in other words, a letter that is not classified as
447 a modifier or "other".
448 </P>
449 <P>
450 The long synonyms for these properties that Perl supports (such as \p{Letter})
451 are not supported by PCRE, nor is it permitted to prefix any of these
452 properties with "Is".
453 </P>
454 <P>
455 No character that is in the Unicode table has the Cn (unassigned) property.
456 Instead, this property is assumed for any code point that is not in the
457 Unicode table.
458 </P>
459 <P>
460 Specifying caseless matching does not affect these escape sequences. For
461 example, \p{Lu} always matches only upper case letters.
462 </P>
463 <P>
464 The \X escape matches any number of Unicode characters that form an extended
465 Unicode sequence. \X is equivalent to
466 <pre>
467 (?&#62;\PM\pM*)
468 </pre>
469 That is, it matches a character without the "mark" property, followed by zero
470 or more characters with the "mark" property, and treats the sequence as an
471 atomic group
472 <a href="#atomicgroup">(see below).</a>
473 Characters with the "mark" property are typically accents that affect the
474 preceding character.
475 </P>
476 <P>
477 Matching characters by Unicode property is not fast, because PCRE has to search
478 a structure that contains data for over fifteen thousand characters. That is
479 why the traditional escape sequences such as \d and \w do not use Unicode
480 properties in PCRE.
481 <a name="smallassertions"></a></P>
482 <br><b>
483 Simple assertions
484 </b><br>
485 <P>
486 The fourth use of backslash is for certain simple assertions. An assertion
487 specifies a condition that has to be met at a particular point in a match,
488 without consuming any characters from the subject string. The use of
489 subpatterns for more complicated assertions is described
490 <a href="#bigassertions">below.</a>
491 The backslashed assertions are:
492 <pre>
493 \b matches at a word boundary
494 \B matches when not at a word boundary
495 \A matches at start of subject
496 \Z matches at end of subject or before newline at end
497 \z matches at end of subject
498 \G matches at first matching position in subject
499 </pre>
500 These assertions may not appear in character classes (but note that \b has a
501 different meaning, namely the backspace character, inside a character class).
502 </P>
503 <P>
504 A word boundary is a position in the subject string where the current character
505 and the previous character do not both match \w or \W (i.e. one matches
506 \w and the other matches \W), or the start or end of the string if the
507 first or last character matches \w, respectively.
508 </P>
509 <P>
510 The \A, \Z, and \z assertions differ from the traditional circumflex and
511 dollar (described in the next section) in that they only ever match at the very
512 start and end of the subject string, whatever options are set. Thus, they are
513 independent of multiline mode. These three assertions are not affected by the
514 PCRE_NOTBOL or PCRE_NOTEOL options, which affect only the behaviour of the
515 circumflex and dollar metacharacters. However, if the <i>startoffset</i>
516 argument of <b>pcre_exec()</b> is non-zero, indicating that matching is to start
517 at a point other than the beginning of the subject, \A can never match. The
518 difference between \Z and \z is that \Z matches before a newline at the end
519 of the string as well as at the very end, whereas \z matches only at the end.
520 </P>
521 <P>
522 The \G assertion is true only when the current matching position is at the
523 start point of the match, as specified by the <i>startoffset</i> argument of
524 <b>pcre_exec()</b>. It differs from \A when the value of <i>startoffset</i> is
525 non-zero. By calling <b>pcre_exec()</b> multiple times with appropriate
526 arguments, you can mimic Perl's /g option, and it is in this kind of
527 implementation where \G can be useful.
528 </P>
529 <P>
530 Note, however, that PCRE's interpretation of \G, as the start of the current
531 match, is subtly different from Perl's, which defines it as the end of the
532 previous match. In Perl, these can be different when the previously matched
533 string was empty. Because PCRE does just one match at a time, it cannot
534 reproduce this behaviour.
535 </P>
536 <P>
537 If all the alternatives of a pattern begin with \G, the expression is anchored
538 to the starting match position, and the "anchored" flag is set in the compiled
539 regular expression.
540 </P>
541 <br><a name="SEC3" href="#TOC1">CIRCUMFLEX AND DOLLAR</a><br>
542 <P>
543 Outside a character class, in the default matching mode, the circumflex
544 character is an assertion that is true only if the current matching point is
545 at the start of the subject string. If the <i>startoffset</i> argument of
546 <b>pcre_exec()</b> is non-zero, circumflex can never match if the PCRE_MULTILINE
547 option is unset. Inside a character class, circumflex has an entirely different
548 meaning
549 <a href="#characterclass">(see below).</a>
550 </P>
551 <P>
552 Circumflex need not be the first character of the pattern if a number of
553 alternatives are involved, but it should be the first thing in each alternative
554 in which it appears if the pattern is ever to match that branch. If all
555 possible alternatives start with a circumflex, that is, if the pattern is
556 constrained to match only at the start of the subject, it is said to be an
557 "anchored" pattern. (There are also other constructs that can cause a pattern
558 to be anchored.)
559 </P>
560 <P>
561 A dollar character is an assertion that is true only if the current matching
562 point is at the end of the subject string, or immediately before a newline
563 at the end of the string (by default). Dollar need not be the last character of
564 the pattern if a number of alternatives are involved, but it should be the last
565 item in any branch in which it appears. Dollar has no special meaning in a
566 character class.
567 </P>
568 <P>
569 The meaning of dollar can be changed so that it matches only at the very end of
570 the string, by setting the PCRE_DOLLAR_ENDONLY option at compile time. This
571 does not affect the \Z assertion.
572 </P>
573 <P>
574 The meanings of the circumflex and dollar characters are changed if the
575 PCRE_MULTILINE option is set. When this is the case, a circumflex matches
576 immediately after internal newlines as well as at the start of the subject
577 string. It does not match after a newline that ends the string. A dollar
578 matches before any newlines in the string, as well as at the very end, when
579 PCRE_MULTILINE is set. When newline is specified as the two-character
580 sequence CRLF, isolated CR and LF characters do not indicate newlines.
581 </P>
582 <P>
583 For example, the pattern /^abc$/ matches the subject string "def\nabc" (where
584 \n represents a newline) in multiline mode, but not otherwise. Consequently,
585 patterns that are anchored in single line mode because all branches start with
586 ^ are not anchored in multiline mode, and a match for circumflex is possible
587 when the <i>startoffset</i> argument of <b>pcre_exec()</b> is non-zero. The
588 PCRE_DOLLAR_ENDONLY option is ignored if PCRE_MULTILINE is set.
589 </P>
590 <P>
591 Note that the sequences \A, \Z, and \z can be used to match the start and
592 end of the subject in both modes, and if all branches of a pattern start with
593 \A it is always anchored, whether or not PCRE_MULTILINE is set.
594 </P>
595 <br><a name="SEC4" href="#TOC1">FULL STOP (PERIOD, DOT)</a><br>
596 <P>
597 Outside a character class, a dot in the pattern matches any one character in
598 the subject string except (by default) a character that signifies the end of a
599 line. In UTF-8 mode, the matched character may be more than one byte long. When
600 a line ending is defined as a single character (CR or LF), dot never matches
601 that character; when the two-character sequence CRLF is used, dot does not
602 match CR if it is immediately followed by LF, but otherwise it matches all
603 characters (including isolated CRs and LFs).
604 </P>
605 <P>
606 The behaviour of dot with regard to newlines can be changed. If the PCRE_DOTALL
607 option is set, a dot matches any one character, without exception. If newline
608 is defined as the two-character sequence CRLF, it takes two dots to match it.
609 </P>
610 <P>
611 The handling of dot is entirely independent of the handling of circumflex and
612 dollar, the only relationship being that they both involve newlines. Dot has no
613 special meaning in a character class.
614 </P>
615 <br><a name="SEC5" href="#TOC1">MATCHING A SINGLE BYTE</a><br>
616 <P>
617 Outside a character class, the escape sequence \C matches any one byte, both
618 in and out of UTF-8 mode. Unlike a dot, it always matches CR and LF. The
619 feature is provided in Perl in order to match individual bytes in UTF-8 mode.
620 Because it breaks up UTF-8 characters into individual bytes, what remains in
621 the string may be a malformed UTF-8 string. For this reason, the \C escape
622 sequence is best avoided.
623 </P>
624 <P>
625 PCRE does not allow \C to appear in lookbehind assertions
626 <a href="#lookbehind">(described below),</a>
627 because in UTF-8 mode this would make it impossible to calculate the length of
628 the lookbehind.
629 <a name="characterclass"></a></P>
630 <br><a name="SEC6" href="#TOC1">SQUARE BRACKETS AND CHARACTER CLASSES</a><br>
631 <P>
632 An opening square bracket introduces a character class, terminated by a closing
633 square bracket. A closing square bracket on its own is not special. If a
634 closing square bracket is required as a member of the class, it should be the
635 first data character in the class (after an initial circumflex, if present) or
636 escaped with a backslash.
637 </P>
638 <P>
639 A character class matches a single character in the subject. In UTF-8 mode, the
640 character may occupy more than one byte. A matched character must be in the set
641 of characters defined by the class, unless the first character in the class
642 definition is a circumflex, in which case the subject character must not be in
643 the set defined by the class. If a circumflex is actually required as a member
644 of the class, ensure it is not the first character, or escape it with a
645 backslash.
646 </P>
647 <P>
648 For example, the character class [aeiou] matches any lower case vowel, while
649 [^aeiou] matches any character that is not a lower case vowel. Note that a
650 circumflex is just a convenient notation for specifying the characters that
651 are in the class by enumerating those that are not. A class that starts with a
652 circumflex is not an assertion: it still consumes a character from the subject
653 string, and therefore it fails if the current pointer is at the end of the
654 string.
655 </P>
656 <P>
657 In UTF-8 mode, characters with values greater than 255 can be included in a
658 class as a literal string of bytes, or by using the \x{ escaping mechanism.
659 </P>
660 <P>
661 When caseless matching is set, any letters in a class represent both their
662 upper case and lower case versions, so for example, a caseless [aeiou] matches
663 "A" as well as "a", and a caseless [^aeiou] does not match "A", whereas a
664 caseful version would. In UTF-8 mode, PCRE always understands the concept of
665 case for characters whose values are less than 128, so caseless matching is
666 always possible. For characters with higher values, the concept of case is
667 supported if PCRE is compiled with Unicode property support, but not otherwise.
668 If you want to use caseless matching for characters 128 and above, you must
669 ensure that PCRE is compiled with Unicode property support as well as with
670 UTF-8 support.
671 </P>
672 <P>
673 Characters that might indicate line breaks (CR and LF) are never treated in any
674 special way when matching character classes, whatever line-ending sequence is
675 in use, and whatever setting of the PCRE_DOTALL and PCRE_MULTILINE options is
676 used. A class such as [^a] always matches one of these characters.
677 </P>
678 <P>
679 The minus (hyphen) character can be used to specify a range of characters in a
680 character class. For example, [d-m] matches any letter between d and m,
681 inclusive. If a minus character is required in a class, it must be escaped with
682 a backslash or appear in a position where it cannot be interpreted as
683 indicating a range, typically as the first or last character in the class.
684 </P>
685 <P>
686 It is not possible to have the literal character "]" as the end character of a
687 range. A pattern such as [W-]46] is interpreted as a class of two characters
688 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
689 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
690 the end of range, so [W-\]46] is interpreted as a class containing a range
691 followed by two other characters. The octal or hexadecimal representation of
692 "]" can also be used to end a range.
693 </P>
694 <P>
695 Ranges operate in the collating sequence of character values. They can also be
696 used for characters specified numerically, for example [\000-\037]. In UTF-8
697 mode, ranges can include characters whose values are greater than 255, for
698 example [\x{100}-\x{2ff}].
699 </P>
700 <P>
701 If a range that includes letters is used when caseless matching is set, it
702 matches the letters in either case. For example, [W-c] is equivalent to
703 [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if character
704 tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches accented E
705 characters in both cases. In UTF-8 mode, PCRE supports the concept of case for
706 characters with values greater than 128 only when it is compiled with Unicode
707 property support.
708 </P>
709 <P>
710 The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear
711 in a character class, and add the characters that they match to the class. For
712 example, [\dABCDEF] matches any hexadecimal digit. A circumflex can
713 conveniently be used with the upper case character types to specify a more
714 restricted set of characters than the matching lower case type. For example,
715 the class [^\W_] matches any letter or digit, but not underscore.
716 </P>
717 <P>
718 The only metacharacters that are recognized in character classes are backslash,
719 hyphen (only where it can be interpreted as specifying a range), circumflex
720 (only at the start), opening square bracket (only when it can be interpreted as
721 introducing a POSIX class name - see the next section), and the terminating
722 closing square bracket. However, escaping other non-alphanumeric characters
723 does no harm.
724 </P>
725 <br><a name="SEC7" href="#TOC1">POSIX CHARACTER CLASSES</a><br>
726 <P>
727 Perl supports the POSIX notation for character classes. This uses names
728 enclosed by [: and :] within the enclosing square brackets. PCRE also supports
729 this notation. For example,
730 <pre>
731 [01[:alpha:]%]
732 </pre>
733 matches "0", "1", any alphabetic character, or "%". The supported class names
734 are
735 <pre>
736 alnum letters and digits
737 alpha letters
738 ascii character codes 0 - 127
739 blank space or tab only
740 cntrl control characters
741 digit decimal digits (same as \d)
742 graph printing characters, excluding space
743 lower lower case letters
744 print printing characters, including space
745 punct printing characters, excluding letters and digits
746 space white space (not quite the same as \s)
747 upper upper case letters
748 word "word" characters (same as \w)
749 xdigit hexadecimal digits
750 </pre>
751 The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
752 space (32). Notice that this list includes the VT character (code 11). This
753 makes "space" different to \s, which does not include VT (for Perl
754 compatibility).
755 </P>
756 <P>
757 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
758 5.8. Another Perl extension is negation, which is indicated by a ^ character
759 after the colon. For example,
760 <pre>
761 [12[:^digit:]]
762 </pre>
763 matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the POSIX
764 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
765 supported, and an error is given if they are encountered.
766 </P>
767 <P>
768 In UTF-8 mode, characters with values greater than 128 do not match any of
769 the POSIX character classes.
770 </P>
771 <br><a name="SEC8" href="#TOC1">VERTICAL BAR</a><br>
772 <P>
773 Vertical bar characters are used to separate alternative patterns. For example,
774 the pattern
775 <pre>
776 gilbert|sullivan
777 </pre>
778 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
779 and an empty alternative is permitted (matching the empty string). The matching
780 process tries each alternative in turn, from left to right, and the first one
781 that succeeds is used. If the alternatives are within a subpattern
782 <a href="#subpattern">(defined below),</a>
783 "succeeds" means matching the rest of the main pattern as well as the
784 alternative in the subpattern.
785 </P>
786 <br><a name="SEC9" href="#TOC1">INTERNAL OPTION SETTING</a><br>
787 <P>
789 PCRE_EXTENDED options can be changed from within the pattern by a sequence of
790 Perl option letters enclosed between "(?" and ")". The option letters are
791 <pre>
794 s for PCRE_DOTALL
796 </pre>
797 For example, (?im) sets caseless, multiline matching. It is also possible to
798 unset these options by preceding the letter with a hyphen, and a combined
799 setting and unsetting such as (?im-sx), which sets PCRE_CASELESS and
800 PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, is also
801 permitted. If a letter appears both before and after the hyphen, the option is
802 unset.
803 </P>
804 <P>
805 When an option change occurs at top level (that is, not inside subpattern
806 parentheses), the change applies to the remainder of the pattern that follows.
807 If the change is placed right at the start of a pattern, PCRE extracts it into
808 the global options (and it will therefore show up in data extracted by the
809 <b>pcre_fullinfo()</b> function).
810 </P>
811 <P>
812 An option change within a subpattern affects only that part of the current
813 pattern that follows it, so
814 <pre>
815 (a(?i)b)c
816 </pre>
817 matches abc and aBc and no other strings (assuming PCRE_CASELESS is not used).
818 By this means, options can be made to have different settings in different
819 parts of the pattern. Any changes made in one alternative do carry on
820 into subsequent branches within the same subpattern. For example,
821 <pre>
822 (a(?i)b|c)
823 </pre>
824 matches "ab", "aB", "c", and "C", even though when matching "C" the first
825 branch is abandoned before the option setting. This is because the effects of
826 option settings happen at compile time. There would be some very weird
827 behaviour otherwise.
828 </P>
829 <P>
830 The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA can be
831 changed in the same way as the Perl-compatible options by using the characters
832 J, U and X respectively.
833 <a name="subpattern"></a></P>
834 <br><a name="SEC10" href="#TOC1">SUBPATTERNS</a><br>
835 <P>
836 Subpatterns are delimited by parentheses (round brackets), which can be nested.
837 Turning part of a pattern into a subpattern does two things:
838 <br>
839 <br>
840 1. It localizes a set of alternatives. For example, the pattern
841 <pre>
842 cat(aract|erpillar|)
843 </pre>
844 matches one of the words "cat", "cataract", or "caterpillar". Without the
845 parentheses, it would match "cataract", "erpillar" or the empty string.
846 <br>
847 <br>
848 2. It sets up the subpattern as a capturing subpattern. This means that, when
849 the whole pattern matches, that portion of the subject string that matched the
850 subpattern is passed back to the caller via the <i>ovector</i> argument of
851 <b>pcre_exec()</b>. Opening parentheses are counted from left to right (starting
852 from 1) to obtain numbers for the capturing subpatterns.
853 </P>
854 <P>
855 For example, if the string "the red king" is matched against the pattern
856 <pre>
857 the ((red|white) (king|queen))
858 </pre>
859 the captured substrings are "red king", "red", and "king", and are numbered 1,
860 2, and 3, respectively.
861 </P>
862 <P>
863 The fact that plain parentheses fulfil two functions is not always helpful.
864 There are often times when a grouping subpattern is required without a
865 capturing requirement. If an opening parenthesis is followed by a question mark
866 and a colon, the subpattern does not do any capturing, and is not counted when
867 computing the number of any subsequent capturing subpatterns. For example, if
868 the string "the white queen" is matched against the pattern
869 <pre>
870 the ((?:red|white) (king|queen))
871 </pre>
872 the captured substrings are "white queen" and "queen", and are numbered 1 and
873 2. The maximum number of capturing subpatterns is 65535, and the maximum depth
874 of nesting of all subpatterns, both capturing and non-capturing, is 200.
875 </P>
876 <P>
877 As a convenient shorthand, if any option settings are required at the start of
878 a non-capturing subpattern, the option letters may appear between the "?" and
879 the ":". Thus the two patterns
880 <pre>
881 (?i:saturday|sunday)
882 (?:(?i)saturday|sunday)
883 </pre>
884 match exactly the same set of strings. Because alternative branches are tried
885 from left to right, and options are not reset until the end of the subpattern
886 is reached, an option setting in one branch does affect subsequent branches, so
887 the above patterns match "SUNDAY" as well as "Saturday".
888 </P>
889 <br><a name="SEC11" href="#TOC1">NAMED SUBPATTERNS</a><br>
890 <P>
891 Identifying capturing parentheses by number is simple, but it can be very hard
892 to keep track of the numbers in complicated regular expressions. Furthermore,
893 if an expression is modified, the numbers may change. To help with this
894 difficulty, PCRE supports the naming of subpatterns, something that Perl does
895 not provide. The Python syntax (?P&#60;name&#62;...) is used. References to capturing
896 parentheses from other parts of the pattern, such as
897 <a href="#backreferences">backreferences,</a>
898 <a href="#recursion">recursion,</a>
899 and
900 <a href="#conditions">conditions,</a>
901 can be made by name as well as by number.
902 </P>
903 <P>
904 Names consist of up to 32 alphanumeric characters and underscores. Named
905 capturing parentheses are still allocated numbers as well as names. The PCRE
906 API provides function calls for extracting the name-to-number translation table
907 from a compiled pattern. There is also a convenience function for extracting a
908 captured substring by name.
909 </P>
910 <P>
911 By default, a name must be unique within a pattern, but it is possible to relax
912 this constraint by setting the PCRE_DUPNAMES option at compile time. This can
913 be useful for patterns where only one instance of the named parentheses can
914 match. Suppose you want to match the name of a weekday, either as a 3-letter
915 abbreviation or as the full name, and in both cases you want to extract the
916 abbreviation. This pattern (ignoring the line breaks) does the job:
917 <pre>
918 (?P&#60;DN&#62;Mon|Fri|Sun)(?:day)?|
919 (?P&#60;DN&#62;Tue)(?:sday)?|
920 (?P&#60;DN&#62;Wed)(?:nesday)?|
921 (?P&#60;DN&#62;Thu)(?:rsday)?|
922 (?P&#60;DN&#62;Sat)(?:urday)?
923 </pre>
924 There are five capturing substrings, but only one is ever set after a match.
925 The convenience function for extracting the data by name returns the substring
926 for the first, and in this example, the only, subpattern of that name that
927 matched. This saves searching to find which numbered subpattern it was. If you
928 make a reference to a non-unique named subpattern from elsewhere in the
929 pattern, the one that corresponds to the lowest number is used. For further
930 details of the interfaces for handling named subpatterns, see the
931 <a href="pcreapi.html"><b>pcreapi</b></a>
932 documentation.
933 </P>
934 <br><a name="SEC12" href="#TOC1">REPETITION</a><br>
935 <P>
936 Repetition is specified by quantifiers, which can follow any of the following
937 items:
938 <pre>
939 a literal data character
940 the . metacharacter
941 the \C escape sequence
942 the \X escape sequence (in UTF-8 mode with Unicode properties)
943 an escape such as \d that matches a single character
944 a character class
945 a back reference (see next section)
946 a parenthesized subpattern (unless it is an assertion)
947 </pre>
948 The general repetition quantifier specifies a minimum and maximum number of
949 permitted matches, by giving the two numbers in curly brackets (braces),
950 separated by a comma. The numbers must be less than 65536, and the first must
951 be less than or equal to the second. For example:
952 <pre>
953 z{2,4}
954 </pre>
955 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
956 character. If the second number is omitted, but the comma is present, there is
957 no upper limit; if the second number and the comma are both omitted, the
958 quantifier specifies an exact number of required matches. Thus
959 <pre>
960 [aeiou]{3,}
961 </pre>
962 matches at least 3 successive vowels, but may match many more, while
963 <pre>
964 \d{8}
965 </pre>
966 matches exactly 8 digits. An opening curly bracket that appears in a position
967 where a quantifier is not allowed, or one that does not match the syntax of a
968 quantifier, is taken as a literal character. For example, {,6} is not a
969 quantifier, but a literal string of four characters.
970 </P>
971 <P>
972 In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to individual
973 bytes. Thus, for example, \x{100}{2} matches two UTF-8 characters, each of
974 which is represented by a two-byte sequence. Similarly, when Unicode property
975 support is available, \X{3} matches three Unicode extended sequences, each of
976 which may be several bytes long (and they may be of different lengths).
977 </P>
978 <P>
979 The quantifier {0} is permitted, causing the expression to behave as if the
980 previous item and the quantifier were not present.
981 </P>
982 <P>
983 For convenience (and historical compatibility) the three most common
984 quantifiers have single-character abbreviations:
985 <pre>
986 * is equivalent to {0,}
987 + is equivalent to {1,}
988 ? is equivalent to {0,1}
989 </pre>
990 It is possible to construct infinite loops by following a subpattern that can
991 match no characters with a quantifier that has no upper limit, for example:
992 <pre>
993 (a?)*
994 </pre>
995 Earlier versions of Perl and PCRE used to give an error at compile time for
996 such patterns. However, because there are cases where this can be useful, such
997 patterns are now accepted, but if any repetition of the subpattern does in fact
998 match no characters, the loop is forcibly broken.
999 </P>
1000 <P>
1001 By default, the quantifiers are "greedy", that is, they match as much as
1002 possible (up to the maximum number of permitted times), without causing the
1003 rest of the pattern to fail. The classic example of where this gives problems
1004 is in trying to match comments in C programs. These appear between /* and */
1005 and within the comment, individual * and / characters may appear. An attempt to
1006 match C comments by applying the pattern
1007 <pre>
1008 /\*.*\*/
1009 </pre>
1010 to the string
1011 <pre>
1012 /* first comment */ not comment /* second comment */
1013 </pre>
1014 fails, because it matches the entire string owing to the greediness of the .*
1015 item.
1016 </P>
1017 <P>
1018 However, if a quantifier is followed by a question mark, it ceases to be
1019 greedy, and instead matches the minimum number of times possible, so the
1020 pattern
1021 <pre>
1022 /\*.*?\*/
1023 </pre>
1024 does the right thing with the C comments. The meaning of the various
1025 quantifiers is not otherwise changed, just the preferred number of matches.
1026 Do not confuse this use of question mark with its use as a quantifier in its
1027 own right. Because it has two uses, it can sometimes appear doubled, as in
1028 <pre>
1029 \d??\d
1030 </pre>
1031 which matches one digit by preference, but can match two if that is the only
1032 way the rest of the pattern matches.
1033 </P>
1034 <P>
1035 If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
1036 the quantifiers are not greedy by default, but individual ones can be made
1037 greedy by following them with a question mark. In other words, it inverts the
1038 default behaviour.
1039 </P>
1040 <P>
1041 When a parenthesized subpattern is quantified with a minimum repeat count that
1042 is greater than 1 or with a limited maximum, more memory is required for the
1043 compiled pattern, in proportion to the size of the minimum or maximum.
1044 </P>
1045 <P>
1046 If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1047 to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
1048 implicitly anchored, because whatever follows will be tried against every
1049 character position in the subject string, so there is no point in retrying the
1050 overall match at any position after the first. PCRE normally treats such a
1051 pattern as though it were preceded by \A.
1052 </P>
1053 <P>
1054 In cases where it is known that the subject string contains no newlines, it is
1055 worth setting PCRE_DOTALL in order to obtain this optimization, or
1056 alternatively using ^ to indicate anchoring explicitly.
1057 </P>
1058 <P>
1059 However, there is one situation where the optimization cannot be used. When .*
1060 is inside capturing parentheses that are the subject of a backreference
1061 elsewhere in the pattern, a match at the start may fail, and a later one
1062 succeed. Consider, for example:
1063 <pre>
1064 (.*)abc\1
1065 </pre>
1066 If the subject is "xyz123abc123" the match point is the fourth character. For
1067 this reason, such a pattern is not implicitly anchored.
1068 </P>
1069 <P>
1070 When a capturing subpattern is repeated, the value captured is the substring
1071 that matched the final iteration. For example, after
1072 <pre>
1073 (tweedle[dume]{3}\s*)+
1074 </pre>
1075 has matched "tweedledum tweedledee" the value of the captured substring is
1076 "tweedledee". However, if there are nested capturing subpatterns, the
1077 corresponding captured values may have been set in previous iterations. For
1078 example, after
1079 <pre>
1080 /(a|(b))+/
1081 </pre>
1082 matches "aba" the value of the second captured substring is "b".
1083 <a name="atomicgroup"></a></P>
1084 <br><a name="SEC13" href="#TOC1">ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS</a><br>
1085 <P>
1086 With both maximizing and minimizing repetition, failure of what follows
1087 normally causes the repeated item to be re-evaluated to see if a different
1088 number of repeats allows the rest of the pattern to match. Sometimes it is
1089 useful to prevent this, either to change the nature of the match, or to cause
1090 it fail earlier than it otherwise might, when the author of the pattern knows
1091 there is no point in carrying on.
1092 </P>
1093 <P>
1094 Consider, for example, the pattern \d+foo when applied to the subject line
1095 <pre>
1096 123456bar
1097 </pre>
1098 After matching all 6 digits and then failing to match "foo", the normal
1099 action of the matcher is to try again with only 5 digits matching the \d+
1100 item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
1101 (a term taken from Jeffrey Friedl's book) provides the means for specifying
1102 that once a subpattern has matched, it is not to be re-evaluated in this way.
1103 </P>
1104 <P>
1105 If we use atomic grouping for the previous example, the matcher would give up
1106 immediately on failing to match "foo" the first time. The notation is a kind of
1107 special parenthesis, starting with (?&#62; as in this example:
1108 <pre>
1109 (?&#62;\d+)foo
1110 </pre>
1111 This kind of parenthesis "locks up" the part of the pattern it contains once
1112 it has matched, and a failure further into the pattern is prevented from
1113 backtracking into it. Backtracking past it to previous items, however, works as
1114 normal.
1115 </P>
1116 <P>
1117 An alternative description is that a subpattern of this type matches the string
1118 of characters that an identical standalone pattern would match, if anchored at
1119 the current point in the subject string.
1120 </P>
1121 <P>
1122 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
1123 the above example can be thought of as a maximizing repeat that must swallow
1124 everything it can. So, while both \d+ and \d+? are prepared to adjust the
1125 number of digits they match in order to make the rest of the pattern match,
1126 (?&#62;\d+) can only match an entire sequence of digits.
1127 </P>
1128 <P>
1129 Atomic groups in general can of course contain arbitrarily complicated
1130 subpatterns, and can be nested. However, when the subpattern for an atomic
1131 group is just a single repeated item, as in the example above, a simpler
1132 notation, called a "possessive quantifier" can be used. This consists of an
1133 additional + character following a quantifier. Using this notation, the
1134 previous example can be rewritten as
1135 <pre>
1136 \d++foo
1137 </pre>
1138 Possessive quantifiers are always greedy; the setting of the PCRE_UNGREEDY
1139 option is ignored. They are a convenient notation for the simpler forms of
1140 atomic group. However, there is no difference in the meaning or processing of a
1141 possessive quantifier and the equivalent atomic group.
1142 </P>
1143 <P>
1144 The possessive quantifier syntax is an extension to the Perl syntax. Jeffrey
1145 Friedl originated the idea (and the name) in the first edition of his book.
1146 Mike McCloskey liked it, so implemented it when he built Sun's Java package,
1147 and PCRE copied it from there.
1148 </P>
1149 <P>
1150 When a pattern contains an unlimited repeat inside a subpattern that can itself
1151 be repeated an unlimited number of times, the use of an atomic group is the
1152 only way to avoid some failing matches taking a very long time indeed. The
1153 pattern
1154 <pre>
1155 (\D+|&#60;\d+&#62;)*[!?]
1156 </pre>
1157 matches an unlimited number of substrings that either consist of non-digits, or
1158 digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
1159 quickly. However, if it is applied to
1160 <pre>
1161 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1162 </pre>
1163 it takes a long time before reporting failure. This is because the string can
1164 be divided between the internal \D+ repeat and the external * repeat in a
1165 large number of ways, and all have to be tried. (The example uses [!?] rather
1166 than a single character at the end, because both PCRE and Perl have an
1167 optimization that allows for fast failure when a single character is used. They
1168 remember the last single character that is required for a match, and fail early
1169 if it is not present in the string.) If the pattern is changed so that it uses
1170 an atomic group, like this:
1171 <pre>
1172 ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
1173 </pre>
1174 sequences of non-digits cannot be broken, and failure happens quickly.
1175 <a name="backreferences"></a></P>
1176 <br><a name="SEC14" href="#TOC1">BACK REFERENCES</a><br>
1177 <P>
1178 Outside a character class, a backslash followed by a digit greater than 0 (and
1179 possibly further digits) is a back reference to a capturing subpattern earlier
1180 (that is, to its left) in the pattern, provided there have been that many
1181 previous capturing left parentheses.
1182 </P>
1183 <P>
1184 However, if the decimal number following the backslash is less than 10, it is
1185 always taken as a back reference, and causes an error only if there are not
1186 that many capturing left parentheses in the entire pattern. In other words, the
1187 parentheses that are referenced need not be to the left of the reference for
1188 numbers less than 10. A "forward back reference" of this type can make sense
1189 when a repetition is involved and the subpattern to the right has participated
1190 in an earlier iteration.
1191 </P>
1192 <P>
1193 It is not possible to have a numerical "forward back reference" to subpattern
1194 whose number is 10 or more. However, a back reference to any subpattern is
1195 possible using named parentheses (see below). See also the subsection entitled
1196 "Non-printing characters"
1197 <a href="#digitsafterbackslash">above</a>
1198 for further details of the handling of digits following a backslash.
1199 </P>
1200 <P>
1201 A back reference matches whatever actually matched the capturing subpattern in
1202 the current subject string, rather than anything matching the subpattern
1203 itself (see
1204 <a href="#subpatternsassubroutines">"Subpatterns as subroutines"</a>
1205 below for a way of doing that). So the pattern
1206 <pre>
1207 (sens|respons)e and \1ibility
1208 </pre>
1209 matches "sense and sensibility" and "response and responsibility", but not
1210 "sense and responsibility". If caseful matching is in force at the time of the
1211 back reference, the case of letters is relevant. For example,
1212 <pre>
1213 ((?i)rah)\s+\1
1214 </pre>
1215 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
1216 capturing subpattern is matched caselessly.
1217 </P>
1218 <P>
1219 Back references to named subpatterns use the Python syntax (?P=name). We could
1220 rewrite the above example as follows:
1221 <pre>
1222 (?P&#60;p1&#62;(?i)rah)\s+(?P=p1)
1223 </pre>
1224 A subpattern that is referenced by name may appear in the pattern before or
1225 after the reference.
1226 </P>
1227 <P>
1228 There may be more than one back reference to the same subpattern. If a
1229 subpattern has not actually been used in a particular match, any back
1230 references to it always fail. For example, the pattern
1231 <pre>
1232 (a|(bc))\2
1233 </pre>
1234 always fails if it starts to match "a" rather than "bc". Because there may be
1235 many capturing parentheses in a pattern, all digits following the backslash are
1236 taken as part of a potential back reference number. If the pattern continues
1237 with a digit character, some delimiter must be used to terminate the back
1238 reference. If the PCRE_EXTENDED option is set, this can be whitespace.
1239 Otherwise an empty comment (see
1240 <a href="#comments">"Comments"</a>
1241 below) can be used.
1242 </P>
1243 <P>
1244 A back reference that occurs inside the parentheses to which it refers fails
1245 when the subpattern is first used, so, for example, (a\1) never matches.
1246 However, such references can be useful inside repeated subpatterns. For
1247 example, the pattern
1248 <pre>
1249 (a|b\1)+
1250 </pre>
1251 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1252 the subpattern, the back reference matches the character string corresponding
1253 to the previous iteration. In order for this to work, the pattern must be such
1254 that the first iteration does not need to match the back reference. This can be
1255 done using alternation, as in the example above, or by a quantifier with a
1256 minimum of zero.
1257 <a name="bigassertions"></a></P>
1258 <br><a name="SEC15" href="#TOC1">ASSERTIONS</a><br>
1259 <P>
1260 An assertion is a test on the characters following or preceding the current
1261 matching point that does not actually consume any characters. The simple
1262 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
1263 <a href="#smallassertions">above.</a>
1264 </P>
1265 <P>
1266 More complicated assertions are coded as subpatterns. There are two kinds:
1267 those that look ahead of the current position in the subject string, and those
1268 that look behind it. An assertion subpattern is matched in the normal way,
1269 except that it does not cause the current matching position to be changed.
1270 </P>
1271 <P>
1272 Assertion subpatterns are not capturing subpatterns, and may not be repeated,
1273 because it makes no sense to assert the same thing several times. If any kind
1274 of assertion contains capturing subpatterns within it, these are counted for
1275 the purposes of numbering the capturing subpatterns in the whole pattern.
1276 However, substring capturing is carried out only for positive assertions,
1277 because it does not make sense for negative assertions.
1278 </P>
1279 <br><b>
1280 Lookahead assertions
1281 </b><br>
1282 <P>
1283 Lookahead assertions start with (?= for positive assertions and (?! for
1284 negative assertions. For example,
1285 <pre>
1286 \w+(?=;)
1287 </pre>
1288 matches a word followed by a semicolon, but does not include the semicolon in
1289 the match, and
1290 <pre>
1291 foo(?!bar)
1292 </pre>
1293 matches any occurrence of "foo" that is not followed by "bar". Note that the
1294 apparently similar pattern
1295 <pre>
1296 (?!foo)bar
1297 </pre>
1298 does not find an occurrence of "bar" that is preceded by something other than
1299 "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
1300 (?!foo) is always true when the next three characters are "bar". A
1301 lookbehind assertion is needed to achieve the other effect.
1302 </P>
1303 <P>
1304 If you want to force a matching failure at some point in a pattern, the most
1305 convenient way to do it is with (?!) because an empty string always matches, so
1306 an assertion that requires there not to be an empty string must always fail.
1307 <a name="lookbehind"></a></P>
1308 <br><b>
1309 Lookbehind assertions
1310 </b><br>
1311 <P>
1312 Lookbehind assertions start with (?&#60;= for positive assertions and (?&#60;! for
1313 negative assertions. For example,
1314 <pre>
1315 (?&#60;!foo)bar
1316 </pre>
1317 does find an occurrence of "bar" that is not preceded by "foo". The contents of
1318 a lookbehind assertion are restricted such that all the strings it matches must
1319 have a fixed length. However, if there are several top-level alternatives, they
1320 do not all have to have the same fixed length. Thus
1321 <pre>
1322 (?&#60;=bullock|donkey)
1323 </pre>
1324 is permitted, but
1325 <pre>
1326 (?&#60;!dogs?|cats?)
1327 </pre>
1328 causes an error at compile time. Branches that match different length strings
1329 are permitted only at the top level of a lookbehind assertion. This is an
1330 extension compared with Perl (at least for 5.8), which requires all branches to
1331 match the same length of string. An assertion such as
1332 <pre>
1333 (?&#60;=ab(c|de))
1334 </pre>
1335 is not permitted, because its single top-level branch can match two different
1336 lengths, but it is acceptable if rewritten to use two top-level branches:
1337 <pre>
1338 (?&#60;=abc|abde)
1339 </pre>
1340 The implementation of lookbehind assertions is, for each alternative, to
1341 temporarily move the current position back by the fixed width and then try to
1342 match. If there are insufficient characters before the current position, the
1343 match is deemed to fail.
1344 </P>
1345 <P>
1346 PCRE does not allow the \C escape (which matches a single byte in UTF-8 mode)
1347 to appear in lookbehind assertions, because it makes it impossible to calculate
1348 the length of the lookbehind. The \X escape, which can match different numbers
1349 of bytes, is also not permitted.
1350 </P>
1351 <P>
1352 Atomic groups can be used in conjunction with lookbehind assertions to specify
1353 efficient matching at the end of the subject string. Consider a simple pattern
1354 such as
1355 <pre>
1356 abcd$
1357 </pre>
1358 when applied to a long string that does not match. Because matching proceeds
1359 from left to right, PCRE will look for each "a" in the subject and then see if
1360 what follows matches the rest of the pattern. If the pattern is specified as
1361 <pre>
1362 ^.*abcd$
1363 </pre>
1364 the initial .* matches the entire string at first, but when this fails (because
1365 there is no following "a"), it backtracks to match all but the last character,
1366 then all but the last two characters, and so on. Once again the search for "a"
1367 covers the entire string, from right to left, so we are no better off. However,
1368 if the pattern is written as
1369 <pre>
1370 ^(?&#62;.*)(?&#60;=abcd)
1371 </pre>
1372 or, equivalently, using the possessive quantifier syntax,
1373 <pre>
1374 ^.*+(?&#60;=abcd)
1375 </pre>
1376 there can be no backtracking for the .* item; it can match only the entire
1377 string. The subsequent lookbehind assertion does a single test on the last four
1378 characters. If it fails, the match fails immediately. For long strings, this
1379 approach makes a significant difference to the processing time.
1380 </P>
1381 <br><b>
1382 Using multiple assertions
1383 </b><br>
1384 <P>
1385 Several assertions (of any sort) may occur in succession. For example,
1386 <pre>
1387 (?&#60;=\d{3})(?&#60;!999)foo
1388 </pre>
1389 matches "foo" preceded by three digits that are not "999". Notice that each of
1390 the assertions is applied independently at the same point in the subject
1391 string. First there is a check that the previous three characters are all
1392 digits, and then there is a check that the same three characters are not "999".
1393 This pattern does <i>not</i> match "foo" preceded by six characters, the first
1394 of which are digits and the last three of which are not "999". For example, it
1395 doesn't match "123abcfoo". A pattern to do that is
1396 <pre>
1397 (?&#60;=\d{3}...)(?&#60;!999)foo
1398 </pre>
1399 This time the first assertion looks at the preceding six characters, checking
1400 that the first three are digits, and then the second assertion checks that the
1401 preceding three characters are not "999".
1402 </P>
1403 <P>
1404 Assertions can be nested in any combination. For example,
1405 <pre>
1406 (?&#60;=(?&#60;!foo)bar)baz
1407 </pre>
1408 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
1409 preceded by "foo", while
1410 <pre>
1411 (?&#60;=\d{3}(?!999)...)foo
1412 </pre>
1413 is another pattern that matches "foo" preceded by three digits and any three
1414 characters that are not "999".
1415 <a name="conditions"></a></P>
1416 <br><a name="SEC16" href="#TOC1">CONDITIONAL SUBPATTERNS</a><br>
1417 <P>
1418 It is possible to cause the matching process to obey a subpattern
1419 conditionally or to choose between two alternative subpatterns, depending on
1420 the result of an assertion, or whether a previous capturing subpattern matched
1421 or not. The two possible forms of conditional subpattern are
1422 <pre>
1423 (?(condition)yes-pattern)
1424 (?(condition)yes-pattern|no-pattern)
1425 </pre>
1426 If the condition is satisfied, the yes-pattern is used; otherwise the
1427 no-pattern (if present) is used. If there are more than two alternatives in the
1428 subpattern, a compile-time error occurs.
1429 </P>
1430 <P>
1431 There are three kinds of condition. If the text between the parentheses
1432 consists of a sequence of digits, or a sequence of alphanumeric characters and
1433 underscores, the condition is satisfied if the capturing subpattern of that
1434 number or name has previously matched. There is a possible ambiguity here,
1435 because subpattern names may consist entirely of digits. PCRE looks first for a
1436 named subpattern; if it cannot find one and the text consists entirely of
1437 digits, it looks for a subpattern of that number, which must be greater than
1438 zero. Using subpattern names that consist entirely of digits is not
1439 recommended.
1440 </P>
1441 <P>
1442 Consider the following pattern, which contains non-significant white space to
1443 make it more readable (assume the PCRE_EXTENDED option) and to divide it into
1444 three parts for ease of discussion:
1445 <pre>
1446 ( \( )? [^()]+ (?(1) \) )
1447 </pre>
1448 The first part matches an optional opening parenthesis, and if that
1449 character is present, sets it as the first captured substring. The second part
1450 matches one or more characters that are not parentheses. The third part is a
1451 conditional subpattern that tests whether the first set of parentheses matched
1452 or not. If they did, that is, if subject started with an opening parenthesis,
1453 the condition is true, and so the yes-pattern is executed and a closing
1454 parenthesis is required. Otherwise, since no-pattern is not present, the
1455 subpattern matches nothing. In other words, this pattern matches a sequence of
1456 non-parentheses, optionally enclosed in parentheses. Rewriting it to use a
1457 named subpattern gives this:
1458 <pre>
1459 (?P&#60;OPEN&#62; \( )? [^()]+ (?(OPEN) \) )
1460 </pre>
1461 If the condition is the string (R), and there is no subpattern with the name R,
1462 the condition is satisfied if a recursive call to the pattern or subpattern has
1463 been made. At "top level", the condition is false. This is a PCRE extension.
1464 Recursive patterns are described in the next section.
1465 </P>
1466 <P>
1467 If the condition is not a sequence of digits or (R), it must be an assertion.
1468 This may be a positive or negative lookahead or lookbehind assertion. Consider
1469 this pattern, again containing non-significant white space, and with the two
1470 alternatives on the second line:
1471 <pre>
1472 (?(?=[^a-z]*[a-z])
1473 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
1474 </pre>
1475 The condition is a positive lookahead assertion that matches an optional
1476 sequence of non-letters followed by a letter. In other words, it tests for the
1477 presence of at least one letter in the subject. If a letter is found, the
1478 subject is matched against the first alternative; otherwise it is matched
1479 against the second. This pattern matches strings in one of the two forms
1480 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
1481 <a name="comments"></a></P>
1482 <br><a name="SEC17" href="#TOC1">COMMENTS</a><br>
1483 <P>
1484 The sequence (?# marks the start of a comment that continues up to the next
1485 closing parenthesis. Nested parentheses are not permitted. The characters
1486 that make up a comment play no part in the pattern matching at all.
1487 </P>
1488 <P>
1489 If the PCRE_EXTENDED option is set, an unescaped # character outside a
1490 character class introduces a comment that continues to immediately after the
1491 next newline in the pattern.
1492 <a name="recursion"></a></P>
1493 <br><a name="SEC18" href="#TOC1">RECURSIVE PATTERNS</a><br>
1494 <P>
1495 Consider the problem of matching a string in parentheses, allowing for
1496 unlimited nested parentheses. Without the use of recursion, the best that can
1497 be done is to use a pattern that matches up to some fixed depth of nesting. It
1498 is not possible to handle an arbitrary nesting depth. Perl provides a facility
1499 that allows regular expressions to recurse (amongst other things). It does this
1500 by interpolating Perl code in the expression at run time, and the code can
1501 refer to the expression itself. A Perl pattern to solve the parentheses problem
1502 can be created like this:
1503 <pre>
1504 $re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
1505 </pre>
1506 The (?p{...}) item interpolates Perl code at run time, and in this case refers
1507 recursively to the pattern in which it appears. Obviously, PCRE cannot support
1508 the interpolation of Perl code. Instead, it supports some special syntax for
1509 recursion of the entire pattern, and also for individual subpattern recursion.
1510 </P>
1511 <P>
1512 The special item that consists of (? followed by a number greater than zero and
1513 a closing parenthesis is a recursive call of the subpattern of the given
1514 number, provided that it occurs inside that subpattern. (If not, it is a
1515 "subroutine" call, which is described in the next section.) The special item
1516 (?R) is a recursive call of the entire regular expression.
1517 </P>
1518 <P>
1519 A recursive subpattern call is always treated as an atomic group. That is, once
1520 it has matched some of the subject string, it is never re-entered, even if
1521 it contains untried alternatives and there is a subsequent matching failure.
1522 </P>
1523 <P>
1524 This PCRE pattern solves the nested parentheses problem (assume the
1525 PCRE_EXTENDED option is set so that white space is ignored):
1526 <pre>
1527 \( ( (?&#62;[^()]+) | (?R) )* \)
1528 </pre>
1529 First it matches an opening parenthesis. Then it matches any number of
1530 substrings which can either be a sequence of non-parentheses, or a recursive
1531 match of the pattern itself (that is, a correctly parenthesized substring).
1532 Finally there is a closing parenthesis.
1533 </P>
1534 <P>
1535 If this were part of a larger pattern, you would not want to recurse the entire
1536 pattern, so instead you could use this:
1537 <pre>
1538 ( \( ( (?&#62;[^()]+) | (?1) )* \) )
1539 </pre>
1540 We have put the pattern into parentheses, and caused the recursion to refer to
1541 them instead of the whole pattern. In a larger pattern, keeping track of
1542 parenthesis numbers can be tricky. It may be more convenient to use named
1543 parentheses instead. For this, PCRE uses (?P&#62;name), which is an extension to
1544 the Python syntax that PCRE uses for named parentheses (Perl does not provide
1545 named parentheses). We could rewrite the above example as follows:
1546 <pre>
1547 (?P&#60;pn&#62; \( ( (?&#62;[^()]+) | (?P&#62;pn) )* \) )
1548 </pre>
1549 This particular example pattern contains nested unlimited repeats, and so the
1550 use of atomic grouping for matching strings of non-parentheses is important
1551 when applying the pattern to strings that do not match. For example, when this
1552 pattern is applied to
1553 <pre>
1554 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1555 </pre>
1556 it yields "no match" quickly. However, if atomic grouping is not used,
1557 the match runs for a very long time indeed because there are so many different
1558 ways the + and * repeats can carve up the subject, and all have to be tested
1559 before failure can be reported.
1560 </P>
1561 <P>
1562 At the end of a match, the values set for any capturing subpatterns are those
1563 from the outermost level of the recursion at which the subpattern value is set.
1564 If you want to obtain intermediate values, a callout function can be used (see
1565 the next section and the
1566 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1567 documentation). If the pattern above is matched against
1568 <pre>
1569 (ab(cd)ef)
1570 </pre>
1571 the value for the capturing parentheses is "ef", which is the last value taken
1572 on at the top level. If additional parentheses are added, giving
1573 <pre>
1574 \( ( ( (?&#62;[^()]+) | (?R) )* ) \)
1575 ^ ^
1576 ^ ^
1577 </pre>
1578 the string they capture is "ab(cd)ef", the contents of the top level
1579 parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1580 has to obtain extra memory to store data during a recursion, which it does by
1581 using <b>pcre_malloc</b>, freeing it via <b>pcre_free</b> afterwards. If no
1582 memory can be obtained, the match fails with the PCRE_ERROR_NOMEMORY error.
1583 </P>
1584 <P>
1585 Do not confuse the (?R) item with the condition (R), which tests for recursion.
1586 Consider this pattern, which matches text in angle brackets, allowing for
1587 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
1588 recursing), whereas any characters are permitted at the outer level.
1589 <pre>
1590 &#60; (?: (?(R) \d++ | [^&#60;&#62;]*+) | (?R)) * &#62;
1591 </pre>
1592 In this pattern, (?(R) is the start of a conditional subpattern, with two
1593 different alternatives for the recursive and non-recursive cases. The (?R) item
1594 is the actual recursive call.
1595 <a name="subpatternsassubroutines"></a></P>
1596 <br><a name="SEC19" href="#TOC1">SUBPATTERNS AS SUBROUTINES</a><br>
1597 <P>
1598 If the syntax for a recursive subpattern reference (either by number or by
1599 name) is used outside the parentheses to which it refers, it operates like a
1600 subroutine in a programming language. An earlier example pointed out that the
1601 pattern
1602 <pre>
1603 (sens|respons)e and \1ibility
1604 </pre>
1605 matches "sense and sensibility" and "response and responsibility", but not
1606 "sense and responsibility". If instead the pattern
1607 <pre>
1608 (sens|respons)e and (?1)ibility
1609 </pre>
1610 is used, it does match "sense and responsibility" as well as the other two
1611 strings. Such references, if given numerically, must follow the subpattern to
1612 which they refer. However, named references can refer to later subpatterns.
1613 </P>
1614 <P>
1615 Like recursive subpatterns, a "subroutine" call is always treated as an atomic
1616 group. That is, once it has matched some of the subject string, it is never
1617 re-entered, even if it contains untried alternatives and there is a subsequent
1618 matching failure.
1619 </P>
1620 <br><a name="SEC20" href="#TOC1">CALLOUTS</a><br>
1621 <P>
1622 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
1623 code to be obeyed in the middle of matching a regular expression. This makes it
1624 possible, amongst other things, to extract different substrings that match the
1625 same pair of parentheses when there is a repetition.
1626 </P>
1627 <P>
1628 PCRE provides a similar feature, but of course it cannot obey arbitrary Perl
1629 code. The feature is called "callout". The caller of PCRE provides an external
1630 function by putting its entry point in the global variable <i>pcre_callout</i>.
1631 By default, this variable contains NULL, which disables all calling out.
1632 </P>
1633 <P>
1634 Within a regular expression, (?C) indicates the points at which the external
1635 function is to be called. If you want to identify different callout points, you
1636 can put a number less than 256 after the letter C. The default value is zero.
1637 For example, this pattern has two callout points:
1638 <pre>
1639 (?C1)\dabc(?C2)def
1640 </pre>
1641 If the PCRE_AUTO_CALLOUT flag is passed to <b>pcre_compile()</b>, callouts are
1642 automatically installed before each item in the pattern. They are all numbered
1643 255.
1644 </P>
1645 <P>
1646 During matching, when PCRE reaches a callout point (and <i>pcre_callout</i> is
1647 set), the external function is called. It is provided with the number of the
1648 callout, the position in the pattern, and, optionally, one item of data
1649 originally supplied by the caller of <b>pcre_exec()</b>. The callout function
1650 may cause matching to proceed, to backtrack, or to fail altogether. A complete
1651 description of the interface to the callout function is given in the
1652 <a href="pcrecallout.html"><b>pcrecallout</b></a>
1653 documentation.
1654 </P>
1655 <P>
1656 Last updated: 06 June 2006
1657 <br>
1658 Copyright &copy; 1997-2006 University of Cambridge.
1659 <p>
1660 Return to the <a href="index.html">PCRE index page</a>.
1661 </p>

  ViewVC Help
Powered by ViewVC 1.1.5