/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1404 - (show annotations)
Tue Nov 19 15:36:57 2013 UTC (5 years, 9 months ago) by ph10
File MIME type: text/html
File size: 16312 byte(s)
Source tidies for 8.34-RC1.
1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
43 </ul>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45 <P>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
50 </P>
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52 <P>
53 <pre>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
56 </PRE>
57 </P>
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59 <P>
60 <pre>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any ASCII character
63 \e escape (hex 1B)
64 \f form feed (hex 0C)
65 \n newline (hex 0A)
66 \r carriage return (hex 0D)
67 \t tab (hex 09)
68 \0dd character with octal code 0dd
69 \ddd character with octal code ddd, or backreference
70 \o{ddd..} character with octal code ddd..
71 \xhh character with hex code hh
72 \x{hhh..} character with hex code hhh..
73 </pre>
74 Note that \0dd is always an octal code, and that \8 and \9 are the literal
75 characters "8" and "9".
76 </P>
77 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
78 <P>
79 <pre>
80 . any character except newline;
81 in dotall mode, any character whatsoever
82 \C one data unit, even in UTF mode (best avoided)
83 \d a decimal digit
84 \D a character that is not a decimal digit
85 \h a horizontal white space character
86 \H a character that is not a horizontal white space character
87 \N a character that is not a newline
88 \p{<i>xx</i>} a character with the <i>xx</i> property
89 \P{<i>xx</i>} a character without the <i>xx</i> property
90 \R a newline sequence
91 \s a white space character
92 \S a character that is not a white space character
93 \v a vertical white space character
94 \V a character that is not a vertical white space character
95 \w a "word" character
96 \W a "non-word" character
97 \X a Unicode extended grapheme cluster
98 </pre>
99 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
100 or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
101 happening, \s and \w may also match characters with code points in the range
102 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
103 is changed to use Unicode properties and they match many more characters.
104 </P>
105 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
106 <P>
107 <pre>
108 C Other
109 Cc Control
110 Cf Format
111 Cn Unassigned
112 Co Private use
113 Cs Surrogate
114
115 L Letter
116 Ll Lower case letter
117 Lm Modifier letter
118 Lo Other letter
119 Lt Title case letter
120 Lu Upper case letter
121 L& Ll, Lu, or Lt
122
123 M Mark
124 Mc Spacing mark
125 Me Enclosing mark
126 Mn Non-spacing mark
127
128 N Number
129 Nd Decimal number
130 Nl Letter number
131 No Other number
132
133 P Punctuation
134 Pc Connector punctuation
135 Pd Dash punctuation
136 Pe Close punctuation
137 Pf Final punctuation
138 Pi Initial punctuation
139 Po Other punctuation
140 Ps Open punctuation
141
142 S Symbol
143 Sc Currency symbol
144 Sk Modifier symbol
145 Sm Mathematical symbol
146 So Other symbol
147
148 Z Separator
149 Zl Line separator
150 Zp Paragraph separator
151 Zs Space separator
152 </PRE>
153 </P>
154 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
155 <P>
156 <pre>
157 Xan Alphanumeric: union of properties L and N
158 Xps POSIX space: property Z or tab, NL, VT, FF, CR
159 Xsp Perl space: property Z or tab, NL, VT, FF, CR
160 Xuc Univerally-named character: one that can be
161 represented by a Universal Character Name
162 Xwd Perl word: property Xan or underscore
163 </pre>
164 Perl and POSIX space are now the same. Perl added VT to its space character set
165 at release 5.18 and PCRE changed at release 8.34.
166 </P>
167 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
168 <P>
169 Arabic,
170 Armenian,
171 Avestan,
172 Balinese,
173 Bamum,
174 Batak,
175 Bengali,
176 Bopomofo,
177 Brahmi,
178 Braille,
179 Buginese,
180 Buhid,
181 Canadian_Aboriginal,
182 Carian,
183 Chakma,
184 Cham,
185 Cherokee,
186 Common,
187 Coptic,
188 Cuneiform,
189 Cypriot,
190 Cyrillic,
191 Deseret,
192 Devanagari,
193 Egyptian_Hieroglyphs,
194 Ethiopic,
195 Georgian,
196 Glagolitic,
197 Gothic,
198 Greek,
199 Gujarati,
200 Gurmukhi,
201 Han,
202 Hangul,
203 Hanunoo,
204 Hebrew,
205 Hiragana,
206 Imperial_Aramaic,
207 Inherited,
208 Inscriptional_Pahlavi,
209 Inscriptional_Parthian,
210 Javanese,
211 Kaithi,
212 Kannada,
213 Katakana,
214 Kayah_Li,
215 Kharoshthi,
216 Khmer,
217 Lao,
218 Latin,
219 Lepcha,
220 Limbu,
221 Linear_B,
222 Lisu,
223 Lycian,
224 Lydian,
225 Malayalam,
226 Mandaic,
227 Meetei_Mayek,
228 Meroitic_Cursive,
229 Meroitic_Hieroglyphs,
230 Miao,
231 Mongolian,
232 Myanmar,
233 New_Tai_Lue,
234 Nko,
235 Ogham,
236 Old_Italic,
237 Old_Persian,
238 Old_South_Arabian,
239 Old_Turkic,
240 Ol_Chiki,
241 Oriya,
242 Osmanya,
243 Phags_Pa,
244 Phoenician,
245 Rejang,
246 Runic,
247 Samaritan,
248 Saurashtra,
249 Sharada,
250 Shavian,
251 Sinhala,
252 Sora_Sompeng,
253 Sundanese,
254 Syloti_Nagri,
255 Syriac,
256 Tagalog,
257 Tagbanwa,
258 Tai_Le,
259 Tai_Tham,
260 Tai_Viet,
261 Takri,
262 Tamil,
263 Telugu,
264 Thaana,
265 Thai,
266 Tibetan,
267 Tifinagh,
268 Ugaritic,
269 Vai,
270 Yi.
271 </P>
272 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
273 <P>
274 <pre>
275 [...] positive character class
276 [^...] negative character class
277 [x-y] range (can be used for hex characters)
278 [[:xxx:]] positive POSIX named set
279 [[:^xxx:]] negative POSIX named set
280
281 alnum alphanumeric
282 alpha alphabetic
283 ascii 0-127
284 blank space or tab
285 cntrl control character
286 digit decimal digit
287 graph printing, excluding space
288 lower lower case letter
289 print printing, including space
290 punct printing, excluding alphanumeric
291 space white space
292 upper upper case letter
293 word same as \w
294 xdigit hexadecimal digit
295 </pre>
296 In PCRE, POSIX character set names recognize only ASCII characters by default,
297 but some of them use Unicode properties if PCRE_UCP is set. You can use
298 \Q...\E inside a character class.
299 </P>
300 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
301 <P>
302 <pre>
303 ? 0 or 1, greedy
304 ?+ 0 or 1, possessive
305 ?? 0 or 1, lazy
306 * 0 or more, greedy
307 *+ 0 or more, possessive
308 *? 0 or more, lazy
309 + 1 or more, greedy
310 ++ 1 or more, possessive
311 +? 1 or more, lazy
312 {n} exactly n
313 {n,m} at least n, no more than m, greedy
314 {n,m}+ at least n, no more than m, possessive
315 {n,m}? at least n, no more than m, lazy
316 {n,} n or more, greedy
317 {n,}+ n or more, possessive
318 {n,}? n or more, lazy
319 </PRE>
320 </P>
321 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
322 <P>
323 <pre>
324 \b word boundary
325 \B not a word boundary
326 ^ start of subject
327 also after internal newline in multiline mode
328 \A start of subject
329 $ end of subject
330 also before newline at end of subject
331 also before internal newline in multiline mode
332 \Z end of subject
333 also before newline at end of subject
334 \z end of subject
335 \G first matching position in subject
336 </PRE>
337 </P>
338 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
339 <P>
340 <pre>
341 \K reset start of match
342 </PRE>
343 </P>
344 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
345 <P>
346 <pre>
347 expr|expr|expr...
348 </PRE>
349 </P>
350 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
351 <P>
352 <pre>
353 (...) capturing group
354 (?&#60;name&#62;...) named capturing group (Perl)
355 (?'name'...) named capturing group (Perl)
356 (?P&#60;name&#62;...) named capturing group (Python)
357 (?:...) non-capturing group
358 (?|...) non-capturing group; reset group numbers for
359 capturing groups in each alternative
360 </PRE>
361 </P>
362 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
363 <P>
364 <pre>
365 (?&#62;...) atomic, non-capturing group
366 </PRE>
367 </P>
368 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
369 <P>
370 <pre>
371 (?#....) comment (not nestable)
372 </PRE>
373 </P>
374 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
375 <P>
376 <pre>
377 (?i) caseless
378 (?J) allow duplicate names
379 (?m) multiline
380 (?s) single line (dotall)
381 (?U) default ungreedy (lazy)
382 (?x) extended (ignore white space)
383 (?-...) unset option(s)
384 </pre>
385 The following are recognized only at the start of a pattern or after one of the
386 newline-setting options with similar syntax:
387 <pre>
388 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
389 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
390 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
391 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
392 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
393 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
394 (*UTF) set appropriate UTF mode for the library in use
395 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
396 </pre>
397 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
398 limits set by the caller of pcre_exec(), not increase them.
399 </P>
400 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
401 <P>
402 <pre>
403 (?=...) positive look ahead
404 (?!...) negative look ahead
405 (?&#60;=...) positive look behind
406 (?&#60;!...) negative look behind
407 </pre>
408 Each top-level branch of a look behind must be of a fixed length.
409 </P>
410 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
411 <P>
412 <pre>
413 \n reference by number (can be ambiguous)
414 \gn reference by number
415 \g{n} reference by number
416 \g{-n} relative reference by number
417 \k&#60;name&#62; reference by name (Perl)
418 \k'name' reference by name (Perl)
419 \g{name} reference by name (Perl)
420 \k{name} reference by name (.NET)
421 (?P=name) reference by name (Python)
422 </PRE>
423 </P>
424 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
425 <P>
426 <pre>
427 (?R) recurse whole pattern
428 (?n) call subpattern by absolute number
429 (?+n) call subpattern by relative number
430 (?-n) call subpattern by relative number
431 (?&name) call subpattern by name (Perl)
432 (?P&#62;name) call subpattern by name (Python)
433 \g&#60;name&#62; call subpattern by name (Oniguruma)
434 \g'name' call subpattern by name (Oniguruma)
435 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
436 \g'n' call subpattern by absolute number (Oniguruma)
437 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
438 \g'+n' call subpattern by relative number (PCRE extension)
439 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
440 \g'-n' call subpattern by relative number (PCRE extension)
441 </PRE>
442 </P>
443 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
444 <P>
445 <pre>
446 (?(condition)yes-pattern)
447 (?(condition)yes-pattern|no-pattern)
448
449 (?(n)... absolute reference condition
450 (?(+n)... relative reference condition
451 (?(-n)... relative reference condition
452 (?(&#60;name&#62;)... named reference condition (Perl)
453 (?('name')... named reference condition (Perl)
454 (?(name)... named reference condition (PCRE)
455 (?(R)... overall recursion condition
456 (?(Rn)... specific group recursion condition
457 (?(R&name)... specific recursion condition
458 (?(DEFINE)... define subpattern for reference
459 (?(assert)... assertion condition
460 </PRE>
461 </P>
462 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
463 <P>
464 The following act immediately they are reached:
465 <pre>
466 (*ACCEPT) force successful match
467 (*FAIL) force backtrack; synonym (*F)
468 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
469 </pre>
470 The following act only when a subsequent match failure causes a backtrack to
471 reach them. They all force a match failure, but they differ in what happens
472 afterwards. Those that advance the start-of-match point do so only if the
473 pattern is not anchored.
474 <pre>
475 (*COMMIT) overall failure, no advance of starting point
476 (*PRUNE) advance to next starting character
477 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
478 (*SKIP) advance to current matching position
479 (*SKIP:NAME) advance to position corresponding to an earlier
480 (*MARK:NAME); if not found, the (*SKIP) is ignored
481 (*THEN) local failure, backtrack to next alternation
482 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
483 </PRE>
484 </P>
485 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
486 <P>
487 These are recognized only at the very start of the pattern or after a
488 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
489 <pre>
490 (*CR) carriage return only
491 (*LF) linefeed only
492 (*CRLF) carriage return followed by linefeed
493 (*ANYCRLF) all three of the above
494 (*ANY) any Unicode newline sequence
495 </PRE>
496 </P>
497 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
498 <P>
499 These are recognized only at the very start of the pattern or after a
500 (*...) option that sets the newline convention or a UTF or UCP mode.
501 <pre>
502 (*BSR_ANYCRLF) CR, LF, or CRLF
503 (*BSR_UNICODE) any Unicode newline sequence
504 </PRE>
505 </P>
506 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
507 <P>
508 <pre>
509 (?C) callout
510 (?Cn) callout with data n
511 </PRE>
512 </P>
513 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
514 <P>
515 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
516 <b>pcrematching</b>(3), <b>pcre</b>(3).
517 </P>
518 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
519 <P>
520 Philip Hazel
521 <br>
522 University Computing Service
523 <br>
524 Cambridge CB2 3QH, England.
525 <br>
526 </P>
527 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
528 <P>
529 Last updated: 12 November 2013
530 <br>
531 Copyright &copy; 1997-2013 University of Cambridge.
532 <br>
533 <p>
534 Return to the <a href="index.html">PCRE index page</a>.
535 </p>

  ViewVC Help
Powered by ViewVC 1.1.5