/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1194 - (show annotations)
Wed Oct 31 17:42:29 2012 UTC (6 years, 9 months ago) by ph10
File MIME type: text/html
File size: 15320 byte(s)
More documentation updates
1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
43 </ul>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45 <P>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
50 </P>
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52 <P>
53 <pre>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
56 </PRE>
57 </P>
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59 <P>
60 <pre>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any ASCII character
63 \e escape (hex 1B)
64 \f form feed (hex 0C)
65 \n newline (hex 0A)
66 \r carriage return (hex 0D)
67 \t tab (hex 09)
68 \ddd character with octal code ddd, or backreference
69 \xhh character with hex code hh
70 \x{hhh..} character with hex code hhh..
71 </PRE>
72 </P>
73 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74 <P>
75 <pre>
76 . any character except newline;
77 in dotall mode, any character whatsoever
78 \C one data unit, even in UTF mode (best avoided)
79 \d a decimal digit
80 \D a character that is not a decimal digit
81 \h a horizontal white space character
82 \H a character that is not a horizontal white space character
83 \N a character that is not a newline
84 \p{<i>xx</i>} a character with the <i>xx</i> property
85 \P{<i>xx</i>} a character without the <i>xx</i> property
86 \R a newline sequence
87 \s a white space character
88 \S a character that is not a white space character
89 \v a vertical white space character
90 \V a character that is not a vertical white space character
91 \w a "word" character
92 \W a "non-word" character
93 \X a Unicode extended grapheme cluster
94 </pre>
95 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
96 characters, even in a UTF mode. However, this can be changed by setting the
97 PCRE_UCP option.
98 </P>
99 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
100 <P>
101 <pre>
102 C Other
103 Cc Control
104 Cf Format
105 Cn Unassigned
106 Co Private use
107 Cs Surrogate
108
109 L Letter
110 Ll Lower case letter
111 Lm Modifier letter
112 Lo Other letter
113 Lt Title case letter
114 Lu Upper case letter
115 L& Ll, Lu, or Lt
116
117 M Mark
118 Mc Spacing mark
119 Me Enclosing mark
120 Mn Non-spacing mark
121
122 N Number
123 Nd Decimal number
124 Nl Letter number
125 No Other number
126
127 P Punctuation
128 Pc Connector punctuation
129 Pd Dash punctuation
130 Pe Close punctuation
131 Pf Final punctuation
132 Pi Initial punctuation
133 Po Other punctuation
134 Ps Open punctuation
135
136 S Symbol
137 Sc Currency symbol
138 Sk Modifier symbol
139 Sm Mathematical symbol
140 So Other symbol
141
142 Z Separator
143 Zl Line separator
144 Zp Paragraph separator
145 Zs Space separator
146 </PRE>
147 </P>
148 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
149 <P>
150 <pre>
151 Xan Alphanumeric: union of properties L and N
152 Xps POSIX space: property Z or tab, NL, VT, FF, CR
153 Xsp Perl space: property Z or tab, NL, FF, CR
154 Xwd Perl word: property Xan or underscore
155 </PRE>
156 </P>
157 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
158 <P>
159 Arabic,
160 Armenian,
161 Avestan,
162 Balinese,
163 Bamum,
164 Batak,
165 Bengali,
166 Bopomofo,
167 Brahmi,
168 Braille,
169 Buginese,
170 Buhid,
171 Canadian_Aboriginal,
172 Carian,
173 Chakma,
174 Cham,
175 Cherokee,
176 Common,
177 Coptic,
178 Cuneiform,
179 Cypriot,
180 Cyrillic,
181 Deseret,
182 Devanagari,
183 Egyptian_Hieroglyphs,
184 Ethiopic,
185 Georgian,
186 Glagolitic,
187 Gothic,
188 Greek,
189 Gujarati,
190 Gurmukhi,
191 Han,
192 Hangul,
193 Hanunoo,
194 Hebrew,
195 Hiragana,
196 Imperial_Aramaic,
197 Inherited,
198 Inscriptional_Pahlavi,
199 Inscriptional_Parthian,
200 Javanese,
201 Kaithi,
202 Kannada,
203 Katakana,
204 Kayah_Li,
205 Kharoshthi,
206 Khmer,
207 Lao,
208 Latin,
209 Lepcha,
210 Limbu,
211 Linear_B,
212 Lisu,
213 Lycian,
214 Lydian,
215 Malayalam,
216 Mandaic,
217 Meetei_Mayek,
218 Meroitic_Cursive,
219 Meroitic_Hieroglyphs,
220 Miao,
221 Mongolian,
222 Myanmar,
223 New_Tai_Lue,
224 Nko,
225 Ogham,
226 Old_Italic,
227 Old_Persian,
228 Old_South_Arabian,
229 Old_Turkic,
230 Ol_Chiki,
231 Oriya,
232 Osmanya,
233 Phags_Pa,
234 Phoenician,
235 Rejang,
236 Runic,
237 Samaritan,
238 Saurashtra,
239 Sharada,
240 Shavian,
241 Sinhala,
242 Sora_Sompeng,
243 Sundanese,
244 Syloti_Nagri,
245 Syriac,
246 Tagalog,
247 Tagbanwa,
248 Tai_Le,
249 Tai_Tham,
250 Tai_Viet,
251 Takri,
252 Tamil,
253 Telugu,
254 Thaana,
255 Thai,
256 Tibetan,
257 Tifinagh,
258 Ugaritic,
259 Vai,
260 Yi.
261 </P>
262 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
263 <P>
264 <pre>
265 [...] positive character class
266 [^...] negative character class
267 [x-y] range (can be used for hex characters)
268 [[:xxx:]] positive POSIX named set
269 [[:^xxx:]] negative POSIX named set
270
271 alnum alphanumeric
272 alpha alphabetic
273 ascii 0-127
274 blank space or tab
275 cntrl control character
276 digit decimal digit
277 graph printing, excluding space
278 lower lower case letter
279 print printing, including space
280 punct printing, excluding alphanumeric
281 space white space
282 upper upper case letter
283 word same as \w
284 xdigit hexadecimal digit
285 </pre>
286 In PCRE, POSIX character set names recognize only ASCII characters by default,
287 but some of them use Unicode properties if PCRE_UCP is set. You can use
288 \Q...\E inside a character class.
289 </P>
290 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
291 <P>
292 <pre>
293 ? 0 or 1, greedy
294 ?+ 0 or 1, possessive
295 ?? 0 or 1, lazy
296 * 0 or more, greedy
297 *+ 0 or more, possessive
298 *? 0 or more, lazy
299 + 1 or more, greedy
300 ++ 1 or more, possessive
301 +? 1 or more, lazy
302 {n} exactly n
303 {n,m} at least n, no more than m, greedy
304 {n,m}+ at least n, no more than m, possessive
305 {n,m}? at least n, no more than m, lazy
306 {n,} n or more, greedy
307 {n,}+ n or more, possessive
308 {n,}? n or more, lazy
309 </PRE>
310 </P>
311 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
312 <P>
313 <pre>
314 \b word boundary
315 \B not a word boundary
316 ^ start of subject
317 also after internal newline in multiline mode
318 \A start of subject
319 $ end of subject
320 also before newline at end of subject
321 also before internal newline in multiline mode
322 \Z end of subject
323 also before newline at end of subject
324 \z end of subject
325 \G first matching position in subject
326 </PRE>
327 </P>
328 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
329 <P>
330 <pre>
331 \K reset start of match
332 </PRE>
333 </P>
334 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
335 <P>
336 <pre>
337 expr|expr|expr...
338 </PRE>
339 </P>
340 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
341 <P>
342 <pre>
343 (...) capturing group
344 (?&#60;name&#62;...) named capturing group (Perl)
345 (?'name'...) named capturing group (Perl)
346 (?P&#60;name&#62;...) named capturing group (Python)
347 (?:...) non-capturing group
348 (?|...) non-capturing group; reset group numbers for
349 capturing groups in each alternative
350 </PRE>
351 </P>
352 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
353 <P>
354 <pre>
355 (?&#62;...) atomic, non-capturing group
356 </PRE>
357 </P>
358 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
359 <P>
360 <pre>
361 (?#....) comment (not nestable)
362 </PRE>
363 </P>
364 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
365 <P>
366 <pre>
367 (?i) caseless
368 (?J) allow duplicate names
369 (?m) multiline
370 (?s) single line (dotall)
371 (?U) default ungreedy (lazy)
372 (?x) extended (ignore white space)
373 (?-...) unset option(s)
374 </pre>
375 The following are recognized only at the start of a pattern or after one of the
376 newline-setting options with similar syntax:
377 <pre>
378 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
379 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
380 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
381 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
382 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
383 </PRE>
384 </P>
385 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
386 <P>
387 <pre>
388 (?=...) positive look ahead
389 (?!...) negative look ahead
390 (?&#60;=...) positive look behind
391 (?&#60;!...) negative look behind
392 </pre>
393 Each top-level branch of a look behind must be of a fixed length.
394 </P>
395 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
396 <P>
397 <pre>
398 \n reference by number (can be ambiguous)
399 \gn reference by number
400 \g{n} reference by number
401 \g{-n} relative reference by number
402 \k&#60;name&#62; reference by name (Perl)
403 \k'name' reference by name (Perl)
404 \g{name} reference by name (Perl)
405 \k{name} reference by name (.NET)
406 (?P=name) reference by name (Python)
407 </PRE>
408 </P>
409 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
410 <P>
411 <pre>
412 (?R) recurse whole pattern
413 (?n) call subpattern by absolute number
414 (?+n) call subpattern by relative number
415 (?-n) call subpattern by relative number
416 (?&name) call subpattern by name (Perl)
417 (?P&#62;name) call subpattern by name (Python)
418 \g&#60;name&#62; call subpattern by name (Oniguruma)
419 \g'name' call subpattern by name (Oniguruma)
420 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
421 \g'n' call subpattern by absolute number (Oniguruma)
422 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
423 \g'+n' call subpattern by relative number (PCRE extension)
424 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
425 \g'-n' call subpattern by relative number (PCRE extension)
426 </PRE>
427 </P>
428 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
429 <P>
430 <pre>
431 (?(condition)yes-pattern)
432 (?(condition)yes-pattern|no-pattern)
433
434 (?(n)... absolute reference condition
435 (?(+n)... relative reference condition
436 (?(-n)... relative reference condition
437 (?(&#60;name&#62;)... named reference condition (Perl)
438 (?('name')... named reference condition (Perl)
439 (?(name)... named reference condition (PCRE)
440 (?(R)... overall recursion condition
441 (?(Rn)... specific group recursion condition
442 (?(R&name)... specific recursion condition
443 (?(DEFINE)... define subpattern for reference
444 (?(assert)... assertion condition
445 </PRE>
446 </P>
447 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
448 <P>
449 The following act immediately they are reached:
450 <pre>
451 (*ACCEPT) force successful match
452 (*FAIL) force backtrack; synonym (*F)
453 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
454 </pre>
455 The following act only when a subsequent match failure causes a backtrack to
456 reach them. They all force a match failure, but they differ in what happens
457 afterwards. Those that advance the start-of-match point do so only if the
458 pattern is not anchored.
459 <pre>
460 (*COMMIT) overall failure, no advance of starting point
461 (*PRUNE) advance to next starting character
462 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
463 (*SKIP) advance to current matching position
464 (*SKIP:NAME) advance to position corresponding to an earlier
465 (*MARK:NAME); if not found, the (*SKIP) is ignored
466 (*THEN) local failure, backtrack to next alternation
467 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
468 </PRE>
469 </P>
470 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
471 <P>
472 These are recognized only at the very start of the pattern or after a
473 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
474 <pre>
475 (*CR) carriage return only
476 (*LF) linefeed only
477 (*CRLF) carriage return followed by linefeed
478 (*ANYCRLF) all three of the above
479 (*ANY) any Unicode newline sequence
480 </PRE>
481 </P>
482 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
483 <P>
484 These are recognized only at the very start of the pattern or after a
485 (*...) option that sets the newline convention or a UTF or UCP mode.
486 <pre>
487 (*BSR_ANYCRLF) CR, LF, or CRLF
488 (*BSR_UNICODE) any Unicode newline sequence
489 </PRE>
490 </P>
491 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
492 <P>
493 <pre>
494 (?C) callout
495 (?Cn) callout with data n
496 </PRE>
497 </P>
498 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
499 <P>
500 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
501 <b>pcrematching</b>(3), <b>pcre</b>(3).
502 </P>
503 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
504 <P>
505 Philip Hazel
506 <br>
507 University Computing Service
508 <br>
509 Cambridge CB2 3QH, England.
510 <br>
511 </P>
512 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
513 <P>
514 Last updated: 25 August 2012
515 <br>
516 Copyright &copy; 1997-2012 University of Cambridge.
517 <br>
518 <p>
519 Return to the <a href="index.html">PCRE index page</a>.
520 </p>

  ViewVC Help
Powered by ViewVC 1.1.5