/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 903 - (show annotations)
Sat Jan 21 16:37:17 2012 UTC (7 years, 7 months ago) by ph10
File MIME type: text/html
File size: 15125 byte(s)
Source file tidies for 8.30-RC1 release; fix Makefile.am bugs for building 
symbolic links to man pages.
1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
43 </ul>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45 <P>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
50 </P>
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52 <P>
53 <pre>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
56 </PRE>
57 </P>
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59 <P>
60 <pre>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any ASCII character
63 \e escape (hex 1B)
64 \f formfeed (hex 0C)
65 \n newline (hex 0A)
66 \r carriage return (hex 0D)
67 \t tab (hex 09)
68 \ddd character with octal code ddd, or backreference
69 \xhh character with hex code hh
70 \x{hhh..} character with hex code hhh..
71 </PRE>
72 </P>
73 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74 <P>
75 <pre>
76 . any character except newline;
77 in dotall mode, any character whatsoever
78 \C one data unit, even in UTF mode (best avoided)
79 \d a decimal digit
80 \D a character that is not a decimal digit
81 \h a horizontal whitespace character
82 \H a character that is not a horizontal whitespace character
83 \N a character that is not a newline
84 \p{<i>xx</i>} a character with the <i>xx</i> property
85 \P{<i>xx</i>} a character without the <i>xx</i> property
86 \R a newline sequence
87 \s a whitespace character
88 \S a character that is not a whitespace character
89 \v a vertical whitespace character
90 \V a character that is not a vertical whitespace character
91 \w a "word" character
92 \W a "non-word" character
93 \X an extended Unicode sequence
94 </pre>
95 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
96 characters, even in a UTF mode. However, this can be changed by setting the
97 PCRE_UCP option.
98 </P>
99 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
100 <P>
101 <pre>
102 C Other
103 Cc Control
104 Cf Format
105 Cn Unassigned
106 Co Private use
107 Cs Surrogate
108
109 L Letter
110 Ll Lower case letter
111 Lm Modifier letter
112 Lo Other letter
113 Lt Title case letter
114 Lu Upper case letter
115 L& Ll, Lu, or Lt
116
117 M Mark
118 Mc Spacing mark
119 Me Enclosing mark
120 Mn Non-spacing mark
121
122 N Number
123 Nd Decimal number
124 Nl Letter number
125 No Other number
126
127 P Punctuation
128 Pc Connector punctuation
129 Pd Dash punctuation
130 Pe Close punctuation
131 Pf Final punctuation
132 Pi Initial punctuation
133 Po Other punctuation
134 Ps Open punctuation
135
136 S Symbol
137 Sc Currency symbol
138 Sk Modifier symbol
139 Sm Mathematical symbol
140 So Other symbol
141
142 Z Separator
143 Zl Line separator
144 Zp Paragraph separator
145 Zs Space separator
146 </PRE>
147 </P>
148 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
149 <P>
150 <pre>
151 Xan Alphanumeric: union of properties L and N
152 Xps POSIX space: property Z or tab, NL, VT, FF, CR
153 Xsp Perl space: property Z or tab, NL, FF, CR
154 Xwd Perl word: property Xan or underscore
155 </PRE>
156 </P>
157 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
158 <P>
159 Arabic,
160 Armenian,
161 Avestan,
162 Balinese,
163 Bamum,
164 Bengali,
165 Bopomofo,
166 Braille,
167 Buginese,
168 Buhid,
169 Canadian_Aboriginal,
170 Carian,
171 Cham,
172 Cherokee,
173 Common,
174 Coptic,
175 Cuneiform,
176 Cypriot,
177 Cyrillic,
178 Deseret,
179 Devanagari,
180 Egyptian_Hieroglyphs,
181 Ethiopic,
182 Georgian,
183 Glagolitic,
184 Gothic,
185 Greek,
186 Gujarati,
187 Gurmukhi,
188 Han,
189 Hangul,
190 Hanunoo,
191 Hebrew,
192 Hiragana,
193 Imperial_Aramaic,
194 Inherited,
195 Inscriptional_Pahlavi,
196 Inscriptional_Parthian,
197 Javanese,
198 Kaithi,
199 Kannada,
200 Katakana,
201 Kayah_Li,
202 Kharoshthi,
203 Khmer,
204 Lao,
205 Latin,
206 Lepcha,
207 Limbu,
208 Linear_B,
209 Lisu,
210 Lycian,
211 Lydian,
212 Malayalam,
213 Meetei_Mayek,
214 Mongolian,
215 Myanmar,
216 New_Tai_Lue,
217 Nko,
218 Ogham,
219 Old_Italic,
220 Old_Persian,
221 Old_South_Arabian,
222 Old_Turkic,
223 Ol_Chiki,
224 Oriya,
225 Osmanya,
226 Phags_Pa,
227 Phoenician,
228 Rejang,
229 Runic,
230 Samaritan,
231 Saurashtra,
232 Shavian,
233 Sinhala,
234 Sundanese,
235 Syloti_Nagri,
236 Syriac,
237 Tagalog,
238 Tagbanwa,
239 Tai_Le,
240 Tai_Tham,
241 Tai_Viet,
242 Tamil,
243 Telugu,
244 Thaana,
245 Thai,
246 Tibetan,
247 Tifinagh,
248 Ugaritic,
249 Vai,
250 Yi.
251 </P>
252 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
253 <P>
254 <pre>
255 [...] positive character class
256 [^...] negative character class
257 [x-y] range (can be used for hex characters)
258 [[:xxx:]] positive POSIX named set
259 [[:^xxx:]] negative POSIX named set
260
261 alnum alphanumeric
262 alpha alphabetic
263 ascii 0-127
264 blank space or tab
265 cntrl control character
266 digit decimal digit
267 graph printing, excluding space
268 lower lower case letter
269 print printing, including space
270 punct printing, excluding alphanumeric
271 space whitespace
272 upper upper case letter
273 word same as \w
274 xdigit hexadecimal digit
275 </pre>
276 In PCRE, POSIX character set names recognize only ASCII characters by default,
277 but some of them use Unicode properties if PCRE_UCP is set. You can use
278 \Q...\E inside a character class.
279 </P>
280 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
281 <P>
282 <pre>
283 ? 0 or 1, greedy
284 ?+ 0 or 1, possessive
285 ?? 0 or 1, lazy
286 * 0 or more, greedy
287 *+ 0 or more, possessive
288 *? 0 or more, lazy
289 + 1 or more, greedy
290 ++ 1 or more, possessive
291 +? 1 or more, lazy
292 {n} exactly n
293 {n,m} at least n, no more than m, greedy
294 {n,m}+ at least n, no more than m, possessive
295 {n,m}? at least n, no more than m, lazy
296 {n,} n or more, greedy
297 {n,}+ n or more, possessive
298 {n,}? n or more, lazy
299 </PRE>
300 </P>
301 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
302 <P>
303 <pre>
304 \b word boundary
305 \B not a word boundary
306 ^ start of subject
307 also after internal newline in multiline mode
308 \A start of subject
309 $ end of subject
310 also before newline at end of subject
311 also before internal newline in multiline mode
312 \Z end of subject
313 also before newline at end of subject
314 \z end of subject
315 \G first matching position in subject
316 </PRE>
317 </P>
318 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
319 <P>
320 <pre>
321 \K reset start of match
322 </PRE>
323 </P>
324 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
325 <P>
326 <pre>
327 expr|expr|expr...
328 </PRE>
329 </P>
330 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
331 <P>
332 <pre>
333 (...) capturing group
334 (?&#60;name&#62;...) named capturing group (Perl)
335 (?'name'...) named capturing group (Perl)
336 (?P&#60;name&#62;...) named capturing group (Python)
337 (?:...) non-capturing group
338 (?|...) non-capturing group; reset group numbers for
339 capturing groups in each alternative
340 </PRE>
341 </P>
342 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
343 <P>
344 <pre>
345 (?&#62;...) atomic, non-capturing group
346 </PRE>
347 </P>
348 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
349 <P>
350 <pre>
351 (?#....) comment (not nestable)
352 </PRE>
353 </P>
354 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
355 <P>
356 <pre>
357 (?i) caseless
358 (?J) allow duplicate names
359 (?m) multiline
360 (?s) single line (dotall)
361 (?U) default ungreedy (lazy)
362 (?x) extended (ignore white space)
363 (?-...) unset option(s)
364 </pre>
365 The following are recognized only at the start of a pattern or after one of the
366 newline-setting options with similar syntax:
367 <pre>
368 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
369 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
370 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
371 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
372 </PRE>
373 </P>
374 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
375 <P>
376 <pre>
377 (?=...) positive look ahead
378 (?!...) negative look ahead
379 (?&#60;=...) positive look behind
380 (?&#60;!...) negative look behind
381 </pre>
382 Each top-level branch of a look behind must be of a fixed length.
383 </P>
384 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
385 <P>
386 <pre>
387 \n reference by number (can be ambiguous)
388 \gn reference by number
389 \g{n} reference by number
390 \g{-n} relative reference by number
391 \k&#60;name&#62; reference by name (Perl)
392 \k'name' reference by name (Perl)
393 \g{name} reference by name (Perl)
394 \k{name} reference by name (.NET)
395 (?P=name) reference by name (Python)
396 </PRE>
397 </P>
398 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
399 <P>
400 <pre>
401 (?R) recurse whole pattern
402 (?n) call subpattern by absolute number
403 (?+n) call subpattern by relative number
404 (?-n) call subpattern by relative number
405 (?&name) call subpattern by name (Perl)
406 (?P&#62;name) call subpattern by name (Python)
407 \g&#60;name&#62; call subpattern by name (Oniguruma)
408 \g'name' call subpattern by name (Oniguruma)
409 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
410 \g'n' call subpattern by absolute number (Oniguruma)
411 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
412 \g'+n' call subpattern by relative number (PCRE extension)
413 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
414 \g'-n' call subpattern by relative number (PCRE extension)
415 </PRE>
416 </P>
417 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
418 <P>
419 <pre>
420 (?(condition)yes-pattern)
421 (?(condition)yes-pattern|no-pattern)
422
423 (?(n)... absolute reference condition
424 (?(+n)... relative reference condition
425 (?(-n)... relative reference condition
426 (?(&#60;name&#62;)... named reference condition (Perl)
427 (?('name')... named reference condition (Perl)
428 (?(name)... named reference condition (PCRE)
429 (?(R)... overall recursion condition
430 (?(Rn)... specific group recursion condition
431 (?(R&name)... specific recursion condition
432 (?(DEFINE)... define subpattern for reference
433 (?(assert)... assertion condition
434 </PRE>
435 </P>
436 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
437 <P>
438 The following act immediately they are reached:
439 <pre>
440 (*ACCEPT) force successful match
441 (*FAIL) force backtrack; synonym (*F)
442 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
443 </pre>
444 The following act only when a subsequent match failure causes a backtrack to
445 reach them. They all force a match failure, but they differ in what happens
446 afterwards. Those that advance the start-of-match point do so only if the
447 pattern is not anchored.
448 <pre>
449 (*COMMIT) overall failure, no advance of starting point
450 (*PRUNE) advance to next starting character
451 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
452 (*SKIP) advance to current matching position
453 (*SKIP:NAME) advance to position corresponding to an earlier
454 (*MARK:NAME); if not found, the (*SKIP) is ignored
455 (*THEN) local failure, backtrack to next alternation
456 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
457 </PRE>
458 </P>
459 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
460 <P>
461 These are recognized only at the very start of the pattern or after a
462 (*BSR_...), (*UTF8), (*UTF16) or (*UCP) option.
463 <pre>
464 (*CR) carriage return only
465 (*LF) linefeed only
466 (*CRLF) carriage return followed by linefeed
467 (*ANYCRLF) all three of the above
468 (*ANY) any Unicode newline sequence
469 </PRE>
470 </P>
471 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
472 <P>
473 These are recognized only at the very start of the pattern or after a
474 (*...) option that sets the newline convention or a UTF or UCP mode.
475 <pre>
476 (*BSR_ANYCRLF) CR, LF, or CRLF
477 (*BSR_UNICODE) any Unicode newline sequence
478 </PRE>
479 </P>
480 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
481 <P>
482 <pre>
483 (?C) callout
484 (?Cn) callout with data n
485 </PRE>
486 </P>
487 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
488 <P>
489 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
490 <b>pcrematching</b>(3), <b>pcre</b>(3).
491 </P>
492 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
493 <P>
494 Philip Hazel
495 <br>
496 University Computing Service
497 <br>
498 Cambridge CB2 3QH, England.
499 <br>
500 </P>
501 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
502 <P>
503 Last updated: 10 January 2012
504 <br>
505 Copyright &copy; 1997-2012 University of Cambridge.
506 <br>
507 <p>
508 Return to the <a href="index.html">PCRE index page</a>.
509 </p>

  ViewVC Help
Powered by ViewVC 1.1.5