/[pcre]/code/trunk/doc/html/pcresyntax.html
ViewVC logotype

Contents of /code/trunk/doc/html/pcresyntax.html

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1320 - (show annotations)
Wed May 1 16:39:35 2013 UTC (6 years, 3 months ago) by ph10
File MIME type: text/html
File size: 15632 byte(s)
Source tidies (trails spaces, html updates) for 8.33-RC1.
1 <html>
2 <head>
3 <title>pcresyntax specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcresyntax man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">MATCH POINT RESET</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
33 <li><a name="TOC18" href="#SEC18">BACKREFERENCES</a>
34 <li><a name="TOC19" href="#SEC19">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
35 <li><a name="TOC20" href="#SEC20">CONDITIONAL PATTERNS</a>
36 <li><a name="TOC21" href="#SEC21">BACKTRACKING CONTROL</a>
37 <li><a name="TOC22" href="#SEC22">NEWLINE CONVENTIONS</a>
38 <li><a name="TOC23" href="#SEC23">WHAT \R MATCHES</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
43 </ul>
44 <br><a name="SEC1" href="#TOC1">PCRE REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
45 <P>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE are described in the
48 <a href="pcrepattern.html"><b>pcrepattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
50 </P>
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
52 <P>
53 <pre>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
56 </PRE>
57 </P>
58 <br><a name="SEC3" href="#TOC1">CHARACTERS</a><br>
59 <P>
60 <pre>
61 \a alarm, that is, the BEL character (hex 07)
62 \cx "control-x", where x is any ASCII character
63 \e escape (hex 1B)
64 \f form feed (hex 0C)
65 \n newline (hex 0A)
66 \r carriage return (hex 0D)
67 \t tab (hex 09)
68 \ddd character with octal code ddd, or backreference
69 \xhh character with hex code hh
70 \x{hhh..} character with hex code hhh..
71 </PRE>
72 </P>
73 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
74 <P>
75 <pre>
76 . any character except newline;
77 in dotall mode, any character whatsoever
78 \C one data unit, even in UTF mode (best avoided)
79 \d a decimal digit
80 \D a character that is not a decimal digit
81 \h a horizontal white space character
82 \H a character that is not a horizontal white space character
83 \N a character that is not a newline
84 \p{<i>xx</i>} a character with the <i>xx</i> property
85 \P{<i>xx</i>} a character without the <i>xx</i> property
86 \R a newline sequence
87 \s a white space character
88 \S a character that is not a white space character
89 \v a vertical white space character
90 \V a character that is not a vertical white space character
91 \w a "word" character
92 \W a "non-word" character
93 \X a Unicode extended grapheme cluster
94 </pre>
95 In PCRE, by default, \d, \D, \s, \S, \w, and \W recognize only ASCII
96 characters, even in a UTF mode. However, this can be changed by setting the
97 PCRE_UCP option.
98 </P>
99 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
100 <P>
101 <pre>
102 C Other
103 Cc Control
104 Cf Format
105 Cn Unassigned
106 Co Private use
107 Cs Surrogate
108
109 L Letter
110 Ll Lower case letter
111 Lm Modifier letter
112 Lo Other letter
113 Lt Title case letter
114 Lu Upper case letter
115 L& Ll, Lu, or Lt
116
117 M Mark
118 Mc Spacing mark
119 Me Enclosing mark
120 Mn Non-spacing mark
121
122 N Number
123 Nd Decimal number
124 Nl Letter number
125 No Other number
126
127 P Punctuation
128 Pc Connector punctuation
129 Pd Dash punctuation
130 Pe Close punctuation
131 Pf Final punctuation
132 Pi Initial punctuation
133 Po Other punctuation
134 Ps Open punctuation
135
136 S Symbol
137 Sc Currency symbol
138 Sk Modifier symbol
139 Sm Mathematical symbol
140 So Other symbol
141
142 Z Separator
143 Zl Line separator
144 Zp Paragraph separator
145 Zs Space separator
146 </PRE>
147 </P>
148 <br><a name="SEC6" href="#TOC1">PCRE SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
149 <P>
150 <pre>
151 Xan Alphanumeric: union of properties L and N
152 Xps POSIX space: property Z or tab, NL, VT, FF, CR
153 Xsp Perl space: property Z or tab, NL, FF, CR
154 Xuc Univerally-named character: one that can be
155 represented by a Universal Character Name
156 Xwd Perl word: property Xan or underscore
157 </PRE>
158 </P>
159 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
160 <P>
161 Arabic,
162 Armenian,
163 Avestan,
164 Balinese,
165 Bamum,
166 Batak,
167 Bengali,
168 Bopomofo,
169 Brahmi,
170 Braille,
171 Buginese,
172 Buhid,
173 Canadian_Aboriginal,
174 Carian,
175 Chakma,
176 Cham,
177 Cherokee,
178 Common,
179 Coptic,
180 Cuneiform,
181 Cypriot,
182 Cyrillic,
183 Deseret,
184 Devanagari,
185 Egyptian_Hieroglyphs,
186 Ethiopic,
187 Georgian,
188 Glagolitic,
189 Gothic,
190 Greek,
191 Gujarati,
192 Gurmukhi,
193 Han,
194 Hangul,
195 Hanunoo,
196 Hebrew,
197 Hiragana,
198 Imperial_Aramaic,
199 Inherited,
200 Inscriptional_Pahlavi,
201 Inscriptional_Parthian,
202 Javanese,
203 Kaithi,
204 Kannada,
205 Katakana,
206 Kayah_Li,
207 Kharoshthi,
208 Khmer,
209 Lao,
210 Latin,
211 Lepcha,
212 Limbu,
213 Linear_B,
214 Lisu,
215 Lycian,
216 Lydian,
217 Malayalam,
218 Mandaic,
219 Meetei_Mayek,
220 Meroitic_Cursive,
221 Meroitic_Hieroglyphs,
222 Miao,
223 Mongolian,
224 Myanmar,
225 New_Tai_Lue,
226 Nko,
227 Ogham,
228 Old_Italic,
229 Old_Persian,
230 Old_South_Arabian,
231 Old_Turkic,
232 Ol_Chiki,
233 Oriya,
234 Osmanya,
235 Phags_Pa,
236 Phoenician,
237 Rejang,
238 Runic,
239 Samaritan,
240 Saurashtra,
241 Sharada,
242 Shavian,
243 Sinhala,
244 Sora_Sompeng,
245 Sundanese,
246 Syloti_Nagri,
247 Syriac,
248 Tagalog,
249 Tagbanwa,
250 Tai_Le,
251 Tai_Tham,
252 Tai_Viet,
253 Takri,
254 Tamil,
255 Telugu,
256 Thaana,
257 Thai,
258 Tibetan,
259 Tifinagh,
260 Ugaritic,
261 Vai,
262 Yi.
263 </P>
264 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
265 <P>
266 <pre>
267 [...] positive character class
268 [^...] negative character class
269 [x-y] range (can be used for hex characters)
270 [[:xxx:]] positive POSIX named set
271 [[:^xxx:]] negative POSIX named set
272
273 alnum alphanumeric
274 alpha alphabetic
275 ascii 0-127
276 blank space or tab
277 cntrl control character
278 digit decimal digit
279 graph printing, excluding space
280 lower lower case letter
281 print printing, including space
282 punct printing, excluding alphanumeric
283 space white space
284 upper upper case letter
285 word same as \w
286 xdigit hexadecimal digit
287 </pre>
288 In PCRE, POSIX character set names recognize only ASCII characters by default,
289 but some of them use Unicode properties if PCRE_UCP is set. You can use
290 \Q...\E inside a character class.
291 </P>
292 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
293 <P>
294 <pre>
295 ? 0 or 1, greedy
296 ?+ 0 or 1, possessive
297 ?? 0 or 1, lazy
298 * 0 or more, greedy
299 *+ 0 or more, possessive
300 *? 0 or more, lazy
301 + 1 or more, greedy
302 ++ 1 or more, possessive
303 +? 1 or more, lazy
304 {n} exactly n
305 {n,m} at least n, no more than m, greedy
306 {n,m}+ at least n, no more than m, possessive
307 {n,m}? at least n, no more than m, lazy
308 {n,} n or more, greedy
309 {n,}+ n or more, possessive
310 {n,}? n or more, lazy
311 </PRE>
312 </P>
313 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
314 <P>
315 <pre>
316 \b word boundary
317 \B not a word boundary
318 ^ start of subject
319 also after internal newline in multiline mode
320 \A start of subject
321 $ end of subject
322 also before newline at end of subject
323 also before internal newline in multiline mode
324 \Z end of subject
325 also before newline at end of subject
326 \z end of subject
327 \G first matching position in subject
328 </PRE>
329 </P>
330 <br><a name="SEC11" href="#TOC1">MATCH POINT RESET</a><br>
331 <P>
332 <pre>
333 \K reset start of match
334 </PRE>
335 </P>
336 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
337 <P>
338 <pre>
339 expr|expr|expr...
340 </PRE>
341 </P>
342 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
343 <P>
344 <pre>
345 (...) capturing group
346 (?&#60;name&#62;...) named capturing group (Perl)
347 (?'name'...) named capturing group (Perl)
348 (?P&#60;name&#62;...) named capturing group (Python)
349 (?:...) non-capturing group
350 (?|...) non-capturing group; reset group numbers for
351 capturing groups in each alternative
352 </PRE>
353 </P>
354 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
355 <P>
356 <pre>
357 (?&#62;...) atomic, non-capturing group
358 </PRE>
359 </P>
360 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
361 <P>
362 <pre>
363 (?#....) comment (not nestable)
364 </PRE>
365 </P>
366 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
367 <P>
368 <pre>
369 (?i) caseless
370 (?J) allow duplicate names
371 (?m) multiline
372 (?s) single line (dotall)
373 (?U) default ungreedy (lazy)
374 (?x) extended (ignore white space)
375 (?-...) unset option(s)
376 </pre>
377 The following are recognized only at the start of a pattern or after one of the
378 newline-setting options with similar syntax:
379 <pre>
380 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
381 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
382 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
383 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
384 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
385 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
386 (*UTF) set appropriate UTF mode for the library in use
387 (*UCP) set PCRE_UCP (use Unicode properties for \d etc)
388 </PRE>
389 </P>
390 <br><a name="SEC17" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
391 <P>
392 <pre>
393 (?=...) positive look ahead
394 (?!...) negative look ahead
395 (?&#60;=...) positive look behind
396 (?&#60;!...) negative look behind
397 </pre>
398 Each top-level branch of a look behind must be of a fixed length.
399 </P>
400 <br><a name="SEC18" href="#TOC1">BACKREFERENCES</a><br>
401 <P>
402 <pre>
403 \n reference by number (can be ambiguous)
404 \gn reference by number
405 \g{n} reference by number
406 \g{-n} relative reference by number
407 \k&#60;name&#62; reference by name (Perl)
408 \k'name' reference by name (Perl)
409 \g{name} reference by name (Perl)
410 \k{name} reference by name (.NET)
411 (?P=name) reference by name (Python)
412 </PRE>
413 </P>
414 <br><a name="SEC19" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
415 <P>
416 <pre>
417 (?R) recurse whole pattern
418 (?n) call subpattern by absolute number
419 (?+n) call subpattern by relative number
420 (?-n) call subpattern by relative number
421 (?&name) call subpattern by name (Perl)
422 (?P&#62;name) call subpattern by name (Python)
423 \g&#60;name&#62; call subpattern by name (Oniguruma)
424 \g'name' call subpattern by name (Oniguruma)
425 \g&#60;n&#62; call subpattern by absolute number (Oniguruma)
426 \g'n' call subpattern by absolute number (Oniguruma)
427 \g&#60;+n&#62; call subpattern by relative number (PCRE extension)
428 \g'+n' call subpattern by relative number (PCRE extension)
429 \g&#60;-n&#62; call subpattern by relative number (PCRE extension)
430 \g'-n' call subpattern by relative number (PCRE extension)
431 </PRE>
432 </P>
433 <br><a name="SEC20" href="#TOC1">CONDITIONAL PATTERNS</a><br>
434 <P>
435 <pre>
436 (?(condition)yes-pattern)
437 (?(condition)yes-pattern|no-pattern)
438
439 (?(n)... absolute reference condition
440 (?(+n)... relative reference condition
441 (?(-n)... relative reference condition
442 (?(&#60;name&#62;)... named reference condition (Perl)
443 (?('name')... named reference condition (Perl)
444 (?(name)... named reference condition (PCRE)
445 (?(R)... overall recursion condition
446 (?(Rn)... specific group recursion condition
447 (?(R&name)... specific recursion condition
448 (?(DEFINE)... define subpattern for reference
449 (?(assert)... assertion condition
450 </PRE>
451 </P>
452 <br><a name="SEC21" href="#TOC1">BACKTRACKING CONTROL</a><br>
453 <P>
454 The following act immediately they are reached:
455 <pre>
456 (*ACCEPT) force successful match
457 (*FAIL) force backtrack; synonym (*F)
458 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
459 </pre>
460 The following act only when a subsequent match failure causes a backtrack to
461 reach them. They all force a match failure, but they differ in what happens
462 afterwards. Those that advance the start-of-match point do so only if the
463 pattern is not anchored.
464 <pre>
465 (*COMMIT) overall failure, no advance of starting point
466 (*PRUNE) advance to next starting character
467 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
468 (*SKIP) advance to current matching position
469 (*SKIP:NAME) advance to position corresponding to an earlier
470 (*MARK:NAME); if not found, the (*SKIP) is ignored
471 (*THEN) local failure, backtrack to next alternation
472 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
473 </PRE>
474 </P>
475 <br><a name="SEC22" href="#TOC1">NEWLINE CONVENTIONS</a><br>
476 <P>
477 These are recognized only at the very start of the pattern or after a
478 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
479 <pre>
480 (*CR) carriage return only
481 (*LF) linefeed only
482 (*CRLF) carriage return followed by linefeed
483 (*ANYCRLF) all three of the above
484 (*ANY) any Unicode newline sequence
485 </PRE>
486 </P>
487 <br><a name="SEC23" href="#TOC1">WHAT \R MATCHES</a><br>
488 <P>
489 These are recognized only at the very start of the pattern or after a
490 (*...) option that sets the newline convention or a UTF or UCP mode.
491 <pre>
492 (*BSR_ANYCRLF) CR, LF, or CRLF
493 (*BSR_UNICODE) any Unicode newline sequence
494 </PRE>
495 </P>
496 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
497 <P>
498 <pre>
499 (?C) callout
500 (?Cn) callout with data n
501 </PRE>
502 </P>
503 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
504 <P>
505 <b>pcrepattern</b>(3), <b>pcreapi</b>(3), <b>pcrecallout</b>(3),
506 <b>pcrematching</b>(3), <b>pcre</b>(3).
507 </P>
508 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
509 <P>
510 Philip Hazel
511 <br>
512 University Computing Service
513 <br>
514 Cambridge CB2 3QH, England.
515 <br>
516 </P>
517 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
518 <P>
519 Last updated: 26 April 2013
520 <br>
521 Copyright &copy; 1997-2013 University of Cambridge.
522 <br>
523 <p>
524 Return to the <a href="index.html">PCRE index page</a>.
525 </p>

  ViewVC Help
Powered by ViewVC 1.1.5