/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1364 - (show annotations)
Sat Oct 5 15:45:11 2013 UTC (6 years ago) by ph10
File size: 12643 byte(s)
Add VT to the set of characters recognized as white space.
1 .TH PCRESYNTAX 3 "05 October 2013" "PCRE 8.34"
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains a quick-reference summary of the syntax.
13 .
14 .
15 .SH "QUOTING"
16 .rs
17 .sp
18 \ex where x is non-alphanumeric is a literal x
19 \eQ...\eE treat enclosed characters as literal
20 .
21 .
22 .SH "CHARACTERS"
23 .rs
24 .sp
25 \ea alarm, that is, the BEL character (hex 07)
26 \ecx "control-x", where x is any ASCII character
27 \ee escape (hex 1B)
28 \ef form feed (hex 0C)
29 \en newline (hex 0A)
30 \er carriage return (hex 0D)
31 \et tab (hex 09)
32 \eddd character with octal code ddd, or backreference
33 \exhh character with hex code hh
34 \ex{hhh..} character with hex code hhh..
35 .
36 .
37 .SH "CHARACTER TYPES"
38 .rs
39 .sp
40 . any character except newline;
41 in dotall mode, any character whatsoever
42 \eC one data unit, even in UTF mode (best avoided)
43 \ed a decimal digit
44 \eD a character that is not a decimal digit
45 \eh a horizontal white space character
46 \eH a character that is not a horizontal white space character
47 \eN a character that is not a newline
48 \ep{\fIxx\fP} a character with the \fIxx\fP property
49 \eP{\fIxx\fP} a character without the \fIxx\fP property
50 \eR a newline sequence
51 \es a white space character
52 \eS a character that is not a white space character
53 \ev a vertical white space character
54 \eV a character that is not a vertical white space character
55 \ew a "word" character
56 \eW a "non-word" character
57 \eX a Unicode extended grapheme cluster
58 .sp
59 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
60 characters, even in a UTF mode. However, this can be changed by setting the
61 PCRE_UCP option.
62 .
63 .
64 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
65 .rs
66 .sp
67 C Other
68 Cc Control
69 Cf Format
70 Cn Unassigned
71 Co Private use
72 Cs Surrogate
73 .sp
74 L Letter
75 Ll Lower case letter
76 Lm Modifier letter
77 Lo Other letter
78 Lt Title case letter
79 Lu Upper case letter
80 L& Ll, Lu, or Lt
81 .sp
82 M Mark
83 Mc Spacing mark
84 Me Enclosing mark
85 Mn Non-spacing mark
86 .sp
87 N Number
88 Nd Decimal number
89 Nl Letter number
90 No Other number
91 .sp
92 P Punctuation
93 Pc Connector punctuation
94 Pd Dash punctuation
95 Pe Close punctuation
96 Pf Final punctuation
97 Pi Initial punctuation
98 Po Other punctuation
99 Ps Open punctuation
100 .sp
101 S Symbol
102 Sc Currency symbol
103 Sk Modifier symbol
104 Sm Mathematical symbol
105 So Other symbol
106 .sp
107 Z Separator
108 Zl Line separator
109 Zp Paragraph separator
110 Zs Space separator
111 .
112 .
113 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
114 .rs
115 .sp
116 Xan Alphanumeric: union of properties L and N
117 Xps POSIX space: property Z or tab, NL, VT, FF, CR
118 Xsp Perl space: property Z or tab, NL, VT, FF, CR
119 Xuc Univerally-named character: one that can be
120 represented by a Universal Character Name
121 Xwd Perl word: property Xan or underscore
122 .sp
123 Perl and POSIX space are now the same. Perl added VT to its space character set
124 at release 5.18 and PCRE changed at release 8.34.
125 .
126 .
127 .SH "SCRIPT NAMES FOR \ep AND \eP"
128 .rs
129 .sp
130 Arabic,
131 Armenian,
132 Avestan,
133 Balinese,
134 Bamum,
135 Batak,
136 Bengali,
137 Bopomofo,
138 Brahmi,
139 Braille,
140 Buginese,
141 Buhid,
142 Canadian_Aboriginal,
143 Carian,
144 Chakma,
145 Cham,
146 Cherokee,
147 Common,
148 Coptic,
149 Cuneiform,
150 Cypriot,
151 Cyrillic,
152 Deseret,
153 Devanagari,
154 Egyptian_Hieroglyphs,
155 Ethiopic,
156 Georgian,
157 Glagolitic,
158 Gothic,
159 Greek,
160 Gujarati,
161 Gurmukhi,
162 Han,
163 Hangul,
164 Hanunoo,
165 Hebrew,
166 Hiragana,
167 Imperial_Aramaic,
168 Inherited,
169 Inscriptional_Pahlavi,
170 Inscriptional_Parthian,
171 Javanese,
172 Kaithi,
173 Kannada,
174 Katakana,
175 Kayah_Li,
176 Kharoshthi,
177 Khmer,
178 Lao,
179 Latin,
180 Lepcha,
181 Limbu,
182 Linear_B,
183 Lisu,
184 Lycian,
185 Lydian,
186 Malayalam,
187 Mandaic,
188 Meetei_Mayek,
189 Meroitic_Cursive,
190 Meroitic_Hieroglyphs,
191 Miao,
192 Mongolian,
193 Myanmar,
194 New_Tai_Lue,
195 Nko,
196 Ogham,
197 Old_Italic,
198 Old_Persian,
199 Old_South_Arabian,
200 Old_Turkic,
201 Ol_Chiki,
202 Oriya,
203 Osmanya,
204 Phags_Pa,
205 Phoenician,
206 Rejang,
207 Runic,
208 Samaritan,
209 Saurashtra,
210 Sharada,
211 Shavian,
212 Sinhala,
213 Sora_Sompeng,
214 Sundanese,
215 Syloti_Nagri,
216 Syriac,
217 Tagalog,
218 Tagbanwa,
219 Tai_Le,
220 Tai_Tham,
221 Tai_Viet,
222 Takri,
223 Tamil,
224 Telugu,
225 Thaana,
226 Thai,
227 Tibetan,
228 Tifinagh,
229 Ugaritic,
230 Vai,
231 Yi.
232 .
233 .
234 .SH "CHARACTER CLASSES"
235 .rs
236 .sp
237 [...] positive character class
238 [^...] negative character class
239 [x-y] range (can be used for hex characters)
240 [[:xxx:]] positive POSIX named set
241 [[:^xxx:]] negative POSIX named set
242 .sp
243 alnum alphanumeric
244 alpha alphabetic
245 ascii 0-127
246 blank space or tab
247 cntrl control character
248 digit decimal digit
249 graph printing, excluding space
250 lower lower case letter
251 print printing, including space
252 punct printing, excluding alphanumeric
253 space white space
254 upper upper case letter
255 word same as \ew
256 xdigit hexadecimal digit
257 .sp
258 In PCRE, POSIX character set names recognize only ASCII characters by default,
259 but some of them use Unicode properties if PCRE_UCP is set. You can use
260 \eQ...\eE inside a character class.
261 .
262 .
263 .SH "QUANTIFIERS"
264 .rs
265 .sp
266 ? 0 or 1, greedy
267 ?+ 0 or 1, possessive
268 ?? 0 or 1, lazy
269 * 0 or more, greedy
270 *+ 0 or more, possessive
271 *? 0 or more, lazy
272 + 1 or more, greedy
273 ++ 1 or more, possessive
274 +? 1 or more, lazy
275 {n} exactly n
276 {n,m} at least n, no more than m, greedy
277 {n,m}+ at least n, no more than m, possessive
278 {n,m}? at least n, no more than m, lazy
279 {n,} n or more, greedy
280 {n,}+ n or more, possessive
281 {n,}? n or more, lazy
282 .
283 .
284 .SH "ANCHORS AND SIMPLE ASSERTIONS"
285 .rs
286 .sp
287 \eb word boundary
288 \eB not a word boundary
289 ^ start of subject
290 also after internal newline in multiline mode
291 \eA start of subject
292 $ end of subject
293 also before newline at end of subject
294 also before internal newline in multiline mode
295 \eZ end of subject
296 also before newline at end of subject
297 \ez end of subject
298 \eG first matching position in subject
299 .
300 .
301 .SH "MATCH POINT RESET"
302 .rs
303 .sp
304 \eK reset start of match
305 .
306 .
307 .SH "ALTERNATION"
308 .rs
309 .sp
310 expr|expr|expr...
311 .
312 .
313 .SH "CAPTURING"
314 .rs
315 .sp
316 (...) capturing group
317 (?<name>...) named capturing group (Perl)
318 (?'name'...) named capturing group (Perl)
319 (?P<name>...) named capturing group (Python)
320 (?:...) non-capturing group
321 (?|...) non-capturing group; reset group numbers for
322 capturing groups in each alternative
323 .
324 .
325 .SH "ATOMIC GROUPS"
326 .rs
327 .sp
328 (?>...) atomic, non-capturing group
329 .
330 .
331 .
332 .
333 .SH "COMMENT"
334 .rs
335 .sp
336 (?#....) comment (not nestable)
337 .
338 .
339 .SH "OPTION SETTING"
340 .rs
341 .sp
342 (?i) caseless
343 (?J) allow duplicate names
344 (?m) multiline
345 (?s) single line (dotall)
346 (?U) default ungreedy (lazy)
347 (?x) extended (ignore white space)
348 (?-...) unset option(s)
349 .sp
350 The following are recognized only at the start of a pattern or after one of the
351 newline-setting options with similar syntax:
352 .sp
353 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
354 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
355 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
356 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
357 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
358 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
359 (*UTF) set appropriate UTF mode for the library in use
360 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
361 .
362 .
363 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
364 .rs
365 .sp
366 (?=...) positive look ahead
367 (?!...) negative look ahead
368 (?<=...) positive look behind
369 (?<!...) negative look behind
370 .sp
371 Each top-level branch of a look behind must be of a fixed length.
372 .
373 .
374 .SH "BACKREFERENCES"
375 .rs
376 .sp
377 \en reference by number (can be ambiguous)
378 \egn reference by number
379 \eg{n} reference by number
380 \eg{-n} relative reference by number
381 \ek<name> reference by name (Perl)
382 \ek'name' reference by name (Perl)
383 \eg{name} reference by name (Perl)
384 \ek{name} reference by name (.NET)
385 (?P=name) reference by name (Python)
386 .
387 .
388 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
389 .rs
390 .sp
391 (?R) recurse whole pattern
392 (?n) call subpattern by absolute number
393 (?+n) call subpattern by relative number
394 (?-n) call subpattern by relative number
395 (?&name) call subpattern by name (Perl)
396 (?P>name) call subpattern by name (Python)
397 \eg<name> call subpattern by name (Oniguruma)
398 \eg'name' call subpattern by name (Oniguruma)
399 \eg<n> call subpattern by absolute number (Oniguruma)
400 \eg'n' call subpattern by absolute number (Oniguruma)
401 \eg<+n> call subpattern by relative number (PCRE extension)
402 \eg'+n' call subpattern by relative number (PCRE extension)
403 \eg<-n> call subpattern by relative number (PCRE extension)
404 \eg'-n' call subpattern by relative number (PCRE extension)
405 .
406 .
407 .SH "CONDITIONAL PATTERNS"
408 .rs
409 .sp
410 (?(condition)yes-pattern)
411 (?(condition)yes-pattern|no-pattern)
412 .sp
413 (?(n)... absolute reference condition
414 (?(+n)... relative reference condition
415 (?(-n)... relative reference condition
416 (?(<name>)... named reference condition (Perl)
417 (?('name')... named reference condition (Perl)
418 (?(name)... named reference condition (PCRE)
419 (?(R)... overall recursion condition
420 (?(Rn)... specific group recursion condition
421 (?(R&name)... specific recursion condition
422 (?(DEFINE)... define subpattern for reference
423 (?(assert)... assertion condition
424 .
425 .
426 .SH "BACKTRACKING CONTROL"
427 .rs
428 .sp
429 The following act immediately they are reached:
430 .sp
431 (*ACCEPT) force successful match
432 (*FAIL) force backtrack; synonym (*F)
433 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
434 .sp
435 The following act only when a subsequent match failure causes a backtrack to
436 reach them. They all force a match failure, but they differ in what happens
437 afterwards. Those that advance the start-of-match point do so only if the
438 pattern is not anchored.
439 .sp
440 (*COMMIT) overall failure, no advance of starting point
441 (*PRUNE) advance to next starting character
442 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
443 (*SKIP) advance to current matching position
444 (*SKIP:NAME) advance to position corresponding to an earlier
445 (*MARK:NAME); if not found, the (*SKIP) is ignored
446 (*THEN) local failure, backtrack to next alternation
447 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
448 .
449 .
450 .SH "NEWLINE CONVENTIONS"
451 .rs
452 .sp
453 These are recognized only at the very start of the pattern or after a
454 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
455 .sp
456 (*CR) carriage return only
457 (*LF) linefeed only
458 (*CRLF) carriage return followed by linefeed
459 (*ANYCRLF) all three of the above
460 (*ANY) any Unicode newline sequence
461 .
462 .
463 .SH "WHAT \eR MATCHES"
464 .rs
465 .sp
466 These are recognized only at the very start of the pattern or after a
467 (*...) option that sets the newline convention or a UTF or UCP mode.
468 .sp
469 (*BSR_ANYCRLF) CR, LF, or CRLF
470 (*BSR_UNICODE) any Unicode newline sequence
471 .
472 .
473 .SH "CALLOUTS"
474 .rs
475 .sp
476 (?C) callout
477 (?Cn) callout with data n
478 .
479 .
480 .SH "SEE ALSO"
481 .rs
482 .sp
483 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
484 \fBpcrematching\fP(3), \fBpcre\fP(3).
485 .
486 .
487 .SH AUTHOR
488 .rs
489 .sp
490 .nf
491 Philip Hazel
492 University Computing Service
493 Cambridge CB2 3QH, England.
494 .fi
495 .
496 .
497 .SH REVISION
498 .rs
499 .sp
500 .nf
501 Last updated: 05 October 2013
502 Copyright (c) 1997-2013 University of Cambridge.
503 .fi

  ViewVC Help
Powered by ViewVC 1.1.5