/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1369 - (show annotations)
Tue Oct 8 15:06:46 2013 UTC (6 years ago) by ph10
File size: 12753 byte(s)
Error occurred while calculating annotation data.
Update \8 and \9 handling to match most recent Perl.
1 .TH PCRESYNTAX 3 "08 October 2013" "PCRE 8.34"
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains a quick-reference summary of the syntax.
13 .
14 .
15 .SH "QUOTING"
16 .rs
17 .sp
18 \ex where x is non-alphanumeric is a literal x
19 \eQ...\eE treat enclosed characters as literal
20 .
21 .
22 .SH "CHARACTERS"
23 .rs
24 .sp
25 \ea alarm, that is, the BEL character (hex 07)
26 \ecx "control-x", where x is any ASCII character
27 \ee escape (hex 1B)
28 \ef form feed (hex 0C)
29 \en newline (hex 0A)
30 \er carriage return (hex 0D)
31 \et tab (hex 09)
32 \eddd character with octal code ddd, or backreference
33 \exhh character with hex code hh
34 \ex{hhh..} character with hex code hhh..
35 .sp
36 Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
37 characters "8" and "9".
38 .
39 .
40 .SH "CHARACTER TYPES"
41 .rs
42 .sp
43 . any character except newline;
44 in dotall mode, any character whatsoever
45 \eC one data unit, even in UTF mode (best avoided)
46 \ed a decimal digit
47 \eD a character that is not a decimal digit
48 \eh a horizontal white space character
49 \eH a character that is not a horizontal white space character
50 \eN a character that is not a newline
51 \ep{\fIxx\fP} a character with the \fIxx\fP property
52 \eP{\fIxx\fP} a character without the \fIxx\fP property
53 \eR a newline sequence
54 \es a white space character
55 \eS a character that is not a white space character
56 \ev a vertical white space character
57 \eV a character that is not a vertical white space character
58 \ew a "word" character
59 \eW a "non-word" character
60 \eX a Unicode extended grapheme cluster
61 .sp
62 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
63 characters, even in a UTF mode. However, this can be changed by setting the
64 PCRE_UCP option.
65 .
66 .
67 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
68 .rs
69 .sp
70 C Other
71 Cc Control
72 Cf Format
73 Cn Unassigned
74 Co Private use
75 Cs Surrogate
76 .sp
77 L Letter
78 Ll Lower case letter
79 Lm Modifier letter
80 Lo Other letter
81 Lt Title case letter
82 Lu Upper case letter
83 L& Ll, Lu, or Lt
84 .sp
85 M Mark
86 Mc Spacing mark
87 Me Enclosing mark
88 Mn Non-spacing mark
89 .sp
90 N Number
91 Nd Decimal number
92 Nl Letter number
93 No Other number
94 .sp
95 P Punctuation
96 Pc Connector punctuation
97 Pd Dash punctuation
98 Pe Close punctuation
99 Pf Final punctuation
100 Pi Initial punctuation
101 Po Other punctuation
102 Ps Open punctuation
103 .sp
104 S Symbol
105 Sc Currency symbol
106 Sk Modifier symbol
107 Sm Mathematical symbol
108 So Other symbol
109 .sp
110 Z Separator
111 Zl Line separator
112 Zp Paragraph separator
113 Zs Space separator
114 .
115 .
116 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
117 .rs
118 .sp
119 Xan Alphanumeric: union of properties L and N
120 Xps POSIX space: property Z or tab, NL, VT, FF, CR
121 Xsp Perl space: property Z or tab, NL, VT, FF, CR
122 Xuc Univerally-named character: one that can be
123 represented by a Universal Character Name
124 Xwd Perl word: property Xan or underscore
125 .sp
126 Perl and POSIX space are now the same. Perl added VT to its space character set
127 at release 5.18 and PCRE changed at release 8.34.
128 .
129 .
130 .SH "SCRIPT NAMES FOR \ep AND \eP"
131 .rs
132 .sp
133 Arabic,
134 Armenian,
135 Avestan,
136 Balinese,
137 Bamum,
138 Batak,
139 Bengali,
140 Bopomofo,
141 Brahmi,
142 Braille,
143 Buginese,
144 Buhid,
145 Canadian_Aboriginal,
146 Carian,
147 Chakma,
148 Cham,
149 Cherokee,
150 Common,
151 Coptic,
152 Cuneiform,
153 Cypriot,
154 Cyrillic,
155 Deseret,
156 Devanagari,
157 Egyptian_Hieroglyphs,
158 Ethiopic,
159 Georgian,
160 Glagolitic,
161 Gothic,
162 Greek,
163 Gujarati,
164 Gurmukhi,
165 Han,
166 Hangul,
167 Hanunoo,
168 Hebrew,
169 Hiragana,
170 Imperial_Aramaic,
171 Inherited,
172 Inscriptional_Pahlavi,
173 Inscriptional_Parthian,
174 Javanese,
175 Kaithi,
176 Kannada,
177 Katakana,
178 Kayah_Li,
179 Kharoshthi,
180 Khmer,
181 Lao,
182 Latin,
183 Lepcha,
184 Limbu,
185 Linear_B,
186 Lisu,
187 Lycian,
188 Lydian,
189 Malayalam,
190 Mandaic,
191 Meetei_Mayek,
192 Meroitic_Cursive,
193 Meroitic_Hieroglyphs,
194 Miao,
195 Mongolian,
196 Myanmar,
197 New_Tai_Lue,
198 Nko,
199 Ogham,
200 Old_Italic,
201 Old_Persian,
202 Old_South_Arabian,
203 Old_Turkic,
204 Ol_Chiki,
205 Oriya,
206 Osmanya,
207 Phags_Pa,
208 Phoenician,
209 Rejang,
210 Runic,
211 Samaritan,
212 Saurashtra,
213 Sharada,
214 Shavian,
215 Sinhala,
216 Sora_Sompeng,
217 Sundanese,
218 Syloti_Nagri,
219 Syriac,
220 Tagalog,
221 Tagbanwa,
222 Tai_Le,
223 Tai_Tham,
224 Tai_Viet,
225 Takri,
226 Tamil,
227 Telugu,
228 Thaana,
229 Thai,
230 Tibetan,
231 Tifinagh,
232 Ugaritic,
233 Vai,
234 Yi.
235 .
236 .
237 .SH "CHARACTER CLASSES"
238 .rs
239 .sp
240 [...] positive character class
241 [^...] negative character class
242 [x-y] range (can be used for hex characters)
243 [[:xxx:]] positive POSIX named set
244 [[:^xxx:]] negative POSIX named set
245 .sp
246 alnum alphanumeric
247 alpha alphabetic
248 ascii 0-127
249 blank space or tab
250 cntrl control character
251 digit decimal digit
252 graph printing, excluding space
253 lower lower case letter
254 print printing, including space
255 punct printing, excluding alphanumeric
256 space white space
257 upper upper case letter
258 word same as \ew
259 xdigit hexadecimal digit
260 .sp
261 In PCRE, POSIX character set names recognize only ASCII characters by default,
262 but some of them use Unicode properties if PCRE_UCP is set. You can use
263 \eQ...\eE inside a character class.
264 .
265 .
266 .SH "QUANTIFIERS"
267 .rs
268 .sp
269 ? 0 or 1, greedy
270 ?+ 0 or 1, possessive
271 ?? 0 or 1, lazy
272 * 0 or more, greedy
273 *+ 0 or more, possessive
274 *? 0 or more, lazy
275 + 1 or more, greedy
276 ++ 1 or more, possessive
277 +? 1 or more, lazy
278 {n} exactly n
279 {n,m} at least n, no more than m, greedy
280 {n,m}+ at least n, no more than m, possessive
281 {n,m}? at least n, no more than m, lazy
282 {n,} n or more, greedy
283 {n,}+ n or more, possessive
284 {n,}? n or more, lazy
285 .
286 .
287 .SH "ANCHORS AND SIMPLE ASSERTIONS"
288 .rs
289 .sp
290 \eb word boundary
291 \eB not a word boundary
292 ^ start of subject
293 also after internal newline in multiline mode
294 \eA start of subject
295 $ end of subject
296 also before newline at end of subject
297 also before internal newline in multiline mode
298 \eZ end of subject
299 also before newline at end of subject
300 \ez end of subject
301 \eG first matching position in subject
302 .
303 .
304 .SH "MATCH POINT RESET"
305 .rs
306 .sp
307 \eK reset start of match
308 .
309 .
310 .SH "ALTERNATION"
311 .rs
312 .sp
313 expr|expr|expr...
314 .
315 .
316 .SH "CAPTURING"
317 .rs
318 .sp
319 (...) capturing group
320 (?<name>...) named capturing group (Perl)
321 (?'name'...) named capturing group (Perl)
322 (?P<name>...) named capturing group (Python)
323 (?:...) non-capturing group
324 (?|...) non-capturing group; reset group numbers for
325 capturing groups in each alternative
326 .
327 .
328 .SH "ATOMIC GROUPS"
329 .rs
330 .sp
331 (?>...) atomic, non-capturing group
332 .
333 .
334 .
335 .
336 .SH "COMMENT"
337 .rs
338 .sp
339 (?#....) comment (not nestable)
340 .
341 .
342 .SH "OPTION SETTING"
343 .rs
344 .sp
345 (?i) caseless
346 (?J) allow duplicate names
347 (?m) multiline
348 (?s) single line (dotall)
349 (?U) default ungreedy (lazy)
350 (?x) extended (ignore white space)
351 (?-...) unset option(s)
352 .sp
353 The following are recognized only at the start of a pattern or after one of the
354 newline-setting options with similar syntax:
355 .sp
356 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
357 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
358 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
359 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
360 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
361 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
362 (*UTF) set appropriate UTF mode for the library in use
363 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
364 .
365 .
366 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
367 .rs
368 .sp
369 (?=...) positive look ahead
370 (?!...) negative look ahead
371 (?<=...) positive look behind
372 (?<!...) negative look behind
373 .sp
374 Each top-level branch of a look behind must be of a fixed length.
375 .
376 .
377 .SH "BACKREFERENCES"
378 .rs
379 .sp
380 \en reference by number (can be ambiguous)
381 \egn reference by number
382 \eg{n} reference by number
383 \eg{-n} relative reference by number
384 \ek<name> reference by name (Perl)
385 \ek'name' reference by name (Perl)
386 \eg{name} reference by name (Perl)
387 \ek{name} reference by name (.NET)
388 (?P=name) reference by name (Python)
389 .
390 .
391 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
392 .rs
393 .sp
394 (?R) recurse whole pattern
395 (?n) call subpattern by absolute number
396 (?+n) call subpattern by relative number
397 (?-n) call subpattern by relative number
398 (?&name) call subpattern by name (Perl)
399 (?P>name) call subpattern by name (Python)
400 \eg<name> call subpattern by name (Oniguruma)
401 \eg'name' call subpattern by name (Oniguruma)
402 \eg<n> call subpattern by absolute number (Oniguruma)
403 \eg'n' call subpattern by absolute number (Oniguruma)
404 \eg<+n> call subpattern by relative number (PCRE extension)
405 \eg'+n' call subpattern by relative number (PCRE extension)
406 \eg<-n> call subpattern by relative number (PCRE extension)
407 \eg'-n' call subpattern by relative number (PCRE extension)
408 .
409 .
410 .SH "CONDITIONAL PATTERNS"
411 .rs
412 .sp
413 (?(condition)yes-pattern)
414 (?(condition)yes-pattern|no-pattern)
415 .sp
416 (?(n)... absolute reference condition
417 (?(+n)... relative reference condition
418 (?(-n)... relative reference condition
419 (?(<name>)... named reference condition (Perl)
420 (?('name')... named reference condition (Perl)
421 (?(name)... named reference condition (PCRE)
422 (?(R)... overall recursion condition
423 (?(Rn)... specific group recursion condition
424 (?(R&name)... specific recursion condition
425 (?(DEFINE)... define subpattern for reference
426 (?(assert)... assertion condition
427 .
428 .
429 .SH "BACKTRACKING CONTROL"
430 .rs
431 .sp
432 The following act immediately they are reached:
433 .sp
434 (*ACCEPT) force successful match
435 (*FAIL) force backtrack; synonym (*F)
436 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
437 .sp
438 The following act only when a subsequent match failure causes a backtrack to
439 reach them. They all force a match failure, but they differ in what happens
440 afterwards. Those that advance the start-of-match point do so only if the
441 pattern is not anchored.
442 .sp
443 (*COMMIT) overall failure, no advance of starting point
444 (*PRUNE) advance to next starting character
445 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
446 (*SKIP) advance to current matching position
447 (*SKIP:NAME) advance to position corresponding to an earlier
448 (*MARK:NAME); if not found, the (*SKIP) is ignored
449 (*THEN) local failure, backtrack to next alternation
450 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
451 .
452 .
453 .SH "NEWLINE CONVENTIONS"
454 .rs
455 .sp
456 These are recognized only at the very start of the pattern or after a
457 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
458 .sp
459 (*CR) carriage return only
460 (*LF) linefeed only
461 (*CRLF) carriage return followed by linefeed
462 (*ANYCRLF) all three of the above
463 (*ANY) any Unicode newline sequence
464 .
465 .
466 .SH "WHAT \eR MATCHES"
467 .rs
468 .sp
469 These are recognized only at the very start of the pattern or after a
470 (*...) option that sets the newline convention or a UTF or UCP mode.
471 .sp
472 (*BSR_ANYCRLF) CR, LF, or CRLF
473 (*BSR_UNICODE) any Unicode newline sequence
474 .
475 .
476 .SH "CALLOUTS"
477 .rs
478 .sp
479 (?C) callout
480 (?Cn) callout with data n
481 .
482 .
483 .SH "SEE ALSO"
484 .rs
485 .sp
486 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
487 \fBpcrematching\fP(3), \fBpcre\fP(3).
488 .
489 .
490 .SH AUTHOR
491 .rs
492 .sp
493 .nf
494 Philip Hazel
495 University Computing Service
496 Cambridge CB2 3QH, England.
497 .fi
498 .
499 .
500 .SH REVISION
501 .rs
502 .sp
503 .nf
504 Last updated: 08 October 2013
505 Copyright (c) 1997-2013 University of Cambridge.
506 .fi

  ViewVC Help
Powered by ViewVC 1.1.5