/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1370 - (show annotations)
Wed Oct 9 10:18:26 2013 UTC (6 years ago) by ph10
File size: 12845 byte(s)
Add \o{} and tidy up \x{} handling. Minor update to RunTest.
1 .TH PCRESYNTAX 3 "08 October 2013" "PCRE 8.34"
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains a quick-reference summary of the syntax.
13 .
14 .
15 .SH "QUOTING"
16 .rs
17 .sp
18 \ex where x is non-alphanumeric is a literal x
19 \eQ...\eE treat enclosed characters as literal
20 .
21 .
22 .SH "CHARACTERS"
23 .rs
24 .sp
25 \ea alarm, that is, the BEL character (hex 07)
26 \ecx "control-x", where x is any ASCII character
27 \ee escape (hex 1B)
28 \ef form feed (hex 0C)
29 \en newline (hex 0A)
30 \er carriage return (hex 0D)
31 \et tab (hex 09)
32 \e0dd character with octal code 0dd
33 \eddd character with octal code ddd, or backreference
34 \eo{ddd..} character with octal code ddd..
35 \exhh character with hex code hh
36 \ex{hhh..} character with hex code hhh..
37 .sp
38 Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
39 characters "8" and "9".
40 .
41 .
42 .SH "CHARACTER TYPES"
43 .rs
44 .sp
45 . any character except newline;
46 in dotall mode, any character whatsoever
47 \eC one data unit, even in UTF mode (best avoided)
48 \ed a decimal digit
49 \eD a character that is not a decimal digit
50 \eh a horizontal white space character
51 \eH a character that is not a horizontal white space character
52 \eN a character that is not a newline
53 \ep{\fIxx\fP} a character with the \fIxx\fP property
54 \eP{\fIxx\fP} a character without the \fIxx\fP property
55 \eR a newline sequence
56 \es a white space character
57 \eS a character that is not a white space character
58 \ev a vertical white space character
59 \eV a character that is not a vertical white space character
60 \ew a "word" character
61 \eW a "non-word" character
62 \eX a Unicode extended grapheme cluster
63 .sp
64 In PCRE, by default, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII
65 characters, even in a UTF mode. However, this can be changed by setting the
66 PCRE_UCP option.
67 .
68 .
69 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
70 .rs
71 .sp
72 C Other
73 Cc Control
74 Cf Format
75 Cn Unassigned
76 Co Private use
77 Cs Surrogate
78 .sp
79 L Letter
80 Ll Lower case letter
81 Lm Modifier letter
82 Lo Other letter
83 Lt Title case letter
84 Lu Upper case letter
85 L& Ll, Lu, or Lt
86 .sp
87 M Mark
88 Mc Spacing mark
89 Me Enclosing mark
90 Mn Non-spacing mark
91 .sp
92 N Number
93 Nd Decimal number
94 Nl Letter number
95 No Other number
96 .sp
97 P Punctuation
98 Pc Connector punctuation
99 Pd Dash punctuation
100 Pe Close punctuation
101 Pf Final punctuation
102 Pi Initial punctuation
103 Po Other punctuation
104 Ps Open punctuation
105 .sp
106 S Symbol
107 Sc Currency symbol
108 Sk Modifier symbol
109 Sm Mathematical symbol
110 So Other symbol
111 .sp
112 Z Separator
113 Zl Line separator
114 Zp Paragraph separator
115 Zs Space separator
116 .
117 .
118 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
119 .rs
120 .sp
121 Xan Alphanumeric: union of properties L and N
122 Xps POSIX space: property Z or tab, NL, VT, FF, CR
123 Xsp Perl space: property Z or tab, NL, VT, FF, CR
124 Xuc Univerally-named character: one that can be
125 represented by a Universal Character Name
126 Xwd Perl word: property Xan or underscore
127 .sp
128 Perl and POSIX space are now the same. Perl added VT to its space character set
129 at release 5.18 and PCRE changed at release 8.34.
130 .
131 .
132 .SH "SCRIPT NAMES FOR \ep AND \eP"
133 .rs
134 .sp
135 Arabic,
136 Armenian,
137 Avestan,
138 Balinese,
139 Bamum,
140 Batak,
141 Bengali,
142 Bopomofo,
143 Brahmi,
144 Braille,
145 Buginese,
146 Buhid,
147 Canadian_Aboriginal,
148 Carian,
149 Chakma,
150 Cham,
151 Cherokee,
152 Common,
153 Coptic,
154 Cuneiform,
155 Cypriot,
156 Cyrillic,
157 Deseret,
158 Devanagari,
159 Egyptian_Hieroglyphs,
160 Ethiopic,
161 Georgian,
162 Glagolitic,
163 Gothic,
164 Greek,
165 Gujarati,
166 Gurmukhi,
167 Han,
168 Hangul,
169 Hanunoo,
170 Hebrew,
171 Hiragana,
172 Imperial_Aramaic,
173 Inherited,
174 Inscriptional_Pahlavi,
175 Inscriptional_Parthian,
176 Javanese,
177 Kaithi,
178 Kannada,
179 Katakana,
180 Kayah_Li,
181 Kharoshthi,
182 Khmer,
183 Lao,
184 Latin,
185 Lepcha,
186 Limbu,
187 Linear_B,
188 Lisu,
189 Lycian,
190 Lydian,
191 Malayalam,
192 Mandaic,
193 Meetei_Mayek,
194 Meroitic_Cursive,
195 Meroitic_Hieroglyphs,
196 Miao,
197 Mongolian,
198 Myanmar,
199 New_Tai_Lue,
200 Nko,
201 Ogham,
202 Old_Italic,
203 Old_Persian,
204 Old_South_Arabian,
205 Old_Turkic,
206 Ol_Chiki,
207 Oriya,
208 Osmanya,
209 Phags_Pa,
210 Phoenician,
211 Rejang,
212 Runic,
213 Samaritan,
214 Saurashtra,
215 Sharada,
216 Shavian,
217 Sinhala,
218 Sora_Sompeng,
219 Sundanese,
220 Syloti_Nagri,
221 Syriac,
222 Tagalog,
223 Tagbanwa,
224 Tai_Le,
225 Tai_Tham,
226 Tai_Viet,
227 Takri,
228 Tamil,
229 Telugu,
230 Thaana,
231 Thai,
232 Tibetan,
233 Tifinagh,
234 Ugaritic,
235 Vai,
236 Yi.
237 .
238 .
239 .SH "CHARACTER CLASSES"
240 .rs
241 .sp
242 [...] positive character class
243 [^...] negative character class
244 [x-y] range (can be used for hex characters)
245 [[:xxx:]] positive POSIX named set
246 [[:^xxx:]] negative POSIX named set
247 .sp
248 alnum alphanumeric
249 alpha alphabetic
250 ascii 0-127
251 blank space or tab
252 cntrl control character
253 digit decimal digit
254 graph printing, excluding space
255 lower lower case letter
256 print printing, including space
257 punct printing, excluding alphanumeric
258 space white space
259 upper upper case letter
260 word same as \ew
261 xdigit hexadecimal digit
262 .sp
263 In PCRE, POSIX character set names recognize only ASCII characters by default,
264 but some of them use Unicode properties if PCRE_UCP is set. You can use
265 \eQ...\eE inside a character class.
266 .
267 .
268 .SH "QUANTIFIERS"
269 .rs
270 .sp
271 ? 0 or 1, greedy
272 ?+ 0 or 1, possessive
273 ?? 0 or 1, lazy
274 * 0 or more, greedy
275 *+ 0 or more, possessive
276 *? 0 or more, lazy
277 + 1 or more, greedy
278 ++ 1 or more, possessive
279 +? 1 or more, lazy
280 {n} exactly n
281 {n,m} at least n, no more than m, greedy
282 {n,m}+ at least n, no more than m, possessive
283 {n,m}? at least n, no more than m, lazy
284 {n,} n or more, greedy
285 {n,}+ n or more, possessive
286 {n,}? n or more, lazy
287 .
288 .
289 .SH "ANCHORS AND SIMPLE ASSERTIONS"
290 .rs
291 .sp
292 \eb word boundary
293 \eB not a word boundary
294 ^ start of subject
295 also after internal newline in multiline mode
296 \eA start of subject
297 $ end of subject
298 also before newline at end of subject
299 also before internal newline in multiline mode
300 \eZ end of subject
301 also before newline at end of subject
302 \ez end of subject
303 \eG first matching position in subject
304 .
305 .
306 .SH "MATCH POINT RESET"
307 .rs
308 .sp
309 \eK reset start of match
310 .
311 .
312 .SH "ALTERNATION"
313 .rs
314 .sp
315 expr|expr|expr...
316 .
317 .
318 .SH "CAPTURING"
319 .rs
320 .sp
321 (...) capturing group
322 (?<name>...) named capturing group (Perl)
323 (?'name'...) named capturing group (Perl)
324 (?P<name>...) named capturing group (Python)
325 (?:...) non-capturing group
326 (?|...) non-capturing group; reset group numbers for
327 capturing groups in each alternative
328 .
329 .
330 .SH "ATOMIC GROUPS"
331 .rs
332 .sp
333 (?>...) atomic, non-capturing group
334 .
335 .
336 .
337 .
338 .SH "COMMENT"
339 .rs
340 .sp
341 (?#....) comment (not nestable)
342 .
343 .
344 .SH "OPTION SETTING"
345 .rs
346 .sp
347 (?i) caseless
348 (?J) allow duplicate names
349 (?m) multiline
350 (?s) single line (dotall)
351 (?U) default ungreedy (lazy)
352 (?x) extended (ignore white space)
353 (?-...) unset option(s)
354 .sp
355 The following are recognized only at the start of a pattern or after one of the
356 newline-setting options with similar syntax:
357 .sp
358 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
359 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
360 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
361 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
362 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
363 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
364 (*UTF) set appropriate UTF mode for the library in use
365 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
366 .
367 .
368 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
369 .rs
370 .sp
371 (?=...) positive look ahead
372 (?!...) negative look ahead
373 (?<=...) positive look behind
374 (?<!...) negative look behind
375 .sp
376 Each top-level branch of a look behind must be of a fixed length.
377 .
378 .
379 .SH "BACKREFERENCES"
380 .rs
381 .sp
382 \en reference by number (can be ambiguous)
383 \egn reference by number
384 \eg{n} reference by number
385 \eg{-n} relative reference by number
386 \ek<name> reference by name (Perl)
387 \ek'name' reference by name (Perl)
388 \eg{name} reference by name (Perl)
389 \ek{name} reference by name (.NET)
390 (?P=name) reference by name (Python)
391 .
392 .
393 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
394 .rs
395 .sp
396 (?R) recurse whole pattern
397 (?n) call subpattern by absolute number
398 (?+n) call subpattern by relative number
399 (?-n) call subpattern by relative number
400 (?&name) call subpattern by name (Perl)
401 (?P>name) call subpattern by name (Python)
402 \eg<name> call subpattern by name (Oniguruma)
403 \eg'name' call subpattern by name (Oniguruma)
404 \eg<n> call subpattern by absolute number (Oniguruma)
405 \eg'n' call subpattern by absolute number (Oniguruma)
406 \eg<+n> call subpattern by relative number (PCRE extension)
407 \eg'+n' call subpattern by relative number (PCRE extension)
408 \eg<-n> call subpattern by relative number (PCRE extension)
409 \eg'-n' call subpattern by relative number (PCRE extension)
410 .
411 .
412 .SH "CONDITIONAL PATTERNS"
413 .rs
414 .sp
415 (?(condition)yes-pattern)
416 (?(condition)yes-pattern|no-pattern)
417 .sp
418 (?(n)... absolute reference condition
419 (?(+n)... relative reference condition
420 (?(-n)... relative reference condition
421 (?(<name>)... named reference condition (Perl)
422 (?('name')... named reference condition (Perl)
423 (?(name)... named reference condition (PCRE)
424 (?(R)... overall recursion condition
425 (?(Rn)... specific group recursion condition
426 (?(R&name)... specific recursion condition
427 (?(DEFINE)... define subpattern for reference
428 (?(assert)... assertion condition
429 .
430 .
431 .SH "BACKTRACKING CONTROL"
432 .rs
433 .sp
434 The following act immediately they are reached:
435 .sp
436 (*ACCEPT) force successful match
437 (*FAIL) force backtrack; synonym (*F)
438 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
439 .sp
440 The following act only when a subsequent match failure causes a backtrack to
441 reach them. They all force a match failure, but they differ in what happens
442 afterwards. Those that advance the start-of-match point do so only if the
443 pattern is not anchored.
444 .sp
445 (*COMMIT) overall failure, no advance of starting point
446 (*PRUNE) advance to next starting character
447 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
448 (*SKIP) advance to current matching position
449 (*SKIP:NAME) advance to position corresponding to an earlier
450 (*MARK:NAME); if not found, the (*SKIP) is ignored
451 (*THEN) local failure, backtrack to next alternation
452 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
453 .
454 .
455 .SH "NEWLINE CONVENTIONS"
456 .rs
457 .sp
458 These are recognized only at the very start of the pattern or after a
459 (*BSR_...), (*UTF8), (*UTF16), (*UTF32) or (*UCP) option.
460 .sp
461 (*CR) carriage return only
462 (*LF) linefeed only
463 (*CRLF) carriage return followed by linefeed
464 (*ANYCRLF) all three of the above
465 (*ANY) any Unicode newline sequence
466 .
467 .
468 .SH "WHAT \eR MATCHES"
469 .rs
470 .sp
471 These are recognized only at the very start of the pattern or after a
472 (*...) option that sets the newline convention or a UTF or UCP mode.
473 .sp
474 (*BSR_ANYCRLF) CR, LF, or CRLF
475 (*BSR_UNICODE) any Unicode newline sequence
476 .
477 .
478 .SH "CALLOUTS"
479 .rs
480 .sp
481 (?C) callout
482 (?Cn) callout with data n
483 .
484 .
485 .SH "SEE ALSO"
486 .rs
487 .sp
488 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
489 \fBpcrematching\fP(3), \fBpcre\fP(3).
490 .
491 .
492 .SH AUTHOR
493 .rs
494 .sp
495 .nf
496 Philip Hazel
497 University Computing Service
498 Cambridge CB2 3QH, England.
499 .fi
500 .
501 .
502 .SH REVISION
503 .rs
504 .sp
505 .nf
506 Last updated: 08 October 2013
507 Copyright (c) 1997-2013 University of Cambridge.
508 .fi

  ViewVC Help
Powered by ViewVC 1.1.5