/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1436 - (show annotations)
Wed Jan 8 17:29:39 2014 UTC (5 years, 9 months ago) by ph10
File size: 13329 byte(s)
Error occurred while calculating annotation data.
Clarify documentation about documentation, and fix an omission.
1 .TH PCRESYNTAX 3 "08 January 2014" "PCRE 8.35"
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains a quick-reference summary of the syntax.
13 .
14 .
15 .SH "QUOTING"
16 .rs
17 .sp
18 \ex where x is non-alphanumeric is a literal x
19 \eQ...\eE treat enclosed characters as literal
20 .
21 .
22 .SH "CHARACTERS"
23 .rs
24 .sp
25 \ea alarm, that is, the BEL character (hex 07)
26 \ecx "control-x", where x is any ASCII character
27 \ee escape (hex 1B)
28 \ef form feed (hex 0C)
29 \en newline (hex 0A)
30 \er carriage return (hex 0D)
31 \et tab (hex 09)
32 \e0dd character with octal code 0dd
33 \eddd character with octal code ddd, or backreference
34 \eo{ddd..} character with octal code ddd..
35 \exhh character with hex code hh
36 \ex{hhh..} character with hex code hhh..
37 .sp
38 Note that \e0dd is always an octal code, and that \e8 and \e9 are the literal
39 characters "8" and "9".
40 .
41 .
42 .SH "CHARACTER TYPES"
43 .rs
44 .sp
45 . any character except newline;
46 in dotall mode, any character whatsoever
47 \eC one data unit, even in UTF mode (best avoided)
48 \ed a decimal digit
49 \eD a character that is not a decimal digit
50 \eh a horizontal white space character
51 \eH a character that is not a horizontal white space character
52 \eN a character that is not a newline
53 \ep{\fIxx\fP} a character with the \fIxx\fP property
54 \eP{\fIxx\fP} a character without the \fIxx\fP property
55 \eR a newline sequence
56 \es a white space character
57 \eS a character that is not a white space character
58 \ev a vertical white space character
59 \eV a character that is not a vertical white space character
60 \ew a "word" character
61 \eW a "non-word" character
62 \eX a Unicode extended grapheme cluster
63 .sp
64 By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
65 or in the 16- bit and 32-bit libraries. However, if locale-specific matching is
66 happening, \es and \ew may also match characters with code points in the range
67 128-255. If the PCRE_UCP option is set, the behaviour of these escape sequences
68 is changed to use Unicode properties and they match many more characters.
69 .
70 .
71 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
72 .rs
73 .sp
74 C Other
75 Cc Control
76 Cf Format
77 Cn Unassigned
78 Co Private use
79 Cs Surrogate
80 .sp
81 L Letter
82 Ll Lower case letter
83 Lm Modifier letter
84 Lo Other letter
85 Lt Title case letter
86 Lu Upper case letter
87 L& Ll, Lu, or Lt
88 .sp
89 M Mark
90 Mc Spacing mark
91 Me Enclosing mark
92 Mn Non-spacing mark
93 .sp
94 N Number
95 Nd Decimal number
96 Nl Letter number
97 No Other number
98 .sp
99 P Punctuation
100 Pc Connector punctuation
101 Pd Dash punctuation
102 Pe Close punctuation
103 Pf Final punctuation
104 Pi Initial punctuation
105 Po Other punctuation
106 Ps Open punctuation
107 .sp
108 S Symbol
109 Sc Currency symbol
110 Sk Modifier symbol
111 Sm Mathematical symbol
112 So Other symbol
113 .sp
114 Z Separator
115 Zl Line separator
116 Zp Paragraph separator
117 Zs Space separator
118 .
119 .
120 .SH "PCRE SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
121 .rs
122 .sp
123 Xan Alphanumeric: union of properties L and N
124 Xps POSIX space: property Z or tab, NL, VT, FF, CR
125 Xsp Perl space: property Z or tab, NL, VT, FF, CR
126 Xuc Univerally-named character: one that can be
127 represented by a Universal Character Name
128 Xwd Perl word: property Xan or underscore
129 .sp
130 Perl and POSIX space are now the same. Perl added VT to its space character set
131 at release 5.18 and PCRE changed at release 8.34.
132 .
133 .
134 .SH "SCRIPT NAMES FOR \ep AND \eP"
135 .rs
136 .sp
137 Arabic,
138 Armenian,
139 Avestan,
140 Balinese,
141 Bamum,
142 Batak,
143 Bengali,
144 Bopomofo,
145 Brahmi,
146 Braille,
147 Buginese,
148 Buhid,
149 Canadian_Aboriginal,
150 Carian,
151 Chakma,
152 Cham,
153 Cherokee,
154 Common,
155 Coptic,
156 Cuneiform,
157 Cypriot,
158 Cyrillic,
159 Deseret,
160 Devanagari,
161 Egyptian_Hieroglyphs,
162 Ethiopic,
163 Georgian,
164 Glagolitic,
165 Gothic,
166 Greek,
167 Gujarati,
168 Gurmukhi,
169 Han,
170 Hangul,
171 Hanunoo,
172 Hebrew,
173 Hiragana,
174 Imperial_Aramaic,
175 Inherited,
176 Inscriptional_Pahlavi,
177 Inscriptional_Parthian,
178 Javanese,
179 Kaithi,
180 Kannada,
181 Katakana,
182 Kayah_Li,
183 Kharoshthi,
184 Khmer,
185 Lao,
186 Latin,
187 Lepcha,
188 Limbu,
189 Linear_B,
190 Lisu,
191 Lycian,
192 Lydian,
193 Malayalam,
194 Mandaic,
195 Meetei_Mayek,
196 Meroitic_Cursive,
197 Meroitic_Hieroglyphs,
198 Miao,
199 Mongolian,
200 Myanmar,
201 New_Tai_Lue,
202 Nko,
203 Ogham,
204 Old_Italic,
205 Old_Persian,
206 Old_South_Arabian,
207 Old_Turkic,
208 Ol_Chiki,
209 Oriya,
210 Osmanya,
211 Phags_Pa,
212 Phoenician,
213 Rejang,
214 Runic,
215 Samaritan,
216 Saurashtra,
217 Sharada,
218 Shavian,
219 Sinhala,
220 Sora_Sompeng,
221 Sundanese,
222 Syloti_Nagri,
223 Syriac,
224 Tagalog,
225 Tagbanwa,
226 Tai_Le,
227 Tai_Tham,
228 Tai_Viet,
229 Takri,
230 Tamil,
231 Telugu,
232 Thaana,
233 Thai,
234 Tibetan,
235 Tifinagh,
236 Ugaritic,
237 Vai,
238 Yi.
239 .
240 .
241 .SH "CHARACTER CLASSES"
242 .rs
243 .sp
244 [...] positive character class
245 [^...] negative character class
246 [x-y] range (can be used for hex characters)
247 [[:xxx:]] positive POSIX named set
248 [[:^xxx:]] negative POSIX named set
249 .sp
250 alnum alphanumeric
251 alpha alphabetic
252 ascii 0-127
253 blank space or tab
254 cntrl control character
255 digit decimal digit
256 graph printing, excluding space
257 lower lower case letter
258 print printing, including space
259 punct printing, excluding alphanumeric
260 space white space
261 upper upper case letter
262 word same as \ew
263 xdigit hexadecimal digit
264 .sp
265 In PCRE, POSIX character set names recognize only ASCII characters by default,
266 but some of them use Unicode properties if PCRE_UCP is set. You can use
267 \eQ...\eE inside a character class.
268 .
269 .
270 .SH "QUANTIFIERS"
271 .rs
272 .sp
273 ? 0 or 1, greedy
274 ?+ 0 or 1, possessive
275 ?? 0 or 1, lazy
276 * 0 or more, greedy
277 *+ 0 or more, possessive
278 *? 0 or more, lazy
279 + 1 or more, greedy
280 ++ 1 or more, possessive
281 +? 1 or more, lazy
282 {n} exactly n
283 {n,m} at least n, no more than m, greedy
284 {n,m}+ at least n, no more than m, possessive
285 {n,m}? at least n, no more than m, lazy
286 {n,} n or more, greedy
287 {n,}+ n or more, possessive
288 {n,}? n or more, lazy
289 .
290 .
291 .SH "ANCHORS AND SIMPLE ASSERTIONS"
292 .rs
293 .sp
294 \eb word boundary
295 \eB not a word boundary
296 ^ start of subject
297 also after internal newline in multiline mode
298 \eA start of subject
299 $ end of subject
300 also before newline at end of subject
301 also before internal newline in multiline mode
302 \eZ end of subject
303 also before newline at end of subject
304 \ez end of subject
305 \eG first matching position in subject
306 .
307 .
308 .SH "MATCH POINT RESET"
309 .rs
310 .sp
311 \eK reset start of match
312 .sp
313 \eK is honoured in positive assertions, but ignored in negative ones.
314 .
315 .
316 .SH "ALTERNATION"
317 .rs
318 .sp
319 expr|expr|expr...
320 .
321 .
322 .SH "CAPTURING"
323 .rs
324 .sp
325 (...) capturing group
326 (?<name>...) named capturing group (Perl)
327 (?'name'...) named capturing group (Perl)
328 (?P<name>...) named capturing group (Python)
329 (?:...) non-capturing group
330 (?|...) non-capturing group; reset group numbers for
331 capturing groups in each alternative
332 .
333 .
334 .SH "ATOMIC GROUPS"
335 .rs
336 .sp
337 (?>...) atomic, non-capturing group
338 .
339 .
340 .
341 .
342 .SH "COMMENT"
343 .rs
344 .sp
345 (?#....) comment (not nestable)
346 .
347 .
348 .SH "OPTION SETTING"
349 .rs
350 .sp
351 (?i) caseless
352 (?J) allow duplicate names
353 (?m) multiline
354 (?s) single line (dotall)
355 (?U) default ungreedy (lazy)
356 (?x) extended (ignore white space)
357 (?-...) unset option(s)
358 .sp
359 The following are recognized only at the very start of a pattern or after one
360 of the newline or \eR options with similar syntax. More than one of them may
361 appear.
362 .sp
363 (*LIMIT_MATCH=d) set the match limit to d (decimal number)
364 (*LIMIT_RECURSION=d) set the recursion limit to d (decimal number)
365 (*NO_AUTO_POSSESS) no auto-possessification (PCRE_NO_AUTO_POSSESS)
366 (*NO_START_OPT) no start-match optimization (PCRE_NO_START_OPTIMIZE)
367 (*UTF8) set UTF-8 mode: 8-bit library (PCRE_UTF8)
368 (*UTF16) set UTF-16 mode: 16-bit library (PCRE_UTF16)
369 (*UTF32) set UTF-32 mode: 32-bit library (PCRE_UTF32)
370 (*UTF) set appropriate UTF mode for the library in use
371 (*UCP) set PCRE_UCP (use Unicode properties for \ed etc)
372 .sp
373 Note that LIMIT_MATCH and LIMIT_RECURSION can only reduce the value of the
374 limits set by the caller of pcre_exec(), not increase them.
375 .
376 .
377 .SH "NEWLINE CONVENTION"
378 .rs
379 .sp
380 These are recognized only at the very start of the pattern or after option
381 settings with a similar syntax.
382 .sp
383 (*CR) carriage return only
384 (*LF) linefeed only
385 (*CRLF) carriage return followed by linefeed
386 (*ANYCRLF) all three of the above
387 (*ANY) any Unicode newline sequence
388 .
389 .
390 .SH "WHAT \eR MATCHES"
391 .rs
392 .sp
393 These are recognized only at the very start of the pattern or after option
394 setting with a similar syntax.
395 .sp
396 (*BSR_ANYCRLF) CR, LF, or CRLF
397 (*BSR_UNICODE) any Unicode newline sequence
398 .
399 .
400 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
401 .rs
402 .sp
403 (?=...) positive look ahead
404 (?!...) negative look ahead
405 (?<=...) positive look behind
406 (?<!...) negative look behind
407 .sp
408 Each top-level branch of a look behind must be of a fixed length.
409 .
410 .
411 .SH "BACKREFERENCES"
412 .rs
413 .sp
414 \en reference by number (can be ambiguous)
415 \egn reference by number
416 \eg{n} reference by number
417 \eg{-n} relative reference by number
418 \ek<name> reference by name (Perl)
419 \ek'name' reference by name (Perl)
420 \eg{name} reference by name (Perl)
421 \ek{name} reference by name (.NET)
422 (?P=name) reference by name (Python)
423 .
424 .
425 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
426 .rs
427 .sp
428 (?R) recurse whole pattern
429 (?n) call subpattern by absolute number
430 (?+n) call subpattern by relative number
431 (?-n) call subpattern by relative number
432 (?&name) call subpattern by name (Perl)
433 (?P>name) call subpattern by name (Python)
434 \eg<name> call subpattern by name (Oniguruma)
435 \eg'name' call subpattern by name (Oniguruma)
436 \eg<n> call subpattern by absolute number (Oniguruma)
437 \eg'n' call subpattern by absolute number (Oniguruma)
438 \eg<+n> call subpattern by relative number (PCRE extension)
439 \eg'+n' call subpattern by relative number (PCRE extension)
440 \eg<-n> call subpattern by relative number (PCRE extension)
441 \eg'-n' call subpattern by relative number (PCRE extension)
442 .
443 .
444 .SH "CONDITIONAL PATTERNS"
445 .rs
446 .sp
447 (?(condition)yes-pattern)
448 (?(condition)yes-pattern|no-pattern)
449 .sp
450 (?(n)... absolute reference condition
451 (?(+n)... relative reference condition
452 (?(-n)... relative reference condition
453 (?(<name>)... named reference condition (Perl)
454 (?('name')... named reference condition (Perl)
455 (?(name)... named reference condition (PCRE)
456 (?(R)... overall recursion condition
457 (?(Rn)... specific group recursion condition
458 (?(R&name)... specific recursion condition
459 (?(DEFINE)... define subpattern for reference
460 (?(assert)... assertion condition
461 .
462 .
463 .SH "BACKTRACKING CONTROL"
464 .rs
465 .sp
466 The following act immediately they are reached:
467 .sp
468 (*ACCEPT) force successful match
469 (*FAIL) force backtrack; synonym (*F)
470 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
471 .sp
472 The following act only when a subsequent match failure causes a backtrack to
473 reach them. They all force a match failure, but they differ in what happens
474 afterwards. Those that advance the start-of-match point do so only if the
475 pattern is not anchored.
476 .sp
477 (*COMMIT) overall failure, no advance of starting point
478 (*PRUNE) advance to next starting character
479 (*PRUNE:NAME) equivalent to (*MARK:NAME)(*PRUNE)
480 (*SKIP) advance to current matching position
481 (*SKIP:NAME) advance to position corresponding to an earlier
482 (*MARK:NAME); if not found, the (*SKIP) is ignored
483 (*THEN) local failure, backtrack to next alternation
484 (*THEN:NAME) equivalent to (*MARK:NAME)(*THEN)
485 .
486 .
487 .SH "CALLOUTS"
488 .rs
489 .sp
490 (?C) callout
491 (?Cn) callout with data n
492 .
493 .
494 .SH "SEE ALSO"
495 .rs
496 .sp
497 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
498 \fBpcrematching\fP(3), \fBpcre\fP(3).
499 .
500 .
501 .SH AUTHOR
502 .rs
503 .sp
504 .nf
505 Philip Hazel
506 University Computing Service
507 Cambridge CB2 3QH, England.
508 .fi
509 .
510 .
511 .SH REVISION
512 .rs
513 .sp
514 .nf
515 Last updated: 08 January 2014
516 Copyright (c) 1997-2014 University of Cambridge.
517 .fi

  ViewVC Help
Powered by ViewVC 1.1.5