/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 491 - (show annotations)
Mon Mar 1 17:45:08 2010 UTC (9 years, 9 months ago) by ph10
File size: 10945 byte(s)
Update Unicode tables to Unicode version 5.2.0.
1 .TH PCRESYNTAX 3
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains just a quick-reference summary of the
13 syntax.
14 .
15 .
16 .SH "QUOTING"
17 .rs
18 .sp
19 \ex where x is non-alphanumeric is a literal x
20 \eQ...\eE treat enclosed characters as literal
21 .
22 .
23 .SH "CHARACTERS"
24 .rs
25 .sp
26 \ea alarm, that is, the BEL character (hex 07)
27 \ecx "control-x", where x is any character
28 \ee escape (hex 1B)
29 \ef formfeed (hex 0C)
30 \en newline (hex 0A)
31 \er carriage return (hex 0D)
32 \et tab (hex 09)
33 \eddd character with octal code ddd, or backreference
34 \exhh character with hex code hh
35 \ex{hhh..} character with hex code hhh..
36 .
37 .
38 .SH "CHARACTER TYPES"
39 .rs
40 .sp
41 . any character except newline;
42 in dotall mode, any character whatsoever
43 \eC one byte, even in UTF-8 mode (best avoided)
44 \ed a decimal digit
45 \eD a character that is not a decimal digit
46 \eh a horizontal whitespace character
47 \eH a character that is not a horizontal whitespace character
48 \ep{\fIxx\fP} a character with the \fIxx\fP property
49 \eP{\fIxx\fP} a character without the \fIxx\fP property
50 \eR a newline sequence
51 \es a whitespace character
52 \eS a character that is not a whitespace character
53 \ev a vertical whitespace character
54 \eV a character that is not a vertical whitespace character
55 \ew a "word" character
56 \eW a "non-word" character
57 \eX an extended Unicode sequence
58 .sp
59 In PCRE, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII characters.
60 .
61 .
62 .SH "GENERAL CATEGORY PROPERTY CODES FOR \ep and \eP"
63 .rs
64 .sp
65 C Other
66 Cc Control
67 Cf Format
68 Cn Unassigned
69 Co Private use
70 Cs Surrogate
71 .sp
72 L Letter
73 Ll Lower case letter
74 Lm Modifier letter
75 Lo Other letter
76 Lt Title case letter
77 Lu Upper case letter
78 L& Ll, Lu, or Lt
79 .sp
80 M Mark
81 Mc Spacing mark
82 Me Enclosing mark
83 Mn Non-spacing mark
84 .sp
85 N Number
86 Nd Decimal number
87 Nl Letter number
88 No Other number
89 .sp
90 P Punctuation
91 Pc Connector punctuation
92 Pd Dash punctuation
93 Pe Close punctuation
94 Pf Final punctuation
95 Pi Initial punctuation
96 Po Other punctuation
97 Ps Open punctuation
98 .sp
99 S Symbol
100 Sc Currency symbol
101 Sk Modifier symbol
102 Sm Mathematical symbol
103 So Other symbol
104 .sp
105 Z Separator
106 Zl Line separator
107 Zp Paragraph separator
108 Zs Space separator
109 .
110 .
111 .SH "SCRIPT NAMES FOR \ep AND \eP"
112 .rs
113 .sp
114 Arabic,
115 Armenian,
116 Avestan,
117 Balinese,
118 Bamum,
119 Bengali,
120 Bopomofo,
121 Braille,
122 Buginese,
123 Buhid,
124 Canadian_Aboriginal,
125 Carian,
126 Cham,
127 Cherokee,
128 Common,
129 Coptic,
130 Cuneiform,
131 Cypriot,
132 Cyrillic,
133 Deseret,
134 Devanagari,
135 Egyptian_Hieroglyphs,
136 Ethiopic,
137 Georgian,
138 Glagolitic,
139 Gothic,
140 Greek,
141 Gujarati,
142 Gurmukhi,
143 Han,
144 Hangul,
145 Hanunoo,
146 Hebrew,
147 Hiragana,
148 Imperial_Aramaic,
149 Inherited,
150 Inscriptional_Pahlavi,
151 Inscriptional_Parthian,
152 Javanese,
153 Kaithi,
154 Kannada,
155 Katakana,
156 Kayah_Li,
157 Kharoshthi,
158 Khmer,
159 Lao,
160 Latin,
161 Lepcha,
162 Limbu,
163 Linear_B,
164 Lisu,
165 Lycian,
166 Lydian,
167 Malayalam,
168 Meetei_Mayek,
169 Mongolian,
170 Myanmar,
171 New_Tai_Lue,
172 Nko,
173 Ogham,
174 Old_Italic,
175 Old_Persian,
176 Old_South_Arabian,
177 Old_Turkic,
178 Ol_Chiki,
179 Oriya,
180 Osmanya,
181 Phags_Pa,
182 Phoenician,
183 Rejang,
184 Runic,
185 Samaritan,
186 Saurashtra,
187 Shavian,
188 Sinhala,
189 Sundanese,
190 Syloti_Nagri,
191 Syriac,
192 Tagalog,
193 Tagbanwa,
194 Tai_Le,
195 Tai_Tham,
196 Tai_Viet,
197 Tamil,
198 Telugu,
199 Thaana,
200 Thai,
201 Tibetan,
202 Tifinagh,
203 Ugaritic,
204 Vai,
205 Yi.
206 .
207 .
208 .SH "CHARACTER CLASSES"
209 .rs
210 .sp
211 [...] positive character class
212 [^...] negative character class
213 [x-y] range (can be used for hex characters)
214 [[:xxx:]] positive POSIX named set
215 [[:^xxx:]] negative POSIX named set
216 .sp
217 alnum alphanumeric
218 alpha alphabetic
219 ascii 0-127
220 blank space or tab
221 cntrl control character
222 digit decimal digit
223 graph printing, excluding space
224 lower lower case letter
225 print printing, including space
226 punct printing, excluding alphanumeric
227 space whitespace
228 upper upper case letter
229 word same as \ew
230 xdigit hexadecimal digit
231 .sp
232 In PCRE, POSIX character set names recognize only ASCII characters. You can use
233 \eQ...\eE inside a character class.
234 .
235 .
236 .SH "QUANTIFIERS"
237 .rs
238 .sp
239 ? 0 or 1, greedy
240 ?+ 0 or 1, possessive
241 ?? 0 or 1, lazy
242 * 0 or more, greedy
243 *+ 0 or more, possessive
244 *? 0 or more, lazy
245 + 1 or more, greedy
246 ++ 1 or more, possessive
247 +? 1 or more, lazy
248 {n} exactly n
249 {n,m} at least n, no more than m, greedy
250 {n,m}+ at least n, no more than m, possessive
251 {n,m}? at least n, no more than m, lazy
252 {n,} n or more, greedy
253 {n,}+ n or more, possessive
254 {n,}? n or more, lazy
255 .
256 .
257 .SH "ANCHORS AND SIMPLE ASSERTIONS"
258 .rs
259 .sp
260 \eb word boundary (only ASCII letters recognized)
261 \eB not a word boundary
262 ^ start of subject
263 also after internal newline in multiline mode
264 \eA start of subject
265 $ end of subject
266 also before newline at end of subject
267 also before internal newline in multiline mode
268 \eZ end of subject
269 also before newline at end of subject
270 \ez end of subject
271 \eG first matching position in subject
272 .
273 .
274 .SH "MATCH POINT RESET"
275 .rs
276 .sp
277 \eK reset start of match
278 .
279 .
280 .SH "ALTERNATION"
281 .rs
282 .sp
283 expr|expr|expr...
284 .
285 .
286 .SH "CAPTURING"
287 .rs
288 .sp
289 (...) capturing group
290 (?<name>...) named capturing group (Perl)
291 (?'name'...) named capturing group (Perl)
292 (?P<name>...) named capturing group (Python)
293 (?:...) non-capturing group
294 (?|...) non-capturing group; reset group numbers for
295 capturing groups in each alternative
296 .
297 .
298 .SH "ATOMIC GROUPS"
299 .rs
300 .sp
301 (?>...) atomic, non-capturing group
302 .
303 .
304 .
305 .
306 .SH "COMMENT"
307 .rs
308 .sp
309 (?#....) comment (not nestable)
310 .
311 .
312 .SH "OPTION SETTING"
313 .rs
314 .sp
315 (?i) caseless
316 (?J) allow duplicate names
317 (?m) multiline
318 (?s) single line (dotall)
319 (?U) default ungreedy (lazy)
320 (?x) extended (ignore white space)
321 (?-...) unset option(s)
322 .sp
323 The following is recognized only at the start of a pattern or after one of the
324 newline-setting options with similar syntax:
325 .sp
326 (*UTF8) set UTF-8 mode
327 .
328 .
329 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
330 .rs
331 .sp
332 (?=...) positive look ahead
333 (?!...) negative look ahead
334 (?<=...) positive look behind
335 (?<!...) negative look behind
336 .sp
337 Each top-level branch of a look behind must be of a fixed length.
338 .
339 .
340 .SH "BACKREFERENCES"
341 .rs
342 .sp
343 \en reference by number (can be ambiguous)
344 \egn reference by number
345 \eg{n} reference by number
346 \eg{-n} relative reference by number
347 \ek<name> reference by name (Perl)
348 \ek'name' reference by name (Perl)
349 \eg{name} reference by name (Perl)
350 \ek{name} reference by name (.NET)
351 (?P=name) reference by name (Python)
352 .
353 .
354 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
355 .rs
356 .sp
357 (?R) recurse whole pattern
358 (?n) call subpattern by absolute number
359 (?+n) call subpattern by relative number
360 (?-n) call subpattern by relative number
361 (?&name) call subpattern by name (Perl)
362 (?P>name) call subpattern by name (Python)
363 \eg<name> call subpattern by name (Oniguruma)
364 \eg'name' call subpattern by name (Oniguruma)
365 \eg<n> call subpattern by absolute number (Oniguruma)
366 \eg'n' call subpattern by absolute number (Oniguruma)
367 \eg<+n> call subpattern by relative number (PCRE extension)
368 \eg'+n' call subpattern by relative number (PCRE extension)
369 \eg<-n> call subpattern by relative number (PCRE extension)
370 \eg'-n' call subpattern by relative number (PCRE extension)
371 .
372 .
373 .SH "CONDITIONAL PATTERNS"
374 .rs
375 .sp
376 (?(condition)yes-pattern)
377 (?(condition)yes-pattern|no-pattern)
378 .sp
379 (?(n)... absolute reference condition
380 (?(+n)... relative reference condition
381 (?(-n)... relative reference condition
382 (?(<name>)... named reference condition (Perl)
383 (?('name')... named reference condition (Perl)
384 (?(name)... named reference condition (PCRE)
385 (?(R)... overall recursion condition
386 (?(Rn)... specific group recursion condition
387 (?(R&name)... specific recursion condition
388 (?(DEFINE)... define subpattern for reference
389 (?(assert)... assertion condition
390 .
391 .
392 .SH "BACKTRACKING CONTROL"
393 .rs
394 .sp
395 The following act immediately they are reached:
396 .sp
397 (*ACCEPT) force successful match
398 (*FAIL) force backtrack; synonym (*F)
399 .sp
400 The following act only when a subsequent match failure causes a backtrack to
401 reach them. They all force a match failure, but they differ in what happens
402 afterwards. Those that advance the start-of-match point do so only if the
403 pattern is not anchored.
404 .sp
405 (*COMMIT) overall failure, no advance of starting point
406 (*PRUNE) advance to next starting character
407 (*SKIP) advance start to current matching position
408 (*THEN) local failure, backtrack to next alternation
409 .
410 .
411 .SH "NEWLINE CONVENTIONS"
412 .rs
413 .sp
414 These are recognized only at the very start of the pattern or after a
415 (*BSR_...) or (*UTF8) option.
416 .sp
417 (*CR) carriage return only
418 (*LF) linefeed only
419 (*CRLF) carriage return followed by linefeed
420 (*ANYCRLF) all three of the above
421 (*ANY) any Unicode newline sequence
422 .
423 .
424 .SH "WHAT \eR MATCHES"
425 .rs
426 .sp
427 These are recognized only at the very start of the pattern or after a
428 (*...) option that sets the newline convention or UTF-8 mode.
429 .sp
430 (*BSR_ANYCRLF) CR, LF, or CRLF
431 (*BSR_UNICODE) any Unicode newline sequence
432 .
433 .
434 .SH "CALLOUTS"
435 .rs
436 .sp
437 (?C) callout
438 (?Cn) callout with data n
439 .
440 .
441 .SH "SEE ALSO"
442 .rs
443 .sp
444 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
445 \fBpcrematching\fP(3), \fBpcre\fP(3).
446 .
447 .
448 .SH AUTHOR
449 .rs
450 .sp
451 .nf
452 Philip Hazel
453 University Computing Service
454 Cambridge CB2 3QH, England.
455 .fi
456 .
457 .
458 .SH REVISION
459 .rs
460 .sp
461 .nf
462 Last updated: 01 March 2010
463 Copyright (c) 1997-2010 University of Cambridge.
464 .fi

  ViewVC Help
Powered by ViewVC 1.1.5