/[pcre]/code/trunk/doc/pcresyntax.3
ViewVC logotype

Contents of /code/trunk/doc/pcresyntax.3

Parent Directory Parent Directory | Revision Log Revision Log


Revision 412 - (show annotations)
Sat Apr 11 10:34:37 2009 UTC (6 years, 2 months ago) by ph10
File size: 10745 byte(s)
Error occurred while calculating annotation data.
Add support for (*UTF8).
1 .TH PCRESYNTAX 3
2 .SH NAME
3 PCRE - Perl-compatible regular expressions
4 .SH "PCRE REGULAR EXPRESSION SYNTAX SUMMARY"
5 .rs
6 .sp
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE are described in the
9 .\" HREF
10 \fBpcrepattern\fP
11 .\"
12 documentation. This document contains just a quick-reference summary of the
13 syntax.
14 .
15 .
16 .SH "QUOTING"
17 .rs
18 .sp
19 \ex where x is non-alphanumeric is a literal x
20 \eQ...\eE treat enclosed characters as literal
21 .
22 .
23 .SH "CHARACTERS"
24 .rs
25 .sp
26 \ea alarm, that is, the BEL character (hex 07)
27 \ecx "control-x", where x is any character
28 \ee escape (hex 1B)
29 \ef formfeed (hex 0C)
30 \en newline (hex 0A)
31 \er carriage return (hex 0D)
32 \et tab (hex 09)
33 \eddd character with octal code ddd, or backreference
34 \exhh character with hex code hh
35 \ex{hhh..} character with hex code hhh..
36 .
37 .
38 .SH "CHARACTER TYPES"
39 .rs
40 .sp
41 . any character except newline;
42 in dotall mode, any character whatsoever
43 \eC one byte, even in UTF-8 mode (best avoided)
44 \ed a decimal digit
45 \eD a character that is not a decimal digit
46 \eh a horizontal whitespace character
47 \eH a character that is not a horizontal whitespace character
48 \ep{\fIxx\fP} a character with the \fIxx\fP property
49 \eP{\fIxx\fP} a character without the \fIxx\fP property
50 \eR a newline sequence
51 \es a whitespace character
52 \eS a character that is not a whitespace character
53 \ev a vertical whitespace character
54 \eV a character that is not a vertical whitespace character
55 \ew a "word" character
56 \eW a "non-word" character
57 \eX an extended Unicode sequence
58 .sp
59 In PCRE, \ed, \eD, \es, \eS, \ew, and \eW recognize only ASCII characters.
60 .
61 .
62 .SH "GENERAL CATEGORY PROPERTY CODES FOR \ep and \eP"
63 .rs
64 .sp
65 C Other
66 Cc Control
67 Cf Format
68 Cn Unassigned
69 Co Private use
70 Cs Surrogate
71 .sp
72 L Letter
73 Ll Lower case letter
74 Lm Modifier letter
75 Lo Other letter
76 Lt Title case letter
77 Lu Upper case letter
78 L& Ll, Lu, or Lt
79 .sp
80 M Mark
81 Mc Spacing mark
82 Me Enclosing mark
83 Mn Non-spacing mark
84 .sp
85 N Number
86 Nd Decimal number
87 Nl Letter number
88 No Other number
89 .sp
90 P Punctuation
91 Pc Connector punctuation
92 Pd Dash punctuation
93 Pe Close punctuation
94 Pf Final punctuation
95 Pi Initial punctuation
96 Po Other punctuation
97 Ps Open punctuation
98 .sp
99 S Symbol
100 Sc Currency symbol
101 Sk Modifier symbol
102 Sm Mathematical symbol
103 So Other symbol
104 .sp
105 Z Separator
106 Zl Line separator
107 Zp Paragraph separator
108 Zs Space separator
109 .
110 .
111 .SH "SCRIPT NAMES FOR \ep AND \eP"
112 .rs
113 .sp
114 Arabic,
115 Armenian,
116 Balinese,
117 Bengali,
118 Bopomofo,
119 Braille,
120 Buginese,
121 Buhid,
122 Canadian_Aboriginal,
123 Carian,
124 Cham,
125 Cherokee,
126 Common,
127 Coptic,
128 Cuneiform,
129 Cypriot,
130 Cyrillic,
131 Deseret,
132 Devanagari,
133 Ethiopic,
134 Georgian,
135 Glagolitic,
136 Gothic,
137 Greek,
138 Gujarati,
139 Gurmukhi,
140 Han,
141 Hangul,
142 Hanunoo,
143 Hebrew,
144 Hiragana,
145 Inherited,
146 Kannada,
147 Katakana,
148 Kayah_Li,
149 Kharoshthi,
150 Khmer,
151 Lao,
152 Latin,
153 Lepcha,
154 Limbu,
155 Linear_B,
156 Lycian,
157 Lydian,
158 Malayalam,
159 Mongolian,
160 Myanmar,
161 New_Tai_Lue,
162 Nko,
163 Ogham,
164 Old_Italic,
165 Old_Persian,
166 Ol_Chiki,
167 Oriya,
168 Osmanya,
169 Phags_Pa,
170 Phoenician,
171 Rejang,
172 Runic,
173 Saurashtra,
174 Shavian,
175 Sinhala,
176 Sudanese,
177 Syloti_Nagri,
178 Syriac,
179 Tagalog,
180 Tagbanwa,
181 Tai_Le,
182 Tamil,
183 Telugu,
184 Thaana,
185 Thai,
186 Tibetan,
187 Tifinagh,
188 Ugaritic,
189 Vai,
190 Yi.
191 .
192 .
193 .SH "CHARACTER CLASSES"
194 .rs
195 .sp
196 [...] positive character class
197 [^...] negative character class
198 [x-y] range (can be used for hex characters)
199 [[:xxx:]] positive POSIX named set
200 [[:^xxx:]] negative POSIX named set
201 .sp
202 alnum alphanumeric
203 alpha alphabetic
204 ascii 0-127
205 blank space or tab
206 cntrl control character
207 digit decimal digit
208 graph printing, excluding space
209 lower lower case letter
210 print printing, including space
211 punct printing, excluding alphanumeric
212 space whitespace
213 upper upper case letter
214 word same as \ew
215 xdigit hexadecimal digit
216 .sp
217 In PCRE, POSIX character set names recognize only ASCII characters. You can use
218 \eQ...\eE inside a character class.
219 .
220 .
221 .SH "QUANTIFIERS"
222 .rs
223 .sp
224 ? 0 or 1, greedy
225 ?+ 0 or 1, possessive
226 ?? 0 or 1, lazy
227 * 0 or more, greedy
228 *+ 0 or more, possessive
229 *? 0 or more, lazy
230 + 1 or more, greedy
231 ++ 1 or more, possessive
232 +? 1 or more, lazy
233 {n} exactly n
234 {n,m} at least n, no more than m, greedy
235 {n,m}+ at least n, no more than m, possessive
236 {n,m}? at least n, no more than m, lazy
237 {n,} n or more, greedy
238 {n,}+ n or more, possessive
239 {n,}? n or more, lazy
240 .
241 .
242 .SH "ANCHORS AND SIMPLE ASSERTIONS"
243 .rs
244 .sp
245 \eb word boundary (only ASCII letters recognized)
246 \eB not a word boundary
247 ^ start of subject
248 also after internal newline in multiline mode
249 \eA start of subject
250 $ end of subject
251 also before newline at end of subject
252 also before internal newline in multiline mode
253 \eZ end of subject
254 also before newline at end of subject
255 \ez end of subject
256 \eG first matching position in subject
257 .
258 .
259 .SH "MATCH POINT RESET"
260 .rs
261 .sp
262 \eK reset start of match
263 .
264 .
265 .SH "ALTERNATION"
266 .rs
267 .sp
268 expr|expr|expr...
269 .
270 .
271 .SH "CAPTURING"
272 .rs
273 .sp
274 (...) capturing group
275 (?<name>...) named capturing group (Perl)
276 (?'name'...) named capturing group (Perl)
277 (?P<name>...) named capturing group (Python)
278 (?:...) non-capturing group
279 (?|...) non-capturing group; reset group numbers for
280 capturing groups in each alternative
281 .
282 .
283 .SH "ATOMIC GROUPS"
284 .rs
285 .sp
286 (?>...) atomic, non-capturing group
287 .
288 .
289 .
290 .
291 .SH "COMMENT"
292 .rs
293 .sp
294 (?#....) comment (not nestable)
295 .
296 .
297 .SH "OPTION SETTING"
298 .rs
299 .sp
300 (?i) caseless
301 (?J) allow duplicate names
302 (?m) multiline
303 (?s) single line (dotall)
304 (?U) default ungreedy (lazy)
305 (?x) extended (ignore white space)
306 (?-...) unset option(s)
307 .sp
308 The following is recognized only at the start of a pattern or after one of the
309 newline-setting options with similar syntax:
310 .sp
311 (*UTF8) set UTF-8 mode
312 .
313 .
314 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
315 .rs
316 .sp
317 (?=...) positive look ahead
318 (?!...) negative look ahead
319 (?<=...) positive look behind
320 (?<!...) negative look behind
321 .sp
322 Each top-level branch of a look behind must be of a fixed length.
323 .
324 .
325 .SH "BACKREFERENCES"
326 .rs
327 .sp
328 \en reference by number (can be ambiguous)
329 \egn reference by number
330 \eg{n} reference by number
331 \eg{-n} relative reference by number
332 \ek<name> reference by name (Perl)
333 \ek'name' reference by name (Perl)
334 \eg{name} reference by name (Perl)
335 \ek{name} reference by name (.NET)
336 (?P=name) reference by name (Python)
337 .
338 .
339 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
340 .rs
341 .sp
342 (?R) recurse whole pattern
343 (?n) call subpattern by absolute number
344 (?+n) call subpattern by relative number
345 (?-n) call subpattern by relative number
346 (?&name) call subpattern by name (Perl)
347 (?P>name) call subpattern by name (Python)
348 \eg<name> call subpattern by name (Oniguruma)
349 \eg'name' call subpattern by name (Oniguruma)
350 \eg<n> call subpattern by absolute number (Oniguruma)
351 \eg'n' call subpattern by absolute number (Oniguruma)
352 \eg<+n> call subpattern by relative number (PCRE extension)
353 \eg'+n' call subpattern by relative number (PCRE extension)
354 \eg<-n> call subpattern by relative number (PCRE extension)
355 \eg'-n' call subpattern by relative number (PCRE extension)
356 .
357 .
358 .SH "CONDITIONAL PATTERNS"
359 .rs
360 .sp
361 (?(condition)yes-pattern)
362 (?(condition)yes-pattern|no-pattern)
363 .sp
364 (?(n)... absolute reference condition
365 (?(+n)... relative reference condition
366 (?(-n)... relative reference condition
367 (?(<name>)... named reference condition (Perl)
368 (?('name')... named reference condition (Perl)
369 (?(name)... named reference condition (PCRE)
370 (?(R)... overall recursion condition
371 (?(Rn)... specific group recursion condition
372 (?(R&name)... specific recursion condition
373 (?(DEFINE)... define subpattern for reference
374 (?(assert)... assertion condition
375 .
376 .
377 .SH "BACKTRACKING CONTROL"
378 .rs
379 .sp
380 The following act immediately they are reached:
381 .sp
382 (*ACCEPT) force successful match
383 (*FAIL) force backtrack; synonym (*F)
384 .sp
385 The following act only when a subsequent match failure causes a backtrack to
386 reach them. They all force a match failure, but they differ in what happens
387 afterwards. Those that advance the start-of-match point do so only if the
388 pattern is not anchored.
389 .sp
390 (*COMMIT) overall failure, no advance of starting point
391 (*PRUNE) advance to next starting character
392 (*SKIP) advance start to current matching position
393 (*THEN) local failure, backtrack to next alternation
394 .
395 .
396 .SH "NEWLINE CONVENTIONS"
397 .rs
398 .sp
399 These are recognized only at the very start of the pattern or after a
400 (*BSR_...) or (*UTF8) option.
401 .sp
402 (*CR) carriage return only
403 (*LF) linefeed only
404 (*CRLF) carriage return followed by linefeed
405 (*ANYCRLF) all three of the above
406 (*ANY) any Unicode newline sequence
407 .
408 .
409 .SH "WHAT \eR MATCHES"
410 .rs
411 .sp
412 These are recognized only at the very start of the pattern or after a
413 (*...) option that sets the newline convention or UTF-8 mode.
414 .sp
415 (*BSR_ANYCRLF) CR, LF, or CRLF
416 (*BSR_UNICODE) any Unicode newline sequence
417 .
418 .
419 .SH "CALLOUTS"
420 .rs
421 .sp
422 (?C) callout
423 (?Cn) callout with data n
424 .
425 .
426 .SH "SEE ALSO"
427 .rs
428 .sp
429 \fBpcrepattern\fP(3), \fBpcreapi\fP(3), \fBpcrecallout\fP(3),
430 \fBpcrematching\fP(3), \fBpcre\fP(3).
431 .
432 .
433 .SH AUTHOR
434 .rs
435 .sp
436 .nf
437 Philip Hazel
438 University Computing Service
439 Cambridge CB2 3QH, England.
440 .fi
441 .
442 .
443 .SH REVISION
444 .rs
445 .sp
446 .nf
447 Last updated: 11 April 2009
448 Copyright (c) 1997-2009 University of Cambridge.
449 .fi

  ViewVC Help
Powered by ViewVC 1.1.5