1 |
.TH PCRE 3 |
.TH PCREPATTERN 3 |
2 |
.SH NAME |
.SH NAME |
3 |
PCRE - Perl-compatible regular expressions |
PCRE - Perl-compatible regular expressions |
4 |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
.SH "PCRE REGULAR EXPRESSION DETAILS" |
148 |
\et tab (hex 09) |
\et tab (hex 09) |
149 |
\eddd character with octal code ddd, or backreference |
\eddd character with octal code ddd, or backreference |
150 |
\exhh character with hex code hh |
\exhh character with hex code hh |
151 |
\ex{hhh..} character with hex code hhh... (UTF-8 mode only) |
\ex{hhh..} character with hex code hhh.. |
152 |
.sp |
.sp |
153 |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
The precise effect of \ecx is as follows: if x is a lower case letter, it |
154 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
156 |
7B. |
7B. |
157 |
.P |
.P |
158 |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
After \ex, from zero to two hexadecimal digits are read (letters can be in |
159 |
upper or lower case). In UTF-8 mode, any number of hexadecimal digits may |
upper or lower case). Any number of hexadecimal digits may appear between \ex{ |
160 |
appear between \ex{ and }, but the value of the character code must be less |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
161 |
than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters |
mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value |
162 |
other than hexadecimal digits appear between \ex{ and }, or if there is no |
is 7FFFFFFF). If characters other than hexadecimal digits appear between \ex{ |
163 |
terminating }, this form of escape is not recognized. Instead, the initial |
and }, or if there is no terminating }, this form of escape is not recognized. |
164 |
\ex will be interpreted as a basic hexadecimal escape, with no following |
Instead, the initial \ex will be interpreted as a basic hexadecimal escape, |
165 |
digits, giving a character whose value is zero. |
with no following digits, giving a character whose value is zero. |
166 |
.P |
.P |
167 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
168 |
syntaxes for \ex when PCRE is in UTF-8 mode. There is no difference in the |
syntaxes for \ex. There is no difference in the way they are handled. For |
169 |
way they are handled. For example, \exdc is exactly the same as \ex{dc}. |
example, \exdc is exactly the same as \ex{dc}. |
170 |
.P |
.P |
171 |
After \e0 up to two further octal digits are read. In both cases, if there |
After \e0 up to two further octal digits are read. In both cases, if there |
172 |
are fewer than two digits, just those that are present are used. Thus the |
are fewer than two digits, just those that are present are used. Thus the |
272 |
.P |
.P |
273 |
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
In UTF-8 mode, characters with values greater than 128 never match \ed, \es, or |
274 |
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
\ew, and always match \eD, \eS, and \eW. This is true even when Unicode |
275 |
character property support is available. |
character property support is available. The use of locales with Unicode is |
276 |
|
discouraged. |
277 |
. |
. |
278 |
. |
. |
279 |
.\" HTML <a name="uniextseq"></a> |
.\" HTML <a name="uniextseq"></a> |
281 |
.rs |
.rs |
282 |
.sp |
.sp |
283 |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
284 |
escape sequences to match generic character types are available when UTF-8 mode |
escape sequences to match character properties are available when UTF-8 mode |
285 |
is selected. They are: |
is selected. They are: |
286 |
.sp |
.sp |
287 |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
\ep{\fIxx\fP} a character with the \fIxx\fP property |
288 |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
\eP{\fIxx\fP} a character without the \fIxx\fP property |
289 |
\eX an extended Unicode sequence |
\eX an extended Unicode sequence |
290 |
.sp |
.sp |
291 |
The property names represented by \fIxx\fP above are limited to the |
The property names represented by \fIxx\fP above are limited to the Unicode |
292 |
Unicode general category properties. Each character has exactly one such |
script names, the general category properties, and "Any", which matches any |
293 |
property, specified by a two-letter abbreviation. For compatibility with Perl, |
character (including newline). Other properties such as "InMusicalSymbols" are |
294 |
negation can be specified by including a circumflex between the opening brace |
not currently supported by PCRE. Note that \eP{Any} does not match any |
295 |
and the property name. For example, \ep{^Lu} is the same as \eP{Lu}. |
characters, so always causes a match failure. |
296 |
.P |
.P |
297 |
If only one letter is specified with \ep or \eP, it includes all the properties |
Sets of Unicode characters are defined as belonging to certain scripts. A |
298 |
that start with that letter. In this case, in the absence of negation, the |
character from one of these sets can be matched using a script name. For |
299 |
curly brackets in the escape sequence are optional; these two examples have |
example: |
300 |
the same effect: |
.sp |
301 |
|
\ep{Greek} |
302 |
|
\eP{Han} |
303 |
|
.sp |
304 |
|
Those that are not part of an identified script are lumped together as |
305 |
|
"Common". The current list of scripts is: |
306 |
|
.P |
307 |
|
Arabic, |
308 |
|
Armenian, |
309 |
|
Bengali, |
310 |
|
Bopomofo, |
311 |
|
Braille, |
312 |
|
Buginese, |
313 |
|
Buhid, |
314 |
|
Canadian_Aboriginal, |
315 |
|
Cherokee, |
316 |
|
Common, |
317 |
|
Coptic, |
318 |
|
Cypriot, |
319 |
|
Cyrillic, |
320 |
|
Deseret, |
321 |
|
Devanagari, |
322 |
|
Ethiopic, |
323 |
|
Georgian, |
324 |
|
Glagolitic, |
325 |
|
Gothic, |
326 |
|
Greek, |
327 |
|
Gujarati, |
328 |
|
Gurmukhi, |
329 |
|
Han, |
330 |
|
Hangul, |
331 |
|
Hanunoo, |
332 |
|
Hebrew, |
333 |
|
Hiragana, |
334 |
|
Inherited, |
335 |
|
Kannada, |
336 |
|
Katakana, |
337 |
|
Kharoshthi, |
338 |
|
Khmer, |
339 |
|
Lao, |
340 |
|
Latin, |
341 |
|
Limbu, |
342 |
|
Linear_B, |
343 |
|
Malayalam, |
344 |
|
Mongolian, |
345 |
|
Myanmar, |
346 |
|
New_Tai_Lue, |
347 |
|
Ogham, |
348 |
|
Old_Italic, |
349 |
|
Old_Persian, |
350 |
|
Oriya, |
351 |
|
Osmanya, |
352 |
|
Runic, |
353 |
|
Shavian, |
354 |
|
Sinhala, |
355 |
|
Syloti_Nagri, |
356 |
|
Syriac, |
357 |
|
Tagalog, |
358 |
|
Tagbanwa, |
359 |
|
Tai_Le, |
360 |
|
Tamil, |
361 |
|
Telugu, |
362 |
|
Thaana, |
363 |
|
Thai, |
364 |
|
Tibetan, |
365 |
|
Tifinagh, |
366 |
|
Ugaritic, |
367 |
|
Yi. |
368 |
|
.P |
369 |
|
Each character has exactly one general category property, specified by a |
370 |
|
two-letter abbreviation. For compatibility with Perl, negation can be specified |
371 |
|
by including a circumflex between the opening brace and the property name. For |
372 |
|
example, \ep{^Lu} is the same as \eP{Lu}. |
373 |
|
.P |
374 |
|
If only one letter is specified with \ep or \eP, it includes all the general |
375 |
|
category properties that start with that letter. In this case, in the absence |
376 |
|
of negation, the curly brackets in the escape sequence are optional; these two |
377 |
|
examples have the same effect: |
378 |
.sp |
.sp |
379 |
\ep{L} |
\ep{L} |
380 |
\epL |
\epL |
381 |
.sp |
.sp |
382 |
The following property codes are supported: |
The following general category property codes are supported: |
383 |
.sp |
.sp |
384 |
C Other |
C Other |
385 |
Cc Control |
Cc Control |
425 |
Zp Paragraph separator |
Zp Paragraph separator |
426 |
Zs Space separator |
Zs Space separator |
427 |
.sp |
.sp |
428 |
Extended properties such as "Greek" or "InMusicalSymbols" are not supported by |
The special property L& is also supported: it matches a character that has |
429 |
PCRE. |
the Lu, Ll, or Lt property, in other words, a letter that is not classified as |
430 |
|
a modifier or "other". |
431 |
|
.P |
432 |
|
The long synonyms for these properties that Perl supports (such as \ep{Letter}) |
433 |
|
are not supported by PCRE. Nor is is permitted to prefix any of these |
434 |
|
properties with "Is". |
435 |
|
.P |
436 |
|
No character that is in the Unicode table has the Cn (unassigned) property. |
437 |
|
Instead, this property is assumed for any code point that is not in the |
438 |
|
Unicode table. |
439 |
.P |
.P |
440 |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
441 |
example, \ep{Lu} always matches only upper case letters. |
example, \ep{Lu} always matches only upper case letters. |
1433 |
"subroutine" call, which is described in the next section.) The special item |
"subroutine" call, which is described in the next section.) The special item |
1434 |
(?R) is a recursive call of the entire regular expression. |
(?R) is a recursive call of the entire regular expression. |
1435 |
.P |
.P |
1436 |
For example, this PCRE pattern solves the nested parentheses problem (assume |
A recursive subpattern call is always treated as an atomic group. That is, once |
1437 |
the PCRE_EXTENDED option is set so that white space is ignored): |
it has matched some of the subject string, it is never re-entered, even if |
1438 |
|
it contains untried alternatives and there is a subsequent matching failure. |
1439 |
|
.P |
1440 |
|
This PCRE pattern solves the nested parentheses problem (assume the |
1441 |
|
PCRE_EXTENDED option is set so that white space is ignored): |
1442 |
.sp |
.sp |
1443 |
\e( ( (?>[^()]+) | (?R) )* \e) |
\e( ( (?>[^()]+) | (?R) )* \e) |
1444 |
.sp |
.sp |
1445 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
1446 |
substrings which can either be a sequence of non-parentheses, or a recursive |
substrings which can either be a sequence of non-parentheses, or a recursive |
1447 |
match of the pattern itself (that is a correctly parenthesized substring). |
match of the pattern itself (that is, a correctly parenthesized substring). |
1448 |
Finally there is a closing parenthesis. |
Finally there is a closing parenthesis. |
1449 |
.P |
.P |
1450 |
If this were part of a larger pattern, you would not want to recurse the entire |
If this were part of a larger pattern, you would not want to recurse the entire |
1528 |
is used, it does match "sense and responsibility" as well as the other two |
is used, it does match "sense and responsibility" as well as the other two |
1529 |
strings. Such references must, however, follow the subpattern to which they |
strings. Such references must, however, follow the subpattern to which they |
1530 |
refer. |
refer. |
1531 |
|
.P |
1532 |
|
Like recursive subpatterns, a "subroutine" call is always treated as an atomic |
1533 |
|
group. That is, once it has matched some of the subject string, it is never |
1534 |
|
re-entered, even if it contains untried alternatives and there is a subsequent |
1535 |
|
matching failure. |
1536 |
. |
. |
1537 |
. |
. |
1538 |
.SH CALLOUTS |
.SH CALLOUTS |
1571 |
documentation. |
documentation. |
1572 |
.P |
.P |
1573 |
.in 0 |
.in 0 |
1574 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
1575 |
.br |
.br |
1576 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |