175 |
\t tab (hex 09) |
\t tab (hex 09) |
176 |
\ddd character with octal code ddd, or backreference |
\ddd character with octal code ddd, or backreference |
177 |
\xhh character with hex code hh |
\xhh character with hex code hh |
178 |
\x{hhh..} character with hex code hhh... (UTF-8 mode only) |
\x{hhh..} character with hex code hhh.. |
179 |
</pre> |
</pre> |
180 |
The precise effect of \cx is as follows: if x is a lower case letter, it |
The precise effect of \cx is as follows: if x is a lower case letter, it |
181 |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
is converted to upper case. Then bit 6 of the character (hex 40) is inverted. |
184 |
</P> |
</P> |
185 |
<P> |
<P> |
186 |
After \x, from zero to two hexadecimal digits are read (letters can be in |
After \x, from zero to two hexadecimal digits are read (letters can be in |
187 |
upper or lower case). In UTF-8 mode, any number of hexadecimal digits may |
upper or lower case). Any number of hexadecimal digits may appear between \x{ |
188 |
appear between \x{ and }, but the value of the character code must be less |
and }, but the value of the character code must be less than 256 in non-UTF-8 |
189 |
than 2**31 (that is, the maximum hexadecimal value is 7FFFFFFF). If characters |
mode, and less than 2**31 in UTF-8 mode (that is, the maximum hexadecimal value |
190 |
other than hexadecimal digits appear between \x{ and }, or if there is no |
is 7FFFFFFF). If characters other than hexadecimal digits appear between \x{ |
191 |
terminating }, this form of escape is not recognized. Instead, the initial |
and }, or if there is no terminating }, this form of escape is not recognized. |
192 |
\x will be interpreted as a basic hexadecimal escape, with no following |
Instead, the initial \x will be interpreted as a basic hexadecimal escape, |
193 |
digits, giving a character whose value is zero. |
with no following digits, giving a character whose value is zero. |
194 |
</P> |
</P> |
195 |
<P> |
<P> |
196 |
Characters whose value is less than 256 can be defined by either of the two |
Characters whose value is less than 256 can be defined by either of the two |
197 |
syntaxes for \x when PCRE is in UTF-8 mode. There is no difference in the |
syntaxes for \x. There is no difference in the way they are handled. For |
198 |
way they are handled. For example, \xdc is exactly the same as \x{dc}. |
example, \xdc is exactly the same as \x{dc}. |
199 |
</P> |
</P> |
200 |
<P> |
<P> |
201 |
After \0 up to two further octal digits are read. In both cases, if there |
After \0 up to two further octal digits are read. In both cases, if there |
285 |
<P> |
<P> |
286 |
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
In UTF-8 mode, characters with values greater than 128 never match \d, \s, or |
287 |
\w, and always match \D, \S, and \W. This is true even when Unicode |
\w, and always match \D, \S, and \W. This is true even when Unicode |
288 |
character property support is available. |
character property support is available. The use of locales with Unicode is |
289 |
|
discouraged. |
290 |
<a name="uniextseq"></a></P> |
<a name="uniextseq"></a></P> |
291 |
<br><b> |
<br><b> |
292 |
Unicode character properties |
Unicode character properties |
293 |
</b><br> |
</b><br> |
294 |
<P> |
<P> |
295 |
When PCRE is built with Unicode character property support, three additional |
When PCRE is built with Unicode character property support, three additional |
296 |
escape sequences to match generic character types are available when UTF-8 mode |
escape sequences to match character properties are available when UTF-8 mode |
297 |
is selected. They are: |
is selected. They are: |
298 |
<pre> |
<pre> |
299 |
\p{<i>xx</i>} a character with the <i>xx</i> property |
\p{<i>xx</i>} a character with the <i>xx</i> property |
300 |
\P{<i>xx</i>} a character without the <i>xx</i> property |
\P{<i>xx</i>} a character without the <i>xx</i> property |
301 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
302 |
</pre> |
</pre> |
303 |
The property names represented by <i>xx</i> above are limited to the |
The property names represented by <i>xx</i> above are limited to the Unicode |
304 |
Unicode general category properties. Each character has exactly one such |
script names, the general category properties, and "Any", which matches any |
305 |
property, specified by a two-letter abbreviation. For compatibility with Perl, |
character (including newline). Other properties such as "InMusicalSymbols" are |
306 |
negation can be specified by including a circumflex between the opening brace |
not currently supported by PCRE. Note that \P{Any} does not match any |
307 |
and the property name. For example, \p{^Lu} is the same as \P{Lu}. |
characters, so always causes a match failure. |
308 |
</P> |
</P> |
309 |
<P> |
<P> |
310 |
If only one letter is specified with \p or \P, it includes all the properties |
Sets of Unicode characters are defined as belonging to certain scripts. A |
311 |
that start with that letter. In this case, in the absence of negation, the |
character from one of these sets can be matched using a script name. For |
312 |
curly brackets in the escape sequence are optional; these two examples have |
example: |
313 |
the same effect: |
<pre> |
314 |
|
\p{Greek} |
315 |
|
\P{Han} |
316 |
|
</pre> |
317 |
|
Those that are not part of an identified script are lumped together as |
318 |
|
"Common". The current list of scripts is: |
319 |
|
</P> |
320 |
|
<P> |
321 |
|
Arabic, |
322 |
|
Armenian, |
323 |
|
Bengali, |
324 |
|
Bopomofo, |
325 |
|
Braille, |
326 |
|
Buginese, |
327 |
|
Buhid, |
328 |
|
Canadian_Aboriginal, |
329 |
|
Cherokee, |
330 |
|
Common, |
331 |
|
Coptic, |
332 |
|
Cypriot, |
333 |
|
Cyrillic, |
334 |
|
Deseret, |
335 |
|
Devanagari, |
336 |
|
Ethiopic, |
337 |
|
Georgian, |
338 |
|
Glagolitic, |
339 |
|
Gothic, |
340 |
|
Greek, |
341 |
|
Gujarati, |
342 |
|
Gurmukhi, |
343 |
|
Han, |
344 |
|
Hangul, |
345 |
|
Hanunoo, |
346 |
|
Hebrew, |
347 |
|
Hiragana, |
348 |
|
Inherited, |
349 |
|
Kannada, |
350 |
|
Katakana, |
351 |
|
Kharoshthi, |
352 |
|
Khmer, |
353 |
|
Lao, |
354 |
|
Latin, |
355 |
|
Limbu, |
356 |
|
Linear_B, |
357 |
|
Malayalam, |
358 |
|
Mongolian, |
359 |
|
Myanmar, |
360 |
|
New_Tai_Lue, |
361 |
|
Ogham, |
362 |
|
Old_Italic, |
363 |
|
Old_Persian, |
364 |
|
Oriya, |
365 |
|
Osmanya, |
366 |
|
Runic, |
367 |
|
Shavian, |
368 |
|
Sinhala, |
369 |
|
Syloti_Nagri, |
370 |
|
Syriac, |
371 |
|
Tagalog, |
372 |
|
Tagbanwa, |
373 |
|
Tai_Le, |
374 |
|
Tamil, |
375 |
|
Telugu, |
376 |
|
Thaana, |
377 |
|
Thai, |
378 |
|
Tibetan, |
379 |
|
Tifinagh, |
380 |
|
Ugaritic, |
381 |
|
Yi. |
382 |
|
</P> |
383 |
|
<P> |
384 |
|
Each character has exactly one general category property, specified by a |
385 |
|
two-letter abbreviation. For compatibility with Perl, negation can be specified |
386 |
|
by including a circumflex between the opening brace and the property name. For |
387 |
|
example, \p{^Lu} is the same as \P{Lu}. |
388 |
|
</P> |
389 |
|
<P> |
390 |
|
If only one letter is specified with \p or \P, it includes all the general |
391 |
|
category properties that start with that letter. In this case, in the absence |
392 |
|
of negation, the curly brackets in the escape sequence are optional; these two |
393 |
|
examples have the same effect: |
394 |
<pre> |
<pre> |
395 |
\p{L} |
\p{L} |
396 |
\pL |
\pL |
397 |
</pre> |
</pre> |
398 |
The following property codes are supported: |
The following general category property codes are supported: |
399 |
<pre> |
<pre> |
400 |
C Other |
C Other |
401 |
Cc Control |
Cc Control |
441 |
Zp Paragraph separator |
Zp Paragraph separator |
442 |
Zs Space separator |
Zs Space separator |
443 |
</pre> |
</pre> |
444 |
Extended properties such as "Greek" or "InMusicalSymbols" are not supported by |
The special property L& is also supported: it matches a character that has |
445 |
PCRE. |
the Lu, Ll, or Lt property, in other words, a letter that is not classified as |
446 |
|
a modifier or "other". |
447 |
|
</P> |
448 |
|
<P> |
449 |
|
The long synonyms for these properties that Perl supports (such as \p{Letter}) |
450 |
|
are not supported by PCRE. Nor is is permitted to prefix any of these |
451 |
|
properties with "Is". |
452 |
|
</P> |
453 |
|
<P> |
454 |
|
No character that is in the Unicode table has the Cn (unassigned) property. |
455 |
|
Instead, this property is assumed for any code point that is not in the |
456 |
|
Unicode table. |
457 |
</P> |
</P> |
458 |
<P> |
<P> |
459 |
Specifying caseless matching does not affect these escape sequences. For |
Specifying caseless matching does not affect these escape sequences. For |
1452 |
(?R) is a recursive call of the entire regular expression. |
(?R) is a recursive call of the entire regular expression. |
1453 |
</P> |
</P> |
1454 |
<P> |
<P> |
1455 |
For example, this PCRE pattern solves the nested parentheses problem (assume |
A recursive subpattern call is always treated as an atomic group. That is, once |
1456 |
the PCRE_EXTENDED option is set so that white space is ignored): |
it has matched some of the subject string, it is never re-entered, even if |
1457 |
|
it contains untried alternatives and there is a subsequent matching failure. |
1458 |
|
</P> |
1459 |
|
<P> |
1460 |
|
This PCRE pattern solves the nested parentheses problem (assume the |
1461 |
|
PCRE_EXTENDED option is set so that white space is ignored): |
1462 |
<pre> |
<pre> |
1463 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
1464 |
</pre> |
</pre> |
1465 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
1466 |
substrings which can either be a sequence of non-parentheses, or a recursive |
substrings which can either be a sequence of non-parentheses, or a recursive |
1467 |
match of the pattern itself (that is a correctly parenthesized substring). |
match of the pattern itself (that is, a correctly parenthesized substring). |
1468 |
Finally there is a closing parenthesis. |
Finally there is a closing parenthesis. |
1469 |
</P> |
</P> |
1470 |
<P> |
<P> |
1547 |
strings. Such references must, however, follow the subpattern to which they |
strings. Such references must, however, follow the subpattern to which they |
1548 |
refer. |
refer. |
1549 |
</P> |
</P> |
1550 |
|
<P> |
1551 |
|
Like recursive subpatterns, a "subroutine" call is always treated as an atomic |
1552 |
|
group. That is, once it has matched some of the subject string, it is never |
1553 |
|
re-entered, even if it contains untried alternatives and there is a subsequent |
1554 |
|
matching failure. |
1555 |
|
</P> |
1556 |
<br><a name="SEC20" href="#TOC1">CALLOUTS</a><br> |
<br><a name="SEC20" href="#TOC1">CALLOUTS</a><br> |
1557 |
<P> |
<P> |
1558 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl |
1589 |
documentation. |
documentation. |
1590 |
</P> |
</P> |
1591 |
<P> |
<P> |
1592 |
Last updated: 28 February 2005 |
Last updated: 24 January 2006 |
1593 |
<br> |
<br> |
1594 |
Copyright © 1997-2005 University of Cambridge. |
Copyright © 1997-2006 University of Cambridge. |
1595 |
<p> |
<p> |
1596 |
Return to the <a href="index.html">PCRE index page</a>. |
Return to the <a href="index.html">PCRE index page</a>. |
1597 |
</p> |
</p> |