116 |
Unicode property support |
Unicode property support |
117 |
</b><br> |
</b><br> |
118 |
<P> |
<P> |
119 |
Another special sequence that may appear at the start of a pattern is |
Another special sequence that may appear at the start of a pattern is (*UCP). |
|
<pre> |
|
|
(*UCP) |
|
|
</pre> |
|
120 |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
This has the same effect as setting the PCRE_UCP option: it causes sequences |
121 |
such as \d and \w to use Unicode properties to determine character types, |
such as \d and \w to use Unicode properties to determine character types, |
122 |
instead of recognizing only characters with codes less than 128 via a lookup |
instead of recognizing only characters with codes less than 128 via a lookup |
123 |
table. |
table. |
124 |
</P> |
</P> |
125 |
<br><b> |
<br><b> |
126 |
|
Disabling auto-possessification |
127 |
|
</b><br> |
128 |
|
<P> |
129 |
|
If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting |
130 |
|
the PCRE_NO_AUTO_POSSESS option at compile time. This stops PCRE from making |
131 |
|
quantifiers possessive when what follows cannot match the repeated item. For |
132 |
|
example, by default a+b is treated as a++b. For more details, see the |
133 |
|
<a href="pcreapi.html"><b>pcreapi</b></a> |
134 |
|
documentation. |
135 |
|
</P> |
136 |
|
<br><b> |
137 |
Disabling start-up optimizations |
Disabling start-up optimizations |
138 |
</b><br> |
</b><br> |
139 |
<P> |
<P> |
140 |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
If a pattern starts with (*NO_START_OPT), it has the same effect as setting the |
141 |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. |
PCRE_NO_START_OPTIMIZE option either at compile or matching time. This disables |
142 |
|
several optimizations for quickly reaching "no match" results. For more |
143 |
|
details, see the |
144 |
|
<a href="pcreapi.html"><b>pcreapi</b></a> |
145 |
|
documentation. |
146 |
<a name="newlines"></a></P> |
<a name="newlines"></a></P> |
147 |
<br><b> |
<br><b> |
148 |
Newline conventions |
Newline conventions |
205 |
(*LIMIT_RECURSION=d) |
(*LIMIT_RECURSION=d) |
206 |
</pre> |
</pre> |
207 |
where d is any number of decimal digits. However, the value of the setting must |
where d is any number of decimal digits. However, the value of the setting must |
208 |
be less than the value set by the caller of <b>pcre_exec()</b> for it to have |
be less than the value set (or defaulted) by the caller of <b>pcre_exec()</b> |
209 |
any effect. In other words, the pattern writer can lower the limit set by the |
for it to have any effect. In other words, the pattern writer can lower the |
210 |
programmer, but not raise it. If there is more than one setting of one of these |
limits set by the programmer, but not raise them. If there is more than one |
211 |
limits, the lower value is used. |
setting of one of these limits, the lower value is used. |
212 |
</P> |
</P> |
213 |
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> |
<br><a name="SEC3" href="#TOC1">EBCDIC CHARACTER CODES</a><br> |
214 |
<P> |
<P> |
295 |
greater than 127) are treated as literals. |
greater than 127) are treated as literals. |
296 |
</P> |
</P> |
297 |
<P> |
<P> |
298 |
If a pattern is compiled with the PCRE_EXTENDED option, white space in the |
If a pattern is compiled with the PCRE_EXTENDED option, most white space in the |
299 |
pattern (other than in a character class) and characters between a # outside |
pattern (other than in a character class), and characters between a # outside a |
300 |
a character class and the next newline are ignored. An escaping backslash can |
character class and the next newline, inclusive, are ignored. An escaping |
301 |
be used to include a white space or # character as part of the pattern. |
backslash can be used to include a white space or # character as part of the |
302 |
|
pattern. |
303 |
</P> |
</P> |
304 |
<P> |
<P> |
305 |
If you want to remove the special meaning from a sequence of characters, you |
If you want to remove the special meaning from a sequence of characters, you |
337 |
\n linefeed (hex 0A) |
\n linefeed (hex 0A) |
338 |
\r carriage return (hex 0D) |
\r carriage return (hex 0D) |
339 |
\t tab (hex 09) |
\t tab (hex 09) |
340 |
|
\0dd character with octal code 0dd |
341 |
\ddd character with octal code ddd, or back reference |
\ddd character with octal code ddd, or back reference |
342 |
|
\o{ddd..} character with octal code ddd.. |
343 |
\xhh character with hex code hh |
\xhh character with hex code hh |
344 |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
\x{hhh..} character with hex code hhh.. (non-JavaScript mode) |
345 |
\uhhhh character with hex code hhhh (JavaScript mode only) |
\uhhhh character with hex code hhhh (JavaScript mode only) |
362 |
characters also generate different values. |
characters also generate different values. |
363 |
</P> |
</P> |
364 |
<P> |
<P> |
|
By default, after \x, from zero to two hexadecimal digits are read (letters |
|
|
can be in upper or lower case). Any number of hexadecimal digits may appear |
|
|
between \x{ and }, but the character code is constrained as follows: |
|
|
<pre> |
|
|
8-bit non-UTF mode less than 0x100 |
|
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
|
|
16-bit non-UTF mode less than 0x10000 |
|
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
|
|
32-bit non-UTF mode less than 0x80000000 |
|
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
|
|
</pre> |
|
|
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
|
|
"surrogate" codepoints), and 0xffef. |
|
|
</P> |
|
|
<P> |
|
|
If characters other than hexadecimal digits appear between \x{ and }, or if |
|
|
there is no terminating }, this form of escape is not recognized. Instead, the |
|
|
initial \x will be interpreted as a basic hexadecimal escape, with no |
|
|
following digits, giving a character whose value is zero. |
|
|
</P> |
|
|
<P> |
|
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is |
|
|
as just described only when it is followed by two hexadecimal digits. |
|
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
|
|
code points greater than 256 is provided by \u, which must be followed by |
|
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
|
|
Character codes specified by \u in JavaScript mode are constrained in the same |
|
|
was as those specified by \x in non-JavaScript mode. |
|
|
</P> |
|
|
<P> |
|
|
Characters whose value is less than 256 can be defined by either of the two |
|
|
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the |
|
|
way they are handled. For example, \xdc is exactly the same as \x{dc} (or |
|
|
\u00dc in JavaScript mode). |
|
|
</P> |
|
|
<P> |
|
365 |
After \0 up to two further octal digits are read. If there are fewer than two |
After \0 up to two further octal digits are read. If there are fewer than two |
366 |
digits, just those that are present are used. Thus the sequence \0\x\07 |
digits, just those that are present are used. Thus the sequence \0\x\07 |
367 |
specifies two binary zeros followed by a BEL character (code value 7). Make |
specifies two binary zeros followed by a BEL character (code value 7). Make |
369 |
follows is itself an octal digit. |
follows is itself an octal digit. |
370 |
</P> |
</P> |
371 |
<P> |
<P> |
372 |
The handling of a backslash followed by a digit other than 0 is complicated. |
The escape \o must be followed by a sequence of octal digits, enclosed in |
373 |
Outside a character class, PCRE reads it and any following digits as a decimal |
braces. An error occurs if this is not the case. This escape is a recent |
374 |
number. If the number is less than 10, or if there have been at least that many |
addition to Perl; it provides way of specifying character code points as octal |
375 |
|
numbers greater than 0777, and it also allows octal numbers and back references |
376 |
|
to be unambiguously specified. |
377 |
|
</P> |
378 |
|
<P> |
379 |
|
For greater clarity and unambiguity, it is best to avoid following \ by a |
380 |
|
digit greater than zero. Instead, use \o{} or \x{} to specify character |
381 |
|
numbers, and \g{} to specify back references. The following paragraphs |
382 |
|
describe the old, ambiguous syntax. |
383 |
|
</P> |
384 |
|
<P> |
385 |
|
The handling of a backslash followed by a digit other than 0 is complicated, |
386 |
|
and Perl has changed in recent releases, causing PCRE also to change. Outside a |
387 |
|
character class, PCRE reads the digit and any following digits as a decimal |
388 |
|
number. If the number is less than 8, or if there have been at least that many |
389 |
previous capturing left parentheses in the expression, the entire sequence is |
previous capturing left parentheses in the expression, the entire sequence is |
390 |
taken as a <i>back reference</i>. A description of how this works is given |
taken as a <i>back reference</i>. A description of how this works is given |
391 |
<a href="#backreferences">later,</a> |
<a href="#backreferences">later,</a> |
393 |
<a href="#subpattern">parenthesized subpatterns.</a> |
<a href="#subpattern">parenthesized subpatterns.</a> |
394 |
</P> |
</P> |
395 |
<P> |
<P> |
396 |
Inside a character class, or if the decimal number is greater than 9 and there |
Inside a character class, or if the decimal number following \ is greater than |
397 |
have not been that many capturing subpatterns, PCRE re-reads up to three octal |
7 and there have not been that many capturing subpatterns, PCRE handles \8 and |
398 |
digits following the backslash, and uses them to generate a data character. Any |
\9 as the literal characters "8" and "9", and otherwise re-reads up to three |
399 |
subsequent digits stand for themselves. The value of the character is |
octal digits following the backslash, using them to generate a data character. |
400 |
constrained in the same way as characters specified in hexadecimal. |
Any subsequent digits stand for themselves. For example: |
|
For example: |
|
401 |
<pre> |
<pre> |
402 |
\040 is another way of writing an ASCII space |
\040 is another way of writing an ASCII space |
403 |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
\40 is the same, provided there are fewer than 40 previous capturing subpatterns |
407 |
\0113 is a tab followed by the character "3" |
\0113 is a tab followed by the character "3" |
408 |
\113 might be a back reference, otherwise the character with octal code 113 |
\113 might be a back reference, otherwise the character with octal code 113 |
409 |
\377 might be a back reference, otherwise the value 255 (decimal) |
\377 might be a back reference, otherwise the value 255 (decimal) |
410 |
\81 is either a back reference, or a binary zero followed by the two characters "8" and "1" |
\81 is either a back reference, or the two characters "8" and "1" |
411 |
|
</pre> |
412 |
|
Note that octal values of 100 or greater that are specified using this syntax |
413 |
|
must not be introduced by a leading zero, because no more than three octal |
414 |
|
digits are ever read. |
415 |
|
</P> |
416 |
|
<P> |
417 |
|
By default, after \x that is not followed by {, from zero to two hexadecimal |
418 |
|
digits are read (letters can be in upper or lower case). Any number of |
419 |
|
hexadecimal digits may appear between \x{ and }. If a character other than |
420 |
|
a hexadecimal digit appears between \x{ and }, or if there is no terminating |
421 |
|
}, an error occurs. |
422 |
|
</P> |
423 |
|
<P> |
424 |
|
If the PCRE_JAVASCRIPT_COMPAT option is set, the interpretation of \x is |
425 |
|
as just described only when it is followed by two hexadecimal digits. |
426 |
|
Otherwise, it matches a literal "x" character. In JavaScript mode, support for |
427 |
|
code points greater than 256 is provided by \u, which must be followed by |
428 |
|
four hexadecimal digits; otherwise it matches a literal "u" character. |
429 |
|
</P> |
430 |
|
<P> |
431 |
|
Characters whose value is less than 256 can be defined by either of the two |
432 |
|
syntaxes for \x (or by \u in JavaScript mode). There is no difference in the |
433 |
|
way they are handled. For example, \xdc is exactly the same as \x{dc} (or |
434 |
|
\u00dc in JavaScript mode). |
435 |
|
</P> |
436 |
|
<br><b> |
437 |
|
Constraints on character values |
438 |
|
</b><br> |
439 |
|
<P> |
440 |
|
Characters that are specified using octal or hexadecimal numbers are |
441 |
|
limited to certain values, as follows: |
442 |
|
<pre> |
443 |
|
8-bit non-UTF mode less than 0x100 |
444 |
|
8-bit UTF-8 mode less than 0x10ffff and a valid codepoint |
445 |
|
16-bit non-UTF mode less than 0x10000 |
446 |
|
16-bit UTF-16 mode less than 0x10ffff and a valid codepoint |
447 |
|
32-bit non-UTF mode less than 0x100000000 |
448 |
|
32-bit UTF-32 mode less than 0x10ffff and a valid codepoint |
449 |
</pre> |
</pre> |
450 |
Note that octal values of 100 or greater must not be introduced by a leading |
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so-called |
451 |
zero, because no more than three octal digits are ever read. |
"surrogate" codepoints), and 0xffef. |
452 |
</P> |
</P> |
453 |
|
<br><b> |
454 |
|
Escape sequences in character classes |
455 |
|
</b><br> |
456 |
<P> |
<P> |
457 |
All the sequences that define a single character value can be used both inside |
All the sequences that define a single character value can be used both inside |
458 |
and outside character classes. In addition, inside a character class, \b is |
and outside character classes. In addition, inside a character class, \b is |
531 |
there is no character to match. |
there is no character to match. |
532 |
</P> |
</P> |
533 |
<P> |
<P> |
534 |
For compatibility with Perl, \s does not match the VT character (code 11). |
For compatibility with Perl, \s did not used to match the VT character (code |
535 |
This makes it different from the the POSIX "space" class. The \s characters |
11), which made it different from the the POSIX "space" class. However, Perl |
536 |
are HT (9), LF (10), FF (12), CR (13), and space (32). If "use locale;" is |
added VT at release 5.18, and PCRE followed suit at release 8.34. The default |
537 |
included in a Perl script, \s may match the VT character. In PCRE, it never |
\s characters are now HT (9), LF (10), VT (11), FF (12), CR (13), and space |
538 |
does. |
(32), which are defined as white space in the "C" locale. This list may vary if |
539 |
|
locale-specific matching is taking place; in particular, in some locales the |
540 |
|
"non-breaking space" character (\xA0) is recognized as white space. |
541 |
</P> |
</P> |
542 |
<P> |
<P> |
543 |
A "word" character is an underscore or any character that is a letter or digit. |
A "word" character is an underscore or any character that is a letter or digit. |
548 |
in the |
in the |
549 |
<a href="pcreapi.html"><b>pcreapi</b></a> |
<a href="pcreapi.html"><b>pcreapi</b></a> |
550 |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
page). For example, in a French locale such as "fr_FR" in Unix-like systems, |
551 |
or "french" in Windows, some character codes greater than 128 are used for |
or "french" in Windows, some character codes greater than 127 are used for |
552 |
accented letters, and these are then matched by \w. The use of locales with |
accented letters, and these are then matched by \w. The use of locales with |
553 |
Unicode is discouraged. |
Unicode is discouraged. |
554 |
</P> |
</P> |
555 |
<P> |
<P> |
556 |
By default, in a UTF mode, characters with values greater than 128 never match |
By default, characters whose code points are greater than 127 never match \d, |
557 |
\d, \s, or \w, and always match \D, \S, and \W. These sequences retain |
\s, or \w, and always match \D, \S, and \W, although this may vary for |
558 |
their original meanings from before UTF support was available, mainly for |
characters in the range 128-255 when locale-specific matching is happening. |
559 |
efficiency reasons. However, if PCRE is compiled with Unicode property support, |
These escape sequences retain their original meanings from before Unicode |
560 |
and the PCRE_UCP option is set, the behaviour is changed so that Unicode |
support was available, mainly for efficiency reasons. If PCRE is compiled with |
561 |
properties are used to determine character types, as follows: |
Unicode property support, and the PCRE_UCP option is set, the behaviour is |
562 |
<pre> |
changed so that Unicode properties are used to determine character types, as |
563 |
\d any character that \p{Nd} matches (decimal digit) |
follows: |
564 |
\s any character that \p{Z} matches, plus HT, LF, FF, CR |
<pre> |
565 |
\w any character that \p{L} or \p{N} matches, plus underscore |
\d any character that matches \p{Nd} (decimal digit) |
566 |
|
\s any character that matches \p{Z} or \h or \v |
567 |
|
\w any character that matches \p{L} or \p{N}, plus underscore |
568 |
</pre> |
</pre> |
569 |
The upper case escapes match the inverse sets of characters. Note that \d |
The upper case escapes match the inverse sets of characters. Note that \d |
570 |
matches only decimal digits, whereas \w matches any Unicode digit, as well as |
matches only decimal digits, whereas \w matches any Unicode digit, as well as |
575 |
<P> |
<P> |
576 |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
The sequences \h, \H, \v, and \V are features that were added to Perl at |
577 |
release 5.10. In contrast to the other sequences, which match only ASCII |
release 5.10. In contrast to the other sequences, which match only ASCII |
578 |
characters by default, these always match certain high-valued codepoints, |
characters by default, these always match certain high-valued code points, |
579 |
whether or not PCRE_UCP is set. The horizontal space characters are: |
whether or not PCRE_UCP is set. The horizontal space characters are: |
580 |
<pre> |
<pre> |
581 |
U+0009 Horizontal tab (HT) |
U+0009 Horizontal tab (HT) |
950 |
<P> |
<P> |
951 |
As well as the standard Unicode properties described above, PCRE supports four |
As well as the standard Unicode properties described above, PCRE supports four |
952 |
more that make it possible to convert traditional escape sequences such as \w |
more that make it possible to convert traditional escape sequences such as \w |
953 |
and \s and POSIX character classes to use Unicode properties. PCRE uses these |
and \s to use Unicode properties. PCRE uses these non-standard, non-Perl |
954 |
non-standard, non-Perl properties internally when PCRE_UCP is set. However, |
properties internally when PCRE_UCP is set. However, they may also be used |
955 |
they may also be used explicitly. These properties are: |
explicitly. These properties are: |
956 |
<pre> |
<pre> |
957 |
Xan Any alphanumeric character |
Xan Any alphanumeric character |
958 |
Xps Any POSIX space character |
Xps Any POSIX space character |
962 |
Xan matches characters that have either the L (letter) or the N (number) |
Xan matches characters that have either the L (letter) or the N (number) |
963 |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
property. Xps matches the characters tab, linefeed, vertical tab, form feed, or |
964 |
carriage return, and any other character that has the Z (separator) property. |
carriage return, and any other character that has the Z (separator) property. |
965 |
Xsp is the same as Xps, except that vertical tab is excluded. Xwd matches the |
Xsp is the same as Xps; it used to exclude vertical tab, for Perl |
966 |
same characters as Xan, plus underscore. |
compatibility, but Perl changed, and so PCRE followed at release 8.34. Xwd |
967 |
|
matches the same characters as Xan, plus underscore. |
968 |
</P> |
</P> |
969 |
<P> |
<P> |
970 |
There is another non-standard property, Xuc, which matches any character that |
There is another non-standard property, Xuc, which matches any character that |
1256 |
character class. For example, [d-m] matches any letter between d and m, |
character class. For example, [d-m] matches any letter between d and m, |
1257 |
inclusive. If a minus character is required in a class, it must be escaped with |
inclusive. If a minus character is required in a class, it must be escaped with |
1258 |
a backslash or appear in a position where it cannot be interpreted as |
a backslash or appear in a position where it cannot be interpreted as |
1259 |
indicating a range, typically as the first or last character in the class. |
indicating a range, typically as the first or last character in the class, or |
1260 |
|
immediately after a range. For example, [b-d-z] matches letters in the range b |
1261 |
|
to d, a hyphen character, or z. |
1262 |
</P> |
</P> |
1263 |
<P> |
<P> |
1264 |
It is not possible to have the literal character "]" as the end character of a |
It is not possible to have the literal character "]" as the end character of a |
1270 |
"]" can also be used to end a range. |
"]" can also be used to end a range. |
1271 |
</P> |
</P> |
1272 |
<P> |
<P> |
1273 |
|
An error is generated if a POSIX character class (see below) or an escape |
1274 |
|
sequence other than one that defines a single character appears at a point |
1275 |
|
where a range ending character is expected. For example, [z-\xff] is valid, |
1276 |
|
but [A-\d] and [A-[:digit:]] are not. |
1277 |
|
</P> |
1278 |
|
<P> |
1279 |
Ranges operate in the collating sequence of character values. They can also be |
Ranges operate in the collating sequence of character values. They can also be |
1280 |
used for characters specified numerically, for example [\000-\037]. Ranges |
used for characters specified numerically, for example [\000-\037]. Ranges |
1281 |
can include any characters that are valid for the current mode. |
can include any characters that are valid for the current mode. |
1340 |
lower lower case letters |
lower lower case letters |
1341 |
print printing characters, including space |
print printing characters, including space |
1342 |
punct printing characters, excluding letters and digits and space |
punct printing characters, excluding letters and digits and space |
1343 |
space white space (not quite the same as \s) |
space white space (the same as \s from PCRE 8.34) |
1344 |
upper upper case letters |
upper upper case letters |
1345 |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
1346 |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
1347 |
</pre> |
</pre> |
1348 |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), and |
The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
1349 |
space (32). Notice that this list includes the VT character (code 11). This |
and space (32). If locale-specific matching is taking place, there may be |
1350 |
makes "space" different to \s, which does not include VT (for Perl |
additional space characters. "Space" used to be different to \s, which did not |
1351 |
compatibility). |
include VT, for Perl compatibility. However, Perl changed at release 5.18, and |
1352 |
|
PCRE followed at release 8.34. "Space" and \s now match the same set of |
1353 |
|
characters. |
1354 |
</P> |
</P> |
1355 |
<P> |
<P> |
1356 |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
The name "word" is a Perl extension, and "blank" is a GNU extension from Perl |
1364 |
supported, and an error is given if they are encountered. |
supported, and an error is given if they are encountered. |
1365 |
</P> |
</P> |
1366 |
<P> |
<P> |
1367 |
By default, in UTF modes, characters with values greater than 128 do not match |
By default, characters with values greater than 128 do not match any of the |
1368 |
any of the POSIX character classes. However, if the PCRE_UCP option is passed |
POSIX character classes. However, if the PCRE_UCP option is passed to |
1369 |
to <b>pcre_compile()</b>, some of the classes are changed so that Unicode |
<b>pcre_compile()</b>, some of the classes are changed so that Unicode character |
1370 |
character properties are used. This is achieved by replacing the POSIX classes |
properties are used. This is achieved by replacing certain POSIX classes by |
1371 |
by other sequences, as follows: |
other sequences, as follows: |
1372 |
<pre> |
<pre> |
1373 |
[:alnum:] becomes \p{Xan} |
[:alnum:] becomes \p{Xan} |
1374 |
[:alpha:] becomes \p{L} |
[:alpha:] becomes \p{L} |
1379 |
[:upper:] becomes \p{Lu} |
[:upper:] becomes \p{Lu} |
1380 |
[:word:] becomes \p{Xwd} |
[:word:] becomes \p{Xwd} |
1381 |
</pre> |
</pre> |
1382 |
Negated versions, such as [:^alpha:] use \P instead of \p. The other POSIX |
Negated versions, such as [:^alpha:] use \P instead of \p. Three other POSIX |
1383 |
classes are unchanged, and match only characters with code points less than |
classes are handled specially in UCP mode: |
1384 |
128. |
</P> |
1385 |
|
<P> |
1386 |
|
[:graph:] |
1387 |
|
This matches characters that have glyphs that mark the page when printed. In |
1388 |
|
Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf |
1389 |
|
properties, except for: |
1390 |
|
<pre> |
1391 |
|
U+061C Arabic Letter Mark |
1392 |
|
U+180E Mongolian Vowel Separator |
1393 |
|
U+2066 - U+2069 Various "isolate"s |
1394 |
|
|
1395 |
|
</PRE> |
1396 |
|
</P> |
1397 |
|
<P> |
1398 |
|
[:print:] |
1399 |
|
This matches the same characters as [:graph:] plus space characters that are |
1400 |
|
not controls, that is, characters with the Zs property. |
1401 |
|
</P> |
1402 |
|
<P> |
1403 |
|
[:punct:] |
1404 |
|
This matches all characters that have the Unicode P (punctuation) property, |
1405 |
|
plus those characters whose code points are less than 128 that have the S |
1406 |
|
(Symbol) property. |
1407 |
|
</P> |
1408 |
|
<P> |
1409 |
|
The other POSIX classes are unchanged, and match only characters with code |
1410 |
|
points less than 128. |
1411 |
</P> |
</P> |
1412 |
<br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br> |
<br><a name="SEC11" href="#TOC1">VERTICAL BAR</a><br> |
1413 |
<P> |
<P> |
1609 |
can be made by name as well as by number. |
can be made by name as well as by number. |
1610 |
</P> |
</P> |
1611 |
<P> |
<P> |
1612 |
Names consist of up to 32 alphanumeric characters and underscores. Named |
Names consist of up to 32 alphanumeric characters and underscores, but must |
1613 |
capturing parentheses are still allocated numbers as well as names, exactly as |
start with a non-digit. Named capturing parentheses are still allocated numbers |
1614 |
if the names were not present. The PCRE API provides function calls for |
as well as names, exactly as if the names were not present. The PCRE API |
1615 |
extracting the name-to-number translation table from a compiled pattern. There |
provides function calls for extracting the name-to-number translation table |
1616 |
is also a convenience function for extracting a captured substring by name. |
from a compiled pattern. There is also a convenience function for extracting a |
1617 |
|
captured substring by name. |
1618 |
</P> |
</P> |
1619 |
<P> |
<P> |
1620 |
By default, a name must be unique within a pattern, but it is possible to relax |
By default, a name must be unique within a pattern, but it is possible to relax |
1643 |
</P> |
</P> |
1644 |
<P> |
<P> |
1645 |
If you make a back reference to a non-unique named subpattern from elsewhere in |
If you make a back reference to a non-unique named subpattern from elsewhere in |
1646 |
the pattern, the one that corresponds to the first occurrence of the name is |
the pattern, the subpatterns to which the name refers are checked in the order |
1647 |
used. In the absence of duplicate numbers (see the previous section) this is |
in which they appear in the overall pattern. The first one that is set is used |
1648 |
the one with the lowest number. If you use a named reference in a condition |
for the reference. For example, this pattern matches both "foofoo" and |
1649 |
|
"barbar" but not "foobar" or "barfoo": |
1650 |
|
<pre> |
1651 |
|
(?:(?<n>foo)|(?<n>bar))\k<n> |
1652 |
|
|
1653 |
|
</PRE> |
1654 |
|
</P> |
1655 |
|
<P> |
1656 |
|
If you make a subroutine call to a non-unique named subpattern, the one that |
1657 |
|
corresponds to the first occurrence of the name is used. In the absence of |
1658 |
|
duplicate numbers (see the previous section) this is the one with the lowest |
1659 |
|
number. |
1660 |
|
</P> |
1661 |
|
<P> |
1662 |
|
If you use a named reference in a condition |
1663 |
test (see the |
test (see the |
1664 |
<a href="#conditions">section about conditions</a> |
<a href="#conditions">section about conditions</a> |
1665 |
below), either to check whether a subpattern has matched, or to check for |
below), either to check whether a subpattern has matched, or to check for |
1674 |
<b>Warning:</b> You cannot use different names to distinguish between two |
<b>Warning:</b> You cannot use different names to distinguish between two |
1675 |
subpatterns with the same number because PCRE uses only the numbers when |
subpatterns with the same number because PCRE uses only the numbers when |
1676 |
matching. For this reason, an error is given at compile time if different names |
matching. For this reason, an error is given at compile time if different names |
1677 |
are given to subpatterns with the same number. However, you can give the same |
are given to subpatterns with the same number. However, you can always give the |
1678 |
name to subpatterns with the same number, even when PCRE_DUPNAMES is not set. |
same name to subpatterns with the same number, even when PCRE_DUPNAMES is not |
1679 |
|
set. |
1680 |
</P> |
</P> |
1681 |
<br><a name="SEC16" href="#TOC1">REPETITION</a><br> |
<br><a name="SEC16" href="#TOC1">REPETITION</a><br> |
1682 |
<P> |
<P> |
2342 |
<P> |
<P> |
2343 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used |
2344 |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
subpattern by name. For compatibility with earlier versions of PCRE, which had |
2345 |
this facility before Perl, the syntax (?(name)...) is also recognized. However, |
this facility before Perl, the syntax (?(name)...) is also recognized. |
|
there is a possible ambiguity with this syntax, because subpattern names may |
|
|
consist entirely of digits. PCRE looks first for a named subpattern; if it |
|
|
cannot find one and the name consists entirely of digits, PCRE looks for a |
|
|
subpattern of that number, which must be greater than zero. Using subpattern |
|
|
names that consist entirely of digits is not recommended. |
|
2346 |
</P> |
</P> |
2347 |
<P> |
<P> |
2348 |
Rewriting the above example to use a named subpattern gives this: |
Rewriting the above example to use a named subpattern gives this: |
2759 |
called. It is provided with the number of the callout, the position in the |
called. It is provided with the number of the callout, the position in the |
2760 |
pattern, and, optionally, one item of data originally supplied by the caller of |
pattern, and, optionally, one item of data originally supplied by the caller of |
2761 |
the matching function. The callout function may cause matching to proceed, to |
the matching function. The callout function may cause matching to proceed, to |
2762 |
backtrack, or to fail altogether. A complete description of the interface to |
backtrack, or to fail altogether. |
2763 |
the callout function is given in the |
</P> |
2764 |
|
<P> |
2765 |
|
By default, PCRE implements a number of optimizations at compile time and |
2766 |
|
matching time, and one side-effect is that sometimes callouts are skipped. If |
2767 |
|
you need all possible callouts to happen, you need to set options that disable |
2768 |
|
the relevant optimizations. More details, and a complete description of the |
2769 |
|
interface to the callout function, are given in the |
2770 |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
<a href="pcrecallout.html"><b>pcrecallout</b></a> |
2771 |
documentation. |
documentation. |
2772 |
<a name="backtrackcontrol"></a></P> |
<a name="backtrackcontrol"></a></P> |
3117 |
<pre> |
<pre> |
3118 |
...(*COMMIT)(*PRUNE)... |
...(*COMMIT)(*PRUNE)... |
3119 |
</pre> |
</pre> |
3120 |
If there is a matching failure to the right, backtracking onto (*PRUNE) cases |
If there is a matching failure to the right, backtracking onto (*PRUNE) causes |
3121 |
it to be triggered, and its action is taken. There can never be a backtrack |
it to be triggered, and its action is taken. There can never be a backtrack |
3122 |
onto (*COMMIT). |
onto (*COMMIT). |
3123 |
<a name="btrepeat"></a></P> |
<a name="btrepeat"></a></P> |
3200 |
</P> |
</P> |
3201 |
<br><a name="SEC29" href="#TOC1">REVISION</a><br> |
<br><a name="SEC29" href="#TOC1">REVISION</a><br> |
3202 |
<P> |
<P> |
3203 |
Last updated: 26 April 2013 |
Last updated: 12 November 2013 |
3204 |
<br> |
<br> |
3205 |
Copyright © 1997-2013 University of Cambridge. |
Copyright © 1997-2013 University of Cambridge. |
3206 |
<br> |
<br> |