26 |
give better JavaScript compatibility. |
give better JavaScript compatibility. |
27 |
|
|
28 |
The current implementation of PCRE corresponds approximately with Perl |
The current implementation of PCRE corresponds approximately with Perl |
29 |
5.10/5.11, including support for UTF-8 encoded strings and Unicode gen- |
5.12, including support for UTF-8 encoded strings and Unicode general |
30 |
eral category properties. However, UTF-8 and Unicode support has to be |
category properties. However, UTF-8 and Unicode support has to be |
31 |
explicitly enabled; it is not the default. The Unicode tables corre- |
explicitly enabled; it is not the default. The Unicode tables corre- |
32 |
spond to Unicode release 5.2.0. |
spond to Unicode release 5.2.0. |
33 |
|
|
238 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
239 |
are all low-valued characters, unless the PCRE_UCP option is set. |
are all low-valued characters, unless the PCRE_UCP option is set. |
240 |
|
|
241 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
8. However, the horizontal and vertical whitespace matching escapes |
242 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
(\h, \H, \v, and \V) do match all the appropriate Unicode characters, |
243 |
acters, whether or not PCRE_UCP is set. |
whether or not PCRE_UCP is set. |
244 |
|
|
245 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
246 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
247 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
248 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
249 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
250 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Furthermore, PCRE supports |
251 |
support is available, PCRE supports case-insensitive matching only when |
case-insensitive matching only when there is a one-to-one mapping |
252 |
there is a one-to-one mapping between a letter's cases. There are a |
between a letter's cases. There are a small number of many-to-one map- |
253 |
small number of many-to-one mappings in Unicode; these are not sup- |
pings in Unicode; these are not supported by PCRE. |
|
ported by PCRE. |
|
254 |
|
|
255 |
|
|
256 |
AUTHOR |
AUTHOR |
259 |
University Computing Service |
University Computing Service |
260 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
261 |
|
|
262 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
263 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
264 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
265 |
|
|
266 |
|
|
267 |
REVISION |
REVISION |
268 |
|
|
269 |
Last updated: 22 October 2010 |
Last updated: 13 November 2010 |
270 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
271 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
272 |
|
|
696 |
represent the different matching possibilities (if there are none, the |
represent the different matching possibilities (if there are none, the |
697 |
match has failed). Thus, if there is more than one possible match, |
match has failed). Thus, if there is more than one possible match, |
698 |
this algorithm finds all of them, and in particular, it finds the long- |
this algorithm finds all of them, and in particular, it finds the long- |
699 |
est. There is an option to stop the algorithm after the first match |
est. The matches are returned in decreasing order of length. There is |
700 |
(which is necessarily the shortest) is found. |
an option to stop the algorithm after the first match (which is neces- |
701 |
|
sarily the shortest) is found. |
702 |
|
|
703 |
Note that all the matches that are found start at the same point in the |
Note that all the matches that are found start at the same point in the |
704 |
subject. If the pattern |
subject. If the pattern |
705 |
|
|
706 |
cat(er(pillar)?) |
cat(er(pillar)?)? |
707 |
|
|
708 |
is matched against the string "the caterpillar catchment", the result |
is matched against the string "the caterpillar catchment", the result |
709 |
will be the three strings "cat", "cater", and "caterpillar" that start |
will be the three strings "caterpillar", "cater", and "cat" that start |
710 |
at the fourth character of the subject. The algorithm does not automat- |
at the fifth character of the subject. The algorithm does not automati- |
711 |
ically move on to find matches that start at later positions. |
cally move on to find matches that start at later positions. |
712 |
|
|
713 |
There are a number of features of PCRE regular expressions that are not |
There are a number of features of PCRE regular expressions that are not |
714 |
supported by the alternative matching algorithm. They are as follows: |
supported by the alternative matching algorithm. They are as follows: |
715 |
|
|
716 |
1. Because the algorithm finds all possible matches, the greedy or |
1. Because the algorithm finds all possible matches, the greedy or |
717 |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
718 |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
719 |
sessive quantifiers can make a difference when what follows could also |
sessive quantifiers can make a difference when what follows could also |
720 |
match what is quantified, for example in a pattern like this: |
match what is quantified, for example in a pattern like this: |
721 |
|
|
722 |
^a++\w! |
^a++\w! |
723 |
|
|
724 |
This pattern matches "aaab!" but not "aaa!", which would be matched by |
This pattern matches "aaab!" but not "aaa!", which would be matched by |
725 |
a non-possessive quantifier. Similarly, if an atomic group is present, |
a non-possessive quantifier. Similarly, if an atomic group is present, |
726 |
it is matched as if it were a standalone pattern at the current point, |
it is matched as if it were a standalone pattern at the current point, |
727 |
and the longest match is then "locked in" for the rest of the overall |
and the longest match is then "locked in" for the rest of the overall |
728 |
pattern. |
pattern. |
729 |
|
|
730 |
2. When dealing with multiple paths through the tree simultaneously, it |
2. When dealing with multiple paths through the tree simultaneously, it |
731 |
is not straightforward to keep track of captured substrings for the |
is not straightforward to keep track of captured substrings for the |
732 |
different matching possibilities, and PCRE's implementation of this |
different matching possibilities, and PCRE's implementation of this |
733 |
algorithm does not attempt to do this. This means that no captured sub- |
algorithm does not attempt to do this. This means that no captured sub- |
734 |
strings are available. |
strings are available. |
735 |
|
|
736 |
3. Because no substrings are captured, back references within the pat- |
3. Because no substrings are captured, back references within the pat- |
737 |
tern are not supported, and cause errors if encountered. |
tern are not supported, and cause errors if encountered. |
738 |
|
|
739 |
4. For the same reason, conditional expressions that use a backrefer- |
4. For the same reason, conditional expressions that use a backrefer- |
740 |
ence as the condition or test for a specific group recursion are not |
ence as the condition or test for a specific group recursion are not |
741 |
supported. |
supported. |
742 |
|
|
743 |
5. Because many paths through the tree may be active, the \K escape |
5. Because many paths through the tree may be active, the \K escape |
744 |
sequence, which resets the start of the match when encountered (but may |
sequence, which resets the start of the match when encountered (but may |
745 |
be on some paths and not on others), is not supported. It causes an |
be on some paths and not on others), is not supported. It causes an |
746 |
error if encountered. |
error if encountered. |
747 |
|
|
748 |
6. Callouts are supported, but the value of the capture_top field is |
6. Callouts are supported, but the value of the capture_top field is |
749 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
750 |
|
|
751 |
7. The \C escape sequence, which (in the standard algorithm) matches a |
7. The \C escape sequence, which (in the standard algorithm) matches a |
752 |
single byte, even in UTF-8 mode, is not supported because the alterna- |
single byte, even in UTF-8 mode, is not supported because the alterna- |
753 |
tive algorithm moves through the subject string one character at a |
tive algorithm moves through the subject string one character at a |
754 |
time, for all active paths through the tree. |
time, for all active paths through the tree. |
755 |
|
|
756 |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
757 |
are not supported. (*FAIL) is supported, and behaves like a failing |
are not supported. (*FAIL) is supported, and behaves like a failing |
758 |
negative assertion. |
negative assertion. |
759 |
|
|
760 |
|
|
761 |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
762 |
|
|
763 |
Using the alternative matching algorithm provides the following advan- |
Using the alternative matching algorithm provides the following advan- |
764 |
tages: |
tages: |
765 |
|
|
766 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
767 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
768 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
769 |
things with callouts. |
things with callouts. |
770 |
|
|
771 |
2. Because the alternative algorithm scans the subject string just |
2. Because the alternative algorithm scans the subject string just |
772 |
once, and never needs to backtrack, it is possible to pass very long |
once, and never needs to backtrack, it is possible to pass very long |
773 |
subject strings to the matching function in several pieces, checking |
subject strings to the matching function in several pieces, checking |
774 |
for partial matching each time. It is possible to do multi-segment |
for partial matching each time. Although it is possible to do multi- |
775 |
matching using pcre_exec() (by retaining partially matched substrings), |
segment matching using the standard algorithm (pcre_exec()), by retain- |
776 |
but it is more complicated. The pcrepartial documentation gives details |
ing partially matched substrings, it is more complicated. The pcrepar- |
777 |
of partial matching and discusses multi-segment matching. |
tial documentation gives details of partial matching and discusses |
778 |
|
multi-segment matching. |
779 |
|
|
780 |
|
|
781 |
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
801 |
|
|
802 |
REVISION |
REVISION |
803 |
|
|
804 |
Last updated: 22 October 2010 |
Last updated: 17 November 2010 |
805 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
806 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
807 |
|
|
1172 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
1173 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
1174 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
1175 |
try to free it. The byte offset from the start of the pattern to the |
try to free it. The offset from the start of the pattern to the byte |
1176 |
character that was being processed when the error was discovered is |
that was being processed when the error was discovered is placed in the |
1177 |
placed in the variable pointed to by erroffset, which must not be NULL. |
variable pointed to by erroffset, which must not be NULL. If it is, an |
1178 |
If it is, an immediate error is given. Some errors are not detected |
immediate error is given. Some errors are not detected until checks are |
1179 |
until checks are carried out when the whole pattern has been scanned; |
carried out when the whole pattern has been scanned; in this case the |
1180 |
in this case the offset is set to the end of the pattern. |
offset is set to the end of the pattern. |
1181 |
|
|
1182 |
|
Note that the offset is in bytes, not characters, even in UTF-8 mode. |
1183 |
|
It may point into the middle of a UTF-8 character (for example, when |
1184 |
|
PCRE_ERROR_BADUTF8 is returned for an invalid UTF-8 string). |
1185 |
|
|
1186 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
1187 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
1259 |
|
|
1260 |
PCRE_DOTALL |
PCRE_DOTALL |
1261 |
|
|
1262 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharacter in the pattern matches a char- |
1263 |
acters, including those that indicate newline. Without it, a dot does |
acter of any value, including one that indicates a newline. However, it |
1264 |
not match when the current position is at a newline. This option is |
only ever matches one character, even if newlines are coded as CRLF. |
1265 |
equivalent to Perl's /s option, and it can be changed within a pattern |
Without this option, a dot does not match when the current position is |
1266 |
by a (?s) option setting. A negative class such as [^a] always matches |
at a newline. This option is equivalent to Perl's /s option, and it can |
1267 |
newline characters, independent of the setting of this option. |
be changed within a pattern by a (?s) option setting. A negative class |
1268 |
|
such as [^a] always matches newline characters, independent of the set- |
1269 |
|
ting of this option. |
1270 |
|
|
1271 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
1272 |
|
|
1286 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
1287 |
ting. |
ting. |
1288 |
|
|
1289 |
This option makes it possible to include comments inside complicated |
Which characters are interpreted as newlines is controlled by the |
1290 |
patterns. Note, however, that this applies only to data characters. |
options passed to pcre_compile() or by a special sequence at the start |
1291 |
Whitespace characters may never appear within special character |
of the pattern, as described in the section entitled "Newline conven- |
1292 |
sequences in a pattern, for example within the sequence (?( which |
tions" in the pcrepattern documentation. Note that the end of this type |
1293 |
introduces a conditional subpattern. |
of comment is a literal newline sequence in the pattern; escape |
1294 |
|
sequences that happen to represent a newline do not count. |
1295 |
|
|
1296 |
|
This option makes it possible to include comments inside complicated |
1297 |
|
patterns. Note, however, that this applies only to data characters. |
1298 |
|
Whitespace characters may never appear within special character |
1299 |
|
sequences in a pattern, for example within the sequence (?( that intro- |
1300 |
|
duces a conditional subpattern. |
1301 |
|
|
1302 |
PCRE_EXTRA |
PCRE_EXTRA |
1303 |
|
|
1304 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
1305 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
1306 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
1307 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
1308 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
1309 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
1310 |
literal. (Perl can, however, be persuaded to give an error for this, by |
literal. (Perl can, however, be persuaded to give an error for this, by |
1311 |
running it with the -w option.) There are at present no other features |
running it with the -w option.) There are at present no other features |
1312 |
controlled by this option. It can also be set by a (?X) option setting |
controlled by this option. It can also be set by a (?X) option setting |
1313 |
within a pattern. |
within a pattern. |
1314 |
|
|
1315 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
1316 |
|
|
1317 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
1318 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
1319 |
matched text may continue over the newline. |
matched text may continue over the newline. |
1320 |
|
|
1321 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
1322 |
|
|
1323 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
1324 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
1325 |
follows: |
follows: |
1326 |
|
|
1327 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
1328 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
1329 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
1330 |
option is set. |
option is set. |
1331 |
|
|
1332 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
1333 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
1334 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
1335 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
1336 |
default, for Perl compatibility. |
default, for Perl compatibility. |
1337 |
|
|
1338 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1339 |
|
|
1340 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1341 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
1342 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
1343 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
1344 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
1345 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
1346 |
|
|
1347 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
1348 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
1349 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
1350 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
1351 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
1352 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
1353 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1354 |
|
|
1355 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1358 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1359 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1360 |
|
|
1361 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1362 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1363 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1364 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1365 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1366 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
1367 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1368 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
1369 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1370 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1371 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
1372 |
UTF-8 mode. |
UTF-8 mode. |
1373 |
|
|
1374 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1375 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
1376 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
1377 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
1378 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1379 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1380 |
cause an error. |
cause an error. |
1381 |
|
|
1382 |
The only time that a line break is specially recognized when compiling |
The only time that a line break in a pattern is specially recognized |
1383 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
when compiling is when PCRE_EXTENDED is set. CR and LF are whitespace |
1384 |
character class is encountered. This indicates a comment that lasts |
characters, and so are ignored in this mode. Also, an unescaped # out- |
1385 |
until after the next line break sequence. In other circumstances, line |
side a character class indicates a comment that lasts until after the |
1386 |
break sequences are treated as literal data, except that in |
next line break sequence. In other circumstances, line break sequences |
1387 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
in patterns are treated as literal data. |
|
and are therefore ignored. |
|
1388 |
|
|
1389 |
The newline option that is set at compile time becomes the default that |
The newline option that is set at compile time becomes the default that |
1390 |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
1399 |
|
|
1400 |
PCRE_UCP |
PCRE_UCP |
1401 |
|
|
1402 |
This option changes the way PCRE processes \b, \d, \s, \w, and some of |
This option changes the way PCRE processes \B, \b, \D, \d, \S, \s, \W, |
1403 |
the POSIX character classes. By default, only ASCII characters are rec- |
\w, and some of the POSIX character classes. By default, only ASCII |
1404 |
ognized, but if PCRE_UCP is set, Unicode properties are used instead to |
characters are recognized, but if PCRE_UCP is set, Unicode properties |
1405 |
classify characters. More details are given in the section on generic |
are used instead to classify characters. More details are given in the |
1406 |
character types in the pcrepattern page. If you set PCRE_UCP, matching |
section on generic character types in the pcrepattern page. If you set |
1407 |
one of the items it affects takes much longer. The option is available |
PCRE_UCP, matching one of the items it affects takes much longer. The |
1408 |
only if PCRE has been compiled with Unicode property support. |
option is available only if PCRE has been compiled with Unicode prop- |
1409 |
|
erty support. |
1410 |
|
|
1411 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
1412 |
|
|
1413 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
1414 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
1415 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
1416 |
within the pattern. |
within the pattern. |
1417 |
|
|
1418 |
PCRE_UTF8 |
PCRE_UTF8 |
1419 |
|
|
1420 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
1421 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
1422 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
1423 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
1424 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
1425 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
1426 |
|
|
1427 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1428 |
|
|
1429 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1430 |
automatically checked. There is a discussion about the validity of |
automatically checked. There is a discussion about the validity of |
1431 |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
1432 |
bytes is found, pcre_compile() returns an error. If you already know |
bytes is found, pcre_compile() returns an error. If you already know |
1433 |
that your pattern is valid, and you want to skip this check for perfor- |
that your pattern is valid, and you want to skip this check for perfor- |
1434 |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
1435 |
set, the effect of passing an invalid UTF-8 string as a pattern is |
set, the effect of passing an invalid UTF-8 string as a pattern is |
1436 |
undefined. It may cause your program to crash. Note that this option |
undefined. It may cause your program to crash. Note that this option |
1437 |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
1438 |
UTF-8 validity checking of subject strings. |
UTF-8 validity checking of subject strings. |
1439 |
|
|
1440 |
|
|
1441 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
1442 |
|
|
1443 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1444 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1445 |
both compiling functions. As PCRE has developed, some error codes have |
both compiling functions. As PCRE has developed, some error codes have |
1446 |
fallen out of use. To avoid confusion, they have not been re-used. |
fallen out of use. To avoid confusion, they have not been re-used. |
1447 |
|
|
1448 |
0 no error |
0 no error |
1517 |
66 (*MARK) must have an argument |
66 (*MARK) must have an argument |
1518 |
67 this version of PCRE is not compiled with PCRE_UCP support |
67 this version of PCRE is not compiled with PCRE_UCP support |
1519 |
|
|
1520 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
1521 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
1522 |
|
|
1523 |
|
|
1526 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
1527 |
const char **errptr); |
const char **errptr); |
1528 |
|
|
1529 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
1530 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
1531 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
1532 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
1533 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
1534 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
1535 |
the results of the study. |
the results of the study. |
1536 |
|
|
1537 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
1538 |
pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- |
pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block also con- |
1539 |
tains other fields that can be set by the caller before the block is |
tains other fields that can be set by the caller before the block is |
1540 |
passed; these are described below in the section on matching a pattern. |
passed; these are described below in the section on matching a pattern. |
1541 |
|
|
1542 |
If studying the pattern does not produce any useful information, |
If studying the pattern does not produce any useful information, |
1543 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
1544 |
wants to pass any of the other fields to pcre_exec() or |
wants to pass any of the other fields to pcre_exec() or |
1545 |
pcre_dfa_exec(), it must set up its own pcre_extra block. |
pcre_dfa_exec(), it must set up its own pcre_extra block. |
1546 |
|
|
1547 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
1548 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
1549 |
|
|
1550 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
1551 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
1552 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
1553 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
1554 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
1555 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
1556 |
|
|
1557 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
1565 |
Studying a pattern does two things: first, a lower bound for the length |
Studying a pattern does two things: first, a lower bound for the length |
1566 |
of subject string that is needed to match the pattern is computed. This |
of subject string that is needed to match the pattern is computed. This |
1567 |
does not mean that there are any strings of that length that match, but |
does not mean that there are any strings of that length that match, but |
1568 |
it does guarantee that no shorter strings match. The value is used by |
it does guarantee that no shorter strings match. The value is used by |
1569 |
pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to |
pcre_exec() and pcre_dfa_exec() to avoid wasting time by trying to |
1570 |
match strings that are shorter than the lower bound. You can find out |
match strings that are shorter than the lower bound. You can find out |
1571 |
the value in a calling program via the pcre_fullinfo() function. |
the value in a calling program via the pcre_fullinfo() function. |
1572 |
|
|
1573 |
Studying a pattern is also useful for non-anchored patterns that do not |
Studying a pattern is also useful for non-anchored patterns that do not |
1574 |
have a single fixed starting character. A bitmap of possible starting |
have a single fixed starting character. A bitmap of possible starting |
1575 |
bytes is created. This speeds up finding a position in the subject at |
bytes is created. This speeds up finding a position in the subject at |
1576 |
which to start matching. |
which to start matching. |
1577 |
|
|
1578 |
The two optimizations just described can be disabled by setting the |
The two optimizations just described can be disabled by setting the |
1579 |
PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or |
PCRE_NO_START_OPTIMIZE option when calling pcre_exec() or |
1580 |
pcre_dfa_exec(). You might want to do this if your pattern contains |
pcre_dfa_exec(). You might want to do this if your pattern contains |
1581 |
callouts, or make use of (*MARK), and you make use of these in cases |
callouts or (*MARK), and you want to make use of these facilities in |
1582 |
where matching fails. See the discussion of PCRE_NO_START_OPTIMIZE |
cases where matching fails. See the discussion of PCRE_NO_START_OPTI- |
1583 |
below. |
MIZE below. |
1584 |
|
|
1585 |
|
|
1586 |
LOCALE SUPPORT |
LOCALE SUPPORT |
1587 |
|
|
1588 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
1589 |
letters, digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
1590 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1591 |
characters with codes less than 128. By default, higher-valued codes |
characters with codes less than 128. By default, higher-valued codes |
1592 |
never match escapes such as \w or \d, but they can be tested with \p if |
never match escapes such as \w or \d, but they can be tested with \p if |
1593 |
PCRE is built with Unicode character property support. Alternatively, |
PCRE is built with Unicode character property support. Alternatively, |
1594 |
the PCRE_UCP option can be set at compile time; this causes \w and |
the PCRE_UCP option can be set at compile time; this causes \w and |
1595 |
friends to use Unicode property support instead of built-in tables. The |
friends to use Unicode property support instead of built-in tables. The |
1596 |
use of locales with Unicode is discouraged. If you are handling charac- |
use of locales with Unicode is discouraged. If you are handling charac- |
1597 |
ters with codes greater than 128, you should either use UTF-8 and Uni- |
ters with codes greater than 128, you should either use UTF-8 and Uni- |
1598 |
code, or use locales, but not try to mix the two. |
code, or use locales, but not try to mix the two. |
1599 |
|
|
1600 |
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
1601 |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
1602 |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
1603 |
acters. However, when PCRE is built, it is possible to cause the inter- |
acters. However, when PCRE is built, it is possible to cause the inter- |
1604 |
nal tables to be rebuilt in the default "C" locale of the local system, |
nal tables to be rebuilt in the default "C" locale of the local system, |
1605 |
which may cause them to be different. |
which may cause them to be different. |
1606 |
|
|
1607 |
The internal tables can always be overridden by tables supplied by the |
The internal tables can always be overridden by tables supplied by the |
1608 |
application that calls PCRE. These may be created in a different locale |
application that calls PCRE. These may be created in a different locale |
1609 |
from the default. As more and more applications change to using Uni- |
from the default. As more and more applications change to using Uni- |
1610 |
code, the need for this locale support is expected to die away. |
code, the need for this locale support is expected to die away. |
1611 |
|
|
1612 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
1613 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
1614 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1615 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
1616 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
1617 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1618 |
|
|
1619 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1620 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1621 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1622 |
|
|
1623 |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1624 |
if you are using Windows, the name for the French locale is "french". |
if you are using Windows, the name for the French locale is "french". |
1625 |
|
|
1626 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1627 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1628 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
1629 |
it is needed. |
it is needed. |
1630 |
|
|
1631 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
1632 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
1633 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
1634 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
1635 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
1636 |
|
|
1637 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
1638 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
1639 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
1640 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
1641 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
1642 |
|
|
1646 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1647 |
int what, void *where); |
int what, void *where); |
1648 |
|
|
1649 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
1650 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1651 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
1652 |
|
|
1653 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
1654 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
1655 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
1656 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
1657 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
1658 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
1659 |
|
|
1660 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1662 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1663 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
1664 |
|
|
1665 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
1666 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
1667 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
1668 |
pattern: |
pattern: |
1669 |
|
|
1670 |
int rc; |
int rc; |
1675 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
1676 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
1677 |
|
|
1678 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
1679 |
are as follows: |
are as follows: |
1680 |
|
|
1681 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
1682 |
|
|
1683 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
1684 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
1685 |
there are no back references. |
there are no back references. |
1686 |
|
|
1687 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
1688 |
|
|
1689 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
1690 |
argument should point to an int variable. |
argument should point to an int variable. |
1691 |
|
|
1692 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
1693 |
|
|
1694 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
1695 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
1696 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
1697 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
1698 |
passing a NULL table pointer. |
passing a NULL table pointer. |
1699 |
|
|
1700 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
1701 |
|
|
1702 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
1703 |
non-anchored pattern. The fourth argument should point to an int vari- |
non-anchored pattern. The fourth argument should point to an int vari- |
1704 |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
1705 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
1706 |
|
|
1707 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1708 |
(cat|cow|coyote), its value is returned. Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
1709 |
|
|
1710 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1711 |
branch starts with "^", or |
branch starts with "^", or |
1712 |
|
|
1713 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1714 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
1715 |
|
|
1716 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
1717 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
1718 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
1719 |
|
|
1720 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
1721 |
|
|
1722 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
1723 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
1724 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
1725 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1726 |
able. |
able. |
1727 |
|
|
1728 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
1729 |
|
|
1730 |
Return 1 if the pattern contains any explicit matches for CR or LF |
Return 1 if the pattern contains any explicit matches for CR or LF |
1731 |
characters, otherwise 0. The fourth argument should point to an int |
characters, otherwise 0. The fourth argument should point to an int |
1732 |
variable. An explicit match is either a literal CR or LF character, or |
variable. An explicit match is either a literal CR or LF character, or |
1733 |
\r or \n. |
\r or \n. |
1734 |
|
|
1735 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
1736 |
|
|
1737 |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
1738 |
otherwise 0. The fourth argument should point to an int variable. (?J) |
otherwise 0. The fourth argument should point to an int variable. (?J) |
1739 |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
1740 |
|
|
1741 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1742 |
|
|
1743 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1744 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
1745 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
1746 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
1747 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
1748 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1749 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
1750 |
|
|
1751 |
PCRE_INFO_MINLENGTH |
PCRE_INFO_MINLENGTH |
1752 |
|
|
1753 |
If the pattern was studied and a minimum length for matching subject |
If the pattern was studied and a minimum length for matching subject |
1754 |
strings was computed, its value is returned. Otherwise the returned |
strings was computed, its value is returned. Otherwise the returned |
1755 |
value is -1. The value is a number of characters, not bytes (this may |
value is -1. The value is a number of characters, not bytes (this may |
1756 |
be relevant in UTF-8 mode). The fourth argument should point to an int |
be relevant in UTF-8 mode). The fourth argument should point to an int |
1757 |
variable. A non-negative value is a lower bound to the length of any |
variable. A non-negative value is a lower bound to the length of any |
1758 |
matching string. There may not be any strings of that length that do |
matching string. There may not be any strings of that length that do |
1759 |
actually match, but every string that does match is at least that long. |
actually match, but every string that does match is at least that long. |
1760 |
|
|
1761 |
PCRE_INFO_NAMECOUNT |
PCRE_INFO_NAMECOUNT |
1762 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
1763 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
1764 |
|
|
1765 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
1766 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
1767 |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
1768 |
pcre_get_named_substring() are provided for extracting captured sub- |
pcre_get_named_substring() are provided for extracting captured sub- |
1769 |
strings by name. It is also possible to extract the data directly, by |
strings by name. It is also possible to extract the data directly, by |
1770 |
first converting the name to a number in order to access the correct |
first converting the name to a number in order to access the correct |
1771 |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
1772 |
the conversion, you need to use the name-to-number map, which is |
the conversion, you need to use the name-to-number map, which is |
1773 |
described by these three values. |
described by these three values. |
1774 |
|
|
1775 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1776 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1777 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
1778 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1779 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
1780 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
1781 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
1782 |
sponding name, zero terminated. |
sponding name, zero terminated. |
1783 |
|
|
1784 |
The names are in alphabetical order. Duplicate names may appear if (?| |
The names are in alphabetical order. Duplicate names may appear if (?| |
1785 |
is used to create multiple groups with the same number, as described in |
is used to create multiple groups with the same number, as described in |
1786 |
the section on duplicate subpattern numbers in the pcrepattern page. |
the section on duplicate subpattern numbers in the pcrepattern page. |
1787 |
Duplicate names for subpatterns with different numbers are permitted |
Duplicate names for subpatterns with different numbers are permitted |
1788 |
only if PCRE_DUPNAMES is set. In all cases of duplicate names, they |
only if PCRE_DUPNAMES is set. In all cases of duplicate names, they |
1789 |
appear in the table in the order in which they were found in the pat- |
appear in the table in the order in which they were found in the pat- |
1790 |
tern. In the absence of (?| this is the order of increasing number; |
tern. In the absence of (?| this is the order of increasing number; |
1791 |
when (?| is used this is not necessarily the case because later subpat- |
when (?| is used this is not necessarily the case because later subpat- |
1792 |
terns may have lower numbers. |
terns may have lower numbers. |
1793 |
|
|
1794 |
As a simple example of the name/number table, consider the following |
As a simple example of the name/number table, consider the following |
1795 |
pattern (assume PCRE_EXTENDED is set, so white space - including new- |
pattern (assume PCRE_EXTENDED is set, so white space - including new- |
1796 |
lines - is ignored): |
lines - is ignored): |
1797 |
|
|
1798 |
(?<date> (?<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
1799 |
(?<month>\d\d) - (?<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
1800 |
|
|
1801 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1802 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1803 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1804 |
as ??: |
as ??: |
1805 |
|
|
1808 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
1809 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
1810 |
|
|
1811 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
1812 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
1813 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
1814 |
|
|
1815 |
PCRE_INFO_OKPARTIAL |
PCRE_INFO_OKPARTIAL |
1816 |
|
|
1817 |
Return 1 if the pattern can be used for partial matching with |
Return 1 if the pattern can be used for partial matching with |
1818 |
pcre_exec(), otherwise 0. The fourth argument should point to an int |
pcre_exec(), otherwise 0. The fourth argument should point to an int |
1819 |
variable. From release 8.00, this always returns 1, because the |
variable. From release 8.00, this always returns 1, because the |
1820 |
restrictions that previously applied to partial matching have been |
restrictions that previously applied to partial matching have been |
1821 |
lifted. The pcrepartial documentation gives details of partial match- |
lifted. The pcrepartial documentation gives details of partial match- |
1822 |
ing. |
ing. |
1823 |
|
|
1824 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1825 |
|
|
1826 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
1827 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
1828 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
1829 |
by any top-level option settings at the start of the pattern itself. In |
by any top-level option settings at the start of the pattern itself. In |
1830 |
other words, they are the options that will be in force when matching |
other words, they are the options that will be in force when matching |
1831 |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
1832 |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
1833 |
and PCRE_EXTENDED. |
and PCRE_EXTENDED. |
1834 |
|
|
1835 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
1836 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
1837 |
|
|
1838 |
^ unless PCRE_MULTILINE is set |
^ unless PCRE_MULTILINE is set |
1846 |
|
|
1847 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
1848 |
|
|
1849 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern, that is, the value that was |
1850 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
1851 |
which to place the compiled data. The fourth argument should point to a |
which to place the compiled data. The fourth argument should point to a |
1852 |
size_t variable. |
size_t variable. |
1854 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
1855 |
|
|
1856 |
Return the size of the data block pointed to by the study_data field in |
Return the size of the data block pointed to by the study_data field in |
1857 |
a pcre_extra block. That is, it is the value that was passed to |
a pcre_extra block. That is, it is the value that was passed to |
1858 |
pcre_malloc() when PCRE was getting memory into which to place the data |
pcre_malloc() when PCRE was getting memory into which to place the data |
1859 |
created by pcre_study(). If pcre_extra is NULL, or there is no study |
created by pcre_study(). If pcre_extra is NULL, or there is no study |
1860 |
data, zero is returned. The fourth argument should point to a size_t |
data, zero is returned. The fourth argument should point to a size_t |
1861 |
variable. |
variable. |
1862 |
|
|
1863 |
|
|
1865 |
|
|
1866 |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
1867 |
|
|
1868 |
The pcre_info() function is now obsolete because its interface is too |
The pcre_info() function is now obsolete because its interface is too |
1869 |
restrictive to return all the available data about a compiled pattern. |
restrictive to return all the available data about a compiled pattern. |
1870 |
New programs should use pcre_fullinfo() instead. The yield of |
New programs should use pcre_fullinfo() instead. The yield of |
1871 |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
1872 |
lowing negative numbers: |
lowing negative numbers: |
1873 |
|
|
1874 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1875 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1876 |
|
|
1877 |
If the optptr argument is not NULL, a copy of the options with which |
If the optptr argument is not NULL, a copy of the options with which |
1878 |
the pattern was compiled is placed in the integer it points to (see |
the pattern was compiled is placed in the integer it points to (see |
1879 |
PCRE_INFO_OPTIONS above). |
PCRE_INFO_OPTIONS above). |
1880 |
|
|
1881 |
If the pattern is not anchored and the firstcharptr argument is not |
If the pattern is not anchored and the firstcharptr argument is not |
1882 |
NULL, it is used to pass back information about the first character of |
NULL, it is used to pass back information about the first character of |
1883 |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
1884 |
|
|
1885 |
|
|
1887 |
|
|
1888 |
int pcre_refcount(pcre *code, int adjust); |
int pcre_refcount(pcre *code, int adjust); |
1889 |
|
|
1890 |
The pcre_refcount() function is used to maintain a reference count in |
The pcre_refcount() function is used to maintain a reference count in |
1891 |
the data block that contains a compiled pattern. It is provided for the |
the data block that contains a compiled pattern. It is provided for the |
1892 |
benefit of applications that operate in an object-oriented manner, |
benefit of applications that operate in an object-oriented manner, |
1893 |
where different parts of the application may be using the same compiled |
where different parts of the application may be using the same compiled |
1894 |
pattern, but you want to free the block when they are all done. |
pattern, but you want to free the block when they are all done. |
1895 |
|
|
1896 |
When a pattern is compiled, the reference count field is initialized to |
When a pattern is compiled, the reference count field is initialized to |
1897 |
zero. It is changed only by calling this function, whose action is to |
zero. It is changed only by calling this function, whose action is to |
1898 |
add the adjust value (which may be positive or negative) to it. The |
add the adjust value (which may be positive or negative) to it. The |
1899 |
yield of the function is the new value. However, the value of the count |
yield of the function is the new value. However, the value of the count |
1900 |
is constrained to lie between 0 and 65535, inclusive. If the new value |
is constrained to lie between 0 and 65535, inclusive. If the new value |
1901 |
is outside these limits, it is forced to the appropriate limit value. |
is outside these limits, it is forced to the appropriate limit value. |
1902 |
|
|
1903 |
Except when it is zero, the reference count is not correctly preserved |
Except when it is zero, the reference count is not correctly preserved |
1904 |
if a pattern is compiled on one host and then transferred to a host |
if a pattern is compiled on one host and then transferred to a host |
1905 |
whose byte-order is different. (This seems a highly unlikely scenario.) |
whose byte-order is different. (This seems a highly unlikely scenario.) |
1906 |
|
|
1907 |
|
|
1911 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
1912 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
1913 |
|
|
1914 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
1915 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
1916 |
was studied, the result of the study should be passed in the extra |
was studied, the result of the study should be passed in the extra |
1917 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
1918 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
1919 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
1920 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
1921 |
|
|
1922 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
1923 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
1924 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
1925 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
1926 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
1927 |
|
|
1928 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
1941 |
|
|
1942 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
1943 |
|
|
1944 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
1945 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
1946 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
1947 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
1948 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
1949 |
|
|
1950 |
unsigned long int flags; |
unsigned long int flags; |
1955 |
const unsigned char *tables; |
const unsigned char *tables; |
1956 |
unsigned char **mark; |
unsigned char **mark; |
1957 |
|
|
1958 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
1959 |
are set. The flag bits are: |
are set. The flag bits are: |
1960 |
|
|
1961 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
1965 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
1966 |
PCRE_EXTRA_MARK |
PCRE_EXTRA_MARK |
1967 |
|
|
1968 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
1969 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
1970 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
1971 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
1972 |
flag bits. |
flag bits. |
1973 |
|
|
1974 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
1975 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
1976 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
1977 |
search trees. The classic example is a pattern that uses nested unlim- |
search trees. The classic example is a pattern that uses nested unlim- |
1978 |
ited repeats. |
ited repeats. |
1979 |
|
|
1980 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
1981 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
1982 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
1983 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
1984 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
1985 |
for each position in the subject string. |
for each position in the subject string. |
1986 |
|
|
1987 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
1988 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
1989 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
1990 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
1991 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
1992 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
1993 |
|
|
1994 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
1995 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
1996 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
1997 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
1998 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
1999 |
|
|
2000 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
2001 |
used, or, when PCRE has been compiled to use memory on the heap instead |
used, or, when PCRE has been compiled to use memory on the heap instead |
2002 |
of the stack, the amount of heap memory that can be used. |
of the stack, the amount of heap memory that can be used. |
2003 |
|
|
2004 |
The default value for match_limit_recursion can be set when PCRE is |
The default value for match_limit_recursion can be set when PCRE is |
2005 |
built; the default default is the same value as the default for |
built; the default default is the same value as the default for |
2006 |
match_limit. You can override the default by suppling pcre_exec() with |
match_limit. You can override the default by suppling pcre_exec() with |
2007 |
a pcre_extra block in which match_limit_recursion is set, and |
a pcre_extra block in which match_limit_recursion is set, and |
2008 |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
2009 |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
2010 |
|
|
2011 |
The callout_data field is used in conjunction with the "callout" fea- |
The callout_data field is used in conjunction with the "callout" fea- |
2012 |
ture, and is described in the pcrecallout documentation. |
ture, and is described in the pcrecallout documentation. |
2013 |
|
|
2014 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
2015 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
2016 |
pattern. A non-NULL value is stored with the compiled pattern only if |
pattern. A non-NULL value is stored with the compiled pattern only if |
2017 |
custom tables were supplied to pcre_compile() via its tableptr argu- |
custom tables were supplied to pcre_compile() via its tableptr argu- |
2018 |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
2019 |
PCRE's internal tables to be used. This facility is helpful when re- |
PCRE's internal tables to be used. This facility is helpful when re- |
2020 |
using patterns that have been saved after compiling with an external |
using patterns that have been saved after compiling with an external |
2021 |
set of tables, because the external tables might be at a different |
set of tables, because the external tables might be at a different |
2022 |
address when pcre_exec() is called. See the pcreprecompile documenta- |
address when pcre_exec() is called. See the pcreprecompile documenta- |
2023 |
tion for a discussion of saving compiled patterns for later use. |
tion for a discussion of saving compiled patterns for later use. |
2024 |
|
|
2025 |
If PCRE_EXTRA_MARK is set in the flags field, the mark field must be |
If PCRE_EXTRA_MARK is set in the flags field, the mark field must be |
2026 |
set to point to a char * variable. If the pattern contains any back- |
set to point to a char * variable. If the pattern contains any back- |
2027 |
tracking control verbs such as (*MARK:NAME), and the execution ends up |
tracking control verbs such as (*MARK:NAME), and the execution ends up |
2028 |
with a name to pass back, a pointer to the name string (zero termi- |
with a name to pass back, a pointer to the name string (zero termi- |
2029 |
nated) is placed in the variable pointed to by the mark field. The |
nated) is placed in the variable pointed to by the mark field. The |
2030 |
names are within the compiled pattern; if you wish to retain such a |
names are within the compiled pattern; if you wish to retain such a |
2031 |
name you must copy it before freeing the memory of a compiled pattern. |
name you must copy it before freeing the memory of a compiled pattern. |
2032 |
If there is no name to pass back, the variable pointed to by the mark |
If there is no name to pass back, the variable pointed to by the mark |
2033 |
field set to NULL. For details of the backtracking control verbs, see |
field set to NULL. For details of the backtracking control verbs, see |
2034 |
the section entitled "Backtracking control" in the pcrepattern documen- |
the section entitled "Backtracking control" in the pcrepattern documen- |
2035 |
tation. |
tation. |
2036 |
|
|
2037 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
2038 |
|
|
2039 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
2040 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
2041 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART, |
2042 |
PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and |
PCRE_NO_START_OPTIMIZE, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and |
2043 |
PCRE_PARTIAL_HARD. |
PCRE_PARTIAL_HARD. |
2044 |
|
|
2045 |
PCRE_ANCHORED |
PCRE_ANCHORED |
2046 |
|
|
2047 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
2048 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
2049 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
2050 |
unachored at matching time. |
unachored at matching time. |
2051 |
|
|
2052 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
2053 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
2054 |
|
|
2055 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
2056 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
2057 |
or to match any Unicode newline sequence. These options override the |
or to match any Unicode newline sequence. These options override the |
2058 |
choice that was made or defaulted when the pattern was compiled. |
choice that was made or defaulted when the pattern was compiled. |
2059 |
|
|
2060 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
2063 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
2064 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
2065 |
|
|
2066 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
2067 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
2068 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
2069 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
2070 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
2071 |
match failure for an unanchored pattern. |
match failure for an unanchored pattern. |
2072 |
|
|
2073 |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
2074 |
set, and a match attempt for an unanchored pattern fails when the cur- |
set, and a match attempt for an unanchored pattern fails when the cur- |
2075 |
rent position is at a CRLF sequence, and the pattern contains no |
rent position is at a CRLF sequence, and the pattern contains no |
2076 |
explicit matches for CR or LF characters, the match position is |
explicit matches for CR or LF characters, the match position is |
2077 |
advanced by two characters instead of one, in other words, to after the |
advanced by two characters instead of one, in other words, to after the |
2078 |
CRLF. |
CRLF. |
2079 |
|
|
2080 |
The above rule is a compromise that makes the most common cases work as |
The above rule is a compromise that makes the most common cases work as |
2081 |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
2082 |
option is not set), it does not match the string "\r\nA" because, after |
option is not set), it does not match the string "\r\nA" because, after |
2083 |
failing at the start, it skips both the CR and the LF before retrying. |
failing at the start, it skips both the CR and the LF before retrying. |
2084 |
However, the pattern [\r\n]A does match that string, because it con- |
However, the pattern [\r\n]A does match that string, because it con- |
2085 |
tains an explicit CR or LF reference, and so advances only by one char- |
tains an explicit CR or LF reference, and so advances only by one char- |
2086 |
acter after the first failure. |
acter after the first failure. |
2087 |
|
|
2088 |
An explicit match for CR of LF is either a literal appearance of one of |
An explicit match for CR of LF is either a literal appearance of one of |
2089 |
those characters, or one of the \r or \n escape sequences. Implicit |
those characters, or one of the \r or \n escape sequences. Implicit |
2090 |
matches such as [^X] do not count, nor does \s (which includes CR and |
matches such as [^X] do not count, nor does \s (which includes CR and |
2091 |
LF in the characters that it matches). |
LF in the characters that it matches). |
2092 |
|
|
2093 |
Notwithstanding the above, anomalous effects may still occur when CRLF |
Notwithstanding the above, anomalous effects may still occur when CRLF |
2094 |
is a valid newline sequence and explicit \r or \n escapes appear in the |
is a valid newline sequence and explicit \r or \n escapes appear in the |
2095 |
pattern. |
pattern. |
2096 |
|
|
2097 |
PCRE_NOTBOL |
PCRE_NOTBOL |
2098 |
|
|
2099 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
2100 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
2101 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
2102 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
2103 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
2104 |
|
|
2105 |
PCRE_NOTEOL |
PCRE_NOTEOL |
2106 |
|
|
2107 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
2108 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
2109 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
2110 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
2111 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
2112 |
not affect \Z or \z. |
not affect \Z or \z. |
2113 |
|
|
2114 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
2115 |
|
|
2116 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
2117 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
2118 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
2119 |
example, if the pattern |
example, if the pattern |
2120 |
|
|
2121 |
a?b? |
a?b? |
2122 |
|
|
2123 |
is applied to a string not beginning with "a" or "b", it matches an |
is applied to a string not beginning with "a" or "b", it matches an |
2124 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
2125 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
2126 |
rences of "a" or "b". |
rences of "a" or "b". |
2127 |
|
|
2128 |
PCRE_NOTEMPTY_ATSTART |
PCRE_NOTEMPTY_ATSTART |
2129 |
|
|
2130 |
This is like PCRE_NOTEMPTY, except that an empty string match that is |
This is like PCRE_NOTEMPTY, except that an empty string match that is |
2131 |
not at the start of the subject is permitted. If the pattern is |
not at the start of the subject is permitted. If the pattern is |
2132 |
anchored, such a match can occur only if the pattern contains \K. |
anchored, such a match can occur only if the pattern contains \K. |
2133 |
|
|
2134 |
Perl has no direct equivalent of PCRE_NOTEMPTY or |
Perl has no direct equivalent of PCRE_NOTEMPTY or |
2135 |
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern |
PCRE_NOTEMPTY_ATSTART, but it does make a special case of a pattern |
2136 |
match of the empty string within its split() function, and when using |
match of the empty string within its split() function, and when using |
2137 |
the /g modifier. It is possible to emulate Perl's behaviour after |
the /g modifier. It is possible to emulate Perl's behaviour after |
2138 |
matching a null string by first trying the match again at the same off- |
matching a null string by first trying the match again at the same off- |
2139 |
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that |
set with PCRE_NOTEMPTY_ATSTART and PCRE_ANCHORED, and then if that |
2140 |
fails, by advancing the starting offset (see below) and trying an ordi- |
fails, by advancing the starting offset (see below) and trying an ordi- |
2141 |
nary match again. There is some code that demonstrates how to do this |
nary match again. There is some code that demonstrates how to do this |
2142 |
in the pcredemo sample program. In the most general case, you have to |
in the pcredemo sample program. In the most general case, you have to |
2143 |
check to see if the newline convention recognizes CRLF as a newline, |
check to see if the newline convention recognizes CRLF as a newline, |
2144 |
and if so, and the current character is CR followed by LF, advance the |
and if so, and the current character is CR followed by LF, advance the |
2145 |
starting offset by two characters instead of one. |
starting offset by two characters instead of one. |
2146 |
|
|
2147 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
2148 |
|
|
2149 |
There are a number of optimizations that pcre_exec() uses at the start |
There are a number of optimizations that pcre_exec() uses at the start |
2150 |
of a match, in order to speed up the process. For example, if it is |
of a match, in order to speed up the process. For example, if it is |
2151 |
known that an unanchored match must start with a specific character, it |
known that an unanchored match must start with a specific character, it |
2152 |
searches the subject for that character, and fails immediately if it |
searches the subject for that character, and fails immediately if it |
2153 |
cannot find it, without actually running the main matching function. |
cannot find it, without actually running the main matching function. |
2154 |
This means that a special item such as (*COMMIT) at the start of a pat- |
This means that a special item such as (*COMMIT) at the start of a pat- |
2155 |
tern is not considered until after a suitable starting point for the |
tern is not considered until after a suitable starting point for the |
2156 |
match has been found. When callouts or (*MARK) items are in use, these |
match has been found. When callouts or (*MARK) items are in use, these |
2157 |
"start-up" optimizations can cause them to be skipped if the pattern is |
"start-up" optimizations can cause them to be skipped if the pattern is |
2158 |
never actually used. The start-up optimizations are in effect a pre- |
never actually used. The start-up optimizations are in effect a pre- |
2159 |
scan of the subject that takes place before the pattern is run. |
scan of the subject that takes place before the pattern is run. |
2160 |
|
|
2161 |
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, |
The PCRE_NO_START_OPTIMIZE option disables the start-up optimizations, |
2162 |
possibly causing performance to suffer, but ensuring that in cases |
possibly causing performance to suffer, but ensuring that in cases |
2163 |
where the result is "no match", the callouts do occur, and that items |
where the result is "no match", the callouts do occur, and that items |
2164 |
such as (*COMMIT) and (*MARK) are considered at every possible starting |
such as (*COMMIT) and (*MARK) are considered at every possible starting |
2165 |
position in the subject string. Setting PCRE_NO_START_OPTIMIZE can |
position in the subject string. Setting PCRE_NO_START_OPTIMIZE can |
2166 |
change the outcome of a matching operation. Consider the pattern |
change the outcome of a matching operation. Consider the pattern |
2167 |
|
|
2168 |
(*COMMIT)ABC |
(*COMMIT)ABC |
2169 |
|
|
2170 |
When this is compiled, PCRE records the fact that a match must start |
When this is compiled, PCRE records the fact that a match must start |
2171 |
with the character "A". Suppose the subject string is "DEFABC". The |
with the character "A". Suppose the subject string is "DEFABC". The |
2172 |
start-up optimization scans along the subject, finds "A" and runs the |
start-up optimization scans along the subject, finds "A" and runs the |
2173 |
first match attempt from there. The (*COMMIT) item means that the pat- |
first match attempt from there. The (*COMMIT) item means that the pat- |
2174 |
tern must match the current starting position, which in this case, it |
tern must match the current starting position, which in this case, it |
2175 |
does. However, if the same match is run with PCRE_NO_START_OPTIMIZE |
does. However, if the same match is run with PCRE_NO_START_OPTIMIZE |
2176 |
set, the initial scan along the subject string does not happen. The |
set, the initial scan along the subject string does not happen. The |
2177 |
first match attempt is run starting from "D" and when this fails, |
first match attempt is run starting from "D" and when this fails, |
2178 |
(*COMMIT) prevents any further matches being tried, so the overall |
(*COMMIT) prevents any further matches being tried, so the overall |
2179 |
result is "no match". If the pattern is studied, more start-up opti- |
result is "no match". If the pattern is studied, more start-up opti- |
2180 |
mizations may be used. For example, a minimum length for the subject |
mizations may be used. For example, a minimum length for the subject |
2181 |
may be recorded. Consider the pattern |
may be recorded. Consider the pattern |
2182 |
|
|
2183 |
(*MARK:A)(X|Y) |
(*MARK:A)(X|Y) |
2184 |
|
|
2185 |
The minimum length for a match is one character. If the subject is |
The minimum length for a match is one character. If the subject is |
2186 |
"ABC", there will be attempts to match "ABC", "BC", "C", and then |
"ABC", there will be attempts to match "ABC", "BC", "C", and then |
2187 |
finally an empty string. If the pattern is studied, the final attempt |
finally an empty string. If the pattern is studied, the final attempt |
2188 |
does not take place, because PCRE knows that the subject is too short, |
does not take place, because PCRE knows that the subject is too short, |
2189 |
and so the (*MARK) is never encountered. In this case, studying the |
and so the (*MARK) is never encountered. In this case, studying the |
2190 |
pattern does not affect the overall match result, which is still "no |
pattern does not affect the overall match result, which is still "no |
2191 |
match", but it does affect the auxiliary information that is returned. |
match", but it does affect the auxiliary information that is returned. |
2192 |
|
|
2193 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
2194 |
|
|
2195 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
2196 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
2197 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
2198 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
2199 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
2200 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
2201 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8 or, if PCRE_PAR- |
2202 |
tains a value that does not point to the start of a UTF-8 character (or |
TIAL_HARD is set and the problem is a truncated UTF-8 character at the |
2203 |
to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned. |
end of the subject, PCRE_ERROR_SHORTUTF8. If startoffset contains a |
2204 |
|
value that does not point to the start of a UTF-8 character (or to the |
2205 |
If you already know that your subject is valid, and you want to skip |
end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned. |
2206 |
these checks for performance reasons, you can set the |
|
2207 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
If you already know that your subject is valid, and you want to skip |
2208 |
do this for the second and subsequent calls to pcre_exec() if you are |
these checks for performance reasons, you can set the |
2209 |
making repeated calls to find all the matches in a single subject |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
2210 |
string. However, you should be sure that the value of startoffset |
do this for the second and subsequent calls to pcre_exec() if you are |
2211 |
points to the start of a UTF-8 character (or the end of the subject). |
making repeated calls to find all the matches in a single subject |
2212 |
When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 |
string. However, you should be sure that the value of startoffset |
2213 |
string as a subject or an invalid value of startoffset is undefined. |
points to the start of a UTF-8 character (or the end of the subject). |
2214 |
|
When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid UTF-8 |
2215 |
|
string as a subject or an invalid value of startoffset is undefined. |
2216 |
Your program may crash. |
Your program may crash. |
2217 |
|
|
2218 |
PCRE_PARTIAL_HARD |
PCRE_PARTIAL_HARD |
2219 |
PCRE_PARTIAL_SOFT |
PCRE_PARTIAL_SOFT |
2220 |
|
|
2221 |
These options turn on the partial matching feature. For backwards com- |
These options turn on the partial matching feature. For backwards com- |
2222 |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
2223 |
match occurs if the end of the subject string is reached successfully, |
match occurs if the end of the subject string is reached successfully, |
2224 |
but there are not enough subject characters to complete the match. If |
but there are not enough subject characters to complete the match. If |
2225 |
this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, |
this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set, |
2226 |
matching continues by testing any remaining alternatives. Only if no |
matching continues by testing any remaining alternatives. Only if no |
2227 |
complete match can be found is PCRE_ERROR_PARTIAL returned instead of |
complete match can be found is PCRE_ERROR_PARTIAL returned instead of |
2228 |
PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the |
PCRE_ERROR_NOMATCH. In other words, PCRE_PARTIAL_SOFT says that the |
2229 |
caller is prepared to handle a partial match, but only if no complete |
caller is prepared to handle a partial match, but only if no complete |
2230 |
match can be found. |
match can be found. |
2231 |
|
|
2232 |
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this |
If PCRE_PARTIAL_HARD is set, it overrides PCRE_PARTIAL_SOFT. In this |
2233 |
case, if a partial match is found, pcre_exec() immediately returns |
case, if a partial match is found, pcre_exec() immediately returns |
2234 |
PCRE_ERROR_PARTIAL, without considering any other alternatives. In |
PCRE_ERROR_PARTIAL, without considering any other alternatives. In |
2235 |
other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- |
other words, when PCRE_PARTIAL_HARD is set, a partial match is consid- |
2236 |
ered to be more important that an alternative complete match. |
ered to be more important that an alternative complete match. |
2237 |
|
|
2238 |
In both cases, the portion of the string that was inspected when the |
In both cases, the portion of the string that was inspected when the |
2239 |
partial match was found is set as the first matching string. There is a |
partial match was found is set as the first matching string. There is a |
2240 |
more detailed discussion of partial and multi-segment matching, with |
more detailed discussion of partial and multi-segment matching, with |
2241 |
examples, in the pcrepartial documentation. |
examples, in the pcrepartial documentation. |
2242 |
|
|
2243 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2244 |
|
|
2245 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
2246 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
2247 |
If this is negative or greater than the length of the subject, |
If this is negative or greater than the length of the subject, |
2248 |
pcre_exec() returns PCRE_ERROR_BADOFFSET. |
pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting offset is |
2249 |
|
zero, the search for a match starts at the beginning of the subject, |
2250 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
and this is by far the most common case. In UTF-8 mode, the byte offset |
2251 |
acter (or the end of the subject). Unlike the pattern string, the sub- |
must point to the start of a UTF-8 character (or the end of the sub- |
2252 |
ject may contain binary zero bytes. When the starting offset is zero, |
ject). Unlike the pattern string, the subject may contain binary zero |
2253 |
the search for a match starts at the beginning of the subject, and this |
bytes. |
|
is by far the most common case. |
|
2254 |
|
|
2255 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
2256 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
2354 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
2355 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2356 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
2357 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1, and the offsets for for the second |
2358 |
for the second and third capturing subpatterns if you wish (assuming |
and third capturing subpatterns (assuming the vector is large enough, |
2359 |
the vector is large enough, of course). |
of course) are set to -1. |
2360 |
|
|
2361 |
|
Note: Elements of ovector that do not correspond to capturing parenthe- |
2362 |
|
ses in the pattern are never changed. That is, if a pattern contains n |
2363 |
|
capturing parentheses, no more than ovector[0] to ovector[2n+1] are set |
2364 |
|
by pcre_exec(). The other elements retain whatever values they previ- |
2365 |
|
ously had. |
2366 |
|
|
2367 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
2368 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
2432 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2433 |
|
|
2434 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2435 |
subject. |
subject. However, if PCRE_PARTIAL_HARD is set and the problem is a |
2436 |
|
truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORT- |
2437 |
|
UTF8 is used instead. |
2438 |
|
|
2439 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2440 |
|
|
2441 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2442 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2443 |
ter. |
ter or the end of the subject. |
2444 |
|
|
2445 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2446 |
|
|
2478 |
The value of startoffset was negative or greater than the length of the |
The value of startoffset was negative or greater than the length of the |
2479 |
subject, that is, the value in length. |
subject, that is, the value in length. |
2480 |
|
|
2481 |
|
PCRE_ERROR_SHORTUTF8 (-25) |
2482 |
|
|
2483 |
|
The subject string ended with an incomplete (truncated) UTF-8 charac- |
2484 |
|
ter, and the PCRE_PARTIAL_HARD option was set. Without this option, |
2485 |
|
PCRE_ERROR_BADUTF8 is returned in this situation. |
2486 |
|
|
2487 |
Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
2488 |
|
|
2489 |
|
|
2862 |
|
|
2863 |
REVISION |
REVISION |
2864 |
|
|
2865 |
Last updated: 06 November 2010 |
Last updated: 13 November 2010 |
2866 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
2867 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2868 |
|
|
3562 |
affects \b, and \B because they are defined in terms of \w and \W. |
affects \b, and \B because they are defined in terms of \w and \W. |
3563 |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
Matching these sequences is noticeably slower when PCRE_UCP is set. |
3564 |
|
|
3565 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are features that were added to Perl |
3566 |
the other sequences, which match only ASCII characters by default, |
at release 5.10. In contrast to the other sequences, which match only |
3567 |
these always match certain high-valued codepoints in UTF-8 mode, |
ASCII characters by default, these always match certain high-valued |
3568 |
whether or not PCRE_UCP is set. The horizontal space characters are: |
codepoints in UTF-8 mode, whether or not PCRE_UCP is set. The horizon- |
3569 |
|
tal space characters are: |
3570 |
|
|
3571 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
3572 |
U+0020 Space |
U+0020 Space |
3600 |
|
|
3601 |
Newline sequences |
Newline sequences |
3602 |
|
|
3603 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
3604 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. In non-UTF-8 mode \R is equivalent to the |
3605 |
mode \R is equivalent to the following: |
following: |
3606 |
|
|
3607 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
3608 |
|
|
3609 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
3610 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
3611 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
3612 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
3613 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
3614 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
3615 |
|
|
3616 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
3617 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
3618 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
3619 |
these characters to be recognized. |
these characters to be recognized. |
3620 |
|
|
3621 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
3622 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
3623 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
3624 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
3625 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
3626 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
3627 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
3628 |
following sequences: |
following sequences: |
3629 |
|
|
3630 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
3631 |
(*BSR_UNICODE) any Unicode newline sequence |
(*BSR_UNICODE) any Unicode newline sequence |
3632 |
|
|
3633 |
These override the default and the options given to pcre_compile() or |
These override the default and the options given to pcre_compile() or |
3634 |
pcre_compile2(), but they can be overridden by options given to |
pcre_compile2(), but they can be overridden by options given to |
3635 |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
pcre_exec() or pcre_dfa_exec(). Note that these special settings, which |
3636 |
are not Perl-compatible, are recognized only at the very start of a |
are not Perl-compatible, are recognized only at the very start of a |
3637 |
pattern, and that they must be in upper case. If more than one of them |
pattern, and that they must be in upper case. If more than one of them |
3638 |
is present, the last one is used. They can be combined with a change of |
is present, the last one is used. They can be combined with a change of |
3639 |
newline convention; for example, a pattern can start with: |
newline convention; for example, a pattern can start with: |
3640 |
|
|
3641 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
3642 |
|
|
3643 |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
They can also be combined with the (*UTF8) or (*UCP) special sequences. |
3644 |
Inside a character class, \R is treated as an unrecognized escape |
Inside a character class, \R is treated as an unrecognized escape |
3645 |
sequence, and so matches the letter "R" by default, but causes an error |
sequence, and so matches the letter "R" by default, but causes an error |
3646 |
if PCRE_EXTRA is set. |
if PCRE_EXTRA is set. |
3647 |
|
|
3648 |
Unicode character properties |
Unicode character properties |
3649 |
|
|
3650 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
3651 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
3652 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
3653 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
3654 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
3655 |
|
|
3656 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
3657 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
3658 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
3659 |
|
|
3660 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
3661 |
script names, the general category properties, "Any", which matches any |
script names, the general category properties, "Any", which matches any |
3662 |
character (including newline), and some special PCRE properties |
character (including newline), and some special PCRE properties |
3663 |
(described in the next section). Other Perl properties such as "InMu- |
(described in the next section). Other Perl properties such as "InMu- |
3664 |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
sicalSymbols" are not currently supported by PCRE. Note that \P{Any} |
3665 |
does not match any characters, so always causes a match failure. |
does not match any characters, so always causes a match failure. |
3666 |
|
|
3667 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
3668 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
3669 |
For example: |
For example: |
3670 |
|
|
3671 |
\p{Greek} |
\p{Greek} |
3672 |
\P{Han} |
\P{Han} |
3673 |
|
|
3674 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
3675 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
3676 |
|
|
3677 |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
Arabic, Armenian, Avestan, Balinese, Bamum, Bengali, Bopomofo, Braille, |
3678 |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
Buginese, Buhid, Canadian_Aboriginal, Carian, Cham, Cherokee, Common, |
3679 |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
Coptic, Cuneiform, Cypriot, Cyrillic, Deseret, Devanagari, Egyp- |
3680 |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
tian_Hieroglyphs, Ethiopic, Georgian, Glagolitic, Gothic, Greek, |
3681 |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Impe- |
3682 |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
rial_Aramaic, Inherited, Inscriptional_Pahlavi, Inscriptional_Parthian, |
3683 |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
Javanese, Kaithi, Kannada, Katakana, Kayah_Li, Kharoshthi, Khmer, Lao, |
3684 |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
Latin, Lepcha, Limbu, Linear_B, Lisu, Lycian, Lydian, Malayalam, |
3685 |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
Meetei_Mayek, Mongolian, Myanmar, New_Tai_Lue, Nko, Ogham, Old_Italic, |
3686 |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
Old_Persian, Old_South_Arabian, Old_Turkic, Ol_Chiki, Oriya, Osmanya, |
3687 |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
Phags_Pa, Phoenician, Rejang, Runic, Samaritan, Saurashtra, Shavian, |
3688 |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
Sinhala, Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, |
3689 |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
Tai_Tham, Tai_Viet, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
3690 |
Ugaritic, Vai, Yi. |
Ugaritic, Vai, Yi. |
3691 |
|
|
3692 |
Each character has exactly one Unicode general category property, spec- |
Each character has exactly one Unicode general category property, spec- |
3693 |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
ified by a two-letter abbreviation. For compatibility with Perl, nega- |
3694 |
tion can be specified by including a circumflex between the opening |
tion can be specified by including a circumflex between the opening |
3695 |
brace and the property name. For example, \p{^Lu} is the same as |
brace and the property name. For example, \p{^Lu} is the same as |
3696 |
\P{Lu}. |
\P{Lu}. |
3697 |
|
|
3698 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
3699 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
3700 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
3701 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
3702 |
|
|
3703 |
\p{L} |
\p{L} |
3749 |
Zp Paragraph separator |
Zp Paragraph separator |
3750 |
Zs Space separator |
Zs Space separator |
3751 |
|
|
3752 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
3753 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
3754 |
classified as a modifier or "other". |
classified as a modifier or "other". |
3755 |
|
|
3756 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
3757 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3758 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3759 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3760 |
the pcreapi page). Perl does not support the Cs property. |
the pcreapi page). Perl does not support the Cs property. |
3761 |
|
|
3762 |
The long synonyms for property names that Perl supports (such as |
The long synonyms for property names that Perl supports (such as |
3763 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3764 |
any of these properties with "Is". |
any of these properties with "Is". |
3765 |
|
|
3766 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
3767 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
3768 |
in the Unicode table. |
in the Unicode table. |
3769 |
|
|
3770 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
3771 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3772 |
|
|
3773 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
3774 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
3775 |
|
|
3776 |
(?>\PM\pM*) |
(?>\PM\pM*) |
3777 |
|
|
3778 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
3779 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
3780 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
3781 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
3782 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
3783 |
matches any one character. |
matches any one character. |
3784 |
|
|
3785 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
3786 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
3787 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
3788 |
\w do not use Unicode properties in PCRE by default, though you can |
\w do not use Unicode properties in PCRE by default, though you can |
3789 |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
make them do so by setting the PCRE_UCP option for pcre_compile() or by |
3790 |
starting the pattern with (*UCP). |
starting the pattern with (*UCP). |
3791 |
|
|
3792 |
PCRE's additional properties |
PCRE's additional properties |
3793 |
|
|
3794 |
As well as the standard Unicode properties described in the previous |
As well as the standard Unicode properties described in the previous |
3795 |
section, PCRE supports four more that make it possible to convert tra- |
section, PCRE supports four more that make it possible to convert tra- |
3796 |
ditional escape sequences such as \w and \s and POSIX character classes |
ditional escape sequences such as \w and \s and POSIX character classes |
3797 |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
to use Unicode properties. PCRE uses these non-standard, non-Perl prop- |
3798 |
erties internally when PCRE_UCP is set. They are: |
erties internally when PCRE_UCP is set. They are: |
3802 |
Xsp Any Perl space character |
Xsp Any Perl space character |
3803 |
Xwd Any Perl "word" character |
Xwd Any Perl "word" character |
3804 |
|
|
3805 |
Xan matches characters that have either the L (letter) or the N (num- |
Xan matches characters that have either the L (letter) or the N (num- |
3806 |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
ber) property. Xps matches the characters tab, linefeed, vertical tab, |
3807 |
formfeed, or carriage return, and any other character that has the Z |
formfeed, or carriage return, and any other character that has the Z |
3808 |
(separator) property. Xsp is the same as Xps, except that vertical tab |
(separator) property. Xsp is the same as Xps, except that vertical tab |
3809 |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
is excluded. Xwd matches the same characters as Xan, plus underscore. |
3810 |
|
|
3811 |
Resetting the match start |
Resetting the match start |
3812 |
|
|
3813 |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
The escape sequence \K causes any previously matched characters not to |
3814 |
ously matched characters not to be included in the final matched |
be included in the final matched sequence. For example, the pattern: |
|
sequence. For example, the pattern: |
|
3815 |
|
|
3816 |
foo\Kbar |
foo\Kbar |
3817 |
|
|
3967 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
3968 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
3969 |
|
|
3970 |
The escape sequence \N always behaves as a dot does when PCRE_DOTALL is |
The escape sequence \N behaves like a dot, except that it is not |
3971 |
not set. In other words, it matches any one character except one that |
affected by the PCRE_DOTALL option. In other words, it matches any |
3972 |
signifies the end of a line. |
character except one that signifies the end of a line. |
3973 |
|
|
3974 |
|
|
3975 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
3978 |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
3979 |
line-ending characters. The feature is provided in Perl in order to |
line-ending characters. The feature is provided in Perl in order to |
3980 |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
3981 |
acters into individual bytes, what remains in the string may be a mal- |
acters into individual bytes, the rest of the string may start with a |
3982 |
formed UTF-8 string. For this reason, the \C escape sequence is best |
malformed UTF-8 character. For this reason, the \C escape sequence is |
3983 |
avoided. |
best avoided. |
3984 |
|
|
3985 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
3986 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
4186 |
fore show up in data extracted by the pcre_fullinfo() function). |
fore show up in data extracted by the pcre_fullinfo() function). |
4187 |
|
|
4188 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
4189 |
subpatterns) affects only that part of the current pattern that follows |
subpatterns) affects only that part of the subpattern that follows it, |
4190 |
it, so |
so |
4191 |
|
|
4192 |
(a(?i)b)c |
(a(?i)b)c |
4193 |
|
|
4223 |
|
|
4224 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
4225 |
|
|
4226 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches "cataract", "caterpillar", or "cat". Without the parentheses, |
4227 |
the parentheses, it would match "cataract", "erpillar" or an empty |
it would match "cataract", "erpillar" or an empty string. |
|
string. |
|
4228 |
|
|
4229 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
4230 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
4231 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
4232 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
4233 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
4234 |
subpatterns. |
subpatterns. For example, if the string "the red king" is matched |
4235 |
|
against the pattern |
|
For example, if the string "the red king" is matched against the pat- |
|
|
tern |
|
4236 |
|
|
4237 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
4238 |
|
|
4239 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
4240 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
4241 |
|
|
4242 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
4243 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
4244 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
4245 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
4246 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
4247 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
4248 |
matched against the pattern |
matched against the pattern |
4249 |
|
|
4250 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
4252 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
4253 |
1 and 2. The maximum number of capturing subpatterns is 65535. |
1 and 2. The maximum number of capturing subpatterns is 65535. |
4254 |
|
|
4255 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
4256 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
4257 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
4258 |
|
|
4259 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
4260 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
4261 |
|
|
4262 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
4263 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
4264 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
4265 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
4266 |
"Saturday". |
"Saturday". |
4267 |
|
|
4268 |
|
|
4269 |
DUPLICATE SUBPATTERN NUMBERS |
DUPLICATE SUBPATTERN NUMBERS |
4270 |
|
|
4271 |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
4272 |
uses the same numbers for its capturing parentheses. Such a subpattern |
uses the same numbers for its capturing parentheses. Such a subpattern |
4273 |
starts with (?| and is itself a non-capturing subpattern. For example, |
starts with (?| and is itself a non-capturing subpattern. For example, |
4274 |
consider this pattern: |
consider this pattern: |
4275 |
|
|
4276 |
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
4277 |
|
|
4278 |
Because the two alternatives are inside a (?| group, both sets of cap- |
Because the two alternatives are inside a (?| group, both sets of cap- |
4279 |
turing parentheses are numbered one. Thus, when the pattern matches, |
turing parentheses are numbered one. Thus, when the pattern matches, |
4280 |
you can look at captured substring number one, whichever alternative |
you can look at captured substring number one, whichever alternative |
4281 |
matched. This construct is useful when you want to capture part, but |
matched. This construct is useful when you want to capture part, but |
4282 |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
4283 |
theses are numbered as usual, but the number is reset at the start of |
theses are numbered as usual, but the number is reset at the start of |
4284 |
each branch. The numbers of any capturing buffers that follow the sub- |
each branch. The numbers of any capturing parentheses that follow the |
4285 |
pattern start after the highest number used in any branch. The follow- |
subpattern start after the highest number used in any branch. The fol- |
4286 |
ing example is taken from the Perl documentation. The numbers under- |
lowing example is taken from the Perl documentation. The numbers under- |
4287 |
neath show in which buffer the captured content will be stored. |
neath show in which buffer the captured content will be stored. |
4288 |
|
|
4289 |
# before ---------------branch-reset----------- after |
# before ---------------branch-reset----------- after |
4290 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
4291 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
4292 |
|
|
4293 |
A back reference to a numbered subpattern uses the most recent value |
A back reference to a numbered subpattern uses the most recent value |
4294 |
that is set for that number by any subpattern. The following pattern |
that is set for that number by any subpattern. The following pattern |
4295 |
matches "abcabc" or "defdef": |
matches "abcabc" or "defdef": |
4296 |
|
|
4297 |
/(?|(abc)|(def))\1/ |
/(?|(abc)|(def))\1/ |
4298 |
|
|
4299 |
In contrast, a recursive or "subroutine" call to a numbered subpattern |
In contrast, a recursive or "subroutine" call to a numbered subpattern |
4300 |
always refers to the first one in the pattern with the given number. |
always refers to the first one in the pattern with the given number. |
4301 |
The following pattern matches "abcabc" or "defabc": |
The following pattern matches "abcabc" or "defabc": |
4302 |
|
|
4303 |
/(?|(abc)|(def))(?1)/ |
/(?|(abc)|(def))(?1)/ |
4304 |
|
|
4305 |
If a condition test for a subpattern's having matched refers to a non- |
If a condition test for a subpattern's having matched refers to a non- |
4306 |
unique number, the test is true if any of the subpatterns of that num- |
unique number, the test is true if any of the subpatterns of that num- |
4307 |
ber have matched. |
ber have matched. |
4308 |
|
|
4309 |
An alternative approach to using this "branch reset" feature is to use |
An alternative approach to using this "branch reset" feature is to use |
4310 |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
4311 |
|
|
4312 |
|
|
4313 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
4314 |
|
|
4315 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
4316 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
4317 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
4318 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
4319 |
patterns. This feature was not added to Perl until release 5.10. Python |
patterns. This feature was not added to Perl until release 5.10. Python |
4320 |
had the feature earlier, and PCRE introduced it at release 4.0, using |
had the feature earlier, and PCRE introduced it at release 4.0, using |
4321 |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
4322 |
tax. Perl allows identically numbered subpatterns to have different |
tax. Perl allows identically numbered subpatterns to have different |
4323 |
names, but PCRE does not. |
names, but PCRE does not. |
4324 |
|
|
4325 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
4326 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
4327 |
to capturing parentheses from other parts of the pattern, such as back |
to capturing parentheses from other parts of the pattern, such as back |
4328 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
4329 |
by number. |
by number. |
4330 |
|
|
4331 |
Names consist of up to 32 alphanumeric characters and underscores. |
Names consist of up to 32 alphanumeric characters and underscores. |
4332 |
Named capturing parentheses are still allocated numbers as well as |
Named capturing parentheses are still allocated numbers as well as |
4333 |
names, exactly as if the names were not present. The PCRE API provides |
names, exactly as if the names were not present. The PCRE API provides |
4334 |
function calls for extracting the name-to-number translation table from |
function calls for extracting the name-to-number translation table from |
4335 |
a compiled pattern. There is also a convenience function for extracting |
a compiled pattern. There is also a convenience function for extracting |
4336 |
a captured substring by name. |
a captured substring by name. |
4337 |
|
|
4338 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
4339 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
4340 |
time. (Duplicate names are also always permitted for subpatterns with |
time. (Duplicate names are also always permitted for subpatterns with |
4341 |
the same number, set up as described in the previous section.) Dupli- |
the same number, set up as described in the previous section.) Dupli- |
4342 |
cate names can be useful for patterns where only one instance of the |
cate names can be useful for patterns where only one instance of the |
4343 |
named parentheses can match. Suppose you want to match the name of a |
named parentheses can match. Suppose you want to match the name of a |
4344 |
weekday, either as a 3-letter abbreviation or as the full name, and in |
weekday, either as a 3-letter abbreviation or as the full name, and in |
4345 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
4346 |
the line breaks) does the job: |
the line breaks) does the job: |
4347 |
|
|
4351 |
(?<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
4352 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
4353 |
|
|
4354 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
4355 |
match. (An alternative way of solving this problem is to use a "branch |
match. (An alternative way of solving this problem is to use a "branch |
4356 |
reset" subpattern, as described in the previous section.) |
reset" subpattern, as described in the previous section.) |
4357 |
|
|
4358 |
The convenience function for extracting the data by name returns the |
The convenience function for extracting the data by name returns the |
4359 |
substring for the first (and in this example, the only) subpattern of |
substring for the first (and in this example, the only) subpattern of |
4360 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
4361 |
subpattern it was. |
subpattern it was. |
4362 |
|
|
4363 |
If you make a back reference to a non-unique named subpattern from |
If you make a back reference to a non-unique named subpattern from |
4364 |
elsewhere in the pattern, the one that corresponds to the first occur- |
elsewhere in the pattern, the one that corresponds to the first occur- |
4365 |
rence of the name is used. In the absence of duplicate numbers (see the |
rence of the name is used. In the absence of duplicate numbers (see the |
4366 |
previous section) this is the one with the lowest number. If you use a |
previous section) this is the one with the lowest number. If you use a |
4367 |
named reference in a condition test (see the section about conditions |
named reference in a condition test (see the section about conditions |
4368 |
below), either to check whether a subpattern has matched, or to check |
below), either to check whether a subpattern has matched, or to check |
4369 |
for recursion, all subpatterns with the same name are tested. If the |
for recursion, all subpatterns with the same name are tested. If the |
4370 |
condition is true for any one of them, the overall condition is true. |
condition is true for any one of them, the overall condition is true. |
4371 |
This is the same behaviour as testing by number. For further details of |
This is the same behaviour as testing by number. For further details of |
4372 |
the interfaces for handling named subpatterns, see the pcreapi documen- |
the interfaces for handling named subpatterns, see the pcreapi documen- |
4373 |
tation. |
tation. |
4374 |
|
|
4375 |
Warning: You cannot use different names to distinguish between two sub- |
Warning: You cannot use different names to distinguish between two sub- |
4376 |
patterns with the same number because PCRE uses only the numbers when |
patterns with the same number because PCRE uses only the numbers when |
4377 |
matching. For this reason, an error is given at compile time if differ- |
matching. For this reason, an error is given at compile time if differ- |
4378 |
ent names are given to subpatterns with the same number. However, you |
ent names are given to subpatterns with the same number. However, you |
4379 |
can give the same name to subpatterns with the same number, even when |
can give the same name to subpatterns with the same number, even when |
4380 |
PCRE_DUPNAMES is not set. |
PCRE_DUPNAMES is not set. |
4381 |
|
|
4382 |
|
|
4383 |
REPETITION |
REPETITION |
4384 |
|
|
4385 |
Repetition is specified by quantifiers, which can follow any of the |
Repetition is specified by quantifiers, which can follow any of the |
4386 |
following items: |
following items: |
4387 |
|
|
4388 |
a literal data character |
a literal data character |
4390 |
the \C escape sequence |
the \C escape sequence |
4391 |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
4392 |
the \R escape sequence |
the \R escape sequence |
4393 |
an escape such as \d that matches a single character |
an escape such as \d or \pL that matches a single character |
4394 |
a character class |
a character class |
4395 |
a back reference (see next section) |
a back reference (see next section) |
4396 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (unless it is an assertion) |
4397 |
a recursive or "subroutine" call to a subpattern |
a recursive or "subroutine" call to a subpattern |
4398 |
|
|
4399 |
The general repetition quantifier specifies a minimum and maximum num- |
The general repetition quantifier specifies a minimum and maximum num- |
4400 |
ber of permitted matches, by giving the two numbers in curly brackets |
ber of permitted matches, by giving the two numbers in curly brackets |
4401 |
(braces), separated by a comma. The numbers must be less than 65536, |
(braces), separated by a comma. The numbers must be less than 65536, |
4402 |
and the first must be less than or equal to the second. For example: |
and the first must be less than or equal to the second. For example: |
4403 |
|
|
4404 |
z{2,4} |
z{2,4} |
4405 |
|
|
4406 |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
4407 |
special character. If the second number is omitted, but the comma is |
special character. If the second number is omitted, but the comma is |
4408 |
present, there is no upper limit; if the second number and the comma |
present, there is no upper limit; if the second number and the comma |
4409 |
are both omitted, the quantifier specifies an exact number of required |
are both omitted, the quantifier specifies an exact number of required |
4410 |
matches. Thus |
matches. Thus |
4411 |
|
|
4412 |
[aeiou]{3,} |
[aeiou]{3,} |
4415 |
|
|
4416 |
\d{8} |
\d{8} |
4417 |
|
|
4418 |
matches exactly 8 digits. An opening curly bracket that appears in a |
matches exactly 8 digits. An opening curly bracket that appears in a |
4419 |
position where a quantifier is not allowed, or one that does not match |
position where a quantifier is not allowed, or one that does not match |
4420 |
the syntax of a quantifier, is taken as a literal character. For exam- |
the syntax of a quantifier, is taken as a literal character. For exam- |
4421 |
ple, {,6} is not a quantifier, but a literal string of four characters. |
ple, {,6} is not a quantifier, but a literal string of four characters. |
4422 |
|
|
4423 |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
4424 |
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
4425 |
acters, each of which is represented by a two-byte sequence. Similarly, |
acters, each of which is represented by a two-byte sequence. Similarly, |
4426 |
when Unicode property support is available, \X{3} matches three Unicode |
when Unicode property support is available, \X{3} matches three Unicode |
4427 |
extended sequences, each of which may be several bytes long (and they |
extended sequences, each of which may be several bytes long (and they |
4428 |
may be of different lengths). |
may be of different lengths). |
4429 |
|
|
4430 |
The quantifier {0} is permitted, causing the expression to behave as if |
The quantifier {0} is permitted, causing the expression to behave as if |
4431 |
the previous item and the quantifier were not present. This may be use- |
the previous item and the quantifier were not present. This may be use- |
4432 |
ful for subpatterns that are referenced as subroutines from elsewhere |
ful for subpatterns that are referenced as subroutines from elsewhere |
4433 |
in the pattern. Items other than subpatterns that have a {0} quantifier |
in the pattern (but see also the section entitled "Defining subpatterns |
4434 |
are omitted from the compiled pattern. |
for use by reference only" below). Items other than subpatterns that |
4435 |
|
have a {0} quantifier are omitted from the compiled pattern. |
4436 |
|
|
4437 |
For convenience, the three most common quantifiers have single-charac- |
For convenience, the three most common quantifiers have single-charac- |
4438 |
ter abbreviations: |
ter abbreviations: |
4663 |
subpattern is possible using named parentheses (see below). |
subpattern is possible using named parentheses (see below). |
4664 |
|
|
4665 |
Another way of avoiding the ambiguity inherent in the use of digits |
Another way of avoiding the ambiguity inherent in the use of digits |
4666 |
following a backslash is to use the \g escape sequence, which is a fea- |
following a backslash is to use the \g escape sequence. This escape |
4667 |
ture introduced in Perl 5.10. This escape must be followed by an |
must be followed by an unsigned number or a negative number, optionally |
4668 |
unsigned number or a negative number, optionally enclosed in braces. |
enclosed in braces. These examples are all identical: |
|
These examples are all identical: |
|
4669 |
|
|
4670 |
(ring), \1 |
(ring), \1 |
4671 |
(ring), \g1 |
(ring), \g1 |
4672 |
(ring), \g{1} |
(ring), \g{1} |
4673 |
|
|
4674 |
An unsigned number specifies an absolute reference without the ambigu- |
An unsigned number specifies an absolute reference without the ambigu- |
4675 |
ity that is present in the older syntax. It is also useful when literal |
ity that is present in the older syntax. It is also useful when literal |
4676 |
digits follow the reference. A negative number is a relative reference. |
digits follow the reference. A negative number is a relative reference. |
4677 |
Consider this example: |
Consider this example: |
4679 |
(abc(def)ghi)\g{-1} |
(abc(def)ghi)\g{-1} |
4680 |
|
|
4681 |
The sequence \g{-1} is a reference to the most recently started captur- |
The sequence \g{-1} is a reference to the most recently started captur- |
4682 |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
4683 |
\g{-2} would be equivalent to \1. The use of relative references can be |
\g{-2} would be equivalent to \1. The use of relative references can be |
4684 |
helpful in long patterns, and also in patterns that are created by |
helpful in long patterns, and also in patterns that are created by |
4685 |
joining together fragments that contain references within themselves. |
joining together fragments that contain references within themselves. |
4686 |
|
|
4687 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
4688 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
4689 |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
the subpattern itself (see "Subpatterns as subroutines" below for a way |
4690 |
of doing that). So the pattern |
of doing that). So the pattern |
4691 |
|
|
4692 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
4693 |
|
|
4694 |
matches "sense and sensibility" and "response and responsibility", but |
matches "sense and sensibility" and "response and responsibility", but |
4695 |
not "sense and responsibility". If caseful matching is in force at the |
not "sense and responsibility". If caseful matching is in force at the |
4696 |
time of the back reference, the case of letters is relevant. For exam- |
time of the back reference, the case of letters is relevant. For exam- |
4697 |
ple, |
ple, |
4698 |
|
|
4699 |
((?i)rah)\s+\1 |
((?i)rah)\s+\1 |
4700 |
|
|
4701 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
4702 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
4703 |
|
|
4704 |
There are several different ways of writing back references to named |
There are several different ways of writing back references to named |
4705 |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
4706 |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
4707 |
unified back reference syntax, in which \g can be used for both numeric |
unified back reference syntax, in which \g can be used for both numeric |
4708 |
and named references, is also supported. We could rewrite the above |
and named references, is also supported. We could rewrite the above |
4709 |
example in any of the following ways: |
example in any of the following ways: |
4710 |
|
|
4711 |
(?<p1>(?i)rah)\s+\k<p1> |
(?<p1>(?i)rah)\s+\k<p1> |
4713 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
4714 |
(?<p1>(?i)rah)\s+\g{p1} |
(?<p1>(?i)rah)\s+\g{p1} |
4715 |
|
|
4716 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
4717 |
before or after the reference. |
before or after the reference. |
4718 |
|
|
4719 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
4720 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
4721 |
references to it always fail by default. For example, the pattern |
references to it always fail by default. For example, the pattern |
4722 |
|
|
4723 |
(a|(bc))\2 |
(a|(bc))\2 |
4724 |
|
|
4725 |
always fails if it starts to match "a" rather than "bc". However, if |
always fails if it starts to match "a" rather than "bc". However, if |
4726 |
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- |
the PCRE_JAVASCRIPT_COMPAT option is set at compile time, a back refer- |
4727 |
ence to an unset value matches an empty string. |
ence to an unset value matches an empty string. |
4728 |
|
|
4729 |
Because there may be many capturing parentheses in a pattern, all dig- |
Because there may be many capturing parentheses in a pattern, all dig- |
4730 |
its following a backslash are taken as part of a potential back refer- |
its following a backslash are taken as part of a potential back refer- |
4731 |
ence number. If the pattern continues with a digit character, some |
ence number. If the pattern continues with a digit character, some |
4732 |
delimiter must be used to terminate the back reference. If the |
delimiter must be used to terminate the back reference. If the |
4733 |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
PCRE_EXTENDED option is set, this can be whitespace. Otherwise, the \g{ |
4734 |
syntax or an empty comment (see "Comments" below) can be used. |
syntax or an empty comment (see "Comments" below) can be used. |
4735 |
|
|
4736 |
Recursive back references |
Recursive back references |
4737 |
|
|
4738 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
4739 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
4740 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
4741 |
patterns. For example, the pattern |
patterns. For example, the pattern |
4742 |
|
|
4743 |
(a|b\1)+ |
(a|b\1)+ |
4744 |
|
|
4745 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
4746 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
4747 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
4748 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
4749 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
4750 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
4751 |
|
|
4752 |
Back references of this type cause the group that they reference to be |
Back references of this type cause the group that they reference to be |
4753 |
treated as an atomic group. Once the whole group has been matched, a |
treated as an atomic group. Once the whole group has been matched, a |
4754 |
subsequent matching failure cannot cause backtracking into the middle |
subsequent matching failure cannot cause backtracking into the middle |
4755 |
of the group. |
of the group. |
4756 |
|
|
4757 |
|
|
4758 |
ASSERTIONS |
ASSERTIONS |
4759 |
|
|
4760 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
4761 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
4762 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
4763 |
described above. |
described above. |
4764 |
|
|
4765 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
4766 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
4767 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
4768 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
4769 |
matching position to be changed. |
matching position to be changed. |
4770 |
|
|
4771 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
4772 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
4773 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
4774 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
4775 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
4776 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
4777 |
negative assertions. |
negative assertions. |
4778 |
|
|
4779 |
Lookahead assertions |
Lookahead assertions |
4783 |
|
|
4784 |
\w+(?=;) |
\w+(?=;) |
4785 |
|
|
4786 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
4787 |
colon in the match, and |
colon in the match, and |
4788 |
|
|
4789 |
foo(?!bar) |
foo(?!bar) |
4790 |
|
|
4791 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
4792 |
that the apparently similar pattern |
that the apparently similar pattern |
4793 |
|
|
4794 |
(?!foo)bar |
(?!foo)bar |
4795 |
|
|
4796 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
4797 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
4798 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
4799 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
4800 |
|
|
4801 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
4802 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
4803 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
4804 |
string must always fail. The Perl 5.10 backtracking control verb |
string must always fail. The backtracking control verb (*FAIL) or (*F) |
4805 |
(*FAIL) or (*F) is essentially a synonym for (?!). |
is essentially a synonym for (?!). |
4806 |
|
|
4807 |
Lookbehind assertions |
Lookbehind assertions |
4808 |
|
|
4809 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
4810 |
for negative assertions. For example, |
for negative assertions. For example, |
4811 |
|
|
4812 |
(?<!foo)bar |
(?<!foo)bar |
4813 |
|
|
4814 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
4815 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
4816 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
4817 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
4818 |
fixed length. Thus |
fixed length. Thus |
4819 |
|
|
4820 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
4823 |
|
|
4824 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
4825 |
|
|
4826 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
4827 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
4828 |
This is an extension compared with Perl (5.8 and 5.10), which requires |
This is an extension compared with Perl, which requires all branches to |
4829 |
all branches to match the same length of string. An assertion such as |
match the same length of string. An assertion such as |
4830 |
|
|
4831 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
4832 |
|
|
4833 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
4834 |
different lengths, but it is acceptable to PCRE if rewritten to use two |
different lengths, but it is acceptable to PCRE if rewritten to use two |
4835 |
top-level branches: |
top-level branches: |
4836 |
|
|
4837 |
(?<=abc|abde) |
(?<=abc|abde) |
4838 |
|
|
4839 |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
In some cases, the escape sequence \K (see above) can be used instead |
4840 |
instead of a lookbehind assertion to get round the fixed-length |
of a lookbehind assertion to get round the fixed-length restriction. |
|
restriction. |
|
4841 |
|
|
4842 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
4843 |
to temporarily move the current position back by the fixed length and |
to temporarily move the current position back by the fixed length and |
5073 |
ters are interpreted as newlines is controlled by the options passed to |
ters are interpreted as newlines is controlled by the options passed to |
5074 |
pcre_compile() or by a special sequence at the start of the pattern, as |
pcre_compile() or by a special sequence at the start of the pattern, as |
5075 |
described in the section entitled "Newline conventions" above. Note |
described in the section entitled "Newline conventions" above. Note |
5076 |
that end of this type of comment is a literal newline sequence in the |
that the end of this type of comment is a literal newline sequence in |
5077 |
pattern; escape sequences that happen to represent a newline do not |
the pattern; escape sequences that happen to represent a newline do not |
5078 |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
count. For example, consider this pattern when PCRE_EXTENDED is set, |
5079 |
and the default newline convention is in force: |
and the default newline convention is in force: |
5080 |
|
|
5081 |
abc #comment \n still comment |
abc #comment \n still comment |
5139 |
refer to them instead of the whole pattern. |
refer to them instead of the whole pattern. |
5140 |
|
|
5141 |
In a larger pattern, keeping track of parenthesis numbers can be |
In a larger pattern, keeping track of parenthesis numbers can be |
5142 |
tricky. This is made easier by the use of relative references (a Perl |
tricky. This is made easier by the use of relative references. Instead |
5143 |
5.10 feature). Instead of (?1) in the pattern above you can write |
of (?1) in the pattern above you can write (?-2) to refer to the second |
5144 |
(?-2) to refer to the second most recently opened parentheses preceding |
most recently opened parentheses preceding the recursion. In other |
5145 |
the recursion. In other words, a negative number counts capturing |
words, a negative number counts capturing parentheses leftwards from |
5146 |
parentheses leftwards from the point at which it is encountered. |
the point at which it is encountered. |
5147 |
|
|
5148 |
It is also possible to refer to subsequently opened parentheses, by |
It is also possible to refer to subsequently opened parentheses, by |
5149 |
writing references such as (?+2). However, these cannot be recursive |
writing references such as (?+2). However, these cannot be recursive |
5649 |
|
|
5650 |
REVISION |
REVISION |
5651 |
|
|
5652 |
Last updated: 31 October 2010 |
Last updated: 17 November 2010 |
5653 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
5654 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5655 |
|
|
6142 |
or $ are encountered at the end of the subject, the result is |
or $ are encountered at the end of the subject, the result is |
6143 |
PCRE_ERROR_PARTIAL. |
PCRE_ERROR_PARTIAL. |
6144 |
|
|
6145 |
|
Setting PCRE_PARTIAL_HARD also affects the way pcre_exec() checks UTF-8 |
6146 |
|
subject strings for validity. Normally, an invalid UTF-8 sequence |
6147 |
|
causes the error PCRE_ERROR_BADUTF8. However, in the special case of a |
6148 |
|
truncated UTF-8 character at the end of the subject, PCRE_ERROR_SHORT- |
6149 |
|
UTF8 is returned when PCRE_PARTIAL_HARD is set. |
6150 |
|
|
6151 |
Comparing hard and soft partial matching |
Comparing hard and soft partial matching |
6152 |
|
|
6153 |
The difference between the two partial matching options can be illus- |
The difference between the two partial matching options can be illus- |
6392 |
data> gsb\R\P\P\D |
data> gsb\R\P\P\D |
6393 |
Partial match: gsb |
Partial match: gsb |
6394 |
|
|
|
|
|
6395 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
6396 |
start with the same pattern item may not work as expected when |
start with the same pattern item may not work as expected when |
6397 |
PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider |
PCRE_DFA_RESTART is used with pcre_dfa_exec(). For example, consider |
6438 |
|
|
6439 |
REVISION |
REVISION |
6440 |
|
|
6441 |
Last updated: 22 October 2010 |
Last updated: 07 November 2010 |
6442 |
Copyright (c) 1997-2010 University of Cambridge. |
Copyright (c) 1997-2010 University of Cambridge. |
6443 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6444 |
|
|