18 |
|
|
19 |
The PCRE library is a set of functions that implement regular expres- |
The PCRE library is a set of functions that implement regular expres- |
20 |
sion pattern matching using the same syntax and semantics as Perl, with |
sion pattern matching using the same syntax and semantics as Perl, with |
21 |
just a few differences. (Certain features that appeared in Python and |
just a few differences. Certain features that appeared in Python and |
22 |
PCRE before they appeared in Perl are also available using the Python |
PCRE before they appeared in Perl are also available using the Python |
23 |
syntax.) |
syntax. There is also some support for certain .NET and Oniguruma syn- |
24 |
|
tax items, and there is an option for requesting some minor changes |
25 |
|
that give better JavaScript compatibility. |
26 |
|
|
27 |
The current implementation of PCRE (release 7.x) corresponds approxi- |
The current implementation of PCRE (release 7.x) corresponds approxi- |
28 |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
258 |
|
|
259 |
REVISION |
REVISION |
260 |
|
|
261 |
Last updated: 09 August 2007 |
Last updated: 12 April 2008 |
262 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
263 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
264 |
|
|
265 |
|
|
545 |
Note that libreadline is GPL-licenced, so if you distribute a binary of |
Note that libreadline is GPL-licenced, so if you distribute a binary of |
546 |
pcretest linked in this way, there may be licensing issues. |
pcretest linked in this way, there may be licensing issues. |
547 |
|
|
548 |
|
Setting this option causes the -lreadline option to be added to the |
549 |
|
pcretest build. In many operating environments with a sytem-installed |
550 |
|
libreadline this is sufficient. However, in some environments (e.g. if |
551 |
|
an unmodified distribution version of readline is in use), some extra |
552 |
|
configuration may be necessary. The INSTALL file for libreadline says |
553 |
|
this: |
554 |
|
|
555 |
|
"Readline uses the termcap functions, but does not link with the |
556 |
|
termcap or curses library itself, allowing applications which link |
557 |
|
with readline the to choose an appropriate library." |
558 |
|
|
559 |
|
If your environment has not been set up so that an appropriate library |
560 |
|
is automatically included, you may need to add something like |
561 |
|
|
562 |
|
LIBS="-ncurses" |
563 |
|
|
564 |
|
immediately before the configure command. |
565 |
|
|
566 |
|
|
567 |
SEE ALSO |
SEE ALSO |
568 |
|
|
578 |
|
|
579 |
REVISION |
REVISION |
580 |
|
|
581 |
Last updated: 18 December 2007 |
Last updated: 13 April 2008 |
582 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
583 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
584 |
|
|
585 |
|
|
727 |
tive algorithm moves through the subject string one character at a |
tive algorithm moves through the subject string one character at a |
728 |
time, for all active paths through the tree. |
time, for all active paths through the tree. |
729 |
|
|
730 |
8. None of the backtracking control verbs such as (*PRUNE) are sup- |
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) |
731 |
ported. |
are not supported. (*FAIL) is supported, and behaves like a failing |
732 |
|
negative assertion. |
733 |
|
|
734 |
|
|
735 |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
736 |
|
|
737 |
Using the alternative matching algorithm provides the following advan- |
Using the alternative matching algorithm provides the following advan- |
738 |
tages: |
tages: |
739 |
|
|
740 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
741 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
742 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
743 |
things with callouts. |
things with callouts. |
744 |
|
|
745 |
2. There is much better support for partial matching. The restrictions |
2. There is much better support for partial matching. The restrictions |
746 |
on the content of the pattern that apply when using the standard algo- |
on the content of the pattern that apply when using the standard algo- |
747 |
rithm for partial matching do not apply to the alternative algorithm. |
rithm for partial matching do not apply to the alternative algorithm. |
748 |
For non-anchored patterns, the starting position of a partial match is |
For non-anchored patterns, the starting position of a partial match is |
749 |
available. |
available. |
750 |
|
|
751 |
3. Because the alternative algorithm scans the subject string just |
3. Because the alternative algorithm scans the subject string just |
752 |
once, and never needs to backtrack, it is possible to pass very long |
once, and never needs to backtrack, it is possible to pass very long |
753 |
subject strings to the matching function in several pieces, checking |
subject strings to the matching function in several pieces, checking |
754 |
for partial matching each time. |
for partial matching each time. |
755 |
|
|
756 |
|
|
758 |
|
|
759 |
The alternative algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
760 |
|
|
761 |
1. It is substantially slower than the standard algorithm. This is |
1. It is substantially slower than the standard algorithm. This is |
762 |
partly because it has to search for all possible matches, but is also |
partly because it has to search for all possible matches, but is also |
763 |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
764 |
|
|
765 |
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
777 |
|
|
778 |
REVISION |
REVISION |
779 |
|
|
780 |
Last updated: 08 August 2007 |
Last updated: 19 April 2008 |
781 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
782 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
783 |
|
|
784 |
|
|
1265 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
1266 |
matched text may continue over the newline. |
matched text may continue over the newline. |
1267 |
|
|
1268 |
|
PCRE_JAVASCRIPT_COMPAT |
1269 |
|
|
1270 |
|
If this option is set, PCRE's behaviour is changed in some ways so that |
1271 |
|
it is compatible with JavaScript rather than Perl. The changes are as |
1272 |
|
follows: |
1273 |
|
|
1274 |
|
(1) A lone closing square bracket in a pattern causes a compile-time |
1275 |
|
error, because this is illegal in JavaScript (by default it is treated |
1276 |
|
as a data character). Thus, the pattern AB]CD becomes illegal when this |
1277 |
|
option is set. |
1278 |
|
|
1279 |
|
(2) At run time, a back reference to an unset subpattern group matches |
1280 |
|
an empty string (by default this causes the current matching alterna- |
1281 |
|
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
1282 |
|
set (assuming it can find an "a" in the subject), whereas it fails by |
1283 |
|
default, for Perl compatibility. |
1284 |
|
|
1285 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1286 |
|
|
1287 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1288 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
1289 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
1290 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
1291 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
1292 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
1293 |
|
|
1294 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
1295 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
1296 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
1297 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
1298 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
1299 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
1300 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1301 |
|
|
1302 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1305 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1306 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1307 |
|
|
1308 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1309 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1310 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1311 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1312 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1313 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
1314 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1315 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
1316 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1317 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1318 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
1319 |
UTF-8 mode. |
UTF-8 mode. |
1320 |
|
|
1321 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1322 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
1323 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
1324 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
1325 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1326 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1327 |
cause an error. |
cause an error. |
1328 |
|
|
1329 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
1330 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1331 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
1332 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
1333 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
1334 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
1335 |
and are therefore ignored. |
and are therefore ignored. |
1336 |
|
|
1337 |
The newline option that is set at compile time becomes the default that |
The newline option that is set at compile time becomes the default that |
1338 |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
1339 |
|
|
1340 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1341 |
|
|
1342 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
1343 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
1344 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
1345 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
1346 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
1347 |
|
|
1348 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
1349 |
|
|
1350 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
1351 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
1352 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
1353 |
within the pattern. |
within the pattern. |
1354 |
|
|
1355 |
PCRE_UTF8 |
PCRE_UTF8 |
1356 |
|
|
1357 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
1358 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
1359 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
1360 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
1361 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
1362 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
1363 |
|
|
1364 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1365 |
|
|
1366 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1367 |
automatically checked. There is a discussion about the validity of |
automatically checked. There is a discussion about the validity of |
1368 |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
1369 |
bytes is found, pcre_compile() returns an error. If you already know |
bytes is found, pcre_compile() returns an error. If you already know |
1370 |
that your pattern is valid, and you want to skip this check for perfor- |
that your pattern is valid, and you want to skip this check for perfor- |
1371 |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
1372 |
set, the effect of passing an invalid UTF-8 string as a pattern is |
set, the effect of passing an invalid UTF-8 string as a pattern is |
1373 |
undefined. It may cause your program to crash. Note that this option |
undefined. It may cause your program to crash. Note that this option |
1374 |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
1375 |
UTF-8 validity checking of subject strings. |
UTF-8 validity checking of subject strings. |
1376 |
|
|
1377 |
|
|
1378 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
1379 |
|
|
1380 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1381 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1382 |
both compiling functions. As PCRE has developed, some error codes have |
both compiling functions. As PCRE has developed, some error codes have |
1383 |
fallen out of use. To avoid confusion, they have not been re-used. |
fallen out of use. To avoid confusion, they have not been re-used. |
1384 |
|
|
1385 |
0 no error |
0 no error |
1435 |
50 [this code is not in use] |
50 [this code is not in use] |
1436 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
1437 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
1438 |
53 internal error: previously-checked referenced subpattern not |
53 internal error: previously-checked referenced subpattern not |
1439 |
found |
found |
1440 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
1441 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
1442 |
56 inconsistent NEWLINE options |
56 inconsistent NEWLINE options |
1443 |
57 \g is not followed by a braced name or an optionally braced |
57 \g is not followed by a braced, angle-bracketed, or quoted |
1444 |
non-zero number |
name/number or by a plain number |
1445 |
58 (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number |
58 a numbered reference must not be zero |
1446 |
59 (*VERB) with an argument is not supported |
59 (*VERB) with an argument is not supported |
1447 |
60 (*VERB) not recognized |
60 (*VERB) not recognized |
1448 |
61 number is too big |
61 number is too big |
1449 |
62 subpattern name expected |
62 subpattern name expected |
1450 |
63 digit expected after (?+ |
63 digit expected after (?+ |
1451 |
|
64 ] is an invalid data character in JavaScript compatibility mode |
1452 |
|
|
1453 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
1454 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
1455 |
|
|
1456 |
|
|
1459 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
1460 |
const char **errptr); |
const char **errptr); |
1461 |
|
|
1462 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
1463 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
1464 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
1465 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
1466 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
1467 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
1468 |
the results of the study. |
the results of the study. |
1469 |
|
|
1470 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
1471 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
1472 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
1473 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
1474 |
|
|
1475 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
1476 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
1477 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
1478 |
its own pcre_extra block. |
its own pcre_extra block. |
1479 |
|
|
1480 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
1481 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
1482 |
|
|
1483 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
1484 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
1485 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
1486 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
1487 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
1488 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
1489 |
|
|
1490 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
1496 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
1497 |
|
|
1498 |
At present, studying a pattern is useful only for non-anchored patterns |
At present, studying a pattern is useful only for non-anchored patterns |
1499 |
that do not have a single fixed starting character. A bitmap of possi- |
that do not have a single fixed starting character. A bitmap of possi- |
1500 |
ble starting bytes is created. |
ble starting bytes is created. |
1501 |
|
|
1502 |
|
|
1503 |
LOCALE SUPPORT |
LOCALE SUPPORT |
1504 |
|
|
1505 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
1506 |
letters, digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
1507 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1508 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
1509 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
1510 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
1511 |
code is discouraged. If you are handling characters with codes greater |
code is discouraged. If you are handling characters with codes greater |
1512 |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
1513 |
not try to mix the two. |
not try to mix the two. |
1514 |
|
|
1515 |
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
1516 |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
1517 |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
1518 |
acters. However, when PCRE is built, it is possible to cause the inter- |
acters. However, when PCRE is built, it is possible to cause the inter- |
1519 |
nal tables to be rebuilt in the default "C" locale of the local system, |
nal tables to be rebuilt in the default "C" locale of the local system, |
1520 |
which may cause them to be different. |
which may cause them to be different. |
1521 |
|
|
1522 |
The internal tables can always be overridden by tables supplied by the |
The internal tables can always be overridden by tables supplied by the |
1523 |
application that calls PCRE. These may be created in a different locale |
application that calls PCRE. These may be created in a different locale |
1524 |
from the default. As more and more applications change to using Uni- |
from the default. As more and more applications change to using Uni- |
1525 |
code, the need for this locale support is expected to die away. |
code, the need for this locale support is expected to die away. |
1526 |
|
|
1527 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
1528 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
1529 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1530 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
1531 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
1532 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1533 |
|
|
1534 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1535 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1536 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1537 |
|
|
1538 |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1539 |
if you are using Windows, the name for the French locale is "french". |
if you are using Windows, the name for the French locale is "french". |
1540 |
|
|
1541 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1542 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1543 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
1544 |
it is needed. |
it is needed. |
1545 |
|
|
1546 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
1547 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
1548 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
1549 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
1550 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
1551 |
|
|
1552 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
1553 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
1554 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
1555 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
1556 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
1557 |
|
|
1561 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1562 |
int what, void *where); |
int what, void *where); |
1563 |
|
|
1564 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
1565 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1566 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
1567 |
|
|
1568 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
1569 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
1570 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
1571 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
1572 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
1573 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
1574 |
|
|
1575 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1577 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1578 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
1579 |
|
|
1580 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
1581 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
1582 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
1583 |
pattern: |
pattern: |
1584 |
|
|
1585 |
int rc; |
int rc; |
1590 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
1591 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
1592 |
|
|
1593 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
1594 |
are as follows: |
are as follows: |
1595 |
|
|
1596 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
1597 |
|
|
1598 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
1599 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
1600 |
there are no back references. |
there are no back references. |
1601 |
|
|
1602 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
1603 |
|
|
1604 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
1605 |
argument should point to an int variable. |
argument should point to an int variable. |
1606 |
|
|
1607 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
1608 |
|
|
1609 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
1610 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
1611 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
1612 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
1613 |
passing a NULL table pointer. |
passing a NULL table pointer. |
1614 |
|
|
1615 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
1616 |
|
|
1617 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
1618 |
non-anchored pattern. The fourth argument should point to an int vari- |
non-anchored pattern. The fourth argument should point to an int vari- |
1619 |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
1620 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
1621 |
|
|
1622 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1623 |
(cat|cow|coyote), its value is returned. Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
1624 |
|
|
1625 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1626 |
branch starts with "^", or |
branch starts with "^", or |
1627 |
|
|
1628 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1629 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
1630 |
|
|
1631 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
1632 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
1633 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
1634 |
|
|
1635 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
1636 |
|
|
1637 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
1638 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
1639 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
1640 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1641 |
able. |
able. |
1642 |
|
|
1643 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
1644 |
|
|
1645 |
Return 1 if the pattern contains any explicit matches for CR or LF |
Return 1 if the pattern contains any explicit matches for CR or LF |
1646 |
characters, otherwise 0. The fourth argument should point to an int |
characters, otherwise 0. The fourth argument should point to an int |
1647 |
variable. An explicit match is either a literal CR or LF character, or |
variable. An explicit match is either a literal CR or LF character, or |
1648 |
\r or \n. |
\r or \n. |
1649 |
|
|
1650 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
1651 |
|
|
1652 |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
1653 |
otherwise 0. The fourth argument should point to an int variable. (?J) |
otherwise 0. The fourth argument should point to an int variable. (?J) |
1654 |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
1655 |
|
|
1656 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1657 |
|
|
1658 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1659 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
1660 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
1661 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
1662 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
1663 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1664 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
1665 |
|
|
1667 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
1668 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
1669 |
|
|
1670 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
1671 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
1672 |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
1673 |
pcre_get_named_substring() are provided for extracting captured sub- |
pcre_get_named_substring() are provided for extracting captured sub- |
1674 |
strings by name. It is also possible to extract the data directly, by |
strings by name. It is also possible to extract the data directly, by |
1675 |
first converting the name to a number in order to access the correct |
first converting the name to a number in order to access the correct |
1676 |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
1677 |
the conversion, you need to use the name-to-number map, which is |
the conversion, you need to use the name-to-number map, which is |
1678 |
described by these three values. |
described by these three values. |
1679 |
|
|
1680 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1681 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1682 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
1683 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1684 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
1685 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
1686 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
1687 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
1688 |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
1689 |
theses numbers. For example, consider the following pattern (assume |
theses numbers. For example, consider the following pattern (assume |
1690 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
1691 |
ignored): |
ignored): |
1692 |
|
|
1693 |
(?<date> (?<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
1694 |
(?<month>\d\d) - (?<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
1695 |
|
|
1696 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1697 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1698 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1699 |
as ??: |
as ??: |
1700 |
|
|
1703 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
1704 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
1705 |
|
|
1706 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
1707 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
1708 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
1709 |
|
|
1710 |
PCRE_INFO_OKPARTIAL |
PCRE_INFO_OKPARTIAL |
1711 |
|
|
1712 |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
1713 |
The fourth argument should point to an int variable. The pcrepartial |
The fourth argument should point to an int variable. The pcrepartial |
1714 |
documentation lists the restrictions that apply to patterns when par- |
documentation lists the restrictions that apply to patterns when par- |
1715 |
tial matching is used. |
tial matching is used. |
1716 |
|
|
1717 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1718 |
|
|
1719 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
1720 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
1721 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
1722 |
by any top-level option settings at the start of the pattern itself. In |
by any top-level option settings at the start of the pattern itself. In |
1723 |
other words, they are the options that will be in force when matching |
other words, they are the options that will be in force when matching |
1724 |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
1725 |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
1726 |
and PCRE_EXTENDED. |
and PCRE_EXTENDED. |
1727 |
|
|
1728 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
1729 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
1730 |
|
|
1731 |
^ unless PCRE_MULTILINE is set |
^ unless PCRE_MULTILINE is set |
1739 |
|
|
1740 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
1741 |
|
|
1742 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern, that is, the value that was |
1743 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
1744 |
which to place the compiled data. The fourth argument should point to a |
which to place the compiled data. The fourth argument should point to a |
1745 |
size_t variable. |
size_t variable. |
1747 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
1748 |
|
|
1749 |
Return the size of the data block pointed to by the study_data field in |
Return the size of the data block pointed to by the study_data field in |
1750 |
a pcre_extra block. That is, it is the value that was passed to |
a pcre_extra block. That is, it is the value that was passed to |
1751 |
pcre_malloc() when PCRE was getting memory into which to place the data |
pcre_malloc() when PCRE was getting memory into which to place the data |
1752 |
created by pcre_study(). The fourth argument should point to a size_t |
created by pcre_study(). The fourth argument should point to a size_t |
1753 |
variable. |
variable. |
1754 |
|
|
1755 |
|
|
1757 |
|
|
1758 |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
1759 |
|
|
1760 |
The pcre_info() function is now obsolete because its interface is too |
The pcre_info() function is now obsolete because its interface is too |
1761 |
restrictive to return all the available data about a compiled pattern. |
restrictive to return all the available data about a compiled pattern. |
1762 |
New programs should use pcre_fullinfo() instead. The yield of |
New programs should use pcre_fullinfo() instead. The yield of |
1763 |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
1764 |
lowing negative numbers: |
lowing negative numbers: |
1765 |
|
|
1766 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1767 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1768 |
|
|
1769 |
If the optptr argument is not NULL, a copy of the options with which |
If the optptr argument is not NULL, a copy of the options with which |
1770 |
the pattern was compiled is placed in the integer it points to (see |
the pattern was compiled is placed in the integer it points to (see |
1771 |
PCRE_INFO_OPTIONS above). |
PCRE_INFO_OPTIONS above). |
1772 |
|
|
1773 |
If the pattern is not anchored and the firstcharptr argument is not |
If the pattern is not anchored and the firstcharptr argument is not |
1774 |
NULL, it is used to pass back information about the first character of |
NULL, it is used to pass back information about the first character of |
1775 |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
1776 |
|
|
1777 |
|
|
1779 |
|
|
1780 |
int pcre_refcount(pcre *code, int adjust); |
int pcre_refcount(pcre *code, int adjust); |
1781 |
|
|
1782 |
The pcre_refcount() function is used to maintain a reference count in |
The pcre_refcount() function is used to maintain a reference count in |
1783 |
the data block that contains a compiled pattern. It is provided for the |
the data block that contains a compiled pattern. It is provided for the |
1784 |
benefit of applications that operate in an object-oriented manner, |
benefit of applications that operate in an object-oriented manner, |
1785 |
where different parts of the application may be using the same compiled |
where different parts of the application may be using the same compiled |
1786 |
pattern, but you want to free the block when they are all done. |
pattern, but you want to free the block when they are all done. |
1787 |
|
|
1788 |
When a pattern is compiled, the reference count field is initialized to |
When a pattern is compiled, the reference count field is initialized to |
1789 |
zero. It is changed only by calling this function, whose action is to |
zero. It is changed only by calling this function, whose action is to |
1790 |
add the adjust value (which may be positive or negative) to it. The |
add the adjust value (which may be positive or negative) to it. The |
1791 |
yield of the function is the new value. However, the value of the count |
yield of the function is the new value. However, the value of the count |
1792 |
is constrained to lie between 0 and 65535, inclusive. If the new value |
is constrained to lie between 0 and 65535, inclusive. If the new value |
1793 |
is outside these limits, it is forced to the appropriate limit value. |
is outside these limits, it is forced to the appropriate limit value. |
1794 |
|
|
1795 |
Except when it is zero, the reference count is not correctly preserved |
Except when it is zero, the reference count is not correctly preserved |
1796 |
if a pattern is compiled on one host and then transferred to a host |
if a pattern is compiled on one host and then transferred to a host |
1797 |
whose byte-order is different. (This seems a highly unlikely scenario.) |
whose byte-order is different. (This seems a highly unlikely scenario.) |
1798 |
|
|
1799 |
|
|
1803 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
1804 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
1805 |
|
|
1806 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
1807 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
1808 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
1809 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
1810 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
1811 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
1812 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
1813 |
|
|
1814 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
1815 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
1816 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
1817 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
1818 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
1819 |
|
|
1820 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
1833 |
|
|
1834 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
1835 |
|
|
1836 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
1837 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
1838 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
1839 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
1840 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
1841 |
|
|
1842 |
unsigned long int flags; |
unsigned long int flags; |
1846 |
void *callout_data; |
void *callout_data; |
1847 |
const unsigned char *tables; |
const unsigned char *tables; |
1848 |
|
|
1849 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
1850 |
are set. The flag bits are: |
are set. The flag bits are: |
1851 |
|
|
1852 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
1855 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
1856 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
1857 |
|
|
1858 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
1859 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
1860 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
1861 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
1862 |
flag bits. |
flag bits. |
1863 |
|
|
1864 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
1865 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
1866 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
1867 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
1868 |
repeats. |
repeats. |
1869 |
|
|
1870 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
1871 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
1872 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
1873 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
1874 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
1875 |
for each position in the subject string. |
for each position in the subject string. |
1876 |
|
|
1877 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
1878 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
1879 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
1880 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
1881 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
1882 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
1883 |
|
|
1884 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
1885 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
1886 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
1887 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
1888 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
1889 |
|
|
1890 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
1891 |
used, or, when PCRE has been compiled to use memory on the heap instead |
used, or, when PCRE has been compiled to use memory on the heap instead |
1892 |
of the stack, the amount of heap memory that can be used. |
of the stack, the amount of heap memory that can be used. |
1893 |
|
|
1894 |
The default value for match_limit_recursion can be set when PCRE is |
The default value for match_limit_recursion can be set when PCRE is |
1895 |
built; the default default is the same value as the default for |
built; the default default is the same value as the default for |
1896 |
match_limit. You can override the default by suppling pcre_exec() with |
match_limit. You can override the default by suppling pcre_exec() with |
1897 |
a pcre_extra block in which match_limit_recursion is set, and |
a pcre_extra block in which match_limit_recursion is set, and |
1898 |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
1899 |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
1900 |
|
|
1901 |
The pcre_callout field is used in conjunction with the "callout" fea- |
The pcre_callout field is used in conjunction with the "callout" fea- |
1902 |
ture, which is described in the pcrecallout documentation. |
ture, which is described in the pcrecallout documentation. |
1903 |
|
|
1904 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
1905 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
1906 |
pattern. A non-NULL value is stored with the compiled pattern only if |
pattern. A non-NULL value is stored with the compiled pattern only if |
1907 |
custom tables were supplied to pcre_compile() via its tableptr argu- |
custom tables were supplied to pcre_compile() via its tableptr argu- |
1908 |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
1909 |
PCRE's internal tables to be used. This facility is helpful when re- |
PCRE's internal tables to be used. This facility is helpful when re- |
1910 |
using patterns that have been saved after compiling with an external |
using patterns that have been saved after compiling with an external |
1911 |
set of tables, because the external tables might be at a different |
set of tables, because the external tables might be at a different |
1912 |
address when pcre_exec() is called. See the pcreprecompile documenta- |
address when pcre_exec() is called. See the pcreprecompile documenta- |
1913 |
tion for a discussion of saving compiled patterns for later use. |
tion for a discussion of saving compiled patterns for later use. |
1914 |
|
|
1915 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
1916 |
|
|
1917 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1918 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1919 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
1920 |
PCRE_PARTIAL. |
PCRE_PARTIAL. |
1921 |
|
|
1922 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1923 |
|
|
1924 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
1925 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
1926 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
1927 |
unachored at matching time. |
unachored at matching time. |
1928 |
|
|
1929 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
1930 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
1931 |
|
|
1932 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
1933 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
1934 |
or to match any Unicode newline sequence. These options override the |
or to match any Unicode newline sequence. These options override the |
1935 |
choice that was made or defaulted when the pattern was compiled. |
choice that was made or defaulted when the pattern was compiled. |
1936 |
|
|
1937 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1940 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1941 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1942 |
|
|
1943 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
1944 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
1945 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
1946 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
1947 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
1948 |
match failure for an unanchored pattern. |
match failure for an unanchored pattern. |
1949 |
|
|
1950 |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is |
1951 |
set, and a match attempt for an unanchored pattern fails when the cur- |
set, and a match attempt for an unanchored pattern fails when the cur- |
1952 |
rent position is at a CRLF sequence, and the pattern contains no |
rent position is at a CRLF sequence, and the pattern contains no |
1953 |
explicit matches for CR or LF characters, the match position is |
explicit matches for CR or LF characters, the match position is |
1954 |
advanced by two characters instead of one, in other words, to after the |
advanced by two characters instead of one, in other words, to after the |
1955 |
CRLF. |
CRLF. |
1956 |
|
|
1957 |
The above rule is a compromise that makes the most common cases work as |
The above rule is a compromise that makes the most common cases work as |
1958 |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
expected. For example, if the pattern is .+A (and the PCRE_DOTALL |
1959 |
option is not set), it does not match the string "\r\nA" because, after |
option is not set), it does not match the string "\r\nA" because, after |
1960 |
failing at the start, it skips both the CR and the LF before retrying. |
failing at the start, it skips both the CR and the LF before retrying. |
1961 |
However, the pattern [\r\n]A does match that string, because it con- |
However, the pattern [\r\n]A does match that string, because it con- |
1962 |
tains an explicit CR or LF reference, and so advances only by one char- |
tains an explicit CR or LF reference, and so advances only by one char- |
1963 |
acter after the first failure. |
acter after the first failure. |
1964 |
|
|
1965 |
An explicit match for CR of LF is either a literal appearance of one of |
An explicit match for CR of LF is either a literal appearance of one of |
1966 |
those characters, or one of the \r or \n escape sequences. Implicit |
those characters, or one of the \r or \n escape sequences. Implicit |
1967 |
matches such as [^X] do not count, nor does \s (which includes CR and |
matches such as [^X] do not count, nor does \s (which includes CR and |
1968 |
LF in the characters that it matches). |
LF in the characters that it matches). |
1969 |
|
|
1970 |
Notwithstanding the above, anomalous effects may still occur when CRLF |
Notwithstanding the above, anomalous effects may still occur when CRLF |
1971 |
is a valid newline sequence and explicit \r or \n escapes appear in the |
is a valid newline sequence and explicit \r or \n escapes appear in the |
1972 |
pattern. |
pattern. |
1973 |
|
|
1974 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1975 |
|
|
1976 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
1977 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
1978 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
1979 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
1980 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
1981 |
|
|
1982 |
PCRE_NOTEOL |
PCRE_NOTEOL |
1983 |
|
|
1984 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
1985 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
1986 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
1987 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
1988 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
1989 |
not affect \Z or \z. |
not affect \Z or \z. |
1990 |
|
|
1991 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
1992 |
|
|
1993 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
1994 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
1995 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
1996 |
example, if the pattern |
example, if the pattern |
1997 |
|
|
1998 |
a?b? |
a?b? |
1999 |
|
|
2000 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
2001 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
2002 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
2003 |
rences of "a" or "b". |
rences of "a" or "b". |
2004 |
|
|
2005 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
2006 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
2007 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
2008 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
2009 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
2010 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
2011 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
2012 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
2013 |
|
|
2014 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
2015 |
|
|
2016 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
2017 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
2018 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
2019 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
2020 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
2021 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
2022 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
2023 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
2024 |
|
|
2025 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
2026 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
2027 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
2028 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
2029 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
2030 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
2031 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
2032 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
2033 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
2034 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
2035 |
|
|
2036 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2037 |
|
|
2038 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
2039 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
2040 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
2041 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
2042 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
2043 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
2044 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
2045 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
2046 |
|
|
2047 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2048 |
|
|
2049 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
2050 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
2051 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
2052 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
2053 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
2054 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
2055 |
|
|
2056 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
2057 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
2058 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
2059 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
2060 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
2061 |
|
|
2062 |
\Biss\B |
\Biss\B |
2063 |
|
|
2064 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
2065 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
2066 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
2067 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
2068 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
2069 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
2070 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
2071 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
2072 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
2073 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
2074 |
|
|
2075 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
2076 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
2077 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
2078 |
subject. |
subject. |
2079 |
|
|
2080 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
2081 |
|
|
2082 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
2083 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
2084 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
2085 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
2086 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
2087 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
2088 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
2089 |
|
|
2090 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
2091 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
2092 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
2093 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
2094 |
|
|
2095 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
2096 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
2097 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
2098 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
2099 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
2100 |
it is not, it is rounded down. |
it is not, it is rounded down. |
2101 |
|
|
2102 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
2103 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
2104 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
2105 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
2106 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
2107 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
2108 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
2109 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
2110 |
tern, and so on. The value returned by pcre_exec() is one more than the |
tern, and so on. The value returned by pcre_exec() is one more than the |
2111 |
highest numbered pair that has been set. For example, if two substrings |
highest numbered pair that has been set. For example, if two substrings |
2112 |
have been captured, the returned value is 3. If there are no capturing |
have been captured, the returned value is 3. If there are no capturing |
2113 |
subpatterns, the return value from a successful match is 1, indicating |
subpatterns, the return value from a successful match is 1, indicating |
2114 |
that just the first pair of offsets has been set. |
that just the first pair of offsets has been set. |
2115 |
|
|
2116 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
2117 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
2118 |
|
|
2119 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
2120 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
2121 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
2122 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
2123 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
2124 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
2125 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
2126 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
2127 |
|
|
2128 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
2129 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
2130 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
2131 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
2132 |
|
|
2133 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
2134 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
2135 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
2136 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
2137 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
2138 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
2139 |
|
|
2140 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
2141 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
2142 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2143 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
2144 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
2145 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
2146 |
the vector is large enough, of course). |
the vector is large enough, of course). |
2147 |
|
|
2148 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
2149 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
2150 |
|
|
2151 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
2152 |
|
|
2153 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
2154 |
defined in the header file: |
defined in the header file: |
2155 |
|
|
2156 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
2159 |
|
|
2160 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
2161 |
|
|
2162 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
2163 |
ovecsize was not zero. |
ovecsize was not zero. |
2164 |
|
|
2165 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
2168 |
|
|
2169 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
2170 |
|
|
2171 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
2172 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
2173 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
2174 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
2175 |
gives when the magic number is not present. |
gives when the magic number is not present. |
2176 |
|
|
2177 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
2178 |
|
|
2179 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
2180 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
2181 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
2182 |
|
|
2183 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2184 |
|
|
2185 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
2186 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
2187 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
2188 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
2189 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
2190 |
|
|
2191 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2192 |
|
|
2193 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
2194 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
2195 |
returned by pcre_exec(). |
returned by pcre_exec(). |
2196 |
|
|
2197 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
2198 |
|
|
2199 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
2200 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
2201 |
above. |
above. |
2202 |
|
|
2203 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
2204 |
|
|
2205 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
2206 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
2207 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
2208 |
|
|
2209 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2210 |
|
|
2211 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2212 |
subject. |
subject. |
2213 |
|
|
2214 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2215 |
|
|
2216 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2217 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2218 |
ter. |
ter. |
2219 |
|
|
2220 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2221 |
|
|
2222 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
2223 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
2224 |
|
|
2225 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2226 |
|
|
2227 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
2228 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
2229 |
documentation for details of partial matching. |
documentation for details of partial matching. |
2230 |
|
|
2231 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2232 |
|
|
2233 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
2234 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
2235 |
|
|
2236 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
2237 |
|
|
2238 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
2239 |
|
|
2240 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
2241 |
|
|
2242 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
2243 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
2244 |
description above. |
description above. |
2245 |
|
|
2246 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
2263 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
2264 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
2265 |
|
|
2266 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
2267 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
2268 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
2269 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
2270 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
2271 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
2272 |
substrings. |
substrings. |
2273 |
|
|
2274 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
2275 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
2276 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
2277 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
2278 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
2279 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
2280 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
2281 |
|
|
2282 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
2283 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
2284 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
2285 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
2286 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
2287 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
2288 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
2289 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
2290 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
2291 |
|
|
2292 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
2293 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
2294 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
2295 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
2296 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
2297 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
2298 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
2299 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
2300 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
2301 |
|
|
2302 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2303 |
|
|
2304 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
2305 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
2306 |
|
|
2307 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2308 |
|
|
2309 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
2310 |
|
|
2311 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
2312 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
2313 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
2314 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
2315 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
2316 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
2317 |
error code |
error code |
2318 |
|
|
2319 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2320 |
|
|
2321 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
2322 |
|
|
2323 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
2324 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
2325 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
2326 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
2327 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
2328 |
tive for unset substrings. |
tive for unset substrings. |
2329 |
|
|
2330 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
2331 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
2332 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
2333 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
2334 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
2335 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
2336 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
2337 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
2338 |
vided. |
vided. |
2339 |
|
|
2340 |
|
|
2353 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
2354 |
const char **stringptr); |
const char **stringptr); |
2355 |
|
|
2356 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
2357 |
ber. For example, for this pattern |
ber. For example, for this pattern |
2358 |
|
|
2359 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
2362 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2363 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
2364 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
2365 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
2366 |
subpattern of that name. |
subpattern of that name. |
2367 |
|
|
2368 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
2369 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
2370 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
2371 |
|
|
2372 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
2373 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
2374 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
2375 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
2376 |
differences: |
differences: |
2377 |
|
|
2378 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
2379 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
2380 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
2381 |
name-to-number translation table. |
name-to-number translation table. |
2382 |
|
|
2383 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
2384 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
2385 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2386 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
2387 |
|
|
2388 |
|
|
2391 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
2392 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
2393 |
|
|
2394 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
2395 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
2396 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
2397 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
2398 |
mentation. |
mentation. |
2399 |
|
|
2400 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
2401 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2402 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
2403 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
2404 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
2405 |
but it is not defined which it is. |
but it is not defined which it is. |
2406 |
|
|
2407 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2408 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2409 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
2410 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2411 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2412 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2413 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2414 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
2415 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
2416 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
2417 |
the captured data, if any. |
the captured data, if any. |
2418 |
|
|
2419 |
|
|
2420 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2421 |
|
|
2422 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
2423 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
2424 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
2425 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
2426 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
2427 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
2428 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
2429 |
tation. |
tation. |
2430 |
|
|
2431 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
2432 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
2433 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
2434 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
2435 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
2436 |
|
|
2437 |
|
|
2442 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
2443 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2444 |
|
|
2445 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2446 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
2447 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
2448 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2449 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2450 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2451 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
2452 |
mentation. |
mentation. |
2453 |
|
|
2454 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2455 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2456 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2457 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2458 |
repeated here. |
repeated here. |
2459 |
|
|
2460 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2461 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2462 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2463 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2464 |
lot of potential matches. |
lot of potential matches. |
2465 |
|
|
2466 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2482 |
|
|
2483 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2484 |
|
|
2485 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2486 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2487 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2488 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2489 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
2490 |
not repeated here. |
not repeated here. |
2491 |
|
|
2492 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2493 |
|
|
2494 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
2495 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
2496 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
2497 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
2498 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
2499 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
2500 |
set as the first matching string. |
set as the first matching string. |
2501 |
|
|
2502 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2503 |
|
|
2504 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2505 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
2506 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
2507 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
2508 |
|
|
2509 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2510 |
|
|
2511 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2512 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2513 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2514 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2515 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2516 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2517 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2518 |
documentation. |
documentation. |
2519 |
|
|
2520 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2521 |
|
|
2522 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2523 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2524 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2525 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2526 |
if the pattern |
if the pattern |
2527 |
|
|
2528 |
<.*> |
<.*> |
2537 |
<something> <something else> |
<something> <something else> |
2538 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2539 |
|
|
2540 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2541 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2542 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2543 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
2544 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
2545 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
2546 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
2547 |
meaning of the strings is different.) |
meaning of the strings is different.) |
2548 |
|
|
2549 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2550 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2551 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2552 |
filled with the longest matches. |
filled with the longest matches. |
2553 |
|
|
2554 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2555 |
|
|
2556 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2557 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2558 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2559 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2560 |
|
|
2561 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2562 |
|
|
2563 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2564 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2565 |
reference. |
reference. |
2566 |
|
|
2567 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2568 |
|
|
2569 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
2570 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
2571 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
2572 |
|
|
2573 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2574 |
|
|
2575 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2576 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2577 |
(it is meaningless). |
(it is meaningless). |
2578 |
|
|
2579 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2580 |
|
|
2581 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2582 |
workspace vector. |
workspace vector. |
2583 |
|
|
2584 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2585 |
|
|
2586 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2587 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2588 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2589 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2590 |
|
|
2591 |
|
|
2592 |
SEE ALSO |
SEE ALSO |
2593 |
|
|
2594 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2595 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2596 |
|
|
2597 |
|
|
2598 |
AUTHOR |
AUTHOR |
2604 |
|
|
2605 |
REVISION |
REVISION |
2606 |
|
|
2607 |
Last updated: 23 January 2008 |
Last updated: 12 April 2008 |
2608 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
2609 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2610 |
|
|
2932 |
|
|
2933 |
The syntax and semantics of the regular expressions that are supported |
The syntax and semantics of the regular expressions that are supported |
2934 |
by PCRE are described in detail below. There is a quick-reference syn- |
by PCRE are described in detail below. There is a quick-reference syn- |
2935 |
tax summary in the pcresyntax page. Perl's regular expressions are |
tax summary in the pcresyntax page. PCRE tries to match Perl syntax and |
2936 |
described in its own documentation, and regular expressions in general |
semantics as closely as it can. PCRE also supports some alternative |
2937 |
are covered in a number of books, some of which have copious examples. |
regular expression syntax (which does not conflict with the Perl syn- |
2938 |
Jeffrey Friedl's "Mastering Regular Expressions", published by |
tax) in order to provide some compatibility with regular expressions in |
2939 |
O'Reilly, covers regular expressions in great detail. This description |
Python, .NET, and Oniguruma. |
2940 |
of PCRE's regular expressions is intended as reference material. |
|
2941 |
|
Perl's regular expressions are described in its own documentation, and |
2942 |
|
regular expressions in general are covered in a number of books, some |
2943 |
|
of which have copious examples. Jeffrey Friedl's "Mastering Regular |
2944 |
|
Expressions", published by O'Reilly, covers regular expressions in |
2945 |
|
great detail. This description of PCRE's regular expressions is |
2946 |
|
intended as reference material. |
2947 |
|
|
2948 |
The original operation of PCRE was on strings of one-byte characters. |
The original operation of PCRE was on strings of one-byte characters. |
2949 |
However, there is now also support for UTF-8 character strings. To use |
However, there is now also support for UTF-8 character strings. To use |
3190 |
named back reference can be coded as \g{name}. Back references are dis- |
named back reference can be coded as \g{name}. Back references are dis- |
3191 |
cussed later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
3192 |
|
|
3193 |
|
Absolute and relative subroutine calls |
3194 |
|
|
3195 |
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
3196 |
|
name or a number enclosed either in angle brackets or single quotes, is |
3197 |
|
an alternative syntax for referencing a subpattern as a "subroutine". |
3198 |
|
Details are discussed later. Note that \g{...} (Perl syntax) and |
3199 |
|
\g<...> (Oniguruma syntax) are not synonymous. The former is a back |
3200 |
|
reference; the latter is a subroutine call. |
3201 |
|
|
3202 |
Generic character types |
Generic character types |
3203 |
|
|
3204 |
Another use of backslash is for specifying generic character types. The |
Another use of backslash is for specifying generic character types. The |
3216 |
\W any "non-word" character |
\W any "non-word" character |
3217 |
|
|
3218 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
3219 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
3220 |
of each pair. |
of each pair. |
3221 |
|
|
3222 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
3223 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
3224 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
3225 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
3226 |
|
|
3227 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
3228 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
3229 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
3230 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
3231 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
3232 |
|
|
3233 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
3234 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
3235 |
code character property support is available. These sequences retain |
code character property support is available. These sequences retain |
3236 |
their original meanings from before UTF-8 support was available, mainly |
their original meanings from before UTF-8 support was available, mainly |
3237 |
for efficiency reasons. |
for efficiency reasons. |
3238 |
|
|
3239 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
3240 |
the other sequences, these do match certain high-valued codepoints in |
the other sequences, these do match certain high-valued codepoints in |
3241 |
UTF-8 mode. The horizontal space characters are: |
UTF-8 mode. The horizontal space characters are: |
3242 |
|
|
3243 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
3271 |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
3272 |
|
|
3273 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
3274 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
3275 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
3276 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
3277 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
3278 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
3279 |
are used for accented letters, and these are matched by \w. The use of |
are used for accented letters, and these are matched by \w. The use of |
3280 |
locales with Unicode is discouraged. |
locales with Unicode is discouraged. |
3281 |
|
|
3282 |
Newline sequences |
Newline sequences |
3283 |
|
|
3284 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
3285 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
3286 |
mode \R is equivalent to the following: |
mode \R is equivalent to the following: |
3287 |
|
|
3288 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
3289 |
|
|
3290 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
3291 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
3292 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
3293 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
3294 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
3295 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
3296 |
|
|
3297 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
3298 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
3299 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
3300 |
these characters to be recognized. |
these characters to be recognized. |
3301 |
|
|
3302 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
3303 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
3304 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
3305 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
3306 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
3307 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
3308 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
3309 |
following sequences: |
following sequences: |
3310 |
|
|
3311 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
3314 |
These override the default and the options given to pcre_compile(), but |
These override the default and the options given to pcre_compile(), but |
3315 |
they can be overridden by options given to pcre_exec(). Note that these |
they can be overridden by options given to pcre_exec(). Note that these |
3316 |
special settings, which are not Perl-compatible, are recognized only at |
special settings, which are not Perl-compatible, are recognized only at |
3317 |
the very start of a pattern, and that they must be in upper case. If |
the very start of a pattern, and that they must be in upper case. If |
3318 |
more than one of them is present, the last one is used. They can be |
more than one of them is present, the last one is used. They can be |
3319 |
combined with a change of newline convention, for example, a pattern |
combined with a change of newline convention, for example, a pattern |
3320 |
can start with: |
can start with: |
3321 |
|
|
3322 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
3326 |
Unicode character properties |
Unicode character properties |
3327 |
|
|
3328 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
3329 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
3330 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
3331 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
3332 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
3333 |
|
|
3334 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
3335 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
3336 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
3337 |
|
|
3338 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
3339 |
script names, the general category properties, and "Any", which matches |
script names, the general category properties, and "Any", which matches |
3340 |
any character (including newline). Other properties such as "InMusical- |
any character (including newline). Other properties such as "InMusical- |
3341 |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
3342 |
not match any characters, so always causes a match failure. |
not match any characters, so always causes a match failure. |
3343 |
|
|
3344 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
3345 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
3346 |
For example: |
For example: |
3347 |
|
|
3348 |
\p{Greek} |
\p{Greek} |
3349 |
\P{Han} |
\P{Han} |
3350 |
|
|
3351 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
3352 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
3353 |
|
|
3354 |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
3355 |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
3356 |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
3357 |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
3358 |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
3359 |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
3360 |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
3361 |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
3362 |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
3363 |
|
|
3364 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
3365 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
3366 |
specified by including a circumflex between the opening brace and the |
specified by including a circumflex between the opening brace and the |
3367 |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
3368 |
|
|
3369 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
3370 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
3371 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
3372 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
3373 |
|
|
3374 |
\p{L} |
\p{L} |
3420 |
Zp Paragraph separator |
Zp Paragraph separator |
3421 |
Zs Space separator |
Zs Space separator |
3422 |
|
|
3423 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
3424 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
3425 |
classified as a modifier or "other". |
classified as a modifier or "other". |
3426 |
|
|
3427 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
3428 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3429 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3430 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3431 |
the pcreapi page). |
the pcreapi page). |
3432 |
|
|
3433 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for these properties that Perl supports (such as |
3434 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3435 |
any of these properties with "Is". |
any of these properties with "Is". |
3436 |
|
|
3437 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
3438 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
3439 |
in the Unicode table. |
in the Unicode table. |
3440 |
|
|
3441 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
3442 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3443 |
|
|
3444 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
3445 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
3446 |
|
|
3447 |
(?>\PM\pM*) |
(?>\PM\pM*) |
3448 |
|
|
3449 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
3450 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
3451 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
3452 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
3453 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
3454 |
matches any one character. |
matches any one character. |
3455 |
|
|
3456 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
3457 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
3458 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
3459 |
\w do not use Unicode properties in PCRE. |
\w do not use Unicode properties in PCRE. |
3460 |
|
|
3461 |
Resetting the match start |
Resetting the match start |
3462 |
|
|
3463 |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
3464 |
ously matched characters not to be included in the final matched |
ously matched characters not to be included in the final matched |
3465 |
sequence. For example, the pattern: |
sequence. For example, the pattern: |
3466 |
|
|
3467 |
foo\Kbar |
foo\Kbar |
3468 |
|
|
3469 |
matches "foobar", but reports that it has matched "bar". This feature |
matches "foobar", but reports that it has matched "bar". This feature |
3470 |
is similar to a lookbehind assertion (described below). However, in |
is similar to a lookbehind assertion (described below). However, in |
3471 |
this case, the part of the subject before the real match does not have |
this case, the part of the subject before the real match does not have |
3472 |
to be of fixed length, as lookbehind assertions do. The use of \K does |
to be of fixed length, as lookbehind assertions do. The use of \K does |
3473 |
not interfere with the setting of captured substrings. For example, |
not interfere with the setting of captured substrings. For example, |
3474 |
when the pattern |
when the pattern |
3475 |
|
|
3476 |
(foo)\Kbar |
(foo)\Kbar |
3479 |
|
|
3480 |
Simple assertions |
Simple assertions |
3481 |
|
|
3482 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
3483 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
3484 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
3485 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
3486 |
The backslashed assertions are: |
The backslashed assertions are: |
3487 |
|
|
3488 |
\b matches at a word boundary |
\b matches at a word boundary |
3493 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
3494 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
3495 |
|
|
3496 |
These assertions may not appear in character classes (but note that \b |
These assertions may not appear in character classes (but note that \b |
3497 |
has a different meaning, namely the backspace character, inside a char- |
has a different meaning, namely the backspace character, inside a char- |
3498 |
acter class). |
acter class). |
3499 |
|
|
3500 |
A word boundary is a position in the subject string where the current |
A word boundary is a position in the subject string where the current |
3501 |
character and the previous character do not both match \w or \W (i.e. |
character and the previous character do not both match \w or \W (i.e. |
3502 |
one matches \w and the other matches \W), or the start or end of the |
one matches \w and the other matches \W), or the start or end of the |
3503 |
string if the first or last character matches \w, respectively. |
string if the first or last character matches \w, respectively. |
3504 |
|
|
3505 |
The \A, \Z, and \z assertions differ from the traditional circumflex |
The \A, \Z, and \z assertions differ from the traditional circumflex |
3506 |
and dollar (described in the next section) in that they only ever match |
and dollar (described in the next section) in that they only ever match |
3507 |
at the very start and end of the subject string, whatever options are |
at the very start and end of the subject string, whatever options are |
3508 |
set. Thus, they are independent of multiline mode. These three asser- |
set. Thus, they are independent of multiline mode. These three asser- |
3509 |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
3510 |
affect only the behaviour of the circumflex and dollar metacharacters. |
affect only the behaviour of the circumflex and dollar metacharacters. |
3511 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
3512 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
3513 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
3514 |
that \Z matches before a newline at the end of the string as well as at |
that \Z matches before a newline at the end of the string as well as at |
3515 |
the very end, whereas \z matches only at the end. |
the very end, whereas \z matches only at the end. |
3516 |
|
|
3517 |
The \G assertion is true only when the current matching position is at |
The \G assertion is true only when the current matching position is at |
3518 |
the start point of the match, as specified by the startoffset argument |
the start point of the match, as specified by the startoffset argument |
3519 |
of pcre_exec(). It differs from \A when the value of startoffset is |
of pcre_exec(). It differs from \A when the value of startoffset is |
3520 |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
3521 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
3522 |
mentation where \G can be useful. |
mentation where \G can be useful. |
3523 |
|
|
3524 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
3525 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
3526 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
3527 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
3528 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
3529 |
|
|
3530 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
3531 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
3532 |
in the compiled regular expression. |
in the compiled regular expression. |
3533 |
|
|
3535 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
3536 |
|
|
3537 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
3538 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
3539 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
3540 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
3541 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
3542 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
3543 |
|
|
3544 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
3545 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
3546 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
3547 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
3548 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
3549 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
3550 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
3551 |
|
|
3552 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
3553 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
3554 |
before a newline at the end of the string (by default). Dollar need not |
before a newline at the end of the string (by default). Dollar need not |
3555 |
be the last character of the pattern if a number of alternatives are |
be the last character of the pattern if a number of alternatives are |
3556 |
involved, but it should be the last item in any branch in which it |
involved, but it should be the last item in any branch in which it |
3557 |
appears. Dollar has no special meaning in a character class. |
appears. Dollar has no special meaning in a character class. |
3558 |
|
|
3559 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
3560 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
3561 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
3562 |
|
|
3563 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
3564 |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
3565 |
matches immediately after internal newlines as well as at the start of |
matches immediately after internal newlines as well as at the start of |
3566 |
the subject string. It does not match after a newline that ends the |
the subject string. It does not match after a newline that ends the |
3567 |
string. A dollar matches before any newlines in the string, as well as |
string. A dollar matches before any newlines in the string, as well as |
3568 |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
3569 |
as the two-character sequence CRLF, isolated CR and LF characters do |
as the two-character sequence CRLF, isolated CR and LF characters do |
3570 |
not indicate newlines. |
not indicate newlines. |
3571 |
|
|
3572 |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
3573 |
(where \n represents a newline) in multiline mode, but not otherwise. |
(where \n represents a newline) in multiline mode, but not otherwise. |
3574 |
Consequently, patterns that are anchored in single line mode because |
Consequently, patterns that are anchored in single line mode because |
3575 |
all branches start with ^ are not anchored in multiline mode, and a |
all branches start with ^ are not anchored in multiline mode, and a |
3576 |
match for circumflex is possible when the startoffset argument of |
match for circumflex is possible when the startoffset argument of |
3577 |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
3578 |
PCRE_MULTILINE is set. |
PCRE_MULTILINE is set. |
3579 |
|
|
3580 |
Note that the sequences \A, \Z, and \z can be used to match the start |
Note that the sequences \A, \Z, and \z can be used to match the start |
3581 |
and end of the subject in both modes, and if all branches of a pattern |
and end of the subject in both modes, and if all branches of a pattern |
3582 |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
3583 |
set. |
set. |
3584 |
|
|
3585 |
|
|
3586 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
3587 |
|
|
3588 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
3589 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
3590 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
3591 |
more than one byte long. |
more than one byte long. |
3592 |
|
|
3593 |
When a line ending is defined as a single character, dot never matches |
When a line ending is defined as a single character, dot never matches |
3594 |
that character; when the two-character sequence CRLF is used, dot does |
that character; when the two-character sequence CRLF is used, dot does |
3595 |
not match CR if it is immediately followed by LF, but otherwise it |
not match CR if it is immediately followed by LF, but otherwise it |
3596 |
matches all characters (including isolated CRs and LFs). When any Uni- |
matches all characters (including isolated CRs and LFs). When any Uni- |
3597 |
code line endings are being recognized, dot does not match CR or LF or |
code line endings are being recognized, dot does not match CR or LF or |
3598 |
any of the other line ending characters. |
any of the other line ending characters. |
3599 |
|
|
3600 |
The behaviour of dot with regard to newlines can be changed. If the |
The behaviour of dot with regard to newlines can be changed. If the |
3601 |
PCRE_DOTALL option is set, a dot matches any one character, without |
PCRE_DOTALL option is set, a dot matches any one character, without |
3602 |
exception. If the two-character sequence CRLF is present in the subject |
exception. If the two-character sequence CRLF is present in the subject |
3603 |
string, it takes two dots to match it. |
string, it takes two dots to match it. |
3604 |
|
|
3605 |
The handling of dot is entirely independent of the handling of circum- |
The handling of dot is entirely independent of the handling of circum- |
3606 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
3607 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
3608 |
|
|
3609 |
|
|
3610 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
3611 |
|
|
3612 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
3613 |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
3614 |
line-ending characters. The feature is provided in Perl in order to |
line-ending characters. The feature is provided in Perl in order to |
3615 |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
3616 |
acters into individual bytes, what remains in the string may be a mal- |
acters into individual bytes, what remains in the string may be a mal- |
3617 |
formed UTF-8 string. For this reason, the \C escape sequence is best |
formed UTF-8 string. For this reason, the \C escape sequence is best |
3618 |
avoided. |
avoided. |
3619 |
|
|
3620 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
3621 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
3622 |
late the length of the lookbehind. |
late the length of the lookbehind. |
3623 |
|
|
3624 |
|
|
3627 |
An opening square bracket introduces a character class, terminated by a |
An opening square bracket introduces a character class, terminated by a |
3628 |
closing square bracket. A closing square bracket on its own is not spe- |
closing square bracket. A closing square bracket on its own is not spe- |
3629 |
cial. If a closing square bracket is required as a member of the class, |
cial. If a closing square bracket is required as a member of the class, |
3630 |
it should be the first data character in the class (after an initial |
it should be the first data character in the class (after an initial |
3631 |
circumflex, if present) or escaped with a backslash. |
circumflex, if present) or escaped with a backslash. |
3632 |
|
|
3633 |
A character class matches a single character in the subject. In UTF-8 |
A character class matches a single character in the subject. In UTF-8 |
3634 |
mode, the character may occupy more than one byte. A matched character |
mode, the character may occupy more than one byte. A matched character |
3635 |
must be in the set of characters defined by the class, unless the first |
must be in the set of characters defined by the class, unless the first |
3636 |
character in the class definition is a circumflex, in which case the |
character in the class definition is a circumflex, in which case the |
3637 |
subject character must not be in the set defined by the class. If a |
subject character must not be in the set defined by the class. If a |
3638 |
circumflex is actually required as a member of the class, ensure it is |
circumflex is actually required as a member of the class, ensure it is |
3639 |
not the first character, or escape it with a backslash. |
not the first character, or escape it with a backslash. |
3640 |
|
|
3641 |
For example, the character class [aeiou] matches any lower case vowel, |
For example, the character class [aeiou] matches any lower case vowel, |
3642 |
while [^aeiou] matches any character that is not a lower case vowel. |
while [^aeiou] matches any character that is not a lower case vowel. |
3643 |
Note that a circumflex is just a convenient notation for specifying the |
Note that a circumflex is just a convenient notation for specifying the |
3644 |
characters that are in the class by enumerating those that are not. A |
characters that are in the class by enumerating those that are not. A |
3645 |
class that starts with a circumflex is not an assertion: it still con- |
class that starts with a circumflex is not an assertion: it still con- |
3646 |
sumes a character from the subject string, and therefore it fails if |
sumes a character from the subject string, and therefore it fails if |
3647 |
the current pointer is at the end of the string. |
the current pointer is at the end of the string. |
3648 |
|
|
3649 |
In UTF-8 mode, characters with values greater than 255 can be included |
In UTF-8 mode, characters with values greater than 255 can be included |
3650 |
in a class as a literal string of bytes, or by using the \x{ escaping |
in a class as a literal string of bytes, or by using the \x{ escaping |
3651 |
mechanism. |
mechanism. |
3652 |
|
|
3653 |
When caseless matching is set, any letters in a class represent both |
When caseless matching is set, any letters in a class represent both |
3654 |
their upper case and lower case versions, so for example, a caseless |
their upper case and lower case versions, so for example, a caseless |
3655 |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
3656 |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
3657 |
understands the concept of case for characters whose values are less |
understands the concept of case for characters whose values are less |
3658 |
than 128, so caseless matching is always possible. For characters with |
than 128, so caseless matching is always possible. For characters with |
3659 |
higher values, the concept of case is supported if PCRE is compiled |
higher values, the concept of case is supported if PCRE is compiled |
3660 |
with Unicode property support, but not otherwise. If you want to use |
with Unicode property support, but not otherwise. If you want to use |
3661 |
caseless matching for characters 128 and above, you must ensure that |
caseless matching for characters 128 and above, you must ensure that |
3662 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
3663 |
support. |
support. |
3664 |
|
|
3665 |
Characters that might indicate line breaks are never treated in any |
Characters that might indicate line breaks are never treated in any |
3666 |
special way when matching character classes, whatever line-ending |
special way when matching character classes, whatever line-ending |
3667 |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
3668 |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
3669 |
of these characters. |
of these characters. |
3670 |
|
|
3671 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
3672 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
3673 |
between d and m, inclusive. If a minus character is required in a |
between d and m, inclusive. If a minus character is required in a |
3674 |
class, it must be escaped with a backslash or appear in a position |
class, it must be escaped with a backslash or appear in a position |
3675 |
where it cannot be interpreted as indicating a range, typically as the |
where it cannot be interpreted as indicating a range, typically as the |
3676 |
first or last character in the class. |
first or last character in the class. |
3677 |
|
|
3678 |
It is not possible to have the literal character "]" as the end charac- |
It is not possible to have the literal character "]" as the end charac- |
3679 |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
3680 |
two characters ("W" and "-") followed by a literal string "46]", so it |
two characters ("W" and "-") followed by a literal string "46]", so it |
3681 |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
3682 |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
3683 |
preted as a class containing a range followed by two other characters. |
preted as a class containing a range followed by two other characters. |
3684 |
The octal or hexadecimal representation of "]" can also be used to end |
The octal or hexadecimal representation of "]" can also be used to end |
3685 |
a range. |
a range. |
3686 |
|
|
3687 |
Ranges operate in the collating sequence of character values. They can |
Ranges operate in the collating sequence of character values. They can |
3688 |
also be used for characters specified numerically, for example |
also be used for characters specified numerically, for example |
3689 |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
3690 |
are greater than 255, for example [\x{100}-\x{2ff}]. |
are greater than 255, for example [\x{100}-\x{2ff}]. |
3691 |
|
|
3692 |
If a range that includes letters is used when caseless matching is set, |
If a range that includes letters is used when caseless matching is set, |
3693 |
it matches the letters in either case. For example, [W-c] is equivalent |
it matches the letters in either case. For example, [W-c] is equivalent |
3694 |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
3695 |
character tables for a French locale are in use, [\xc8-\xcb] matches |
character tables for a French locale are in use, [\xc8-\xcb] matches |
3696 |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
3697 |
concept of case for characters with values greater than 128 only when |
concept of case for characters with values greater than 128 only when |
3698 |
it is compiled with Unicode property support. |
it is compiled with Unicode property support. |
3699 |
|
|
3700 |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
3701 |
in a character class, and add the characters that they match to the |
in a character class, and add the characters that they match to the |
3702 |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
3703 |
flex can conveniently be used with the upper case character types to |
flex can conveniently be used with the upper case character types to |
3704 |
specify a more restricted set of characters than the matching lower |
specify a more restricted set of characters than the matching lower |
3705 |
case type. For example, the class [^\W_] matches any letter or digit, |
case type. For example, the class [^\W_] matches any letter or digit, |
3706 |
but not underscore. |
but not underscore. |
3707 |
|
|
3708 |
The only metacharacters that are recognized in character classes are |
The only metacharacters that are recognized in character classes are |
3709 |
backslash, hyphen (only where it can be interpreted as specifying a |
backslash, hyphen (only where it can be interpreted as specifying a |
3710 |
range), circumflex (only at the start), opening square bracket (only |
range), circumflex (only at the start), opening square bracket (only |
3711 |
when it can be interpreted as introducing a POSIX class name - see the |
when it can be interpreted as introducing a POSIX class name - see the |
3712 |
next section), and the terminating closing square bracket. However, |
next section), and the terminating closing square bracket. However, |
3713 |
escaping other non-alphanumeric characters does no harm. |
escaping other non-alphanumeric characters does no harm. |
3714 |
|
|
3715 |
|
|
3716 |
POSIX CHARACTER CLASSES |
POSIX CHARACTER CLASSES |
3717 |
|
|
3718 |
Perl supports the POSIX notation for character classes. This uses names |
Perl supports the POSIX notation for character classes. This uses names |
3719 |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
3720 |
supports this notation. For example, |
supports this notation. For example, |
3721 |
|
|
3722 |
[01[:alpha:]%] |
[01[:alpha:]%] |
3739 |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
3740 |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
3741 |
|
|
3742 |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
3743 |
and space (32). Notice that this list includes the VT character (code |
and space (32). Notice that this list includes the VT character (code |
3744 |
11). This makes "space" different to \s, which does not include VT (for |
11). This makes "space" different to \s, which does not include VT (for |
3745 |
Perl compatibility). |
Perl compatibility). |
3746 |
|
|
3747 |
The name "word" is a Perl extension, and "blank" is a GNU extension |
The name "word" is a Perl extension, and "blank" is a GNU extension |
3748 |
from Perl 5.8. Another Perl extension is negation, which is indicated |
from Perl 5.8. Another Perl extension is negation, which is indicated |
3749 |
by a ^ character after the colon. For example, |
by a ^ character after the colon. For example, |
3750 |
|
|
3751 |
[12[:^digit:]] |
[12[:^digit:]] |
3752 |
|
|
3753 |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
3754 |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
3755 |
these are not supported, and an error is given if they are encountered. |
these are not supported, and an error is given if they are encountered. |
3756 |
|
|
3760 |
|
|
3761 |
VERTICAL BAR |
VERTICAL BAR |
3762 |
|
|
3763 |
Vertical bar characters are used to separate alternative patterns. For |
Vertical bar characters are used to separate alternative patterns. For |
3764 |
example, the pattern |
example, the pattern |
3765 |
|
|
3766 |
gilbert|sullivan |
gilbert|sullivan |
3767 |
|
|
3768 |
matches either "gilbert" or "sullivan". Any number of alternatives may |
matches either "gilbert" or "sullivan". Any number of alternatives may |
3769 |
appear, and an empty alternative is permitted (matching the empty |
appear, and an empty alternative is permitted (matching the empty |
3770 |
string). The matching process tries each alternative in turn, from left |
string). The matching process tries each alternative in turn, from left |
3771 |
to right, and the first one that succeeds is used. If the alternatives |
to right, and the first one that succeeds is used. If the alternatives |
3772 |
are within a subpattern (defined below), "succeeds" means matching the |
are within a subpattern (defined below), "succeeds" means matching the |
3773 |
rest of the main pattern as well as the alternative in the subpattern. |
rest of the main pattern as well as the alternative in the subpattern. |
3774 |
|
|
3775 |
|
|
3776 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
3777 |
|
|
3778 |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
3779 |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
3780 |
within the pattern by a sequence of Perl option letters enclosed |
within the pattern by a sequence of Perl option letters enclosed |
3781 |
between "(?" and ")". The option letters are |
between "(?" and ")". The option letters are |
3782 |
|
|
3783 |
i for PCRE_CASELESS |
i for PCRE_CASELESS |
3787 |
|
|
3788 |
For example, (?im) sets caseless, multiline matching. It is also possi- |
For example, (?im) sets caseless, multiline matching. It is also possi- |
3789 |
ble to unset these options by preceding the letter with a hyphen, and a |
ble to unset these options by preceding the letter with a hyphen, and a |
3790 |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
3791 |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
3792 |
is also permitted. If a letter appears both before and after the |
is also permitted. If a letter appears both before and after the |
3793 |
hyphen, the option is unset. |
hyphen, the option is unset. |
3794 |
|
|
3795 |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
3796 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
3797 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
3798 |
|
|
3799 |
When an option change occurs at top level (that is, not inside subpat- |
When an option change occurs at top level (that is, not inside subpat- |
3800 |
tern parentheses), the change applies to the remainder of the pattern |
tern parentheses), the change applies to the remainder of the pattern |
3801 |
that follows. If the change is placed right at the start of a pattern, |
that follows. If the change is placed right at the start of a pattern, |
3802 |
PCRE extracts it into the global options (and it will therefore show up |
PCRE extracts it into the global options (and it will therefore show up |
3803 |
in data extracted by the pcre_fullinfo() function). |
in data extracted by the pcre_fullinfo() function). |
3804 |
|
|
3805 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
3806 |
subpatterns) affects only that part of the current pattern that follows |
subpatterns) affects only that part of the current pattern that follows |
3807 |
it, so |
it, so |
3808 |
|
|
3809 |
(a(?i)b)c |
(a(?i)b)c |
3810 |
|
|
3811 |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
3812 |
used). By this means, options can be made to have different settings |
used). By this means, options can be made to have different settings |
3813 |
in different parts of the pattern. Any changes made in one alternative |
in different parts of the pattern. Any changes made in one alternative |
3814 |
do carry on into subsequent branches within the same subpattern. For |
do carry on into subsequent branches within the same subpattern. For |
3815 |
example, |
example, |
3816 |
|
|
3817 |
(a(?i)b|c) |
(a(?i)b|c) |
3818 |
|
|
3819 |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
3820 |
first branch is abandoned before the option setting. This is because |
first branch is abandoned before the option setting. This is because |
3821 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
3822 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
3823 |
|
|
3824 |
Note: There are other PCRE-specific options that can be set by the |
Note: There are other PCRE-specific options that can be set by the |
3825 |
application when the compile or match functions are called. In some |
application when the compile or match functions are called. In some |
3826 |
cases the pattern can contain special leading sequences to override |
cases the pattern can contain special leading sequences to override |
3827 |
what the application has set or what has been defaulted. Details are |
what the application has set or what has been defaulted. Details are |
3828 |
given in the section entitled "Newline sequences" above. |
given in the section entitled "Newline sequences" above. |
3829 |
|
|
3830 |
|
|
3837 |
|
|
3838 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
3839 |
|
|
3840 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
3841 |
the parentheses, it would match "cataract", "erpillar" or an empty |
the parentheses, it would match "cataract", "erpillar" or an empty |
3842 |
string. |
string. |
3843 |
|
|
3844 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
3845 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
3846 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
3847 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
3848 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
3849 |
subpatterns. |
subpatterns. |
3850 |
|
|
3851 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
3852 |
tern |
tern |
3853 |
|
|
3854 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
3856 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
3857 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
3858 |
|
|
3859 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
3860 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
3861 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
3862 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
3863 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
3864 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
3865 |
matched against the pattern |
matched against the pattern |
3866 |
|
|
3867 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
3869 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
3870 |
1 and 2. The maximum number of capturing subpatterns is 65535. |
1 and 2. The maximum number of capturing subpatterns is 65535. |
3871 |
|
|
3872 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
3873 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
3874 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
3875 |
|
|
3876 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
3877 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
3878 |
|
|
3879 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
3880 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
3881 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
3882 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
3883 |
"Saturday". |
"Saturday". |
3884 |
|
|
3885 |
|
|
3886 |
DUPLICATE SUBPATTERN NUMBERS |
DUPLICATE SUBPATTERN NUMBERS |
3887 |
|
|
3888 |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
3889 |
uses the same numbers for its capturing parentheses. Such a subpattern |
uses the same numbers for its capturing parentheses. Such a subpattern |
3890 |
starts with (?| and is itself a non-capturing subpattern. For example, |
starts with (?| and is itself a non-capturing subpattern. For example, |
3891 |
consider this pattern: |
consider this pattern: |
3892 |
|
|
3893 |
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
3894 |
|
|
3895 |
Because the two alternatives are inside a (?| group, both sets of cap- |
Because the two alternatives are inside a (?| group, both sets of cap- |
3896 |
turing parentheses are numbered one. Thus, when the pattern matches, |
turing parentheses are numbered one. Thus, when the pattern matches, |
3897 |
you can look at captured substring number one, whichever alternative |
you can look at captured substring number one, whichever alternative |
3898 |
matched. This construct is useful when you want to capture part, but |
matched. This construct is useful when you want to capture part, but |
3899 |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
3900 |
theses are numbered as usual, but the number is reset at the start of |
theses are numbered as usual, but the number is reset at the start of |
3901 |
each branch. The numbers of any capturing buffers that follow the sub- |
each branch. The numbers of any capturing buffers that follow the sub- |
3902 |
pattern start after the highest number used in any branch. The follow- |
pattern start after the highest number used in any branch. The follow- |
3903 |
ing example is taken from the Perl documentation. The numbers under- |
ing example is taken from the Perl documentation. The numbers under- |
3904 |
neath show in which buffer the captured content will be stored. |
neath show in which buffer the captured content will be stored. |
3905 |
|
|
3906 |
# before ---------------branch-reset----------- after |
# before ---------------branch-reset----------- after |
3907 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
3908 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
3909 |
|
|
3910 |
A backreference or a recursive call to a numbered subpattern always |
A backreference or a recursive call to a numbered subpattern always |
3911 |
refers to the first one in the pattern with the given number. |
refers to the first one in the pattern with the given number. |
3912 |
|
|
3913 |
An alternative approach to using this "branch reset" feature is to use |
An alternative approach to using this "branch reset" feature is to use |
3914 |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
3915 |
|
|
3916 |
|
|
3917 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
3918 |
|
|
3919 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
3920 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
3921 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
3922 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
3923 |
patterns. This feature was not added to Perl until release 5.10. Python |
patterns. This feature was not added to Perl until release 5.10. Python |
3924 |
had the feature earlier, and PCRE introduced it at release 4.0, using |
had the feature earlier, and PCRE introduced it at release 4.0, using |
3925 |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
3926 |
tax. |
tax. |
3927 |
|
|
3928 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
3929 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
3930 |
to capturing parentheses from other parts of the pattern, such as back- |
to capturing parentheses from other parts of the pattern, such as back- |
3931 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
3932 |
by number. |
by number. |
3933 |
|
|
3934 |
Names consist of up to 32 alphanumeric characters and underscores. |
Names consist of up to 32 alphanumeric characters and underscores. |
3935 |
Named capturing parentheses are still allocated numbers as well as |
Named capturing parentheses are still allocated numbers as well as |
3936 |
names, exactly as if the names were not present. The PCRE API provides |
names, exactly as if the names were not present. The PCRE API provides |
3937 |
function calls for extracting the name-to-number translation table from |
function calls for extracting the name-to-number translation table from |
3938 |
a compiled pattern. There is also a convenience function for extracting |
a compiled pattern. There is also a convenience function for extracting |
3939 |
a captured substring by name. |
a captured substring by name. |
3940 |
|
|
3941 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
3942 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
3943 |
time. This can be useful for patterns where only one instance of the |
time. This can be useful for patterns where only one instance of the |
3944 |
named parentheses can match. Suppose you want to match the name of a |
named parentheses can match. Suppose you want to match the name of a |
3945 |
weekday, either as a 3-letter abbreviation or as the full name, and in |
weekday, either as a 3-letter abbreviation or as the full name, and in |
3946 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
3947 |
the line breaks) does the job: |
the line breaks) does the job: |
3948 |
|
|
3952 |
(?<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
3953 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
3954 |
|
|
3955 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
3956 |
match. (An alternative way of solving this problem is to use a "branch |
match. (An alternative way of solving this problem is to use a "branch |
3957 |
reset" subpattern, as described in the previous section.) |
reset" subpattern, as described in the previous section.) |
3958 |
|
|
3959 |
The convenience function for extracting the data by name returns the |
The convenience function for extracting the data by name returns the |
3960 |
substring for the first (and in this example, the only) subpattern of |
substring for the first (and in this example, the only) subpattern of |
3961 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
3962 |
subpattern it was. If you make a reference to a non-unique named sub- |
subpattern it was. If you make a reference to a non-unique named sub- |
3963 |
pattern from elsewhere in the pattern, the one that corresponds to the |
pattern from elsewhere in the pattern, the one that corresponds to the |
3964 |
lowest number is used. For further details of the interfaces for han- |
lowest number is used. For further details of the interfaces for han- |
3965 |
dling named subpatterns, see the pcreapi documentation. |
dling named subpatterns, see the pcreapi documentation. |
3966 |
|
|
3967 |
|
|
3968 |
REPETITION |
REPETITION |
3969 |
|
|
3970 |
Repetition is specified by quantifiers, which can follow any of the |
Repetition is specified by quantifiers, which can follow any of the |
3971 |
following items: |
following items: |
3972 |
|
|
3973 |
a literal data character |
a literal data character |
3980 |
a back reference (see next section) |
a back reference (see next section) |
3981 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (unless it is an assertion) |
3982 |
|
|
3983 |
The general repetition quantifier specifies a minimum and maximum num- |
The general repetition quantifier specifies a minimum and maximum num- |
3984 |
ber of permitted matches, by giving the two numbers in curly brackets |
ber of permitted matches, by giving the two numbers in curly brackets |
3985 |
(braces), separated by a comma. The numbers must be less than 65536, |
(braces), separated by a comma. The numbers must be less than 65536, |
3986 |
and the first must be less than or equal to the second. For example: |
and the first must be less than or equal to the second. For example: |
3987 |
|
|
3988 |
z{2,4} |
z{2,4} |
3989 |
|
|
3990 |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
3991 |
special character. If the second number is omitted, but the comma is |
special character. If the second number is omitted, but the comma is |
3992 |
present, there is no upper limit; if the second number and the comma |
present, there is no upper limit; if the second number and the comma |
3993 |
are both omitted, the quantifier specifies an exact number of required |
are both omitted, the quantifier specifies an exact number of required |
3994 |
matches. Thus |
matches. Thus |
3995 |
|
|
3996 |
[aeiou]{3,} |
[aeiou]{3,} |
3999 |
|
|
4000 |
\d{8} |
\d{8} |
4001 |
|
|
4002 |
matches exactly 8 digits. An opening curly bracket that appears in a |
matches exactly 8 digits. An opening curly bracket that appears in a |
4003 |
position where a quantifier is not allowed, or one that does not match |
position where a quantifier is not allowed, or one that does not match |
4004 |
the syntax of a quantifier, is taken as a literal character. For exam- |
the syntax of a quantifier, is taken as a literal character. For exam- |
4005 |
ple, {,6} is not a quantifier, but a literal string of four characters. |
ple, {,6} is not a quantifier, but a literal string of four characters. |
4006 |
|
|
4007 |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
4008 |
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
individual bytes. Thus, for example, \x{100}{2} matches two UTF-8 char- |
4009 |
acters, each of which is represented by a two-byte sequence. Similarly, |
acters, each of which is represented by a two-byte sequence. Similarly, |
4010 |
when Unicode property support is available, \X{3} matches three Unicode |
when Unicode property support is available, \X{3} matches three Unicode |
4011 |
extended sequences, each of which may be several bytes long (and they |
extended sequences, each of which may be several bytes long (and they |
4012 |
may be of different lengths). |
may be of different lengths). |
4013 |
|
|
4014 |
The quantifier {0} is permitted, causing the expression to behave as if |
The quantifier {0} is permitted, causing the expression to behave as if |
4015 |
the previous item and the quantifier were not present. |
the previous item and the quantifier were not present. This may be use- |
4016 |
|
ful for subpatterns that are referenced as subroutines from elsewhere |
4017 |
|
in the pattern. Items other than subpatterns that have a {0} quantifier |
4018 |
|
are omitted from the compiled pattern. |
4019 |
|
|
4020 |
For convenience, the three most common quantifiers have single-charac- |
For convenience, the three most common quantifiers have single-charac- |
4021 |
ter abbreviations: |
ter abbreviations: |
4793 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
4794 |
|
|
4795 |
|
|
4796 |
|
ONIGURUMA SUBROUTINE SYNTAX |
4797 |
|
|
4798 |
|
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
4799 |
|
name or a number enclosed either in angle brackets or single quotes, is |
4800 |
|
an alternative syntax for referencing a subpattern as a subroutine, |
4801 |
|
possibly recursively. Here are two of the examples used above, rewrit- |
4802 |
|
ten using this syntax: |
4803 |
|
|
4804 |
|
(?<pn> \( ( (?>[^()]+) | \g<pn> )* \) ) |
4805 |
|
(sens|respons)e and \g'1'ibility |
4806 |
|
|
4807 |
|
PCRE supports an extension to Oniguruma: if a number is preceded by a |
4808 |
|
plus or a minus sign it is taken as a relative reference. For example: |
4809 |
|
|
4810 |
|
(abc)(?i:\g<-1>) |
4811 |
|
|
4812 |
|
Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not |
4813 |
|
synonymous. The former is a back reference; the latter is a subroutine |
4814 |
|
call. |
4815 |
|
|
4816 |
|
|
4817 |
CALLOUTS |
CALLOUTS |
4818 |
|
|
4819 |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
Perl has a feature whereby using the sequence (?{...}) causes arbitrary |
4820 |
Perl code to be obeyed in the middle of matching a regular expression. |
Perl code to be obeyed in the middle of matching a regular expression. |
4821 |
This makes it possible, amongst other things, to extract different sub- |
This makes it possible, amongst other things, to extract different sub- |
4822 |
strings that match the same pair of parentheses when there is a repeti- |
strings that match the same pair of parentheses when there is a repeti- |
4823 |
tion. |
tion. |
4824 |
|
|
4825 |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
PCRE provides a similar feature, but of course it cannot obey arbitrary |
4826 |
Perl code. The feature is called "callout". The caller of PCRE provides |
Perl code. The feature is called "callout". The caller of PCRE provides |
4827 |
an external function by putting its entry point in the global variable |
an external function by putting its entry point in the global variable |
4828 |
pcre_callout. By default, this variable contains NULL, which disables |
pcre_callout. By default, this variable contains NULL, which disables |
4829 |
all calling out. |
all calling out. |
4830 |
|
|
4831 |
Within a regular expression, (?C) indicates the points at which the |
Within a regular expression, (?C) indicates the points at which the |
4832 |
external function is to be called. If you want to identify different |
external function is to be called. If you want to identify different |
4833 |
callout points, you can put a number less than 256 after the letter C. |
callout points, you can put a number less than 256 after the letter C. |
4834 |
The default value is zero. For example, this pattern has two callout |
The default value is zero. For example, this pattern has two callout |
4835 |
points: |
points: |
4836 |
|
|
4837 |
(?C1)abc(?C2)def |
(?C1)abc(?C2)def |
4838 |
|
|
4839 |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
If the PCRE_AUTO_CALLOUT flag is passed to pcre_compile(), callouts are |
4840 |
automatically installed before each item in the pattern. They are all |
automatically installed before each item in the pattern. They are all |
4841 |
numbered 255. |
numbered 255. |
4842 |
|
|
4843 |
During matching, when PCRE reaches a callout point (and pcre_callout is |
During matching, when PCRE reaches a callout point (and pcre_callout is |
4844 |
set), the external function is called. It is provided with the number |
set), the external function is called. It is provided with the number |
4845 |
of the callout, the position in the pattern, and, optionally, one item |
of the callout, the position in the pattern, and, optionally, one item |
4846 |
of data originally supplied by the caller of pcre_exec(). The callout |
of data originally supplied by the caller of pcre_exec(). The callout |
4847 |
function may cause matching to proceed, to backtrack, or to fail alto- |
function may cause matching to proceed, to backtrack, or to fail alto- |
4848 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
4849 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
4850 |
|
|
4851 |
|
|
4852 |
BACKTRACKING CONTROL |
BACKTRACKING CONTROL |
4853 |
|
|
4854 |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
Perl 5.10 introduced a number of "Special Backtracking Control Verbs", |
4855 |
which are described in the Perl documentation as "experimental and sub- |
which are described in the Perl documentation as "experimental and sub- |
4856 |
ject to change or removal in a future version of Perl". It goes on to |
ject to change or removal in a future version of Perl". It goes on to |
4857 |
say: "Their usage in production code should be noted to avoid problems |
say: "Their usage in production code should be noted to avoid problems |
4858 |
during upgrades." The same remarks apply to the PCRE features described |
during upgrades." The same remarks apply to the PCRE features described |
4859 |
in this section. |
in this section. |
4860 |
|
|
4861 |
Since these verbs are specifically related to backtracking, they can be |
Since these verbs are specifically related to backtracking, most of |
4862 |
used only when the pattern is to be matched using pcre_exec(), which |
them can be used only when the pattern is to be matched using |
4863 |
uses a backtracking algorithm. They cause an error if encountered by |
pcre_exec(), which uses a backtracking algorithm. With the exception of |
4864 |
pcre_dfa_exec(). |
(*FAIL), which behaves like a failing negative assertion, they cause an |
4865 |
|
error if encountered by pcre_dfa_exec(). |
4866 |
|
|
4867 |
The new verbs make use of what was previously invalid syntax: an open- |
The new verbs make use of what was previously invalid syntax: an open- |
4868 |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
ing parenthesis followed by an asterisk. In Perl, they are generally of |
4981 |
|
|
4982 |
REVISION |
REVISION |
4983 |
|
|
4984 |
Last updated: 17 September 2007 |
Last updated: 19 April 2008 |
4985 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
4986 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4987 |
|
|
4988 |
|
|
5240 |
(?-n) call subpattern by relative number |
(?-n) call subpattern by relative number |
5241 |
(?&name) call subpattern by name (Perl) |
(?&name) call subpattern by name (Perl) |
5242 |
(?P>name) call subpattern by name (Python) |
(?P>name) call subpattern by name (Python) |
5243 |
|
\g<name> call subpattern by name (Oniguruma) |
5244 |
|
\g'name' call subpattern by name (Oniguruma) |
5245 |
|
\g<n> call subpattern by absolute number (Oniguruma) |
5246 |
|
\g'n' call subpattern by absolute number (Oniguruma) |
5247 |
|
\g<+n> call subpattern by relative number (PCRE extension) |
5248 |
|
\g'+n' call subpattern by relative number (PCRE extension) |
5249 |
|
\g<-n> call subpattern by relative number (PCRE extension) |
5250 |
|
\g'-n' call subpattern by relative number (PCRE extension) |
5251 |
|
|
5252 |
|
|
5253 |
CONDITIONAL PATTERNS |
CONDITIONAL PATTERNS |
5327 |
|
|
5328 |
REVISION |
REVISION |
5329 |
|
|
5330 |
Last updated: 14 November 2007 |
Last updated: 09 April 2008 |
5331 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
5332 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5333 |
|
|
5334 |
|
|
5983 |
MATCHING A PATTERN |
MATCHING A PATTERN |
5984 |
|
|
5985 |
The function regexec() is called to match a compiled pattern preg |
The function regexec() is called to match a compiled pattern preg |
5986 |
against a given string, which is terminated by a zero byte, subject to |
against a given string, which is by default terminated by a zero byte |
5987 |
the options in eflags. These can be: |
(but see REG_STARTEND below), subject to the options in eflags. These |
5988 |
|
can be: |
5989 |
|
|
5990 |
REG_NOTBOL |
REG_NOTBOL |
5991 |
|
|
5997 |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
5998 |
function. |
function. |
5999 |
|
|
6000 |
|
REG_STARTEND |
6001 |
|
|
6002 |
|
The string is considered to start at string + pmatch[0].rm_so and to |
6003 |
|
have a terminating NUL located at string + pmatch[0].rm_eo (there need |
6004 |
|
not actually be a NUL at that location), regardless of the value of |
6005 |
|
nmatch. This is a BSD extension, compatible with but not specified by |
6006 |
|
IEEE Standard 1003.2 (POSIX.2), and should be used with caution in |
6007 |
|
software intended to be portable to other systems. Note that a non-zero |
6008 |
|
rm_so does not imply REG_NOTBOL; REG_STARTEND affects only the location |
6009 |
|
of the string, not how it is matched. |
6010 |
|
|
6011 |
If the pattern was compiled with the REG_NOSUB flag, no data about any |
If the pattern was compiled with the REG_NOSUB flag, no data about any |
6012 |
matched strings is returned. The nmatch and pmatch arguments of |
matched strings is returned. The nmatch and pmatch arguments of |
6013 |
regexec() are ignored. |
regexec() are ignored. |
6054 |
|
|
6055 |
REVISION |
REVISION |
6056 |
|
|
6057 |
Last updated: 06 March 2007 |
Last updated: 05 April 2008 |
6058 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
6059 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6060 |
|
|
6061 |
|
|