610 |
ence as the condition or test for a specific group recursion are not |
ence as the condition or test for a specific group recursion are not |
611 |
supported. |
supported. |
612 |
|
|
613 |
5. Callouts are supported, but the value of the capture_top field is |
5. Because many paths through the tree may be active, the \K escape |
614 |
|
sequence, which resets the start of the match when encountered (but may |
615 |
|
be on some paths and not on others), is not supported. It causes an |
616 |
|
error if encountered. |
617 |
|
|
618 |
|
6. Callouts are supported, but the value of the capture_top field is |
619 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
620 |
|
|
621 |
6. The \C escape sequence, which (in the standard algorithm) matches a |
7. The \C escape sequence, which (in the standard algorithm) matches a |
622 |
single byte, even in UTF-8 mode, is not supported because the alterna- |
single byte, even in UTF-8 mode, is not supported because the alterna- |
623 |
tive algorithm moves through the subject string one character at a |
tive algorithm moves through the subject string one character at a |
624 |
time, for all active paths through the tree. |
time, for all active paths through the tree. |
625 |
|
|
626 |
|
|
627 |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
628 |
|
|
629 |
Using the alternative matching algorithm provides the following advan- |
Using the alternative matching algorithm provides the following advan- |
630 |
tages: |
tages: |
631 |
|
|
632 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
633 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
634 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
635 |
things with callouts. |
things with callouts. |
636 |
|
|
637 |
2. There is much better support for partial matching. The restrictions |
2. There is much better support for partial matching. The restrictions |
638 |
on the content of the pattern that apply when using the standard algo- |
on the content of the pattern that apply when using the standard algo- |
639 |
rithm for partial matching do not apply to the alternative algorithm. |
rithm for partial matching do not apply to the alternative algorithm. |
640 |
For non-anchored patterns, the starting position of a partial match is |
For non-anchored patterns, the starting position of a partial match is |
641 |
available. |
available. |
642 |
|
|
643 |
3. Because the alternative algorithm scans the subject string just |
3. Because the alternative algorithm scans the subject string just |
644 |
once, and never needs to backtrack, it is possible to pass very long |
once, and never needs to backtrack, it is possible to pass very long |
645 |
subject strings to the matching function in several pieces, checking |
subject strings to the matching function in several pieces, checking |
646 |
for partial matching each time. |
for partial matching each time. |
647 |
|
|
648 |
|
|
650 |
|
|
651 |
The alternative algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
652 |
|
|
653 |
1. It is substantially slower than the standard algorithm. This is |
1. It is substantially slower than the standard algorithm. This is |
654 |
partly because it has to search for all possible matches, but is also |
partly because it has to search for all possible matches, but is also |
655 |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
656 |
|
|
657 |
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
669 |
|
|
670 |
REVISION |
REVISION |
671 |
|
|
672 |
Last updated: 06 March 2007 |
Last updated: 29 May 2007 |
673 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
674 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
675 |
|
|
1476 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1477 |
able. |
able. |
1478 |
|
|
1479 |
|
PCRE_INFO_JCHANGED |
1480 |
|
|
1481 |
|
Return 1 if the (?J) option setting is used in the pattern, otherwise |
1482 |
|
0. The fourth argument should point to an int variable. The (?J) inter- |
1483 |
|
nal option setting changes the local PCRE_DUPNAMES value. |
1484 |
|
|
1485 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1486 |
|
|
1487 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1536 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
1537 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
1538 |
|
|
1539 |
|
PCRE_INFO_OKPARTIAL |
1540 |
|
|
1541 |
|
Return 1 if the pattern can be used for partial matching, otherwise 0. |
1542 |
|
The fourth argument should point to an int variable. The pcrepartial |
1543 |
|
documentation lists the restrictions that apply to patterns when par- |
1544 |
|
tial matching is used. |
1545 |
|
|
1546 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1547 |
|
|
1548 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
1549 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
1550 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
1551 |
by any top-level option settings within the pattern itself. |
by any top-level option settings within the pattern itself. |
1552 |
|
|
1553 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
1554 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
1555 |
|
|
1556 |
^ unless PCRE_MULTILINE is set |
^ unless PCRE_MULTILINE is set |
1564 |
|
|
1565 |
PCRE_INFO_SIZE |
PCRE_INFO_SIZE |
1566 |
|
|
1567 |
Return the size of the compiled pattern, that is, the value that was |
Return the size of the compiled pattern, that is, the value that was |
1568 |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
passed as the argument to pcre_malloc() when PCRE was getting memory in |
1569 |
which to place the compiled data. The fourth argument should point to a |
which to place the compiled data. The fourth argument should point to a |
1570 |
size_t variable. |
size_t variable. |
1572 |
PCRE_INFO_STUDYSIZE |
PCRE_INFO_STUDYSIZE |
1573 |
|
|
1574 |
Return the size of the data block pointed to by the study_data field in |
Return the size of the data block pointed to by the study_data field in |
1575 |
a pcre_extra block. That is, it is the value that was passed to |
a pcre_extra block. That is, it is the value that was passed to |
1576 |
pcre_malloc() when PCRE was getting memory into which to place the data |
pcre_malloc() when PCRE was getting memory into which to place the data |
1577 |
created by pcre_study(). The fourth argument should point to a size_t |
created by pcre_study(). The fourth argument should point to a size_t |
1578 |
variable. |
variable. |
1579 |
|
|
1580 |
|
|
1582 |
|
|
1583 |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
int pcre_info(const pcre *code, int *optptr, int *firstcharptr); |
1584 |
|
|
1585 |
The pcre_info() function is now obsolete because its interface is too |
The pcre_info() function is now obsolete because its interface is too |
1586 |
restrictive to return all the available data about a compiled pattern. |
restrictive to return all the available data about a compiled pattern. |
1587 |
New programs should use pcre_fullinfo() instead. The yield of |
New programs should use pcre_fullinfo() instead. The yield of |
1588 |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
pcre_info() is the number of capturing subpatterns, or one of the fol- |
1589 |
lowing negative numbers: |
lowing negative numbers: |
1590 |
|
|
1591 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1592 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1593 |
|
|
1594 |
If the optptr argument is not NULL, a copy of the options with which |
If the optptr argument is not NULL, a copy of the options with which |
1595 |
the pattern was compiled is placed in the integer it points to (see |
the pattern was compiled is placed in the integer it points to (see |
1596 |
PCRE_INFO_OPTIONS above). |
PCRE_INFO_OPTIONS above). |
1597 |
|
|
1598 |
If the pattern is not anchored and the firstcharptr argument is not |
If the pattern is not anchored and the firstcharptr argument is not |
1599 |
NULL, it is used to pass back information about the first character of |
NULL, it is used to pass back information about the first character of |
1600 |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
any matched string (see PCRE_INFO_FIRSTBYTE above). |
1601 |
|
|
1602 |
|
|
1604 |
|
|
1605 |
int pcre_refcount(pcre *code, int adjust); |
int pcre_refcount(pcre *code, int adjust); |
1606 |
|
|
1607 |
The pcre_refcount() function is used to maintain a reference count in |
The pcre_refcount() function is used to maintain a reference count in |
1608 |
the data block that contains a compiled pattern. It is provided for the |
the data block that contains a compiled pattern. It is provided for the |
1609 |
benefit of applications that operate in an object-oriented manner, |
benefit of applications that operate in an object-oriented manner, |
1610 |
where different parts of the application may be using the same compiled |
where different parts of the application may be using the same compiled |
1611 |
pattern, but you want to free the block when they are all done. |
pattern, but you want to free the block when they are all done. |
1612 |
|
|
1613 |
When a pattern is compiled, the reference count field is initialized to |
When a pattern is compiled, the reference count field is initialized to |
1614 |
zero. It is changed only by calling this function, whose action is to |
zero. It is changed only by calling this function, whose action is to |
1615 |
add the adjust value (which may be positive or negative) to it. The |
add the adjust value (which may be positive or negative) to it. The |
1616 |
yield of the function is the new value. However, the value of the count |
yield of the function is the new value. However, the value of the count |
1617 |
is constrained to lie between 0 and 65535, inclusive. If the new value |
is constrained to lie between 0 and 65535, inclusive. If the new value |
1618 |
is outside these limits, it is forced to the appropriate limit value. |
is outside these limits, it is forced to the appropriate limit value. |
1619 |
|
|
1620 |
Except when it is zero, the reference count is not correctly preserved |
Except when it is zero, the reference count is not correctly preserved |
1621 |
if a pattern is compiled on one host and then transferred to a host |
if a pattern is compiled on one host and then transferred to a host |
1622 |
whose byte-order is different. (This seems a highly unlikely scenario.) |
whose byte-order is different. (This seems a highly unlikely scenario.) |
1623 |
|
|
1624 |
|
|
1628 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
1629 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
1630 |
|
|
1631 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
1632 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
1633 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
1634 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
1635 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
1636 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
1637 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
1638 |
|
|
1639 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
1640 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
1641 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
1642 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
1643 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
1644 |
|
|
1645 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
1658 |
|
|
1659 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
1660 |
|
|
1661 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
1662 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
1663 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
1664 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
1665 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
1666 |
|
|
1667 |
unsigned long int flags; |
unsigned long int flags; |
1671 |
void *callout_data; |
void *callout_data; |
1672 |
const unsigned char *tables; |
const unsigned char *tables; |
1673 |
|
|
1674 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
1675 |
are set. The flag bits are: |
are set. The flag bits are: |
1676 |
|
|
1677 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
1680 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
1681 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
1682 |
|
|
1683 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
1684 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
1685 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
1686 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
1687 |
flag bits. |
flag bits. |
1688 |
|
|
1689 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
1690 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
1691 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
1692 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
1693 |
repeats. |
repeats. |
1694 |
|
|
1695 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
1696 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
1697 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
1698 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
1699 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
1700 |
for each position in the subject string. |
for each position in the subject string. |
1701 |
|
|
1702 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
1703 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
1704 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
1705 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
1706 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
1707 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
1708 |
|
|
1709 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
1710 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
1711 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
1712 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
1713 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
1714 |
|
|
1715 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
1716 |
used, or, when PCRE has been compiled to use memory on the heap instead |
used, or, when PCRE has been compiled to use memory on the heap instead |
1717 |
of the stack, the amount of heap memory that can be used. |
of the stack, the amount of heap memory that can be used. |
1718 |
|
|
1719 |
The default value for match_limit_recursion can be set when PCRE is |
The default value for match_limit_recursion can be set when PCRE is |
1720 |
built; the default default is the same value as the default for |
built; the default default is the same value as the default for |
1721 |
match_limit. You can override the default by suppling pcre_exec() with |
match_limit. You can override the default by suppling pcre_exec() with |
1722 |
a pcre_extra block in which match_limit_recursion is set, and |
a pcre_extra block in which match_limit_recursion is set, and |
1723 |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in the flags field. If the |
1724 |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT. |
1725 |
|
|
1726 |
The pcre_callout field is used in conjunction with the "callout" fea- |
The pcre_callout field is used in conjunction with the "callout" fea- |
1727 |
ture, which is described in the pcrecallout documentation. |
ture, which is described in the pcrecallout documentation. |
1728 |
|
|
1729 |
The tables field is used to pass a character tables pointer to |
The tables field is used to pass a character tables pointer to |
1730 |
pcre_exec(); this overrides the value that is stored with the compiled |
pcre_exec(); this overrides the value that is stored with the compiled |
1731 |
pattern. A non-NULL value is stored with the compiled pattern only if |
pattern. A non-NULL value is stored with the compiled pattern only if |
1732 |
custom tables were supplied to pcre_compile() via its tableptr argu- |
custom tables were supplied to pcre_compile() via its tableptr argu- |
1733 |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
ment. If NULL is passed to pcre_exec() using this mechanism, it forces |
1734 |
PCRE's internal tables to be used. This facility is helpful when re- |
PCRE's internal tables to be used. This facility is helpful when re- |
1735 |
using patterns that have been saved after compiling with an external |
using patterns that have been saved after compiling with an external |
1736 |
set of tables, because the external tables might be at a different |
set of tables, because the external tables might be at a different |
1737 |
address when pcre_exec() is called. See the pcreprecompile documenta- |
address when pcre_exec() is called. See the pcreprecompile documenta- |
1738 |
tion for a discussion of saving compiled patterns for later use. |
tion for a discussion of saving compiled patterns for later use. |
1739 |
|
|
1740 |
Option bits for pcre_exec() |
Option bits for pcre_exec() |
1741 |
|
|
1742 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1743 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1744 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
1745 |
PCRE_PARTIAL. |
PCRE_PARTIAL. |
1746 |
|
|
1747 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1748 |
|
|
1749 |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
The PCRE_ANCHORED option limits pcre_exec() to matching at the first |
1750 |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
matching position. If a pattern was compiled with PCRE_ANCHORED, or |
1751 |
turned out to be anchored by virtue of its contents, it cannot be made |
turned out to be anchored by virtue of its contents, it cannot be made |
1752 |
unachored at matching time. |
unachored at matching time. |
1753 |
|
|
1754 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1757 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1758 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1759 |
|
|
1760 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
1761 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
1762 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
1763 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
1764 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
1765 |
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF, |
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF, |
1766 |
PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt |
PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt |
1767 |
fails when the current position is at a CRLF sequence, the match posi- |
fails when the current position is at a CRLF sequence, the match posi- |
1768 |
tion is advanced by two characters instead of one, in other words, to |
tion is advanced by two characters instead of one, in other words, to |
1769 |
after the CRLF. |
after the CRLF. |
1770 |
|
|
1771 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1772 |
|
|
1773 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
1774 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
1775 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
1776 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
1777 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
1778 |
|
|
1779 |
PCRE_NOTEOL |
PCRE_NOTEOL |
1780 |
|
|
1781 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
1782 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
1783 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
1784 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
1785 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
1786 |
not affect \Z or \z. |
not affect \Z or \z. |
1787 |
|
|
1788 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
1789 |
|
|
1790 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
1791 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
1792 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
1793 |
example, if the pattern |
example, if the pattern |
1794 |
|
|
1795 |
a?b? |
a?b? |
1796 |
|
|
1797 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
1798 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
1799 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
1800 |
rences of "a" or "b". |
rences of "a" or "b". |
1801 |
|
|
1802 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
1803 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
1804 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
1805 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
1806 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
1807 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
1808 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
1809 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
1810 |
|
|
1811 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1812 |
|
|
1813 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
1814 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
1815 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
1816 |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
1817 |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
1818 |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
1819 |
returned. |
returned. |
1820 |
|
|
1821 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
1822 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
1823 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
1824 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
1825 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
1826 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
1827 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
1828 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
1829 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
1830 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
1831 |
|
|
1832 |
PCRE_PARTIAL |
PCRE_PARTIAL |
1833 |
|
|
1834 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
1835 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
1836 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
1837 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
1838 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
1839 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
1840 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
1841 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
1842 |
|
|
1843 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
1844 |
|
|
1845 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
1846 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
1847 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
1848 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
1849 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
1850 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
1851 |
|
|
1852 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
1853 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
1854 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
1855 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
1856 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
1857 |
|
|
1858 |
\Biss\B |
\Biss\B |
1859 |
|
|
1860 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
1861 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
1862 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
1863 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
1864 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
1865 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
1866 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
1867 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
1868 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
1869 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
1870 |
|
|
1871 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
1872 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
1873 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
1874 |
subject. |
subject. |
1875 |
|
|
1876 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
1877 |
|
|
1878 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
1879 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
1880 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
1881 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
1882 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
1883 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
1884 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
1885 |
|
|
1886 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
1887 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
1888 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
1889 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
1890 |
|
|
1891 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
1892 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
1893 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
1894 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
1895 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
1896 |
it is not, it is rounded down. |
it is not, it is rounded down. |
1897 |
|
|
1898 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
1899 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
1900 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
1901 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
1902 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
1903 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
1904 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
1905 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
1906 |
tern, and so on. The value returned by pcre_exec() is one more than the |
tern, and so on. The value returned by pcre_exec() is one more than the |
1907 |
highest numbered pair that has been set. For example, if two substrings |
highest numbered pair that has been set. For example, if two substrings |
1908 |
have been captured, the returned value is 3. If there are no capturing |
have been captured, the returned value is 3. If there are no capturing |
1909 |
subpatterns, the return value from a successful match is 1, indicating |
subpatterns, the return value from a successful match is 1, indicating |
1910 |
that just the first pair of offsets has been set. |
that just the first pair of offsets has been set. |
1911 |
|
|
1912 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
1913 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
1914 |
|
|
1915 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
1916 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
1917 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
1918 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
1919 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
1920 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
1921 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
1922 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
1923 |
|
|
1924 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
1925 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
1926 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
1927 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
1928 |
|
|
1929 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
1930 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
1931 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
1932 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
1933 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
1934 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
1935 |
|
|
1936 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
1937 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
1938 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
1939 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
1940 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
1941 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
1942 |
the vector is large enough, of course). |
the vector is large enough, of course). |
1943 |
|
|
1944 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
1945 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
1946 |
|
|
1947 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
1948 |
|
|
1949 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
1950 |
defined in the header file: |
defined in the header file: |
1951 |
|
|
1952 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
1955 |
|
|
1956 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
1957 |
|
|
1958 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
1959 |
ovecsize was not zero. |
ovecsize was not zero. |
1960 |
|
|
1961 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
1964 |
|
|
1965 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
1966 |
|
|
1967 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
1968 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
1969 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
1970 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
1971 |
gives when the magic number is not present. |
gives when the magic number is not present. |
1972 |
|
|
1973 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
1974 |
|
|
1975 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
1976 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
1977 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
1978 |
|
|
1979 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1980 |
|
|
1981 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
1982 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
1983 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
1984 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
1985 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
1986 |
|
|
1987 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
1988 |
|
|
1989 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
1990 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
1991 |
returned by pcre_exec(). |
returned by pcre_exec(). |
1992 |
|
|
1993 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
1994 |
|
|
1995 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
1996 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
1997 |
above. |
above. |
1998 |
|
|
1999 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
2000 |
|
|
2001 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
2002 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
2003 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
2004 |
|
|
2005 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2006 |
|
|
2007 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2008 |
subject. |
subject. |
2009 |
|
|
2010 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2011 |
|
|
2012 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2013 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2014 |
ter. |
ter. |
2015 |
|
|
2016 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2017 |
|
|
2018 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
2019 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
2020 |
|
|
2021 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2022 |
|
|
2023 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
2024 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
2025 |
documentation for details of partial matching. |
documentation for details of partial matching. |
2026 |
|
|
2027 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2028 |
|
|
2029 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
2030 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
2031 |
|
|
2032 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
2033 |
|
|
2034 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
2035 |
|
|
2036 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
2037 |
|
|
2038 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
2039 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
2040 |
description above. |
description above. |
2041 |
|
|
2042 |
PCRE_ERROR_NULLWSLIMIT (-22) |
PCRE_ERROR_NULLWSLIMIT (-22) |
2043 |
|
|
2044 |
When a group that can match an empty substring is repeated with an |
When a group that can match an empty substring is repeated with an |
2045 |
unbounded upper limit, the subject position at the start of the group |
unbounded upper limit, the subject position at the start of the group |
2046 |
must be remembered, so that a test for an empty string can be made when |
must be remembered, so that a test for an empty string can be made when |
2047 |
the end of the group is reached. Some workspace is required for this; |
the end of the group is reached. Some workspace is required for this; |
2048 |
if it runs out, this error is given. |
if it runs out, this error is given. |
2049 |
|
|
2050 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
2067 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
2068 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
2069 |
|
|
2070 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
2071 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
2072 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
2073 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
2074 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
2075 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
2076 |
substrings. |
substrings. |
2077 |
|
|
2078 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
2079 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
2080 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
2081 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
2082 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
2083 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
2084 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
2085 |
|
|
2086 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
2087 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
2088 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
2089 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
2090 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
2091 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
2092 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
2093 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
2094 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
2095 |
|
|
2096 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
2097 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
2098 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
2099 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
2100 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
2101 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
2102 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
2103 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
2104 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
2105 |
|
|
2106 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2107 |
|
|
2108 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
2109 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
2110 |
|
|
2111 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2112 |
|
|
2113 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
2114 |
|
|
2115 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
2116 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
2117 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
2118 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
2119 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
2120 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
2121 |
error code |
error code |
2122 |
|
|
2123 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2124 |
|
|
2125 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
2126 |
|
|
2127 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
2128 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
2129 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
2130 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
2131 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
2132 |
tive for unset substrings. |
tive for unset substrings. |
2133 |
|
|
2134 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
2135 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
2136 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
2137 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
2138 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
2139 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
2140 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
2141 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
2142 |
vided. |
vided. |
2143 |
|
|
2144 |
|
|
2157 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
2158 |
const char **stringptr); |
const char **stringptr); |
2159 |
|
|
2160 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
2161 |
ber. For example, for this pattern |
ber. For example, for this pattern |
2162 |
|
|
2163 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
2166 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2167 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
2168 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
2169 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
2170 |
subpattern of that name. |
subpattern of that name. |
2171 |
|
|
2172 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
2173 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
2174 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
2175 |
|
|
2176 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
2177 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
2178 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
2179 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
2180 |
differences: |
differences: |
2181 |
|
|
2182 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
2183 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
2184 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
2185 |
name-to-number translation table. |
name-to-number translation table. |
2186 |
|
|
2187 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
2188 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
2189 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2190 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
2191 |
|
|
2192 |
|
|
2195 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
2196 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
2197 |
|
|
2198 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
2199 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
2200 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
2201 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
2202 |
mentation. When duplicates are present, pcre_copy_named_substring() and |
mentation. When duplicates are present, pcre_copy_named_substring() and |
2203 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2204 |
the given name that is set. If none are set, an empty string is |
the given name that is set. If none are set, an empty string is |
2205 |
returned. The pcre_get_stringnumber() function returns one of the num- |
returned. The pcre_get_stringnumber() function returns one of the num- |
2206 |
bers that are associated with the name, but it is not defined which it |
bers that are associated with the name, but it is not defined which it |
2207 |
is. |
is. |
2208 |
|
|
2209 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2210 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2211 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
2212 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2213 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2214 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2215 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2216 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
2217 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
2218 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
2219 |
the captured data, if any. |
the captured data, if any. |
2220 |
|
|
2221 |
|
|
2222 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2223 |
|
|
2224 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
2225 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
2226 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
2227 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
2228 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
2229 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
2230 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
2231 |
tation. |
tation. |
2232 |
|
|
2233 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
2234 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
2235 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
2236 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
2237 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
2238 |
|
|
2239 |
|
|
2244 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
2245 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2246 |
|
|
2247 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2248 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
2249 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
2250 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2251 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2252 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2253 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
2254 |
mentation. |
mentation. |
2255 |
|
|
2256 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2257 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2258 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2259 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2260 |
repeated here. |
repeated here. |
2261 |
|
|
2262 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2263 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2264 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2265 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2266 |
lot of potential matches. |
lot of potential matches. |
2267 |
|
|
2268 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2284 |
|
|
2285 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2286 |
|
|
2287 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2288 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2289 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2290 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2291 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
2292 |
not repeated here. |
not repeated here. |
2293 |
|
|
2294 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2295 |
|
|
2296 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
2297 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
2298 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
2299 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
2300 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
2301 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
2302 |
set as the first matching string. |
set as the first matching string. |
2303 |
|
|
2304 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2305 |
|
|
2306 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2307 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
2308 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
2309 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
2310 |
|
|
2311 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2312 |
|
|
2313 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2314 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2315 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2316 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2317 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2318 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2319 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2320 |
documentation. |
documentation. |
2321 |
|
|
2322 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2323 |
|
|
2324 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2325 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2326 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2327 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2328 |
if the pattern |
if the pattern |
2329 |
|
|
2330 |
<.*> |
<.*> |
2339 |
<something> <something else> |
<something> <something else> |
2340 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2341 |
|
|
2342 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2343 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2344 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2345 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
2346 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
2347 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
2348 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
2349 |
meaning of the strings is different.) |
meaning of the strings is different.) |
2350 |
|
|
2351 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2352 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2353 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2354 |
filled with the longest matches. |
filled with the longest matches. |
2355 |
|
|
2356 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2357 |
|
|
2358 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2359 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2360 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2361 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2362 |
|
|
2363 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2364 |
|
|
2365 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2366 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2367 |
reference. |
reference. |
2368 |
|
|
2369 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2370 |
|
|
2371 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
2372 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
2373 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
2374 |
|
|
2375 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2376 |
|
|
2377 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2378 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2379 |
(it is meaningless). |
(it is meaningless). |
2380 |
|
|
2381 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2382 |
|
|
2383 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2384 |
workspace vector. |
workspace vector. |
2385 |
|
|
2386 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2387 |
|
|
2388 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2389 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2390 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2391 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2392 |
|
|
2393 |
|
|
2394 |
SEE ALSO |
SEE ALSO |
2395 |
|
|
2396 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2397 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2398 |
|
|
2399 |
|
|
2400 |
AUTHOR |
AUTHOR |
2406 |
|
|
2407 |
REVISION |
REVISION |
2408 |
|
|
2409 |
Last updated: 24 April 2007 |
Last updated: 04 June 2007 |
2410 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
2411 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2412 |
|
|
2509 |
The subject and subject_length fields contain copies of the values that |
The subject and subject_length fields contain copies of the values that |
2510 |
were passed to pcre_exec(). |
were passed to pcre_exec(). |
2511 |
|
|
2512 |
The start_match field contains the offset within the subject at which |
The start_match field normally contains the offset within the subject |
2513 |
the current match attempt started. If the pattern is not anchored, the |
at which the current match attempt started. However, if the escape |
2514 |
callout function may be called several times from the same point in the |
sequence \K has been encountered, this value is changed to reflect the |
2515 |
pattern for different starting points in the subject. |
modified starting point. If the pattern is not anchored, the callout |
2516 |
|
function may be called several times from the same point in the pattern |
2517 |
|
for different starting points in the subject. |
2518 |
|
|
2519 |
The current_position field contains the offset within the subject of |
The current_position field contains the offset within the subject of |
2520 |
the current match pointer. |
the current match pointer. |
2577 |
|
|
2578 |
REVISION |
REVISION |
2579 |
|
|
2580 |
Last updated: 06 March 2007 |
Last updated: 29 May 2007 |
2581 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
2582 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2583 |
|
|
2738 |
ported by PCRE when its main matching function, pcre_exec(), is used. |
ported by PCRE when its main matching function, pcre_exec(), is used. |
2739 |
From release 6.0, PCRE offers a second matching function, |
From release 6.0, PCRE offers a second matching function, |
2740 |
pcre_dfa_exec(), which matches using a different algorithm that is not |
pcre_dfa_exec(), which matches using a different algorithm that is not |
2741 |
Perl-compatible. The advantages and disadvantages of the alternative |
Perl-compatible. Some of the features discussed below are not available |
2742 |
function, and how it differs from the normal function, are discussed in |
when pcre_dfa_exec() is used. The advantages and disadvantages of the |
2743 |
the pcrematching page. |
alternative function, and how it differs from the normal function, are |
2744 |
|
discussed in the pcrematching page. |
2745 |
|
|
2746 |
|
|
2747 |
CHARACTERS AND METACHARACTERS |
CHARACTERS AND METACHARACTERS |
2748 |
|
|
2749 |
A regular expression is a pattern that is matched against a subject |
A regular expression is a pattern that is matched against a subject |
2750 |
string from left to right. Most characters stand for themselves in a |
string from left to right. Most characters stand for themselves in a |
2751 |
pattern, and match the corresponding characters in the subject. As a |
pattern, and match the corresponding characters in the subject. As a |
2752 |
trivial example, the pattern |
trivial example, the pattern |
2753 |
|
|
2754 |
The quick brown fox |
The quick brown fox |
2755 |
|
|
2756 |
matches a portion of a subject string that is identical to itself. When |
matches a portion of a subject string that is identical to itself. When |
2757 |
caseless matching is specified (the PCRE_CASELESS option), letters are |
caseless matching is specified (the PCRE_CASELESS option), letters are |
2758 |
matched independently of case. In UTF-8 mode, PCRE always understands |
matched independently of case. In UTF-8 mode, PCRE always understands |
2759 |
the concept of case for characters whose values are less than 128, so |
the concept of case for characters whose values are less than 128, so |
2760 |
caseless matching is always possible. For characters with higher val- |
caseless matching is always possible. For characters with higher val- |
2761 |
ues, the concept of case is supported if PCRE is compiled with Unicode |
ues, the concept of case is supported if PCRE is compiled with Unicode |
2762 |
property support, but not otherwise. If you want to use caseless |
property support, but not otherwise. If you want to use caseless |
2763 |
matching for characters 128 and above, you must ensure that PCRE is |
matching for characters 128 and above, you must ensure that PCRE is |
2764 |
compiled with Unicode property support as well as with UTF-8 support. |
compiled with Unicode property support as well as with UTF-8 support. |
2765 |
|
|
2766 |
The power of regular expressions comes from the ability to include |
The power of regular expressions comes from the ability to include |
2767 |
alternatives and repetitions in the pattern. These are encoded in the |
alternatives and repetitions in the pattern. These are encoded in the |
2768 |
pattern by the use of metacharacters, which do not stand for themselves |
pattern by the use of metacharacters, which do not stand for themselves |
2769 |
but instead are interpreted in some special way. |
but instead are interpreted in some special way. |
2770 |
|
|
2771 |
There are two different sets of metacharacters: those that are recog- |
There are two different sets of metacharacters: those that are recog- |
2772 |
nized anywhere in the pattern except within square brackets, and those |
nized anywhere in the pattern except within square brackets, and those |
2773 |
that are recognized within square brackets. Outside square brackets, |
that are recognized within square brackets. Outside square brackets, |
2774 |
the metacharacters are as follows: |
the metacharacters are as follows: |
2775 |
|
|
2776 |
\ general escape character with several uses |
\ general escape character with several uses |
2789 |
also "possessive quantifier" |
also "possessive quantifier" |
2790 |
{ start min/max quantifier |
{ start min/max quantifier |
2791 |
|
|
2792 |
Part of a pattern that is in square brackets is called a "character |
Part of a pattern that is in square brackets is called a "character |
2793 |
class". In a character class the only metacharacters are: |
class". In a character class the only metacharacters are: |
2794 |
|
|
2795 |
\ general escape character |
\ general escape character |
2799 |
syntax) |
syntax) |
2800 |
] terminates the character class |
] terminates the character class |
2801 |
|
|
2802 |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
2803 |
|
|
2804 |
|
|
2805 |
BACKSLASH |
BACKSLASH |
2806 |
|
|
2807 |
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
2808 |
a non-alphanumeric character, it takes away any special meaning that |
a non-alphanumeric character, it takes away any special meaning that |
2809 |
character may have. This use of backslash as an escape character |
character may have. This use of backslash as an escape character |
2810 |
applies both inside and outside character classes. |
applies both inside and outside character classes. |
2811 |
|
|
2812 |
For example, if you want to match a * character, you write \* in the |
For example, if you want to match a * character, you write \* in the |
2813 |
pattern. This escaping action applies whether or not the following |
pattern. This escaping action applies whether or not the following |
2814 |
character would otherwise be interpreted as a metacharacter, so it is |
character would otherwise be interpreted as a metacharacter, so it is |
2815 |
always safe to precede a non-alphanumeric with backslash to specify |
always safe to precede a non-alphanumeric with backslash to specify |
2816 |
that it stands for itself. In particular, if you want to match a back- |
that it stands for itself. In particular, if you want to match a back- |
2817 |
slash, you write \\. |
slash, you write \\. |
2818 |
|
|
2819 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
2820 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
2821 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
2822 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a whitespace or # character as |
2823 |
part of the pattern. |
part of the pattern. |
2824 |
|
|
2825 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
2826 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
2827 |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
2828 |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
2829 |
tion. Note the following examples: |
tion. Note the following examples: |
2830 |
|
|
2831 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
2835 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
2836 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
2837 |
|
|
2838 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
2839 |
classes. |
classes. |
2840 |
|
|
2841 |
Non-printing characters |
Non-printing characters |
2842 |
|
|
2843 |
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
2844 |
acters in patterns in a visible manner. There is no restriction on the |
acters in patterns in a visible manner. There is no restriction on the |
2845 |
appearance of non-printing characters, apart from the binary zero that |
appearance of non-printing characters, apart from the binary zero that |
2846 |
terminates a pattern, but when a pattern is being prepared by text |
terminates a pattern, but when a pattern is being prepared by text |
2847 |
editing, it is usually easier to use one of the following escape |
editing, it is usually easier to use one of the following escape |
2848 |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
2849 |
|
|
2850 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
2858 |
\xhh character with hex code hh |
\xhh character with hex code hh |
2859 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
2860 |
|
|
2861 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
2862 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
2863 |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
2864 |
becomes hex 7B. |
becomes hex 7B. |
2865 |
|
|
2866 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
2867 |
in upper or lower case). Any number of hexadecimal digits may appear |
in upper or lower case). Any number of hexadecimal digits may appear |
2868 |
between \x{ and }, but the value of the character code must be less |
between \x{ and }, but the value of the character code must be less |
2869 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode (that is, |
2870 |
the maximum hexadecimal value is 7FFFFFFF). If characters other than |
the maximum hexadecimal value is 7FFFFFFF). If characters other than |
2871 |
hexadecimal digits appear between \x{ and }, or if there is no termi- |
hexadecimal digits appear between \x{ and }, or if there is no termi- |
2872 |
nating }, this form of escape is not recognized. Instead, the initial |
nating }, this form of escape is not recognized. Instead, the initial |
2873 |
\x will be interpreted as a basic hexadecimal escape, with no following |
\x will be interpreted as a basic hexadecimal escape, with no following |
2874 |
digits, giving a character whose value is zero. |
digits, giving a character whose value is zero. |
2875 |
|
|
2876 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
2877 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
2878 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
2879 |
|
|
2880 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
2881 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
2882 |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
2883 |
(code value 7). Make sure you supply two digits after the initial zero |
(code value 7). Make sure you supply two digits after the initial zero |
2884 |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
2885 |
|
|
2886 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
2887 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
2888 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
2889 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
2890 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
2891 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
2892 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
2893 |
|
|
2894 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
2895 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
2896 |
up to three octal digits following the backslash, and uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
2897 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
2898 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
2899 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
2900 |
example: |
example: |
2901 |
|
|
2902 |
\040 is another way of writing a space |
\040 is another way of writing a space |
2914 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
2915 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
2916 |
|
|
2917 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
2918 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
2919 |
|
|
2920 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
2921 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
2922 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
2923 |
08), and the sequences \R and \X are interpreted as the characters "R" |
08), and the sequences \R and \X are interpreted as the characters "R" |
2924 |
and "X", respectively. Outside a character class, these sequences have |
and "X", respectively. Outside a character class, these sequences have |
2925 |
different meanings (see below). |
different meanings (see below). |
2926 |
|
|
2927 |
Absolute and relative back references |
Absolute and relative back references |
2928 |
|
|
2929 |
The sequence \g followed by a positive or negative number, optionally |
The sequence \g followed by a positive or negative number, optionally |
2930 |
enclosed in braces, is an absolute or relative back reference. Back |
enclosed in braces, is an absolute or relative back reference. A named |
2931 |
references are discussed later, following the discussion of parenthe- |
back reference can be coded as \g{name}. Back references are discussed |
2932 |
sized subpatterns. |
later, following the discussion of parenthesized subpatterns. |
2933 |
|
|
2934 |
Generic character types |
Generic character types |
2935 |
|
|
2944 |
\W any "non-word" character |
\W any "non-word" character |
2945 |
|
|
2946 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
2947 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
2948 |
of each pair. |
of each pair. |
2949 |
|
|
2950 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
2951 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
2952 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
2953 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
2954 |
|
|
2955 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
2956 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
2957 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). (If |
2958 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
2959 |
ter. In PCRE, it never does.) |
ter. In PCRE, it never does.) |
2960 |
|
|
2961 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
2962 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
2963 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
2964 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
2965 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
2966 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
2967 |
are used for accented letters, and these are matched by \w. |
are used for accented letters, and these are matched by \w. |
2968 |
|
|
2969 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
2970 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
2971 |
code character property support is available. The use of locales with |
code character property support is available. The use of locales with |
2972 |
Unicode is discouraged. |
Unicode is discouraged. |
2973 |
|
|
2974 |
Newline sequences |
Newline sequences |
2975 |
|
|
2976 |
Outside a character class, the escape sequence \R matches any Unicode |
Outside a character class, the escape sequence \R matches any Unicode |
2977 |
newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is |
newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is |
2978 |
equivalent to the following: |
equivalent to the following: |
2979 |
|
|
2980 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
2981 |
|
|
2982 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
2983 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
2984 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
2985 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
2986 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
2987 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
2988 |
|
|
2989 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
2990 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
2991 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
2992 |
these characters to be recognized. |
these characters to be recognized. |
2993 |
|
|
2994 |
Inside a character class, \R matches the letter "R". |
Inside a character class, \R matches the letter "R". |
2996 |
Unicode character properties |
Unicode character properties |
2997 |
|
|
2998 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
2999 |
tional escape sequences to match character properties are available |
tional escape sequences to match character properties are available |
3000 |
when UTF-8 mode is selected. They are: |
when UTF-8 mode is selected. They are: |
3001 |
|
|
3002 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
3003 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
3004 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
3005 |
|
|
3006 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
3007 |
script names, the general category properties, and "Any", which matches |
script names, the general category properties, and "Any", which matches |
3008 |
any character (including newline). Other properties such as "InMusical- |
any character (including newline). Other properties such as "InMusical- |
3009 |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
3010 |
not match any characters, so always causes a match failure. |
not match any characters, so always causes a match failure. |
3011 |
|
|
3012 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
3013 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
3014 |
For example: |
For example: |
3015 |
|
|
3016 |
\p{Greek} |
\p{Greek} |
3017 |
\P{Han} |
\P{Han} |
3018 |
|
|
3019 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
3020 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
3021 |
|
|
3022 |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
3023 |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
3024 |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
3025 |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
3026 |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
3027 |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
3028 |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
3029 |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
3030 |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
3031 |
|
|
3032 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
3033 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
3034 |
specified by including a circumflex between the opening brace and the |
specified by including a circumflex between the opening brace and the |
3035 |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
3036 |
|
|
3037 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
3038 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
3039 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
3040 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
3041 |
|
|
3042 |
\p{L} |
\p{L} |
3088 |
Zp Paragraph separator |
Zp Paragraph separator |
3089 |
Zs Space separator |
Zs Space separator |
3090 |
|
|
3091 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
3092 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
3093 |
classified as a modifier or "other". |
classified as a modifier or "other". |
3094 |
|
|
3095 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for these properties that Perl supports (such as |
3096 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3097 |
any of these properties with "Is". |
any of these properties with "Is". |
3098 |
|
|
3099 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
3100 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
3101 |
in the Unicode table. |
in the Unicode table. |
3102 |
|
|
3103 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
3104 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3105 |
|
|
3106 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
3107 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
3108 |
|
|
3109 |
(?>\PM\pM*) |
(?>\PM\pM*) |
3110 |
|
|
3111 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
3112 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
3113 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
3114 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
3115 |
|
|
3116 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
3117 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
3118 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
3119 |
\w do not use Unicode properties in PCRE. |
\w do not use Unicode properties in PCRE. |
3120 |
|
|
3121 |
|
Resetting the match start |
3122 |
|
|
3123 |
|
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
3124 |
|
ously matched characters not to be included in the final matched |
3125 |
|
sequence. For example, the pattern: |
3126 |
|
|
3127 |
|
foo\Kbar |
3128 |
|
|
3129 |
|
matches "foobar", but reports that it has matched "bar". This feature |
3130 |
|
is similar to a lookbehind assertion (described below). However, in |
3131 |
|
this case, the part of the subject before the real match does not have |
3132 |
|
to be of fixed length, as lookbehind assertions do. The use of \K does |
3133 |
|
not interfere with the setting of captured substrings. For example, |
3134 |
|
when the pattern |
3135 |
|
|
3136 |
|
(foo)\Kbar |
3137 |
|
|
3138 |
|
matches "foobar", the first substring is still set to "foo". |
3139 |
|
|
3140 |
Simple assertions |
Simple assertions |
3141 |
|
|
3142 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
3898 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
3899 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
3900 |
|
|
3901 |
Back references to named subpatterns use the Perl syntax \k<name> or |
There are several different ways of writing back references to named |
3902 |
\k'name' or the Python syntax (?P=name). We could rewrite the above |
subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or |
3903 |
example in either of the following ways: |
\k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's |
3904 |
|
unified back reference syntax, in which \g can be used for both numeric |
3905 |
|
and named references, is also supported. We could rewrite the above |
3906 |
|
example in any of the following ways: |
3907 |
|
|
3908 |
(?<p1>(?i)rah)\s+\k<p1> |
(?<p1>(?i)rah)\s+\k<p1> |
3909 |
|
(?'p1'(?i)rah)\s+\k{p1} |
3910 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
3911 |
|
(?<p1>(?i)rah)\s+\g{p1} |
3912 |
|
|
3913 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
3914 |
before or after the reference. |
before or after the reference. |
3915 |
|
|
3916 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
3917 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
3918 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
3919 |
|
|
3920 |
(a|(bc))\2 |
(a|(bc))\2 |
3921 |
|
|
3922 |
always fails if it starts to match "a" rather than "bc". Because there |
always fails if it starts to match "a" rather than "bc". Because there |
3923 |
may be many capturing parentheses in a pattern, all digits following |
may be many capturing parentheses in a pattern, all digits following |
3924 |
the backslash are taken as part of a potential back reference number. |
the backslash are taken as part of a potential back reference number. |
3925 |
If the pattern continues with a digit character, some delimiter must be |
If the pattern continues with a digit character, some delimiter must be |
3926 |
used to terminate the back reference. If the PCRE_EXTENDED option is |
used to terminate the back reference. If the PCRE_EXTENDED option is |
3927 |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
3928 |
ments" below) can be used. |
ments" below) can be used. |
3929 |
|
|
3930 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
3931 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
3932 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
3933 |
patterns. For example, the pattern |
patterns. For example, the pattern |
3934 |
|
|
3935 |
(a|b\1)+ |
(a|b\1)+ |
3936 |
|
|
3937 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
3938 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
3939 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
3940 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
3941 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
3942 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
3943 |
|
|
3944 |
|
|
3945 |
ASSERTIONS |
ASSERTIONS |
3946 |
|
|
3947 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
3948 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
3949 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
3950 |
described above. |
described above. |
3951 |
|
|
3952 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
3953 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
3954 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
3955 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
3956 |
matching position to be changed. |
matching position to be changed. |
3957 |
|
|
3958 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
3959 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
3960 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
3961 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
3962 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
3963 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
3964 |
negative assertions. |
negative assertions. |
3965 |
|
|
3966 |
Lookahead assertions |
Lookahead assertions |
3970 |
|
|
3971 |
\w+(?=;) |
\w+(?=;) |
3972 |
|
|
3973 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
3974 |
colon in the match, and |
colon in the match, and |
3975 |
|
|
3976 |
foo(?!bar) |
foo(?!bar) |
3977 |
|
|
3978 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
3979 |
that the apparently similar pattern |
that the apparently similar pattern |
3980 |
|
|
3981 |
(?!foo)bar |
(?!foo)bar |
3982 |
|
|
3983 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
3984 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
3985 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
3986 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
3987 |
|
|
3988 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
3989 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
3990 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
3991 |
string must always fail. |
string must always fail. |
3992 |
|
|
3993 |
Lookbehind assertions |
Lookbehind assertions |
3994 |
|
|
3995 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
3996 |
for negative assertions. For example, |
for negative assertions. For example, |
3997 |
|
|
3998 |
(?<!foo)bar |
(?<!foo)bar |
3999 |
|
|
4000 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
4001 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
4002 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
4003 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
4004 |
fixed length. Thus |
fixed length. Thus |
4005 |
|
|
4006 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
4009 |
|
|
4010 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
4011 |
|
|
4012 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
4013 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
4014 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (at least for 5.8), which |
4015 |
requires all branches to match the same length of string. An assertion |
requires all branches to match the same length of string. An assertion |
4016 |
such as |
such as |
4017 |
|
|
4018 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
4019 |
|
|
4020 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
4021 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable if rewritten to use two top- |
4022 |
level branches: |
level branches: |
4023 |
|
|
4024 |
(?<=abc|abde) |
(?<=abc|abde) |
4025 |
|
|
4026 |
The implementation of lookbehind assertions is, for each alternative, |
In some cases, the Perl 5.10 escape sequence \K (see above) can be used |
4027 |
to temporarily move the current position back by the fixed length and |
instead of a lookbehind assertion; this is not restricted to a fixed- |
4028 |
|
length. |
4029 |
|
|
4030 |
|
The implementation of lookbehind assertions is, for each alternative, |
4031 |
|
to temporarily move the current position back by the fixed length and |
4032 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
4033 |
rent position, the assertion fails. |
rent position, the assertion fails. |
4034 |
|
|
4035 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
4036 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
4037 |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
4038 |
which can match different numbers of bytes, are also not permitted. |
which can match different numbers of bytes, are also not permitted. |
4039 |
|
|
4040 |
Possessive quantifiers can be used in conjunction with lookbehind |
Possessive quantifiers can be used in conjunction with lookbehind |
4041 |
assertions to specify efficient matching at the end of the subject |
assertions to specify efficient matching at the end of the subject |
4042 |
string. Consider a simple pattern such as |
string. Consider a simple pattern such as |
4043 |
|
|
4044 |
abcd$ |
abcd$ |
4045 |
|
|
4046 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
4047 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
4048 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
4049 |
pattern is specified as |
pattern is specified as |
4050 |
|
|
4051 |
^.*abcd$ |
^.*abcd$ |
4052 |
|
|
4053 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
4054 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
4055 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
4056 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
4057 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
4058 |
|
|
4059 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
4060 |
|
|
4061 |
there can be no backtracking for the .*+ item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
4062 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
4063 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
4064 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
4065 |
processing time. |
processing time. |
4066 |
|
|
4067 |
Using multiple assertions |
Using multiple assertions |
4070 |
|
|
4071 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
4072 |
|
|
4073 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
4074 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
4075 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
4076 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
4077 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
4078 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
4079 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
4080 |
foo". A pattern to do that is |
foo". A pattern to do that is |
4081 |
|
|
4082 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
4083 |
|
|
4084 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
4085 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
4086 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
4087 |
|
|
4089 |
|
|
4090 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
4091 |
|
|
4092 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
4093 |
is not preceded by "foo", while |
is not preceded by "foo", while |
4094 |
|
|
4095 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
4096 |
|
|
4097 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
4098 |
three characters that are not "999". |
three characters that are not "999". |
4099 |
|
|
4100 |
|
|
4101 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
4102 |
|
|
4103 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
4104 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
4105 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
4106 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
4107 |
are |
are |
4108 |
|
|
4109 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
4110 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
4111 |
|
|
4112 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
4113 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
4114 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
4115 |
|
|
4116 |
There are four kinds of condition: references to subpatterns, refer- |
There are four kinds of condition: references to subpatterns, refer- |
4117 |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
4118 |
|
|
4119 |
Checking for a used subpattern by number |
Checking for a used subpattern by number |
4120 |
|
|
4121 |
If the text between the parentheses consists of a sequence of digits, |
If the text between the parentheses consists of a sequence of digits, |
4122 |
the condition is true if the capturing subpattern of that number has |
the condition is true if the capturing subpattern of that number has |
4123 |
previously matched. |
previously matched. An alternative notation is to precede the digits |
4124 |
|
with a plus or minus sign. In this case, the subpattern number is rela- |
4125 |
|
tive rather than absolute. The most recently opened parentheses can be |
4126 |
|
referenced by (?(-1), the next most recent by (?(-2), and so on. In |
4127 |
|
looping constructs it can also make sense to refer to subsequent groups |
4128 |
|
with constructs such as (?(+2). |
4129 |
|
|
4130 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
4131 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
4144 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
4145 |
optionally enclosed in parentheses. |
optionally enclosed in parentheses. |
4146 |
|
|
4147 |
|
If you were embedding this pattern in a larger one, you could use a |
4148 |
|
relative reference: |
4149 |
|
|
4150 |
|
...other stuff... ( \( )? [^()]+ (?(-1) \) ) ... |
4151 |
|
|
4152 |
|
This makes the fragment independent of the parentheses in the larger |
4153 |
|
pattern. |
4154 |
|
|
4155 |
Checking for a used subpattern by name |
Checking for a used subpattern by name |
4156 |
|
|
4157 |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
4293 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
4294 |
|
|
4295 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
4296 |
refer to them instead of the whole pattern. In a larger pattern, keep- |
refer to them instead of the whole pattern. |
4297 |
ing track of parenthesis numbers can be tricky. It may be more conve- |
|
4298 |
nient to use named parentheses instead. The Perl syntax for this is |
In a larger pattern, keeping track of parenthesis numbers can be |
4299 |
(?&name); PCRE's earlier syntax (?P>name) is also supported. We could |
tricky. This is made easier by the use of relative references. (A Perl |
4300 |
rewrite the above example as follows: |
5.10 feature.) Instead of (?1) in the pattern above you can write |
4301 |
|
(?-2) to refer to the second most recently opened parentheses preceding |
4302 |
|
the recursion. In other words, a negative number counts capturing |
4303 |
|
parentheses leftwards from the point at which it is encountered. |
4304 |
|
|
4305 |
|
It is also possible to refer to subsequently opened parentheses, by |
4306 |
|
writing references such as (?+2). However, these cannot be recursive |
4307 |
|
because the reference is not inside the parentheses that are refer- |
4308 |
|
enced. They are always "subroutine" calls, as described in the next |
4309 |
|
section. |
4310 |
|
|
4311 |
|
An alternative approach is to use named parentheses instead. The Perl |
4312 |
|
syntax for this is (?&name); PCRE's earlier syntax (?P>name) is also |
4313 |
|
supported. We could rewrite the above example as follows: |
4314 |
|
|
4315 |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
4316 |
|
|
4317 |
If there is more than one subpattern with the same name, the earliest |
If there is more than one subpattern with the same name, the earliest |
4318 |
one is used. This particular example pattern contains nested unlimited |
one is used. |
4319 |
repeats, and so the use of atomic grouping for matching strings of non- |
|
4320 |
parentheses is important when applying the pattern to strings that do |
This particular example pattern that we have been looking at contains |
4321 |
not match. For example, when this pattern is applied to |
nested unlimited repeats, and so the use of atomic grouping for match- |
4322 |
|
ing strings of non-parentheses is important when applying the pattern |
4323 |
|
to strings that do not match. For example, when this pattern is applied |
4324 |
|
to |
4325 |
|
|
4326 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
4327 |
|
|
4371 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
4372 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
4373 |
ates like a subroutine in a programming language. The "called" subpat- |
ates like a subroutine in a programming language. The "called" subpat- |
4374 |
tern may be defined before or after the reference. An earlier example |
tern may be defined before or after the reference. A numbered reference |
4375 |
pointed out that the pattern |
can be absolute or relative, as in these examples: |
4376 |
|
|
4377 |
|
(...(absolute)...)...(?2)... |
4378 |
|
(...(relative)...)...(?-1)... |
4379 |
|
(...(?+1)...(relative)... |
4380 |
|
|
4381 |
|
An earlier example pointed out that the pattern |
4382 |
|
|
4383 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
4384 |
|
|
4400 |
case-independence are fixed when the subpattern is defined. They cannot |
case-independence are fixed when the subpattern is defined. They cannot |
4401 |
be changed for different calls. For example, consider this pattern: |
be changed for different calls. For example, consider this pattern: |
4402 |
|
|
4403 |
(abc)(?i:(?1)) |
(abc)(?i:(?-1)) |
4404 |
|
|
4405 |
It matches "abcabc". It does not match "abcABC" because the change of |
It matches "abcabc". It does not match "abcABC" because the change of |
4406 |
processing option does not affect the called subpattern. |
processing option does not affect the called subpattern. |
4455 |
|
|
4456 |
REVISION |
REVISION |
4457 |
|
|
4458 |
Last updated: 06 March 2007 |
Last updated: 29 May 2007 |
4459 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
4460 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4461 |
|
|
4536 |
|
|
4537 |
If PCRE_PARTIAL is set for a pattern that does not conform to the |
If PCRE_PARTIAL is set for a pattern that does not conform to the |
4538 |
restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL |
restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL |
4539 |
(-13). |
(-13). You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to |
4540 |
|
find out if a compiled pattern can be used for partial matching. |
4541 |
|
|
4542 |
|
|
4543 |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
4544 |
|
|
4545 |
If the escape sequence \P is present in a pcretest data line, the |
If the escape sequence \P is present in a pcretest data line, the |
4546 |
PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that |
PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that |
4547 |
uses the date example quoted above: |
uses the date example quoted above: |
4548 |
|
|
4559 |
data> j\P |
data> j\P |
4560 |
No match |
No match |
4561 |
|
|
4562 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
4563 |
matched substrings. The remaining four strings do not match the com- |
matched substrings. The remaining four strings do not match the com- |
4564 |
plete pattern, but the first two are partial matches. The same test, |
plete pattern, but the first two are partial matches. The same test, |
4565 |
using pcre_dfa_exec() matching (by means of the \D escape sequence), |
using pcre_dfa_exec() matching (by means of the \D escape sequence), |
4566 |
produces the following output: |
produces the following output: |
4567 |
|
|
4568 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
4577 |
data> j\P\D |
data> j\P\D |
4578 |
No match |
No match |
4579 |
|
|
4580 |
Notice that in this case the portion of the string that was matched is |
Notice that in this case the portion of the string that was matched is |
4581 |
made available. |
made available. |
4582 |
|
|
4583 |
|
|
4584 |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
4585 |
|
|
4586 |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
4587 |
ble to continue the match by providing additional subject data and |
ble to continue the match by providing additional subject data and |
4588 |
calling pcre_dfa_exec() again with the same compiled regular expres- |
calling pcre_dfa_exec() again with the same compiled regular expres- |
4589 |
sion, this time setting the PCRE_DFA_RESTART option. You must also pass |
sion, this time setting the PCRE_DFA_RESTART option. You must also pass |
4590 |
the same working space as before, because this is where details of the |
the same working space as before, because this is where details of the |
4591 |
previous partial match are stored. Here is an example using pcretest, |
previous partial match are stored. Here is an example using pcretest, |
4592 |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and |
4593 |
\D are as above): |
\D are as above): |
4594 |
|
|
4598 |
data> n05\R\D |
data> n05\R\D |
4599 |
0: n05 |
0: n05 |
4600 |
|
|
4601 |
The first call has "23ja" as the subject, and requests partial match- |
The first call has "23ja" as the subject, and requests partial match- |
4602 |
ing; the second call has "n05" as the subject for the continued |
ing; the second call has "n05" as the subject for the continued |
4603 |
(restarted) match. Notice that when the match is complete, only the |
(restarted) match. Notice that when the match is complete, only the |
4604 |
last part is shown; PCRE does not retain the previously partially- |
last part is shown; PCRE does not retain the previously partially- |
4605 |
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
4606 |
to. |
to. |
4607 |
|
|
4608 |
You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial |
You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial |
4609 |
matching over multiple segments. This facility can be used to pass very |
matching over multiple segments. This facility can be used to pass very |
4610 |
long subject strings to pcre_dfa_exec(). However, some care is needed |
long subject strings to pcre_dfa_exec(). However, some care is needed |
4611 |
for certain types of pattern. |
for certain types of pattern. |
4612 |
|
|
4613 |
1. If the pattern contains tests for the beginning or end of a line, |
1. If the pattern contains tests for the beginning or end of a line, |
4614 |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
4615 |
ate, when the subject string for any call does not contain the begin- |
ate, when the subject string for any call does not contain the begin- |
4616 |
ning or end of a line. |
ning or end of a line. |
4617 |
|
|
4618 |
2. If the pattern contains backward assertions (including \b or \B), |
2. If the pattern contains backward assertions (including \b or \B), |
4619 |
you need to arrange for some overlap in the subject strings to allow |
you need to arrange for some overlap in the subject strings to allow |
4620 |
for this. For example, you could pass the subject in chunks that are |
for this. For example, you could pass the subject in chunks that are |
4621 |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
4622 |
set to 200 and the previous 200 bytes at the start of the buffer. |
set to 200 and the previous 200 bytes at the start of the buffer. |
4623 |
|
|
4624 |
3. Matching a subject string that is split into multiple segments does |
3. Matching a subject string that is split into multiple segments does |
4625 |
not always produce exactly the same result as matching over one single |
not always produce exactly the same result as matching over one single |
4626 |
long string. The difference arises when there are multiple matching |
long string. The difference arises when there are multiple matching |
4627 |
possibilities, because a partial match result is given only when there |
possibilities, because a partial match result is given only when there |
4628 |
are no completed matches in a call to pcre_dfa_exec(). This means that |
are no completed matches in a call to pcre_dfa_exec(). This means that |
4629 |
as soon as the shortest match has been found, continuation to a new |
as soon as the shortest match has been found, continuation to a new |
4630 |
subject segment is no longer possible. Consider this pcretest example: |
subject segment is no longer possible. Consider this pcretest example: |
4631 |
|
|
4632 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
4638 |
0: dogsbody |
0: dogsbody |
4639 |
1: dog |
1: dog |
4640 |
|
|
4641 |
The pattern matches the words "dog" or "dogsbody". When the subject is |
The pattern matches the words "dog" or "dogsbody". When the subject is |
4642 |
presented in several parts ("do" and "gsb" being the first two) the |
presented in several parts ("do" and "gsb" being the first two) the |
4643 |
match stops when "dog" has been found, and it is not possible to con- |
match stops when "dog" has been found, and it is not possible to con- |
4644 |
tinue. On the other hand, if "dogsbody" is presented as a single |
tinue. On the other hand, if "dogsbody" is presented as a single |
4645 |
string, both matches are found. |
string, both matches are found. |
4646 |
|
|
4647 |
Because of this phenomenon, it does not usually make sense to end a |
Because of this phenomenon, it does not usually make sense to end a |
4648 |
pattern that is going to be matched in this way with a variable repeat. |
pattern that is going to be matched in this way with a variable repeat. |
4649 |
|
|
4650 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
4653 |
|
|
4654 |
1234|3789 |
1234|3789 |
4655 |
|
|
4656 |
If the first part of the subject is "ABC123", a partial match of the |
If the first part of the subject is "ABC123", a partial match of the |
4657 |
first alternative is found at offset 3. There is no partial match for |
first alternative is found at offset 3. There is no partial match for |
4658 |
the second alternative, because such a match does not start at the same |
the second alternative, because such a match does not start at the same |
4659 |
point in the subject string. Attempting to continue with the string |
point in the subject string. Attempting to continue with the string |
4660 |
"789" does not yield a match because only those alternatives that match |
"789" does not yield a match because only those alternatives that match |
4661 |
at one point in the subject are remembered. The problem arises because |
at one point in the subject are remembered. The problem arises because |
4662 |
the start of the second alternative matches within the first alterna- |
the start of the second alternative matches within the first alterna- |
4663 |
tive. There is no problem with anchored patterns or patterns such as: |
tive. There is no problem with anchored patterns or patterns such as: |
4664 |
|
|
4665 |
1234|ABCD |
1234|ABCD |
4676 |
|
|
4677 |
REVISION |
REVISION |
4678 |
|
|
4679 |
Last updated: 06 March 2007 |
Last updated: 04 June 2007 |
4680 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
4681 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4682 |
|
|