312 |
|
|
313 |
to the configure command. There is a fourth option, specified by |
to the configure command. There is a fourth option, specified by |
314 |
|
|
315 |
|
--enable-newline-is-anycrlf |
316 |
|
|
317 |
|
which causes PCRE to recognize any of the three sequences CR, LF, or |
318 |
|
CRLF as indicating a line ending. Finally, a fifth option, specified by |
319 |
|
|
320 |
--enable-newline-is-any |
--enable-newline-is-any |
321 |
|
|
322 |
which causes PCRE to recognize any Unicode newline sequence. |
causes PCRE to recognize any Unicode newline sequence. |
323 |
|
|
324 |
Whatever line ending convention is selected when PCRE is built can be |
Whatever line ending convention is selected when PCRE is built can be |
325 |
overridden when the library functions are called. At build time it is |
overridden when the library functions are called. At build time it is |
473 |
|
|
474 |
REVISION |
REVISION |
475 |
|
|
476 |
Last updated: 20 March 2007 |
Last updated: 16 April 2007 |
477 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
478 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
479 |
|
|
846 |
|
|
847 |
NEWLINES |
NEWLINES |
848 |
|
|
849 |
PCRE supports four different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
850 |
strings: a single CR (carriage return) character, a single LF (line- |
strings: a single CR (carriage return) character, a single LF (line- |
851 |
feed) character, the two-character sequence CRLF, or any Unicode new- |
feed) character, the two-character sequence CRLF, any of the three pre- |
852 |
line sequence. The Unicode newline sequences are the three just men- |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
853 |
tioned, plus the single characters VT (vertical tab, U+000B), FF (form- |
are the three just mentioned, plus the single characters VT (vertical |
854 |
feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
855 |
and PS (paragraph separator, U+2029). |
separator, U+2028), and PS (paragraph separator, U+2029). |
856 |
|
|
857 |
Each of the first three conventions is used by at least one operating |
Each of the first three conventions is used by at least one operating |
858 |
system as its standard newline sequence. When PCRE is built, a default |
system as its standard newline sequence. When PCRE is built, a default |
917 |
|
|
918 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
919 |
sequence that is recognized as meaning "newline". The four values that |
sequence that is recognized as meaning "newline". The four values that |
920 |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY. |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
921 |
The default should normally be the standard sequence for your operating |
and -1 for ANY. The default should normally be the standard sequence |
922 |
system. |
for your operating system. |
923 |
|
|
924 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
925 |
|
|
1143 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1144 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
1145 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
1146 |
|
PCRE_NEWLINE_ANYCRLF |
1147 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1148 |
|
|
1149 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1150 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1151 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1152 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1153 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANY specifies that |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1154 |
any Unicode newline sequence should be recognized. The Unicode newline |
that any of the three preceding sequences should be recognized. Setting |
1155 |
sequences are the three just mentioned, plus the single characters VT |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1156 |
(vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), |
recognized. The Unicode newline sequences are the three just mentioned, |
1157 |
LS (line separator, U+2028), and PS (paragraph separator, U+2029). The |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1158 |
last two are recognized only in UTF-8 mode. |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1159 |
|
(paragraph separator, U+2029). The last two are recognized only in |
1160 |
|
UTF-8 mode. |
1161 |
|
|
1162 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1163 |
treated as a number, giving eight possibilities. Currently only five |
treated as a number, giving eight possibilities. Currently only six are |
1164 |
are used (default plus the four values above). This means that if you |
used (default plus the five values above). This means that if you set |
1165 |
set more than one newline option, the combination may or may not be |
more than one newline option, the combination may or may not be sensi- |
1166 |
sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equiva- |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1167 |
lent to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1168 |
and cause an error. |
cause an error. |
1169 |
|
|
1170 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
1171 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1733 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1734 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
1735 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
1736 |
|
PCRE_NEWLINE_ANYCRLF |
1737 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1738 |
|
|
1739 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
1741 |
tion of pcre_compile() above. During matching, the newline choice |
tion of pcre_compile() above. During matching, the newline choice |
1742 |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
1743 |
ters. It may also alter the way the match position is advanced after a |
ters. It may also alter the way the match position is advanced after a |
1744 |
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF or |
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF, |
1745 |
PCRE_NEWLINE_ANY is set, and a match attempt fails when the current |
PCRE_NEWLINE_ANYCRLF, or PCRE_NEWLINE_ANY is set, and a match attempt |
1746 |
position is at a CRLF sequence, the match position is advanced by two |
fails when the current position is at a CRLF sequence, the match posi- |
1747 |
characters instead of one, in other words, to after the CRLF. |
tion is advanced by two characters instead of one, in other words, to |
1748 |
|
after the CRLF. |
1749 |
|
|
1750 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1751 |
|
|
1752 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
1753 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
1754 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
1755 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
1756 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
1757 |
|
|
1758 |
PCRE_NOTEOL |
PCRE_NOTEOL |
1759 |
|
|
1760 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
1761 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
1762 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
1763 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
1764 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
1765 |
not affect \Z or \z. |
not affect \Z or \z. |
1766 |
|
|
1767 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
1768 |
|
|
1769 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
1770 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
1771 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
1772 |
example, if the pattern |
example, if the pattern |
1773 |
|
|
1774 |
a?b? |
a?b? |
1775 |
|
|
1776 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
1777 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
1778 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
1779 |
rences of "a" or "b". |
rences of "a" or "b". |
1780 |
|
|
1781 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
1782 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
1783 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
1784 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
1785 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
1786 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
1787 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
1788 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
1789 |
|
|
1790 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1791 |
|
|
1792 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
1793 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
1794 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
1795 |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
1796 |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
1797 |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
1798 |
returned. |
returned. |
1799 |
|
|
1800 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
1801 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
1802 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
1803 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
1804 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
1805 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
1806 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
1807 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
1808 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
1809 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
1810 |
|
|
1811 |
PCRE_PARTIAL |
PCRE_PARTIAL |
1812 |
|
|
1813 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
1814 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
1815 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
1816 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
1817 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
1818 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
1819 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
1820 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
1821 |
|
|
1822 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
1823 |
|
|
1824 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
1825 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
1826 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
1827 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
1828 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
1829 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
1830 |
|
|
1831 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
1832 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
1833 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
1834 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
1835 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
1836 |
|
|
1837 |
\Biss\B |
\Biss\B |
1838 |
|
|
1839 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
1840 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
1841 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
1842 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
1843 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
1844 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
1845 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
1846 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
1847 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
1848 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
1849 |
|
|
1850 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
1851 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
1852 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
1853 |
subject. |
subject. |
1854 |
|
|
1855 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
1856 |
|
|
1857 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
1858 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
1859 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
1860 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
1861 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
1862 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
1863 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
1864 |
|
|
1865 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
1866 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
1867 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
1868 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
1869 |
|
|
1870 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
1871 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
1872 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
1873 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
1874 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
1875 |
it is not, it is rounded down. |
it is not, it is rounded down. |
1876 |
|
|
1877 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
1878 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
1879 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
1880 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
1881 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
1882 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
1883 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
1884 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
1885 |
tern, and so on. The value returned by pcre_exec() is one more than the |
tern, and so on. The value returned by pcre_exec() is one more than the |
1886 |
highest numbered pair that has been set. For example, if two substrings |
highest numbered pair that has been set. For example, if two substrings |
1887 |
have been captured, the returned value is 3. If there are no capturing |
have been captured, the returned value is 3. If there are no capturing |
1888 |
subpatterns, the return value from a successful match is 1, indicating |
subpatterns, the return value from a successful match is 1, indicating |
1889 |
that just the first pair of offsets has been set. |
that just the first pair of offsets has been set. |
1890 |
|
|
1891 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
1892 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
1893 |
|
|
1894 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
1895 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
1896 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
1897 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
1898 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
1899 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
1900 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
1901 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
1902 |
|
|
1903 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
1904 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
1905 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
1906 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
1907 |
|
|
1908 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
1909 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
1910 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
1911 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
1912 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
1913 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
1914 |
|
|
1915 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
1916 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
1917 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
1918 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
1919 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
1920 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
1921 |
the vector is large enough, of course). |
the vector is large enough, of course). |
1922 |
|
|
1923 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
1924 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
1925 |
|
|
1926 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
1927 |
|
|
1928 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
1929 |
defined in the header file: |
defined in the header file: |
1930 |
|
|
1931 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
1934 |
|
|
1935 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
1936 |
|
|
1937 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
1938 |
ovecsize was not zero. |
ovecsize was not zero. |
1939 |
|
|
1940 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
1943 |
|
|
1944 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
1945 |
|
|
1946 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
1947 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
1948 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
1949 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
1950 |
gives when the magic number is not present. |
gives when the magic number is not present. |
1951 |
|
|
1952 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
1953 |
|
|
1954 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
1955 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
1956 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
1957 |
|
|
1958 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1959 |
|
|
1960 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
1961 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
1962 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
1963 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
1964 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
1965 |
|
|
1966 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
1967 |
|
|
1968 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
1969 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
1970 |
returned by pcre_exec(). |
returned by pcre_exec(). |
1971 |
|
|
1972 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
1973 |
|
|
1974 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
1975 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
1976 |
above. |
above. |
1977 |
|
|
1978 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
1979 |
|
|
1980 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
1981 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
1982 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
1983 |
|
|
1984 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
1985 |
|
|
1986 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
1987 |
subject. |
subject. |
1988 |
|
|
1989 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
1990 |
|
|
1991 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
1992 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
1993 |
ter. |
ter. |
1994 |
|
|
1995 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
1996 |
|
|
1997 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
1998 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
1999 |
|
|
2000 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2001 |
|
|
2002 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
2003 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
2004 |
documentation for details of partial matching. |
documentation for details of partial matching. |
2005 |
|
|
2006 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2007 |
|
|
2008 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
2009 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
2010 |
|
|
2011 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
2012 |
|
|
2013 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
2014 |
|
|
2015 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
2016 |
|
|
2017 |
The internal recursion limit, as specified by the match_limit_recursion |
The internal recursion limit, as specified by the match_limit_recursion |
2018 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
2019 |
description above. |
description above. |
2020 |
|
|
2021 |
PCRE_ERROR_NULLWSLIMIT (-22) |
PCRE_ERROR_NULLWSLIMIT (-22) |
2022 |
|
|
2023 |
When a group that can match an empty substring is repeated with an |
When a group that can match an empty substring is repeated with an |
2024 |
unbounded upper limit, the subject position at the start of the group |
unbounded upper limit, the subject position at the start of the group |
2025 |
must be remembered, so that a test for an empty string can be made when |
must be remembered, so that a test for an empty string can be made when |
2026 |
the end of the group is reached. Some workspace is required for this; |
the end of the group is reached. Some workspace is required for this; |
2027 |
if it runs out, this error is given. |
if it runs out, this error is given. |
2028 |
|
|
2029 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
2046 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
2047 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
2048 |
|
|
2049 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
2050 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
2051 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
2052 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
2053 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
2054 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
2055 |
substrings. |
substrings. |
2056 |
|
|
2057 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
2058 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
2059 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
2060 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
2061 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
2062 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
2063 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
2064 |
|
|
2065 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
2066 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
2067 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
2068 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
2069 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
2070 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
2071 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
2072 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
2073 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
2074 |
|
|
2075 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
2076 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
2077 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
2078 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
2079 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
2080 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
2081 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
2082 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
2083 |
the terminating zero, or one of these error codes: |
the terminating zero, or one of these error codes: |
2084 |
|
|
2085 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2086 |
|
|
2087 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
2088 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
2089 |
|
|
2090 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2091 |
|
|
2092 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
2093 |
|
|
2094 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
2095 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
2096 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
2097 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
2098 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
2099 |
pointer. The yield of the function is zero if all went well, or the |
pointer. The yield of the function is zero if all went well, or the |
2100 |
error code |
error code |
2101 |
|
|
2102 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2103 |
|
|
2104 |
if the attempt to get the memory block failed. |
if the attempt to get the memory block failed. |
2105 |
|
|
2106 |
When any of these functions encounter a substring that is unset, which |
When any of these functions encounter a substring that is unset, which |
2107 |
can happen when capturing subpattern number n+1 matches some part of |
can happen when capturing subpattern number n+1 matches some part of |
2108 |
the subject, but subpattern n has not been used at all, they return an |
the subject, but subpattern n has not been used at all, they return an |
2109 |
empty string. This can be distinguished from a genuine zero-length sub- |
empty string. This can be distinguished from a genuine zero-length sub- |
2110 |
string by inspecting the appropriate offset in ovector, which is nega- |
string by inspecting the appropriate offset in ovector, which is nega- |
2111 |
tive for unset substrings. |
tive for unset substrings. |
2112 |
|
|
2113 |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
The two convenience functions pcre_free_substring() and pcre_free_sub- |
2114 |
string_list() can be used to free the memory returned by a previous |
string_list() can be used to free the memory returned by a previous |
2115 |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
call of pcre_get_substring() or pcre_get_substring_list(), respec- |
2116 |
tively. They do nothing more than call the function pointed to by |
tively. They do nothing more than call the function pointed to by |
2117 |
pcre_free, which of course could be called directly from a C program. |
pcre_free, which of course could be called directly from a C program. |
2118 |
However, PCRE is used in some situations where it is linked via a spe- |
However, PCRE is used in some situations where it is linked via a spe- |
2119 |
cial interface to another programming language that cannot use |
cial interface to another programming language that cannot use |
2120 |
pcre_free directly; it is for these cases that the functions are pro- |
pcre_free directly; it is for these cases that the functions are pro- |
2121 |
vided. |
vided. |
2122 |
|
|
2123 |
|
|
2136 |
int stringcount, const char *stringname, |
int stringcount, const char *stringname, |
2137 |
const char **stringptr); |
const char **stringptr); |
2138 |
|
|
2139 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
2140 |
ber. For example, for this pattern |
ber. For example, for this pattern |
2141 |
|
|
2142 |
(a+)b(?<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
2145 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2146 |
name by calling pcre_get_stringnumber(). The first argument is the com- |
name by calling pcre_get_stringnumber(). The first argument is the com- |
2147 |
piled pattern, and the second is the name. The yield of the function is |
piled pattern, and the second is the name. The yield of the function is |
2148 |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
the subpattern number, or PCRE_ERROR_NOSUBSTRING (-7) if there is no |
2149 |
subpattern of that name. |
subpattern of that name. |
2150 |
|
|
2151 |
Given the number, you can extract the substring directly, or use one of |
Given the number, you can extract the substring directly, or use one of |
2152 |
the functions described in the previous section. For convenience, there |
the functions described in the previous section. For convenience, there |
2153 |
are also two functions that do the whole job. |
are also two functions that do the whole job. |
2154 |
|
|
2155 |
Most of the arguments of pcre_copy_named_substring() and |
Most of the arguments of pcre_copy_named_substring() and |
2156 |
pcre_get_named_substring() are the same as those for the similarly |
pcre_get_named_substring() are the same as those for the similarly |
2157 |
named functions that extract by number. As these are described in the |
named functions that extract by number. As these are described in the |
2158 |
previous section, they are not re-described here. There are just two |
previous section, they are not re-described here. There are just two |
2159 |
differences: |
differences: |
2160 |
|
|
2161 |
First, instead of a substring number, a substring name is given. Sec- |
First, instead of a substring number, a substring name is given. Sec- |
2162 |
ond, there is an extra argument, given at the start, which is a pointer |
ond, there is an extra argument, given at the start, which is a pointer |
2163 |
to the compiled pattern. This is needed in order to gain access to the |
to the compiled pattern. This is needed in order to gain access to the |
2164 |
name-to-number translation table. |
name-to-number translation table. |
2165 |
|
|
2166 |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
These functions call pcre_get_stringnumber(), and if it succeeds, they |
2167 |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
then call pcre_copy_substring() or pcre_get_substring(), as appropri- |
2168 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2169 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
2170 |
|
|
2171 |
|
|
2174 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
2175 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
2176 |
|
|
2177 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
2178 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
2179 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
2180 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
2181 |
mentation. When duplicates are present, pcre_copy_named_substring() and |
mentation. When duplicates are present, pcre_copy_named_substring() and |
2182 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2183 |
the given name that is set. If none are set, an empty string is |
the given name that is set. If none are set, an empty string is |
2184 |
returned. The pcre_get_stringnumber() function returns one of the num- |
returned. The pcre_get_stringnumber() function returns one of the num- |
2185 |
bers that are associated with the name, but it is not defined which it |
bers that are associated with the name, but it is not defined which it |
2186 |
is. |
is. |
2187 |
|
|
2188 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2189 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2190 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
2191 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2192 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2193 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2194 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2195 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
2196 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
2197 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
2198 |
the captured data, if any. |
the captured data, if any. |
2199 |
|
|
2200 |
|
|
2201 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2202 |
|
|
2203 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
2204 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
2205 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
2206 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
2207 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
2208 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
2209 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
2210 |
tation. |
tation. |
2211 |
|
|
2212 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
2213 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
2214 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
2215 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
2216 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
2217 |
|
|
2218 |
|
|
2223 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
2224 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2225 |
|
|
2226 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2227 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
2228 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
2229 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2230 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2231 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2232 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
2233 |
mentation. |
mentation. |
2234 |
|
|
2235 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2236 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2237 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2238 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2239 |
repeated here. |
repeated here. |
2240 |
|
|
2241 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2242 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2243 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2244 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2245 |
lot of potential matches. |
lot of potential matches. |
2246 |
|
|
2247 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2263 |
|
|
2264 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2265 |
|
|
2266 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2267 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2268 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2269 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2270 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
2271 |
not repeated here. |
not repeated here. |
2272 |
|
|
2273 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2274 |
|
|
2275 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
2276 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
2277 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
2278 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
2279 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
2280 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
2281 |
set as the first matching string. |
set as the first matching string. |
2282 |
|
|
2283 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2284 |
|
|
2285 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2286 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
2287 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
2288 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
2289 |
|
|
2290 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2291 |
|
|
2292 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2293 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2294 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2295 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2296 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2297 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2298 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2299 |
documentation. |
documentation. |
2300 |
|
|
2301 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2302 |
|
|
2303 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2304 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2305 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2306 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2307 |
if the pattern |
if the pattern |
2308 |
|
|
2309 |
<.*> |
<.*> |
2318 |
<something> <something else> |
<something> <something else> |
2319 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2320 |
|
|
2321 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2322 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2323 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2324 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
2325 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
2326 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
2327 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
2328 |
meaning of the strings is different.) |
meaning of the strings is different.) |
2329 |
|
|
2330 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2331 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2332 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2333 |
filled with the longest matches. |
filled with the longest matches. |
2334 |
|
|
2335 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2336 |
|
|
2337 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2338 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2339 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2340 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2341 |
|
|
2342 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2343 |
|
|
2344 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2345 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2346 |
reference. |
reference. |
2347 |
|
|
2348 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2349 |
|
|
2350 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
2351 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
2352 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
2353 |
|
|
2354 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2355 |
|
|
2356 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2357 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2358 |
(it is meaningless). |
(it is meaningless). |
2359 |
|
|
2360 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2361 |
|
|
2362 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2363 |
workspace vector. |
workspace vector. |
2364 |
|
|
2365 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2366 |
|
|
2367 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2368 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2369 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2370 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2371 |
|
|
2372 |
|
|
2373 |
SEE ALSO |
SEE ALSO |
2374 |
|
|
2375 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2376 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2377 |
|
|
2378 |
|
|
2379 |
AUTHOR |
AUTHOR |
2385 |
|
|
2386 |
REVISION |
REVISION |
2387 |
|
|
2388 |
Last updated: 06 March 2007 |
Last updated: 16 April 2007 |
2389 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
2390 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2391 |
|
|