430 |
<P> |
<P> |
431 |
Return information about the first character of any matched string, for a |
Return information about the first character of any matched string, for a |
432 |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
433 |
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
such as (cat|cow|coyote), it is returned in the integer pointed to by |
434 |
<I>where</I>. Otherwise, if either |
<I>where</I>. Otherwise, if either |
435 |
</P> |
</P> |
436 |
<P> |
<P> |
442 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
443 |
</P> |
</P> |
444 |
<P> |
<P> |
445 |
then -1 is returned, indicating that the pattern matches only at the |
-1 is returned, indicating that the pattern matches only at the start of a |
446 |
start of a subject string or after any "\n" within the string. Otherwise -2 is |
subject string or after any "\n" within the string. Otherwise -2 is returned. |
447 |
returned. For anchored patterns, -2 is returned. |
For anchored patterns, -2 is returned. |
448 |
</P> |
</P> |
449 |
<P> |
<P> |
450 |
<PRE> |
<PRE> |
734 |
were captured by the match, including the substring that matched the entire |
were captured by the match, including the substring that matched the entire |
735 |
regular expression. This is the value returned by <B>pcre_exec</B> if it |
regular expression. This is the value returned by <B>pcre_exec</B> if it |
736 |
is greater than zero. If <B>pcre_exec()</B> returned zero, indicating that it |
is greater than zero. If <B>pcre_exec()</B> returned zero, indicating that it |
737 |
ran out of space in <I>ovector</I>, then the value passed as |
ran out of space in <I>ovector</I>, the value passed as <I>stringcount</I> should |
738 |
<I>stringcount</I> should be the size of the vector divided by three. |
be the size of the vector divided by three. |
739 |
</P> |
</P> |
740 |
<P> |
<P> |
741 |
The functions <B>pcre_copy_substring()</B> and <B>pcre_get_substring()</B> |
The functions <B>pcre_copy_substring()</B> and <B>pcre_get_substring()</B> |
857 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
858 |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
859 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
860 |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. |
861 |
</P> |
</P> |
862 |
<P> |
<P> |
863 |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
1186 |
<P> |
<P> |
1187 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
1188 |
the subject, including a non-printing character, but not (by default) newline. |
the subject, including a non-printing character, but not (by default) newline. |
1189 |
If the PCRE_DOTALL option is set, then dots match newlines as well. The |
If the PCRE_DOTALL option is set, dots match newlines as well. The handling of |
1190 |
handling of dot is entirely independent of the handling of circumflex and |
dot is entirely independent of the handling of circumflex and dollar, the only |
1191 |
dollar, the only relationship being that they both involve newline characters. |
relationship being that they both involve newline characters. Dot has no |
1192 |
Dot has no special meaning in a character class. |
special meaning in a character class. |
1193 |
</P> |
</P> |
1194 |
<LI><A NAME="SEC17" HREF="#TOC1">SQUARE BRACKETS</A> |
<LI><A NAME="SEC17" HREF="#TOC1">SQUARE BRACKETS</A> |
1195 |
<P> |
<P> |
1580 |
item. |
item. |
1581 |
</P> |
</P> |
1582 |
<P> |
<P> |
1583 |
However, if a quantifier is followed by a question mark, then it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
1584 |
greedy, and instead matches the minimum number of times possible, so the |
greedy, and instead matches the minimum number of times possible, so the |
1585 |
pattern |
pattern |
1586 |
</P> |
</P> |
1605 |
way the rest of the pattern matches. |
way the rest of the pattern matches. |
1606 |
</P> |
</P> |
1607 |
<P> |
<P> |
1608 |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl), |
1609 |
then the quantifiers are not greedy by default, but individual ones can be made |
the quantifiers are not greedy by default, but individual ones can be made |
1610 |
greedy by following them with a question mark. In other words, it inverts the |
greedy by following them with a question mark. In other words, it inverts the |
1611 |
default behaviour. |
default behaviour. |
1612 |
</P> |
</P> |
1617 |
</P> |
</P> |
1618 |
<P> |
<P> |
1619 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
1620 |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is |
1621 |
is implicitly anchored, because whatever follows will be tried against every |
implicitly anchored, because whatever follows will be tried against every |
1622 |
character position in the subject string, so there is no point in retrying the |
character position in the subject string, so there is no point in retrying the |
1623 |
overall match at any position after the first. PCRE treats such a pattern as |
overall match at any position after the first. PCRE treats such a pattern as |
1624 |
though it were preceded by \A. In cases where it is known that the subject |
though it were preceded by \A. In cases where it is known that the subject |
1677 |
<P> |
<P> |
1678 |
matches "sense and sensibility" and "response and responsibility", but not |
matches "sense and sensibility" and "response and responsibility", but not |
1679 |
"sense and responsibility". If caseful matching is in force at the time of the |
"sense and responsibility". If caseful matching is in force at the time of the |
1680 |
back reference, then the case of letters is relevant. For example, |
back reference, the case of letters is relevant. For example, |
1681 |
</P> |
</P> |
1682 |
<P> |
<P> |
1683 |
<PRE> |
<PRE> |
1690 |
</P> |
</P> |
1691 |
<P> |
<P> |
1692 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
1693 |
subpattern has not actually been used in a particular match, then any back |
subpattern has not actually been used in a particular match, any back |
1694 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
1695 |
</P> |
</P> |
1696 |
<P> |
<P> |
1702 |
always fails if it starts to match "a" rather than "bc". Because there may be |
always fails if it starts to match "a" rather than "bc". Because there may be |
1703 |
up to 99 back references, all digits following the backslash are taken |
up to 99 back references, all digits following the backslash are taken |
1704 |
as part of a potential back reference number. If the pattern continues with a |
as part of a potential back reference number. If the pattern continues with a |
1705 |
digit character, then some delimiter must be used to terminate the back |
digit character, some delimiter must be used to terminate the back reference. |
1706 |
reference. If the PCRE_EXTENDED option is set, this can be whitespace. |
If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty |
1707 |
Otherwise an empty comment can be used. |
comment can be used. |
1708 |
</P> |
</P> |
1709 |
<P> |
<P> |
1710 |
A back reference that occurs inside the parentheses to which it refers fails |
A back reference that occurs inside the parentheses to which it refers fails |
1836 |
matches "foo" preceded by three digits that are not "999". Notice that each of |
matches "foo" preceded by three digits that are not "999". Notice that each of |
1837 |
the assertions is applied independently at the same point in the subject |
the assertions is applied independently at the same point in the subject |
1838 |
string. First there is a check that the previous three characters are all |
string. First there is a check that the previous three characters are all |
1839 |
digits, then there is a check that the same three characters are not "999". |
digits, and then there is a check that the same three characters are not "999". |
1840 |
This pattern does <I>not</I> match "foo" preceded by six characters, the first |
This pattern does <I>not</I> match "foo" preceded by six characters, the first |
1841 |
of which are digits and the last three of which are not "999". For example, it |
of which are digits and the last three of which are not "999". For example, it |
1842 |
doesn't match "123abcfoo". A pattern to do that is |
doesn't match "123abcfoo". A pattern to do that is |
1957 |
</PRE> |
</PRE> |
1958 |
</P> |
</P> |
1959 |
<P> |
<P> |
1960 |
then the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails (because |
1961 |
(because there is no following "a"), it backtracks to match all but the last |
there is no following "a"), it backtracks to match all but the last character, |
1962 |
character, then all but the last two characters, and so on. Once again the |
then all but the last two characters, and so on. Once again the search for "a" |
1963 |
search for "a" covers the entire string, from right to left, so we are no |
covers the entire string, from right to left, so we are no better off. However, |
1964 |
better off. However, if the pattern is written as |
if the pattern is written as |
1965 |
</P> |
</P> |
1966 |
<P> |
<P> |
1967 |
<PRE> |
<PRE> |
1969 |
</PRE> |
</PRE> |
1970 |
</P> |
</P> |
1971 |
<P> |
<P> |
1972 |
then there can be no backtracking for the .* item; it can match only the entire |
there can be no backtracking for the .* item; it can match only the entire |
1973 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
1974 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
1975 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
2032 |
</P> |
</P> |
2033 |
<P> |
<P> |
2034 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
2035 |
of a sequence of digits, then the condition is satisfied if the capturing |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
2036 |
subpattern of that number has previously matched. Consider the following |
of that number has previously matched. Consider the following pattern, which |
2037 |
pattern, which contains non-significant white space to make it more readable |
contains non-significant white space to make it more readable (assume the |
2038 |
(assume the PCRE_EXTENDED option) and to divide it into three parts for ease |
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: |
|
of discussion: |
|
2039 |
</P> |
</P> |
2040 |
<P> |
<P> |
2041 |
<PRE> |
<PRE> |
2156 |
^ ^ |
^ ^ |
2157 |
^ ^ |
^ ^ |
2158 |
</PRE> |
</PRE> |
2159 |
then the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
2160 |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
2161 |
has to obtain extra memory to store data during a recursion, which it does by |
has to obtain extra memory to store data during a recursion, which it does by |
2162 |
using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no |
using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no |