323 |
|
|
324 |
Return information about the first character of any matched string, for a |
Return information about the first character of any matched string, for a |
325 |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
326 |
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
such as (cat|cow|coyote), it is returned in the integer pointed to by |
327 |
\fIwhere\fR. Otherwise, if either |
\fIwhere\fR. Otherwise, if either |
328 |
|
|
329 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
332 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
333 |
(if it were set, the pattern would be anchored), |
(if it were set, the pattern would be anchored), |
334 |
|
|
335 |
then -1 is returned, indicating that the pattern matches only at the |
-1 is returned, indicating that the pattern matches only at the start of a |
336 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
subject string or after any "\\n" within the string. Otherwise -2 is returned. |
337 |
returned. For anchored patterns, -2 is returned. |
For anchored patterns, -2 is returned. |
338 |
|
|
339 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
340 |
|
|
550 |
were captured by the match, including the substring that matched the entire |
were captured by the match, including the substring that matched the entire |
551 |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
regular expression. This is the value returned by \fBpcre_exec\fR if it |
552 |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it |
553 |
ran out of space in \fIovector\fR, then the value passed as |
ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should |
554 |
\fIstringcount\fR should be the size of the vector divided by three. |
be the size of the vector divided by three. |
555 |
|
|
556 |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR |
557 |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
extract a single substring, whose number is given as \fIstringnumber\fR. A |
650 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
651 |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value |
652 |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
"b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if |
653 |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set. |
the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set. |
654 |
|
|
655 |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the |
656 |
future Perl changes to a consistent state that is different, PCRE may change to |
future Perl changes to a consistent state that is different, PCRE may change to |
920 |
.SH FULL STOP (PERIOD, DOT) |
.SH FULL STOP (PERIOD, DOT) |
921 |
Outside a character class, a dot in the pattern matches any one character in |
Outside a character class, a dot in the pattern matches any one character in |
922 |
the subject, including a non-printing character, but not (by default) newline. |
the subject, including a non-printing character, but not (by default) newline. |
923 |
If the PCRE_DOTALL option is set, then dots match newlines as well. The |
If the PCRE_DOTALL option is set, dots match newlines as well. The handling of |
924 |
handling of dot is entirely independent of the handling of circumflex and |
dot is entirely independent of the handling of circumflex and dollar, the only |
925 |
dollar, the only relationship being that they both involve newline characters. |
relationship being that they both involve newline characters. Dot has no |
926 |
Dot has no special meaning in a character class. |
special meaning in a character class. |
927 |
|
|
928 |
|
|
929 |
.SH SQUARE BRACKETS |
.SH SQUARE BRACKETS |
1213 |
fails, because it matches the entire string due to the greediness of the .* |
fails, because it matches the entire string due to the greediness of the .* |
1214 |
item. |
item. |
1215 |
|
|
1216 |
However, if a quantifier is followed by a question mark, then it ceases to be |
However, if a quantifier is followed by a question mark, it ceases to be |
1217 |
greedy, and instead matches the minimum number of times possible, so the |
greedy, and instead matches the minimum number of times possible, so the |
1218 |
pattern |
pattern |
1219 |
|
|
1229 |
which matches one digit by preference, but can match two if that is the only |
which matches one digit by preference, but can match two if that is the only |
1230 |
way the rest of the pattern matches. |
way the rest of the pattern matches. |
1231 |
|
|
1232 |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl) |
If the PCRE_UNGREEDY option is set (an option which is not available in Perl), |
1233 |
then the quantifiers are not greedy by default, but individual ones can be made |
the quantifiers are not greedy by default, but individual ones can be made |
1234 |
greedy by following them with a question mark. In other words, it inverts the |
greedy by following them with a question mark. In other words, it inverts the |
1235 |
default behaviour. |
default behaviour. |
1236 |
|
|
1239 |
compiled pattern, in proportion to the size of the minimum or maximum. |
compiled pattern, in proportion to the size of the minimum or maximum. |
1240 |
|
|
1241 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
1242 |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
to Perl's /s) is set, thus allowing the . to match newlines, the pattern is |
1243 |
is implicitly anchored, because whatever follows will be tried against every |
implicitly anchored, because whatever follows will be tried against every |
1244 |
character position in the subject string, so there is no point in retrying the |
character position in the subject string, so there is no point in retrying the |
1245 |
overall match at any position after the first. PCRE treats such a pattern as |
overall match at any position after the first. PCRE treats such a pattern as |
1246 |
though it were preceded by \\A. In cases where it is known that the subject |
though it were preceded by \\A. In cases where it is known that the subject |
1284 |
|
|
1285 |
matches "sense and sensibility" and "response and responsibility", but not |
matches "sense and sensibility" and "response and responsibility", but not |
1286 |
"sense and responsibility". If caseful matching is in force at the time of the |
"sense and responsibility". If caseful matching is in force at the time of the |
1287 |
back reference, then the case of letters is relevant. For example, |
back reference, the case of letters is relevant. For example, |
1288 |
|
|
1289 |
((?i)rah)\\s+\\1 |
((?i)rah)\\s+\\1 |
1290 |
|
|
1292 |
capturing subpattern is matched caselessly. |
capturing subpattern is matched caselessly. |
1293 |
|
|
1294 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
1295 |
subpattern has not actually been used in a particular match, then any back |
subpattern has not actually been used in a particular match, any back |
1296 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
1297 |
|
|
1298 |
(a|(bc))\\2 |
(a|(bc))\\2 |
1300 |
always fails if it starts to match "a" rather than "bc". Because there may be |
always fails if it starts to match "a" rather than "bc". Because there may be |
1301 |
up to 99 back references, all digits following the backslash are taken |
up to 99 back references, all digits following the backslash are taken |
1302 |
as part of a potential back reference number. If the pattern continues with a |
as part of a potential back reference number. If the pattern continues with a |
1303 |
digit character, then some delimiter must be used to terminate the back |
digit character, some delimiter must be used to terminate the back reference. |
1304 |
reference. If the PCRE_EXTENDED option is set, this can be whitespace. |
If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty |
1305 |
Otherwise an empty comment can be used. |
comment can be used. |
1306 |
|
|
1307 |
A back reference that occurs inside the parentheses to which it refers fails |
A back reference that occurs inside the parentheses to which it refers fails |
1308 |
when the subpattern is first used, so, for example, (a\\1) never matches. |
when the subpattern is first used, so, for example, (a\\1) never matches. |
1390 |
matches "foo" preceded by three digits that are not "999". Notice that each of |
matches "foo" preceded by three digits that are not "999". Notice that each of |
1391 |
the assertions is applied independently at the same point in the subject |
the assertions is applied independently at the same point in the subject |
1392 |
string. First there is a check that the previous three characters are all |
string. First there is a check that the previous three characters are all |
1393 |
digits, then there is a check that the same three characters are not "999". |
digits, and then there is a check that the same three characters are not "999". |
1394 |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
This pattern does \fInot\fR match "foo" preceded by six characters, the first |
1395 |
of which are digits and the last three of which are not "999". For example, it |
of which are digits and the last three of which are not "999". For example, it |
1396 |
doesn't match "123abcfoo". A pattern to do that is |
doesn't match "123abcfoo". A pattern to do that is |
1475 |
|
|
1476 |
^.*abcd$ |
^.*abcd$ |
1477 |
|
|
1478 |
then the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails (because |
1479 |
(because there is no following "a"), it backtracks to match all but the last |
there is no following "a"), it backtracks to match all but the last character, |
1480 |
character, then all but the last two characters, and so on. Once again the |
then all but the last two characters, and so on. Once again the search for "a" |
1481 |
search for "a" covers the entire string, from right to left, so we are no |
covers the entire string, from right to left, so we are no better off. However, |
1482 |
better off. However, if the pattern is written as |
if the pattern is written as |
1483 |
|
|
1484 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1485 |
|
|
1486 |
then there can be no backtracking for the .* item; it can match only the entire |
there can be no backtracking for the .* item; it can match only the entire |
1487 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
1488 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
1489 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
1528 |
subpattern, a compile-time error occurs. |
subpattern, a compile-time error occurs. |
1529 |
|
|
1530 |
There are two kinds of condition. If the text between the parentheses consists |
There are two kinds of condition. If the text between the parentheses consists |
1531 |
of a sequence of digits, then the condition is satisfied if the capturing |
of a sequence of digits, the condition is satisfied if the capturing subpattern |
1532 |
subpattern of that number has previously matched. Consider the following |
of that number has previously matched. Consider the following pattern, which |
1533 |
pattern, which contains non-significant white space to make it more readable |
contains non-significant white space to make it more readable (assume the |
1534 |
(assume the PCRE_EXTENDED option) and to divide it into three parts for ease |
PCRE_EXTENDED option) and to divide it into three parts for ease of discussion: |
|
of discussion: |
|
1535 |
|
|
1536 |
( \\( )? [^()]+ (?(1) \\) ) |
( \\( )? [^()]+ (?(1) \\) ) |
1537 |
|
|
1621 |
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
1622 |
^ ^ |
^ ^ |
1623 |
^ ^ |
^ ^ |
1624 |
then the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
1625 |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
1626 |
has to obtain extra memory to store data during a recursion, which it does by |
has to obtain extra memory to store data during a recursion, which it does by |
1627 |
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |