264 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
265 |
|
|
266 |
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
267 |
pattern was compiled is placed in the integer it points to. |
pattern was compiled is placed in the integer it points to. These option bits |
268 |
|
are those specified in the call to \fBpcre_compile()\fR, modified by any |
269 |
|
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
270 |
|
bit set if the form of the pattern implies that it can match only at the start |
271 |
|
of a subject string. |
272 |
|
|
273 |
|
If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, |
274 |
|
it is used to pass back information about the first character of any matched |
275 |
|
string. If there is a fixed first character, e.g. from a pattern such as |
276 |
|
(cat|cow|coyote), then it is returned in the integer pointed to by |
277 |
|
\fIfirstcharptr\fR. Otherwise, if either |
278 |
|
|
279 |
If the \fIfirstcharptr\fR argument is not NULL, is is used to pass back |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
280 |
information about the first character of any matched string. If there is a |
starts with "^", or |
281 |
fixed first character, e.g. from a pattern such as (cat|cow|coyote), then it is |
|
282 |
returned in the integer pointed to by \fIfirstcharptr\fR. Otherwise, if the |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set |
283 |
pattern was compiled with the PCRE_MULTILINE option, and every branch started |
(if it were set, the pattern would be anchored), |
284 |
with "^", then -1 is returned, indicating that the pattern will match at the |
|
285 |
|
then -1 is returned, indicating that the pattern matches only at the |
286 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
287 |
returned. |
returned. |
288 |
|
|
1061 |
is greater than 1 or with a limited maximum, more store is required for the |
is greater than 1 or with a limited maximum, more store is required for the |
1062 |
compiled pattern, in proportion to the size of the minimum or maximum. |
compiled pattern, in proportion to the size of the minimum or maximum. |
1063 |
|
|
1064 |
If a pattern starts with .* then it is implicitly anchored, since whatever |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent |
1065 |
follows will be tried against every character position in the subject string. |
to Perl's /s) is set, thus allowing the . to match newlines, then the pattern |
1066 |
PCRE treats this as though it were preceded by \\A. |
is implicitly anchored, because whatever follows will be tried against every |
1067 |
|
character position in the subject string, so there is no point in retrying the |
1068 |
|
overall match at any position after the first. PCRE treats such a pattern as |
1069 |
|
though it were preceded by \\A. In cases where it is known that the subject |
1070 |
|
string contains no newlines, it is worth setting PCRE_DOTALL when the pattern |
1071 |
|
begins with .* in order to obtain this optimization, or alternatively using ^ |
1072 |
|
to indicate anchoring explicitly. |
1073 |
|
|
1074 |
When a capturing subpattern is repeated, the value captured is the substring |
When a capturing subpattern is repeated, the value captured is the substring |
1075 |
that matched the final iteration. For example, after |
that matched the final iteration. For example, after |
1279 |
then see if what follows matches the rest of the pattern. If the pattern is |
then see if what follows matches the rest of the pattern. If the pattern is |
1280 |
specified as |
specified as |
1281 |
|
|
1282 |
.*abcd$ |
^.*abcd$ |
1283 |
|
|
1284 |
then the initial .* matches the entire string at first, but when this fails, it |
then the initial .* matches the entire string at first, but when this fails, it |
1285 |
backtracks to match all but the last character, then all but the last two |
backtracks to match all but the last character, then all but the last two |
1287 |
from right to left, so we are no better off. However, if the pattern is written |
from right to left, so we are no better off. However, if the pattern is written |
1288 |
as |
as |
1289 |
|
|
1290 |
(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1291 |
|
|
1292 |
then there can be no backtracking for the .* item; it can match only the entire |
then there can be no backtracking for the .* item; it can match only the entire |
1293 |
string. The subsequent lookbehind assertion does a single test on the last four |
string. The subsequent lookbehind assertion does a single test on the last four |
1361 |
contains a lot of discussion about optimizing regular expressions for efficient |
contains a lot of discussion about optimizing regular expressions for efficient |
1362 |
performance. |
performance. |
1363 |
|
|
1364 |
|
When a pattern begins with .* and the PCRE_DOTALL option is set, the pattern is |
1365 |
|
implicitly anchored by PCRE, since it can match only at the start of a subject |
1366 |
|
string. However, if PCRE_DOTALL is not set, PCRE cannot make this optimization, |
1367 |
|
because the . metacharacter does not then match a newline, and if the subject |
1368 |
|
string contains newlines, the pattern may match from the character immediately |
1369 |
|
following one of them instead of from the very start. For example, the pattern |
1370 |
|
|
1371 |
|
(.*) second |
1372 |
|
|
1373 |
|
matches the subject "first\\nand second" (where \\n stands for a newline |
1374 |
|
character) with the first captured substring being "and". In order to do this, |
1375 |
|
PCRE has to retry the match starting after every newline in the subject. |
1376 |
|
|
1377 |
|
If you are using such a pattern with subject strings that do not contain |
1378 |
|
newlines, the best performance is obtained by setting PCRE_DOTALL, or starting |
1379 |
|
the pattern with ^.* to indicate explicit anchoring. That saves PCRE from |
1380 |
|
having to scan along the subject looking for a newline to restart at. |
1381 |
|
|
1382 |
.SH AUTHOR |
.SH AUTHOR |
1383 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |