47 |
.B const unsigned char *pcre_maketables(void); |
.B const unsigned char *pcre_maketables(void); |
48 |
.PP |
.PP |
49 |
.br |
.br |
50 |
|
.B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," |
51 |
|
.ti +5n |
52 |
|
.B int \fIwhat\fR, void *\fIwhere\fR); |
53 |
|
.PP |
54 |
|
.br |
55 |
.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int |
.B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int |
56 |
.B *\fIfirstcharptr\fR); |
.B *\fIfirstcharptr\fR); |
57 |
.PP |
.PP |
69 |
.SH DESCRIPTION |
.SH DESCRIPTION |
70 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
71 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
72 |
differences (see below). The current implementation corresponds to Perl 5.005. |
differences (see below). The current implementation corresponds to Perl 5.005, |
73 |
|
with some additional features from the Perl development release. |
74 |
|
|
75 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
76 |
a set of wrapper functions that correspond to the POSIX API. These are |
a set of wrapper functions that correspond to the POSIX regular expression API. |
77 |
described in the \fBpcreposix\fR documentation. |
These are described in the \fBpcreposix\fR documentation. |
78 |
|
|
79 |
The native API function prototypes are defined in the header file \fBpcre.h\fR, |
The native API function prototypes are defined in the header file \fBpcre.h\fR, |
80 |
and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be |
and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be |
81 |
accessed by adding \fB-lpcre\fR to the command for linking an application which |
accessed by adding \fB-lpcre\fR to the command for linking an application which |
82 |
calls it. |
calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to |
83 |
|
contain the major and minor release numbers for the library. Applications can |
84 |
|
use these to include support for different releases. |
85 |
|
|
86 |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR |
87 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions, while |
91 |
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables |
\fBpcre_maketables()\fR is used (optionally) to build a set of character tables |
92 |
in the current locale for passing to \fBpcre_compile()\fR. |
in the current locale for passing to \fBpcre_compile()\fR. |
93 |
|
|
94 |
The function \fBpcre_info()\fR is used to find out information about a compiled |
The function \fBpcre_fullinfo()\fR is used to find out information about a |
95 |
pattern, while the function \fBpcre_version()\fR returns a pointer to a string |
compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only |
96 |
containing the version of PCRE and its date of release. |
some of the available information, but is retained for backwards compatibility. |
97 |
|
The function \fBpcre_version()\fR returns a pointer to a string containing the |
98 |
|
version of PCRE and its date of release. |
99 |
|
|
100 |
The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain |
The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain |
101 |
the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions |
the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions |
192 |
|
|
193 |
PCRE_EXTRA |
PCRE_EXTRA |
194 |
|
|
195 |
This option turns on additional functionality of PCRE that is incompatible with |
This option was invented in order to turn on additional functionality of PCRE |
196 |
Perl. Any backslash in a pattern that is followed by a letter that has no |
that is incompatible with Perl, but it is currently of very little use. When |
197 |
|
set, any backslash in a pattern that is followed by a letter that has no |
198 |
special meaning causes an error, thus reserving these combinations for future |
special meaning causes an error, thus reserving these combinations for future |
199 |
expansion. By default, as in Perl, a backslash followed by a letter with no |
expansion. By default, as in Perl, a backslash followed by a letter with no |
200 |
special meaning is treated as a literal. There are at present no other features |
special meaning is treated as a literal. There are at present no other features |
201 |
controlled by this option. |
controlled by this option. It can also be set by a (?X) option setting within a |
202 |
|
pattern. |
203 |
|
|
204 |
PCRE_MULTILINE |
PCRE_MULTILINE |
205 |
|
|
273 |
|
|
274 |
|
|
275 |
.SH INFORMATION ABOUT A PATTERN |
.SH INFORMATION ABOUT A PATTERN |
276 |
The \fBpcre_info()\fR function returns information about a compiled pattern. |
The \fBpcre_fullinfo()\fR function returns information about a compiled |
277 |
Its yield is the number of capturing subpatterns, or one of the following |
pattern. It replaces the obsolete \fBpcre_info()\fR function, which is |
278 |
negative numbers: |
nevertheless retained for backwards compability (and is documented below). |
279 |
|
|
280 |
|
The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled |
281 |
|
pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if |
282 |
|
the pattern was not studied. The third argument specifies which piece of |
283 |
|
information is required, while the fourth argument is a pointer to a variable |
284 |
|
to receive the data. The yield of the function is zero for success, or one of |
285 |
|
the following negative numbers: |
286 |
|
|
287 |
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
288 |
|
the argument \fIwhere\fR was NULL |
289 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
290 |
|
PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid |
291 |
|
|
292 |
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
The possible values for the third argument are defined in \fBpcre.h\fR, and are |
293 |
pattern was compiled is placed in the integer it points to. These option bits |
as follows: |
294 |
|
|
295 |
|
PCRE_INFO_OPTIONS |
296 |
|
|
297 |
|
Return a copy of the options with which the pattern was compiled. The fourth |
298 |
|
argument should point to au \fBunsigned long int\fR variable. These option bits |
299 |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
are those specified in the call to \fBpcre_compile()\fR, modified by any |
300 |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
top-level option settings within the pattern itself, and with the PCRE_ANCHORED |
301 |
bit set if the form of the pattern implies that it can match only at the start |
bit forcibly set if the form of the pattern implies that it can match only at |
302 |
of a subject string. |
the start of a subject string. |
303 |
|
|
304 |
If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, |
PCRE_INFO_SIZE |
305 |
it is used to pass back information about the first character of any matched |
|
306 |
string. If there is a fixed first character, e.g. from a pattern such as |
Return the size of the compiled pattern, that is, the value that was passed as |
307 |
(cat|cow|coyote), then it is returned in the integer pointed to by |
the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to |
308 |
\fIfirstcharptr\fR. Otherwise, if either |
place the compiled data. The fourth argument should point to a \fBsize_t\fR |
309 |
|
variable. |
310 |
|
|
311 |
|
PCRE_INFO_CAPTURECOUNT |
312 |
|
|
313 |
|
Return the number of capturing subpatterns in the pattern. The fourth argument |
314 |
|
should point to an \fbint\fR variable. |
315 |
|
|
316 |
|
PCRE_INFO_BACKREFMAX |
317 |
|
|
318 |
|
Return the number of the highest back reference in the pattern. The fourth |
319 |
|
argument should point to an \fBint\fR variable. Zero is returned if there are |
320 |
|
no back references. |
321 |
|
|
322 |
|
PCRE_INFO_FIRSTCHAR |
323 |
|
|
324 |
|
Return information about the first character of any matched string, for a |
325 |
|
non-anchored pattern. If there is a fixed first character, e.g. from a pattern |
326 |
|
such as (cat|cow|coyote), then it is returned in the integer pointed to by |
327 |
|
\fIwhere\fR. Otherwise, if either |
328 |
|
|
329 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every branch |
330 |
starts with "^", or |
starts with "^", or |
334 |
|
|
335 |
then -1 is returned, indicating that the pattern matches only at the |
then -1 is returned, indicating that the pattern matches only at the |
336 |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
start of a subject string or after any "\\n" within the string. Otherwise -2 is |
337 |
returned. |
returned. For anchored patterns, -2 is returned. |
338 |
|
|
339 |
|
PCRE_INFO_FIRSTTABLE |
340 |
|
|
341 |
|
If the pattern was studied, and this resulted in the construction of a 256-bit |
342 |
|
table indicating a fixed set of characters for the first character in any |
343 |
|
matching string, a pointer to the table is returned. Otherwise NULL is |
344 |
|
returned. The fourth argument should point to an \fBunsigned char *\fR |
345 |
|
variable. |
346 |
|
|
347 |
|
PCRE_INFO_LASTLITERAL |
348 |
|
|
349 |
|
For a non-anchored pattern, return the value of the rightmost literal character |
350 |
|
which must exist in any matched string, other than at its start. The fourth |
351 |
|
argument should point to an \fBint\fR variable. If there is no such character, |
352 |
|
or if the pattern is anchored, -1 is returned. For example, for the pattern |
353 |
|
/a\\d+z\\d+/ the returned value is 'z'. |
354 |
|
|
355 |
|
The \fBpcre_info()\fR function is now obsolete because its interface is too |
356 |
|
restrictive to return all the available data about a compiled pattern. New |
357 |
|
programs should use \fBpcre_fullinfo()\fR instead. The yield of |
358 |
|
\fBpcre_info()\fR is the number of capturing subpatterns, or one of the |
359 |
|
following negative numbers: |
360 |
|
|
361 |
|
PCRE_ERROR_NULL the argument \fIcode\fR was NULL |
362 |
|
PCRE_ERROR_BADMAGIC the "magic number" was not found |
363 |
|
|
364 |
|
If the \fIoptptr\fR argument is not NULL, a copy of the options with which the |
365 |
|
pattern was compiled is placed in the integer it points to (see |
366 |
|
PCRE_INFO_OPTIONS above). |
367 |
|
|
368 |
|
If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, |
369 |
|
it is used to pass back information about the first character of any matched |
370 |
|
string (see PCRE_INFO_FIRSTCHAR above). |
371 |
|
|
372 |
|
|
373 |
.SH MATCHING A PATTERN |
.SH MATCHING A PATTERN |
642 |
6. The Perl \\G assertion is not supported as it is not relevant to single |
6. The Perl \\G assertion is not supported as it is not relevant to single |
643 |
pattern matches. |
pattern matches. |
644 |
|
|
645 |
7. Fairly obviously, PCRE does not support the (?{code}) construction. |
7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) |
646 |
|
constructions. However, there is some experimental support for recursive |
647 |
|
patterns using the non-Perl item (?R). |
648 |
|
|
649 |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
8. There are at the time of writing some oddities in Perl 5.005_02 concerned |
650 |
with the settings of captured strings when part of a pattern is repeated. For |
with the settings of captured strings when part of a pattern is repeated. For |
682 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for |
683 |
\fBpcre_exec()\fR have no Perl equivalents. |
\fBpcre_exec()\fR have no Perl equivalents. |
684 |
|
|
685 |
|
(g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do |
686 |
|
this using the (?p{code}) construct, which PCRE cannot of course support.) |
687 |
|
|
688 |
|
|
689 |
.SH REGULAR EXPRESSION DETAILS |
.SH REGULAR EXPRESSION DETAILS |
690 |
The syntax and semantics of the regular expressions supported by PCRE are |
The syntax and semantics of the regular expressions supported by PCRE are |
691 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
692 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
693 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
694 |
O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description |
695 |
here is intended as reference documentation. |
here is intended as reference documentation. |
696 |
|
|
697 |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
989 |
are escaped. |
are escaped. |
990 |
|
|
991 |
|
|
992 |
|
.SH POSIX CHARACTER CLASSES |
993 |
|
Perl 5.6 (not yet released at the time of writing) is going to support the |
994 |
|
POSIX notation for character classes, which uses names enclosed by [: and :] |
995 |
|
within the enclosing square brackets. PCRE supports this notation. For example, |
996 |
|
|
997 |
|
[01[:alpha:]%] |
998 |
|
|
999 |
|
matches "0", "1", any alphabetic character, or "%". The supported class names |
1000 |
|
are |
1001 |
|
|
1002 |
|
alnum letters and digits |
1003 |
|
alpha letters |
1004 |
|
ascii character codes 0 - 127 |
1005 |
|
cntrl control characters |
1006 |
|
digit decimal digits (same as \\d) |
1007 |
|
graph printing characters, excluding space |
1008 |
|
lower lower case letters |
1009 |
|
print printing characters, including space |
1010 |
|
punct printing characters, excluding letters and digits |
1011 |
|
space white space (same as \\s) |
1012 |
|
upper upper case letters |
1013 |
|
word "word" characters (same as \\w) |
1014 |
|
xdigit hexadecimal digits |
1015 |
|
|
1016 |
|
The names "ascii" and "word" are Perl extensions. Another Perl extension is |
1017 |
|
negation, which is indicated by a ^ character after the colon. For example, |
1018 |
|
|
1019 |
|
[12[:^digit:]] |
1020 |
|
|
1021 |
|
matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX |
1022 |
|
syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not |
1023 |
|
supported, and an error is given if they are encountered. |
1024 |
|
|
1025 |
|
|
1026 |
.SH VERTICAL BAR |
.SH VERTICAL BAR |
1027 |
Vertical bar characters are used to separate alternative patterns. For example, |
Vertical bar characters are used to separate alternative patterns. For example, |
1028 |
the pattern |
the pattern |
1469 |
|
|
1470 |
abcd$ |
abcd$ |
1471 |
|
|
1472 |
when applied to a long string which does not match it. Because matching |
when applied to a long string which does not match. Because matching proceeds |
1473 |
proceeds from left to right, PCRE will look for each "a" in the subject and |
from left to right, PCRE will look for each "a" in the subject and then see if |
1474 |
then see if what follows matches the rest of the pattern. If the pattern is |
what follows matches the rest of the pattern. If the pattern is specified as |
|
specified as |
|
1475 |
|
|
1476 |
^.*abcd$ |
^.*abcd$ |
1477 |
|
|
1478 |
then the initial .* matches the entire string at first, but when this fails, it |
then the initial .* matches the entire string at first, but when this fails |
1479 |
backtracks to match all but the last character, then all but the last two |
(because there is no following "a"), it backtracks to match all but the last |
1480 |
characters, and so on. Once again the search for "a" covers the entire string, |
character, then all but the last two characters, and so on. Once again the |
1481 |
from right to left, so we are no better off. However, if the pattern is written |
search for "a" covers the entire string, from right to left, so we are no |
1482 |
as |
better off. However, if the pattern is written as |
1483 |
|
|
1484 |
^(?>.*)(?<=abcd) |
^(?>.*)(?<=abcd) |
1485 |
|
|
1488 |
characters. If it fails, the match fails immediately. For long strings, this |
characters. If it fails, the match fails immediately. For long strings, this |
1489 |
approach makes a significant difference to the processing time. |
approach makes a significant difference to the processing time. |
1490 |
|
|
1491 |
|
When a pattern contains an unlimited repeat inside a subpattern that can itself |
1492 |
|
be repeated an unlimited number of times, the use of a once-only subpattern is |
1493 |
|
the only way to avoid some failing matches taking a very long time indeed. |
1494 |
|
The pattern |
1495 |
|
|
1496 |
|
(\\D+|<\\d+>)*[!?] |
1497 |
|
|
1498 |
|
matches an unlimited number of substrings that either consist of non-digits, or |
1499 |
|
digits enclosed in <>, followed by either ! or ?. When it matches, it runs |
1500 |
|
quickly. However, if it is applied to |
1501 |
|
|
1502 |
|
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
1503 |
|
|
1504 |
|
it takes a long time before reporting failure. This is because the string can |
1505 |
|
be divided between the two repeats in a large number of ways, and all have to |
1506 |
|
be tried. (The example used [!?] rather than a single character at the end, |
1507 |
|
because both PCRE and Perl have an optimization that allows for fast failure |
1508 |
|
when a single character is used. They remember the last single character that |
1509 |
|
is required for a match, and fail early if it is not present in the string.) |
1510 |
|
If the pattern is changed to |
1511 |
|
|
1512 |
|
((?>\\D+)|<\\d+>)*[!?] |
1513 |
|
|
1514 |
|
sequences of non-digits cannot be broken, and failure happens quickly. |
1515 |
|
|
1516 |
|
|
1517 |
.SH CONDITIONAL SUBPATTERNS |
.SH CONDITIONAL SUBPATTERNS |
1518 |
It is possible to cause the matching process to obey a subpattern |
It is possible to cause the matching process to obey a subpattern |
1572 |
character in the pattern. |
character in the pattern. |
1573 |
|
|
1574 |
|
|
1575 |
|
.SH RECURSIVE PATTERNS |
1576 |
|
Consider the problem of matching a string in parentheses, allowing for |
1577 |
|
unlimited nested parentheses. Without the use of recursion, the best that can |
1578 |
|
be done is to use a pattern that matches up to some fixed depth of nesting. It |
1579 |
|
is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an |
1580 |
|
experimental facility that allows regular expressions to recurse (amongst other |
1581 |
|
things). It does this by interpolating Perl code in the expression at run time, |
1582 |
|
and the code can refer to the expression itself. A Perl pattern to solve the |
1583 |
|
parentheses problem can be created like this: |
1584 |
|
|
1585 |
|
$re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x; |
1586 |
|
|
1587 |
|
The (?p{...}) item interpolates Perl code at run time, and in this case refers |
1588 |
|
recursively to the pattern in which it appears. Obviously, PCRE cannot support |
1589 |
|
the interpolation of Perl code. Instead, the special item (?R) is provided for |
1590 |
|
the specific case of recursion. This PCRE pattern solves the parentheses |
1591 |
|
problem (assume the PCRE_EXTENDED option is set so that white space is |
1592 |
|
ignored): |
1593 |
|
|
1594 |
|
\\( ( (?>[^()]+) | (?R) )* \\) |
1595 |
|
|
1596 |
|
First it matches an opening parenthesis. Then it matches any number of |
1597 |
|
substrings which can either be a sequence of non-parentheses, or a recursive |
1598 |
|
match of the pattern itself (i.e. a correctly parenthesized substring). Finally |
1599 |
|
there is a closing parenthesis. |
1600 |
|
|
1601 |
|
This particular example pattern contains nested unlimited repeats, and so the |
1602 |
|
use of a once-only subpattern for matching strings of non-parentheses is |
1603 |
|
important when applying the pattern to strings that do not match. For example, |
1604 |
|
when it is applied to |
1605 |
|
|
1606 |
|
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
1607 |
|
|
1608 |
|
it yields "no match" quickly. However, if a once-only subpattern is not used, |
1609 |
|
the match runs for a very long time indeed because there are so many different |
1610 |
|
ways the + and * repeats can carve up the subject, and all have to be tested |
1611 |
|
before failure can be reported. |
1612 |
|
|
1613 |
|
The values set for any capturing subpatterns are those from the outermost level |
1614 |
|
of the recursion at which the subpattern value is set. If the pattern above is |
1615 |
|
matched against |
1616 |
|
|
1617 |
|
(ab(cd)ef) |
1618 |
|
|
1619 |
|
the value for the capturing parentheses is "ef", which is the last value taken |
1620 |
|
on at the top level. If additional parentheses are added, giving |
1621 |
|
|
1622 |
|
\\( ( ( (?>[^()]+) | (?R) )* ) \\) |
1623 |
|
^ ^ |
1624 |
|
^ ^ |
1625 |
|
then the string they capture is "ab(cd)ef", the contents of the top level |
1626 |
|
parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE |
1627 |
|
has to obtain extra memory to store data during a recursion, which it does by |
1628 |
|
using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no |
1629 |
|
memory can be obtained, it saves data for the first 15 capturing parentheses |
1630 |
|
only, as there is no way to give an out-of-memory error from within a |
1631 |
|
recursion. |
1632 |
|
|
1633 |
|
|
1634 |
.SH PERFORMANCE |
.SH PERFORMANCE |
1635 |
Certain items that may appear in patterns are more efficient than others. It is |
Certain items that may appear in patterns are more efficient than others. It is |
1636 |
more efficient to use a character class like [aeiou] than a set of alternatives |
more efficient to use a character class like [aeiou] than a set of alternatives |
1697 |
.br |
.br |
1698 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
1699 |
|
|
1700 |
Last updated: 29 July 1999 |
Last updated: 27 January 2000 |
1701 |
.br |
.br |
1702 |
Copyright (c) 1997-1999 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |