28 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
29 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
30 |
|
|
31 |
|
void pcre_free_substring(const char *stringptr); |
32 |
|
|
33 |
|
void pcre_free_substring_list(const char **stringptr); |
34 |
|
|
35 |
const unsigned char *pcre_maketables(void); |
const unsigned char *pcre_maketables(void); |
36 |
|
|
37 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
52 |
The PCRE library is a set of functions that implement regu- |
The PCRE library is a set of functions that implement regu- |
53 |
lar expression pattern matching using the same syntax and |
lar expression pattern matching using the same syntax and |
54 |
semantics as Perl 5, with just a few differences (see |
semantics as Perl 5, with just a few differences (see |
55 |
|
|
56 |
below). The current implementation corresponds to Perl |
below). The current implementation corresponds to Perl |
57 |
5.005, with some additional features from the Perl develop- |
5.005, with some additional features from later versions. |
58 |
ment release. |
This includes some experimental, incomplete support for |
59 |
|
UTF-8 encoded strings. Details of exactly what is and what |
60 |
|
is not supported are given below. |
61 |
|
|
62 |
PCRE has its own native API, which is described in this |
PCRE has its own native API, which is described in this |
63 |
document. There is also a set of wrapper functions that |
document. There is also a set of wrapper functions that |
74 |
releases. |
releases. |
75 |
|
|
76 |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
The functions pcre_compile(), pcre_study(), and pcre_exec() |
77 |
are used for compiling and matching regular expressions, |
are used for compiling and matching regular expressions. |
78 |
while pcre_copy_substring(), pcre_get_substring(), and |
|
79 |
pcre_get_substring_list() are convenience functions for |
The functions pcre_copy_substring(), pcre_get_substring(), |
80 |
|
and pcre_get_substring_list() are convenience functions for |
81 |
extracting captured substrings from a matched subject |
extracting captured substrings from a matched subject |
82 |
string. The function pcre_maketables() is used (optionally) |
string; pcre_free_substring() and pcre_free_substring_list() |
83 |
to build a set of character tables in the current locale for |
are also provided, to free the memory used for extracted |
84 |
passing to pcre_compile(). |
strings. |
85 |
|
|
86 |
|
The function pcre_maketables() is used (optionally) to build |
87 |
|
a set of character tables in the current locale for passing |
88 |
|
to pcre_compile(). |
89 |
|
|
90 |
The function pcre_fullinfo() is used to find out information |
The function pcre_fullinfo() is used to find out information |
91 |
about a compiled pattern; pcre_info() is an obsolete version |
about a compiled pattern; pcre_info() is an obsolete version |
104 |
|
|
105 |
|
|
106 |
MULTI-THREADING |
MULTI-THREADING |
107 |
The PCRE functions can be used in multi-threading applica- |
The PCRE functions can be used in multi-threading |
108 |
tions, with the proviso that the memory management functions |
|
109 |
pointed to by pcre_malloc and pcre_free are shared by all |
|
110 |
threads. |
|
111 |
|
|
112 |
|
|
113 |
|
SunOS 5.8 Last change: 2 |
114 |
|
|
115 |
|
|
116 |
|
|
117 |
|
applications, with the proviso that the memory management |
118 |
|
functions pointed to by pcre_malloc and pcre_free are shared |
119 |
|
by all threads. |
120 |
|
|
121 |
The compiled form of a regular expression is not altered |
The compiled form of a regular expression is not altered |
122 |
during matching, so the same compiled pattern can safely be |
during matching, so the same compiled pattern can safely be |
124 |
|
|
125 |
|
|
126 |
|
|
|
|
|
127 |
COMPILING A PATTERN |
COMPILING A PATTERN |
128 |
The function pcre_compile() is called to compile a pattern |
The function pcre_compile() is called to compile a pattern |
129 |
into an internal form. The pattern is a C string terminated |
into an internal form. The pattern is a C string terminated |
255 |
followed by "?". It is not compatible with Perl. It can also |
followed by "?". It is not compatible with Perl. It can also |
256 |
be set by a (?U) option setting within the pattern. |
be set by a (?U) option setting within the pattern. |
257 |
|
|
258 |
|
PCRE_UTF8 |
259 |
|
|
260 |
|
This option causes PCRE to regard both the pattern and the |
261 |
|
subject as strings of UTF-8 characters instead of just byte |
262 |
|
strings. However, it is available only if PCRE has been |
263 |
|
built to include UTF-8 support. If not, the use of this |
264 |
|
option provokes an error. Support for UTF-8 is new, experi- |
265 |
|
mental, and incomplete. Details of exactly what it entails |
266 |
|
are given below. |
267 |
|
|
268 |
|
|
269 |
|
|
270 |
STUDYING A PATTERN |
STUDYING A PATTERN |
271 |
When a pattern is going to be used several times, it is |
When a pattern is going to be used several times, it is |
272 |
worth spending more time analyzing it in order to speed up |
worth spending more time analyzing it in order to speed up |
273 |
the time taken for matching. The function pcre_study() takes |
the time taken for matching. The function pcre_study() takes |
274 |
|
|
275 |
a pointer to a compiled pattern as its first argument, and |
a pointer to a compiled pattern as its first argument, and |
276 |
returns a pointer to a pcre_extra block (another void |
returns a pointer to a pcre_extra block (another void |
277 |
typedef) containing additional information about the pat- |
typedef) containing additional information about the pat- |
375 |
|
|
376 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
377 |
|
|
378 |
Return the number of the highest back reference in the pat- |
Return the number of the highest back reference in the |
379 |
tern. The fourth argument should point to an int variable. |
pattern. The fourth argument should point to an int vari- |
380 |
Zero is returned if there are no back references. |
able. Zero is returned if there are no back references. |
381 |
|
|
382 |
PCRE_INFO_FIRSTCHAR |
PCRE_INFO_FIRSTCHAR |
383 |
|
|
636 |
|
|
637 |
EXTRACTING CAPTURED SUBSTRINGS |
EXTRACTING CAPTURED SUBSTRINGS |
638 |
Captured substrings can be accessed directly by using the |
Captured substrings can be accessed directly by using the |
639 |
|
|
640 |
|
|
641 |
|
|
642 |
|
|
643 |
|
|
644 |
|
SunOS 5.8 Last change: 12 |
645 |
|
|
646 |
|
|
647 |
|
|
648 |
offsets returned by pcre_exec() in ovector. For convenience, |
offsets returned by pcre_exec() in ovector. For convenience, |
649 |
the functions pcre_copy_substring(), pcre_get_substring(), |
the functions pcre_copy_substring(), pcre_get_substring(), |
650 |
and pcre_get_substring_list() are provided for extracting |
and pcre_get_substring_list() are provided for extracting |
671 |
the entire pattern, while higher values extract the captured |
the entire pattern, while higher values extract the captured |
672 |
substrings. For pcre_copy_substring(), the string is placed |
substrings. For pcre_copy_substring(), the string is placed |
673 |
in buffer, whose length is given by buffersize, while for |
in buffer, whose length is given by buffersize, while for |
674 |
pcre_get_substring() a new block of store is obtained via |
pcre_get_substring() a new block of memory is obtained via |
675 |
pcre_malloc, and its address is returned via stringptr. The |
pcre_malloc, and its address is returned via stringptr. The |
676 |
yield of the function is the length of the string, not |
yield of the function is the length of the string, not |
677 |
including the terminating zero, or one of |
including the terminating zero, or one of |
705 |
inspecting the appropriate offset in ovector, which is nega- |
inspecting the appropriate offset in ovector, which is nega- |
706 |
tive for unset substrings. |
tive for unset substrings. |
707 |
|
|
708 |
|
The two convenience functions pcre_free_substring() and |
709 |
|
pcre_free_substring_list() can be used to free the memory |
710 |
|
returned by a previous call of pcre_get_substring() or |
711 |
|
pcre_get_substring_list(), respectively. They do nothing |
712 |
|
more than call the function pointed to by pcre_free, which |
713 |
|
of course could be called directly from a C program. How- |
714 |
|
ever, PCRE is used in some situations where it is linked via |
715 |
|
a special interface to another programming language which |
716 |
|
cannot use pcre_free directly; it is for these cases that |
717 |
|
the functions are provided. |
718 |
|
|
719 |
|
|
720 |
|
|
783 |
(?p{code}) constructions. However, there is some experimen- |
(?p{code}) constructions. However, there is some experimen- |
784 |
tal support for recursive patterns using the non-Perl item |
tal support for recursive patterns using the non-Perl item |
785 |
(?R). |
(?R). |
786 |
|
|
787 |
8. There are at the time of writing some oddities in Perl |
8. There are at the time of writing some oddities in Perl |
788 |
5.005_02 concerned with the settings of captured strings |
5.005_02 concerned with the settings of captured strings |
789 |
when part of a pattern is repeated. For example, matching |
when part of a pattern is repeated. For example, matching |
836 |
The syntax and semantics of the regular expressions sup- |
The syntax and semantics of the regular expressions sup- |
837 |
ported by PCRE are described below. Regular expressions are |
ported by PCRE are described below. Regular expressions are |
838 |
also described in the Perl documentation and in a number of |
also described in the Perl documentation and in a number of |
|
|
|
839 |
other books, some of which have copious examples. Jeffrey |
other books, some of which have copious examples. Jeffrey |
840 |
Friedl's "Mastering Regular Expressions", published by |
Friedl's "Mastering Regular Expressions", published by |
841 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
842 |
|
|
843 |
The description here is intended as reference documentation. |
The description here is intended as reference documentation. |
844 |
|
The basic operation of PCRE is on strings of bytes. However, |
845 |
|
there is the beginnings of some support for UTF-8 character |
846 |
|
strings. To use this support you must configure PCRE to |
847 |
|
include it, and then call pcre_compile() with the PCRE_UTF8 |
848 |
|
option. How this affects the pattern matching is described |
849 |
|
in the final section of this document. |
850 |
|
|
851 |
A regular expression is a pattern that is matched against a |
A regular expression is a pattern that is matched against a |
852 |
subject string from left to right. Most characters stand for |
subject string from left to right. Most characters stand for |
1061 |
Outside a character class, in the default matching mode, the |
Outside a character class, in the default matching mode, the |
1062 |
circumflex character is an assertion which is true only if |
circumflex character is an assertion which is true only if |
1063 |
the current matching point is at the start of the subject |
the current matching point is at the start of the subject |
1064 |
|
|
1065 |
string. If the startoffset argument of pcre_exec() is non- |
string. If the startoffset argument of pcre_exec() is non- |
1066 |
zero, circumflex can never match. Inside a character class, |
zero, circumflex can never match. Inside a character class, |
1067 |
circumflex has an entirely different meaning (see below). |
circumflex has an entirely different meaning (see below). |
1114 |
Outside a character class, a dot in the pattern matches any |
Outside a character class, a dot in the pattern matches any |
1115 |
one character in the subject, including a non-printing char- |
one character in the subject, including a non-printing char- |
1116 |
acter, but not (by default) newline. If the PCRE_DOTALL |
acter, but not (by default) newline. If the PCRE_DOTALL |
1117 |
|
|
1118 |
option is set, dots match newlines as well. The handling of |
option is set, dots match newlines as well. The handling of |
1119 |
dot is entirely independent of the handling of circumflex |
dot is entirely independent of the handling of circumflex |
1120 |
and dollar, the only relationship being that they both |
and dollar, the only relationship being that they both |
1576 |
A back reference that occurs inside the parentheses to which |
A back reference that occurs inside the parentheses to which |
1577 |
it refers fails when the subpattern is first used, so, for |
it refers fails when the subpattern is first used, so, for |
1578 |
example, (a\1) never matches. However, such references can |
example, (a\1) never matches. However, such references can |
1579 |
be useful inside repeated subpatterns. For example, the |
be useful inside repeated subpatterns. For example, the pat- |
1580 |
pattern |
tern |
1581 |
|
|
1582 |
(a|b\1)+ |
(a|b\1)+ |
1583 |
|
|
1584 |
matches any number of "a"s and also "aba", "ababaa" etc. At |
matches any number of "a"s and also "aba", "ababbaa" etc. At |
1585 |
each iteration of the subpattern, the back reference matches |
each iteration of the subpattern, the back reference matches |
1586 |
the character string corresponding to the previous itera- |
the character string corresponding to the previous |
1587 |
tion. In order for this to work, the pattern must be such |
iteration. In order for this to work, the pattern must be |
1588 |
that the first iteration does not need to match the back |
such that the first iteration does not need to match the |
1589 |
reference. This can be done using alternation, as in the |
back reference. This can be done using alternation, as in |
1590 |
example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of |
1591 |
|
zero. |
1592 |
|
|
1593 |
|
|
1594 |
|
|
1741 |
|
|
1742 |
This kind of parenthesis "locks up" the part of the pattern |
This kind of parenthesis "locks up" the part of the pattern |
1743 |
it contains once it has matched, and a failure further into |
it contains once it has matched, and a failure further into |
1744 |
the pattern is prevented from backtracking into it. Back- |
the pattern is prevented from backtracking into it. |
1745 |
tracking past it to previous items, however, works as nor- |
Backtracking past it to previous items, however, works as |
1746 |
mal. |
normal. |
1747 |
|
|
1748 |
An alternative description is that a subpattern of this type |
An alternative description is that a subpattern of this type |
1749 |
matches the string of characters that an identical stan- |
matches the string of characters that an identical stan- |
2001 |
repeat can match 0, 1, 2, 3, or 4 times, and for each of |
repeat can match 0, 1, 2, 3, or 4 times, and for each of |
2002 |
those cases other than 0, the + repeats can match different |
those cases other than 0, the + repeats can match different |
2003 |
numbers of times.) When the remainder of the pattern is such |
numbers of times.) When the remainder of the pattern is such |
2004 |
that the entire match is going to fail, PCRE has in princi- |
that the entire match is going to fail, PCRE has in |
2005 |
ple to try every possible variation, and this can take an |
principle to try every possible variation, and this can take |
2006 |
extremely long time. |
an extremely long time. |
2007 |
|
|
2008 |
An optimization catches some of the more simple cases such |
An optimization catches some of the more simple cases such |
2009 |
as |
as |
2026 |
|
|
2027 |
|
|
2028 |
|
|
2029 |
|
UTF-8 SUPPORT |
2030 |
|
Starting at release 3.3, PCRE has some support for character |
2031 |
|
strings encoded in the UTF-8 format. This is incomplete, and |
2032 |
|
is regarded as experimental. In order to use it, you must |
2033 |
|
configure PCRE to include UTF-8 support in the code, and, in |
2034 |
|
addition, you must call pcre_compile() with the PCRE_UTF8 |
2035 |
|
option flag. When you do this, both the pattern and any sub- |
2036 |
|
ject strings that are matched against it are treated as |
2037 |
|
UTF-8 strings instead of just strings of bytes, but only in |
2038 |
|
the cases that are mentioned below. |
2039 |
|
|
2040 |
|
If you compile PCRE with UTF-8 support, but do not use it at |
2041 |
|
run time, the library will be a bit bigger, but the addi- |
2042 |
|
tional run time overhead is limited to testing the PCRE_UTF8 |
2043 |
|
flag in several places, so should not be very large. |
2044 |
|
|
2045 |
|
PCRE assumes that the strings it is given contain valid |
2046 |
|
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
2047 |
|
you pass invalid UTF-8 strings to PCRE, the results are |
2048 |
|
undefined. |
2049 |
|
|
2050 |
|
Running with PCRE_UTF8 set causes these changes in the way |
2051 |
|
PCRE works: |
2052 |
|
|
2053 |
|
1. In a pattern, the escape sequence \x{...}, where the con- |
2054 |
|
tents of the braces is a string of hexadecimal digits, is |
2055 |
|
interpreted as a UTF-8 character whose code number is the |
2056 |
|
given hexadecimal number, for example: \x{1234}. This |
2057 |
|
inserts from one to six literal bytes into the pattern, |
2058 |
|
using the UTF-8 encoding. If a non-hexadecimal digit appears |
2059 |
|
between the braces, the item is not recognized. |
2060 |
|
|
2061 |
|
2. The original hexadecimal escape sequence, \xhh, generates |
2062 |
|
a two-byte UTF-8 character if its value is greater than 127. |
2063 |
|
|
2064 |
|
3. Repeat quantifiers are NOT correctly handled if they fol- |
2065 |
|
low a multibyte character. For example, \x{100}* and \xc3+ |
2066 |
|
do not work. If you want to repeat such characters, you must |
2067 |
|
enclose them in non-capturing parentheses, for example |
2068 |
|
(?:\x{100}), at present. |
2069 |
|
|
2070 |
|
4. The dot metacharacter matches one UTF-8 character instead |
2071 |
|
of a single byte. |
2072 |
|
|
2073 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter |
2074 |
|
followed by a repeat quantifier does operate correctly on |
2075 |
|
UTF-8 characters instead of single bytes. |
2076 |
|
|
2077 |
|
4. Although the \x{...} escape is permitted in a character |
2078 |
|
class, characters whose values are greater than 255 cannot |
2079 |
|
be included in a class. |
2080 |
|
|
2081 |
|
5. A class is matched against a UTF-8 character instead of |
2082 |
|
just a single byte, but it can match only characters whose |
2083 |
|
values are less than 256. Characters with greater values |
2084 |
|
always fail to match a class. |
2085 |
|
|
2086 |
|
6. Repeated classes work correctly on multiple characters. |
2087 |
|
|
2088 |
|
7. Classes containing just a single character whose value is |
2089 |
|
greater than 127 (but less than 256), for example, [\x80] or |
2090 |
|
[^\x{93}], do not work because these are optimized into sin- |
2091 |
|
gle byte matches. In the first case, of course, the class |
2092 |
|
brackets are just redundant. |
2093 |
|
|
2094 |
|
8. Lookbehind assertions move backwards in the subject by a |
2095 |
|
fixed number of characters instead of a fixed number of |
2096 |
|
bytes. Simple cases have been tested to work correctly, but |
2097 |
|
there may be hidden gotchas herein. |
2098 |
|
|
2099 |
|
9. The character types such as \d and \w do not work |
2100 |
|
correctly with UTF-8 characters. They continue to test a |
2101 |
|
single byte. |
2102 |
|
|
2103 |
|
10. Anything not explicitly mentioned here continues to work |
2104 |
|
in bytes rather than in characters. |
2105 |
|
|
2106 |
|
The following UTF-8 features of Perl 5.6 are not imple- |
2107 |
|
mented: |
2108 |
|
|
2109 |
|
1. The escape sequence \C to match a single byte. |
2110 |
|
|
2111 |
|
2. The use of Unicode tables and properties and escapes \p, |
2112 |
|
\P, and \X. |
2113 |
|
|
2114 |
|
|
2115 |
|
|
2116 |
AUTHOR |
AUTHOR |
2117 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
2118 |
University Computing Service, |
University Computing Service, |
2120 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
2121 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
2122 |
|
|
2123 |
Last updated: 27 January 2000 |
Last updated: 28 August 2000, |
2124 |
|
the 250th anniversary of the death of J.S. Bach. |
2125 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |