2 |
This file contains a concatenation of the PCRE man pages, converted to plain |
This file contains a concatenation of the PCRE man pages, converted to plain |
3 |
text format for ease of searching with a text editor, or for use on systems |
text format for ease of searching with a text editor, or for use on systems |
4 |
that do not have a man page processor. The small individual files that give |
that do not have a man page processor. The small individual files that give |
5 |
synopses of each function in the library have not been included. There are |
synopses of each function in the library have not been included. Neither has |
6 |
separate text files for the pcregrep and pcretest commands. |
the pcredemo program. There are separate text files for the pcregrep and |
7 |
|
pcretest commands. |
8 |
----------------------------------------------------------------------------- |
----------------------------------------------------------------------------- |
9 |
|
|
10 |
|
|
25 |
tax items, and there is an option for requesting some minor changes |
tax items, and there is an option for requesting some minor changes |
26 |
that give better JavaScript compatibility. |
that give better JavaScript compatibility. |
27 |
|
|
28 |
The current implementation of PCRE (release 7.x) corresponds approxi- |
The current implementation of PCRE (release 8.xx) corresponds approxi- |
29 |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
30 |
Unicode general category properties. However, UTF-8 and Unicode support |
Unicode general category properties. However, UTF-8 and Unicode support |
31 |
has to be explicitly enabled; it is not the default. The Unicode tables |
has to be explicitly enabled; it is not the default. The Unicode tables |
72 |
The user documentation for PCRE comprises a number of different sec- |
The user documentation for PCRE comprises a number of different sec- |
73 |
tions. In the "man" format, each of these is a separate "man page". In |
tions. In the "man" format, each of these is a separate "man page". In |
74 |
the HTML format, each is a separate page, linked from the index page. |
the HTML format, each is a separate page, linked from the index page. |
75 |
In the plain text format, all the sections are concatenated, for ease |
In the plain text format, all the sections, except the pcredemo sec- |
76 |
of searching. The sections are as follows: |
tion, are concatenated, for ease of searching. The sections are as fol- |
77 |
|
lows: |
78 |
|
|
79 |
pcre this document |
pcre this document |
80 |
pcre-config show PCRE installation configuration information |
pcre-config show PCRE installation configuration information |
83 |
pcrecallout details of the callout feature |
pcrecallout details of the callout feature |
84 |
pcrecompat discussion of Perl compatibility |
pcrecompat discussion of Perl compatibility |
85 |
pcrecpp details of the C++ wrapper |
pcrecpp details of the C++ wrapper |
86 |
|
pcredemo a demonstration C program that uses PCRE |
87 |
pcregrep description of the pcregrep command |
pcregrep description of the pcregrep command |
88 |
pcrematching discussion of the two matching algorithms |
pcrematching discussion of the two matching algorithms |
89 |
pcrepartial details of the partial matching facility |
pcrepartial details of the partial matching facility |
93 |
pcreperform discussion of performance issues |
pcreperform discussion of performance issues |
94 |
pcreposix the POSIX-compatible C API |
pcreposix the POSIX-compatible C API |
95 |
pcreprecompile details of saving and re-using precompiled patterns |
pcreprecompile details of saving and re-using precompiled patterns |
96 |
pcresample discussion of the sample program |
pcresample discussion of the pcredemo program |
97 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
98 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
99 |
|
|
100 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
101 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
102 |
|
|
103 |
|
|
104 |
LIMITATIONS |
LIMITATIONS |
105 |
|
|
106 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
107 |
never in practice be relevant. |
never in practice be relevant. |
108 |
|
|
109 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
110 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
111 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
112 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
113 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
114 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
115 |
of execution is slower. |
of execution is slower. |
116 |
|
|
117 |
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
122 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
123 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
124 |
|
|
125 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
126 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
127 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
128 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
129 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
130 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
131 |
|
|
132 |
|
|
133 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
134 |
|
|
135 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
136 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
137 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
138 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
139 |
|
|
140 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
141 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
142 |
with the PCRE_UTF8 option flag, or the pattern must start with the |
with the PCRE_UTF8 option flag, or the pattern must start with the |
143 |
sequence (*UTF8). When either of these is the case, both the pattern |
sequence (*UTF8). When either of these is the case, both the pattern |
144 |
and any subject strings that are matched against it are treated as |
and any subject strings that are matched against it are treated as |
145 |
UTF-8 strings instead of just strings of bytes. |
UTF-8 strings instead of just strings of bytes. |
146 |
|
|
147 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
148 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
149 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
150 |
very big. |
very big. |
151 |
|
|
152 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
153 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
154 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
155 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
156 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
157 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
158 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
159 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
160 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
161 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
162 |
does not support this. |
does not support this. |
163 |
|
|
164 |
Validity of UTF-8 strings |
Validity of UTF-8 strings |
165 |
|
|
166 |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
167 |
subjects are (by default) checked for validity on entry to the relevant |
subjects are (by default) checked for validity on entry to the relevant |
168 |
functions. From release 7.3 of PCRE, the check is according the rules |
functions. From release 7.3 of PCRE, the check is according the rules |
169 |
of RFC 3629, which are themselves derived from the Unicode specifica- |
of RFC 3629, which are themselves derived from the Unicode specifica- |
170 |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
171 |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
172 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
173 |
to U+DFFF. |
to U+DFFF. |
174 |
|
|
175 |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
176 |
which the Unicode Standard says this: "The Low Surrogate Area does not |
which the Unicode Standard says this: "The Low Surrogate Area does not |
177 |
contain any character assignments, consequently no character code |
contain any character assignments, consequently no character code |
178 |
charts or namelists are provided for this area. Surrogates are reserved |
charts or namelists are provided for this area. Surrogates are reserved |
179 |
for use with UTF-16 and then must be used in pairs." The code points |
for use with UTF-16 and then must be used in pairs." The code points |
180 |
that are encoded by UTF-16 pairs are available as independent code |
that are encoded by UTF-16 pairs are available as independent code |
181 |
points in the UTF-8 encoding. (In other words, the whole surrogate |
points in the UTF-8 encoding. (In other words, the whole surrogate |
182 |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
183 |
|
|
184 |
If an invalid UTF-8 string is passed to PCRE, an error return |
If an invalid UTF-8 string is passed to PCRE, an error return |
185 |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
186 |
that your strings are valid, and therefore want to skip these checks in |
that your strings are valid, and therefore want to skip these checks in |
187 |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
188 |
compile time or at run time, PCRE assumes that the pattern or subject |
compile time or at run time, PCRE assumes that the pattern or subject |
189 |
it is given (respectively) contains only valid UTF-8 codes. In this |
it is given (respectively) contains only valid UTF-8 codes. In this |
190 |
case, it does not diagnose an invalid UTF-8 string. |
case, it does not diagnose an invalid UTF-8 string. |
191 |
|
|
192 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
193 |
what happens depends on why the string is invalid. If the string con- |
what happens depends on why the string is invalid. If the string con- |
194 |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
195 |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
196 |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
197 |
strings according to the more liberal rules of RFC 2279. However, if |
strings according to the more liberal rules of RFC 2279. However, if |
198 |
the string does not even conform to RFC 2279, the result is undefined. |
the string does not even conform to RFC 2279, the result is undefined. |
199 |
Your program may crash. |
Your program may crash. |
200 |
|
|
201 |
If you want to process strings of values in the full range 0 to |
If you want to process strings of values in the full range 0 to |
202 |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
203 |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
204 |
this situation, you will have to apply your own validity check. |
this situation, you will have to apply your own validity check. |
205 |
|
|
206 |
General comments about UTF-8 mode |
General comments about UTF-8 mode |
207 |
|
|
208 |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
209 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
210 |
|
|
211 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
212 |
characters for values greater than \177. |
characters for values greater than \177. |
213 |
|
|
214 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
215 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
216 |
|
|
217 |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
218 |
gle byte. |
gle byte. |
219 |
|
|
220 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
221 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
222 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
223 |
|
|
224 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
225 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
226 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
227 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
228 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
229 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
230 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
231 |
\p{Nd}. Note that this also applies to \b, because it is defined in |
\p{Nd}. Note that this also applies to \b, because it is defined in |
232 |
terms of \w and \W. |
terms of \w and \W. |
233 |
|
|
234 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
235 |
are all low-valued characters. |
are all low-valued characters. |
236 |
|
|
237 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
238 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
239 |
acters. |
acters. |
240 |
|
|
241 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
242 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
243 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
244 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
245 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
246 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
247 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
248 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
249 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
250 |
ported by PCRE. |
ported by PCRE. |
251 |
|
|
252 |
|
|
256 |
University Computing Service |
University Computing Service |
257 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
258 |
|
|
259 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
260 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
261 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
262 |
|
|
263 |
|
|
264 |
REVISION |
REVISION |
265 |
|
|
266 |
Last updated: 11 April 2009 |
Last updated: 01 September 2009 |
267 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
268 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
269 |
|
|
270 |
|
|
271 |
PCREBUILD(3) PCREBUILD(3) |
PCREBUILD(3) PCREBUILD(3) |
272 |
|
|
273 |
|
|
593 |
Last updated: 17 March 2009 |
Last updated: 17 March 2009 |
594 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
595 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
596 |
|
|
597 |
|
|
598 |
PCREMATCHING(3) PCREMATCHING(3) |
PCREMATCHING(3) PCREMATCHING(3) |
599 |
|
|
600 |
|
|
754 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
755 |
things with callouts. |
things with callouts. |
756 |
|
|
757 |
2. There is much better support for partial matching. The restrictions |
2. Because the alternative algorithm scans the subject string just |
|
on the content of the pattern that apply when using the standard algo- |
|
|
rithm for partial matching do not apply to the alternative algorithm. |
|
|
For non-anchored patterns, the starting position of a partial match is |
|
|
available. |
|
|
|
|
|
3. Because the alternative algorithm scans the subject string just |
|
758 |
once, and never needs to backtrack, it is possible to pass very long |
once, and never needs to backtrack, it is possible to pass very long |
759 |
subject strings to the matching function in several pieces, checking |
subject strings to the matching function in several pieces, checking |
760 |
for partial matching each time. |
for partial matching each time. |
783 |
|
|
784 |
REVISION |
REVISION |
785 |
|
|
786 |
Last updated: 19 April 2008 |
Last updated: 25 August 2009 |
787 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
788 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
789 |
|
|
790 |
|
|
791 |
PCREAPI(3) PCREAPI(3) |
PCREAPI(3) PCREAPI(3) |
792 |
|
|
793 |
|
|
895 |
pcre_exec() are used for compiling and matching regular expressions in |
pcre_exec() are used for compiling and matching regular expressions in |
896 |
a Perl-compatible manner. A sample program that demonstrates the sim- |
a Perl-compatible manner. A sample program that demonstrates the sim- |
897 |
plest way of using them is provided in the file called pcredemo.c in |
plest way of using them is provided in the file called pcredemo.c in |
898 |
the source distribution. The pcresample documentation describes how to |
the PCRE source distribution. A listing of this program is given in the |
899 |
compile and run it. |
pcredemo documentation, and the pcresample documentation describes how |
900 |
|
to compile and run it. |
901 |
|
|
902 |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
903 |
ble, is also provided. This uses a different algorithm for the match- |
ble, is also provided. This uses a different algorithm for the match- |
904 |
ing. The alternative algorithm finds all possible matches (at a given |
ing. The alternative algorithm finds all possible matches (at a given |
905 |
point in the subject), and scans the subject just once. However, this |
point in the subject), and scans the subject just once. However, this |
906 |
algorithm does not return captured substrings. A description of the two |
algorithm does not return captured substrings. A description of the two |
907 |
matching algorithms and their advantages and disadvantages is given in |
matching algorithms and their advantages and disadvantages is given in |
908 |
the pcrematching documentation. |
the pcrematching documentation. |
909 |
|
|
910 |
In addition to the main compiling and matching functions, there are |
In addition to the main compiling and matching functions, there are |
911 |
convenience functions for extracting captured substrings from a subject |
convenience functions for extracting captured substrings from a subject |
912 |
string that is matched by pcre_exec(). They are: |
string that is matched by pcre_exec(). They are: |
913 |
|
|
922 |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
pcre_free_substring() and pcre_free_substring_list() are also provided, |
923 |
to free the memory used for extracted strings. |
to free the memory used for extracted strings. |
924 |
|
|
925 |
The function pcre_maketables() is used to build a set of character |
The function pcre_maketables() is used to build a set of character |
926 |
tables in the current locale for passing to pcre_compile(), |
tables in the current locale for passing to pcre_compile(), |
927 |
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is |
pcre_exec(), or pcre_dfa_exec(). This is an optional facility that is |
928 |
provided for specialist use. Most commonly, no special tables are |
provided for specialist use. Most commonly, no special tables are |
929 |
passed, in which case internal tables that are generated when PCRE is |
passed, in which case internal tables that are generated when PCRE is |
930 |
built are used. |
built are used. |
931 |
|
|
932 |
The function pcre_fullinfo() is used to find out information about a |
The function pcre_fullinfo() is used to find out information about a |
933 |
compiled pattern; pcre_info() is an obsolete version that returns only |
compiled pattern; pcre_info() is an obsolete version that returns only |
934 |
some of the available information, but is retained for backwards com- |
some of the available information, but is retained for backwards com- |
935 |
patibility. The function pcre_version() returns a pointer to a string |
patibility. The function pcre_version() returns a pointer to a string |
936 |
containing the version of PCRE and its date of release. |
containing the version of PCRE and its date of release. |
937 |
|
|
938 |
The function pcre_refcount() maintains a reference count in a data |
The function pcre_refcount() maintains a reference count in a data |
939 |
block containing a compiled pattern. This is provided for the benefit |
block containing a compiled pattern. This is provided for the benefit |
940 |
of object-oriented applications. |
of object-oriented applications. |
941 |
|
|
942 |
The global variables pcre_malloc and pcre_free initially contain the |
The global variables pcre_malloc and pcre_free initially contain the |
943 |
entry points of the standard malloc() and free() functions, respec- |
entry points of the standard malloc() and free() functions, respec- |
944 |
tively. PCRE calls the memory management functions via these variables, |
tively. PCRE calls the memory management functions via these variables, |
945 |
so a calling program can replace them if it wishes to intercept the |
so a calling program can replace them if it wishes to intercept the |
946 |
calls. This should be done before calling any PCRE functions. |
calls. This should be done before calling any PCRE functions. |
947 |
|
|
948 |
The global variables pcre_stack_malloc and pcre_stack_free are also |
The global variables pcre_stack_malloc and pcre_stack_free are also |
949 |
indirections to memory management functions. These special functions |
indirections to memory management functions. These special functions |
950 |
are used only when PCRE is compiled to use the heap for remembering |
are used only when PCRE is compiled to use the heap for remembering |
951 |
data, instead of recursive function calls, when running the pcre_exec() |
data, instead of recursive function calls, when running the pcre_exec() |
952 |
function. See the pcrebuild documentation for details of how to do |
function. See the pcrebuild documentation for details of how to do |
953 |
this. It is a non-standard way of building PCRE, for use in environ- |
this. It is a non-standard way of building PCRE, for use in environ- |
954 |
ments that have limited stacks. Because of the greater use of memory |
ments that have limited stacks. Because of the greater use of memory |
955 |
management, it runs more slowly. Separate functions are provided so |
management, it runs more slowly. Separate functions are provided so |
956 |
that special-purpose external code can be used for this case. When |
that special-purpose external code can be used for this case. When |
957 |
used, these functions are always called in a stack-like manner (last |
used, these functions are always called in a stack-like manner (last |
958 |
obtained, first freed), and always for memory blocks of the same size. |
obtained, first freed), and always for memory blocks of the same size. |
959 |
There is a discussion about PCRE's stack usage in the pcrestack docu- |
There is a discussion about PCRE's stack usage in the pcrestack docu- |
960 |
mentation. |
mentation. |
961 |
|
|
962 |
The global variable pcre_callout initially contains NULL. It can be set |
The global variable pcre_callout initially contains NULL. It can be set |
963 |
by the caller to a "callout" function, which PCRE will then call at |
by the caller to a "callout" function, which PCRE will then call at |
964 |
specified points during a matching operation. Details are given in the |
specified points during a matching operation. Details are given in the |
965 |
pcrecallout documentation. |
pcrecallout documentation. |
966 |
|
|
967 |
|
|
968 |
NEWLINES |
NEWLINES |
969 |
|
|
970 |
PCRE supports five different conventions for indicating line breaks in |
PCRE supports five different conventions for indicating line breaks in |
971 |
strings: a single CR (carriage return) character, a single LF (line- |
strings: a single CR (carriage return) character, a single LF (line- |
972 |
feed) character, the two-character sequence CRLF, any of the three pre- |
feed) character, the two-character sequence CRLF, any of the three pre- |
973 |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
ceding, or any Unicode newline sequence. The Unicode newline sequences |
974 |
are the three just mentioned, plus the single characters VT (vertical |
are the three just mentioned, plus the single characters VT (vertical |
975 |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line |
976 |
separator, U+2028), and PS (paragraph separator, U+2029). |
separator, U+2028), and PS (paragraph separator, U+2029). |
977 |
|
|
978 |
Each of the first three conventions is used by at least one operating |
Each of the first three conventions is used by at least one operating |
979 |
system as its standard newline sequence. When PCRE is built, a default |
system as its standard newline sequence. When PCRE is built, a default |
980 |
can be specified. The default default is LF, which is the Unix stan- |
can be specified. The default default is LF, which is the Unix stan- |
981 |
dard. When PCRE is run, the default can be overridden, either when a |
dard. When PCRE is run, the default can be overridden, either when a |
982 |
pattern is compiled, or when it is matched. |
pattern is compiled, or when it is matched. |
983 |
|
|
984 |
At compile time, the newline convention can be specified by the options |
At compile time, the newline convention can be specified by the options |
985 |
argument of pcre_compile(), or it can be specified by special text at |
argument of pcre_compile(), or it can be specified by special text at |
986 |
the start of the pattern itself; this overrides any other settings. See |
the start of the pattern itself; this overrides any other settings. See |
987 |
the pcrepattern page for details of the special character sequences. |
the pcrepattern page for details of the special character sequences. |
988 |
|
|
989 |
In the PCRE documentation the word "newline" is used to mean "the char- |
In the PCRE documentation the word "newline" is used to mean "the char- |
990 |
acter or pair of characters that indicate a line break". The choice of |
acter or pair of characters that indicate a line break". The choice of |
991 |
newline convention affects the handling of the dot, circumflex, and |
newline convention affects the handling of the dot, circumflex, and |
992 |
dollar metacharacters, the handling of #-comments in /x mode, and, when |
dollar metacharacters, the handling of #-comments in /x mode, and, when |
993 |
CRLF is a recognized line ending sequence, the match position advance- |
CRLF is a recognized line ending sequence, the match position advance- |
994 |
ment for a non-anchored pattern. There is more detail about this in the |
ment for a non-anchored pattern. There is more detail about this in the |
995 |
section on pcre_exec() options below. |
section on pcre_exec() options below. |
996 |
|
|
997 |
The choice of newline convention does not affect the interpretation of |
The choice of newline convention does not affect the interpretation of |
998 |
the \n or \r escape sequences, nor does it affect what \R matches, |
the \n or \r escape sequences, nor does it affect what \R matches, |
999 |
which is controlled in a similar way, but by separate options. |
which is controlled in a similar way, but by separate options. |
1000 |
|
|
1001 |
|
|
1002 |
MULTITHREADING |
MULTITHREADING |
1003 |
|
|
1004 |
The PCRE functions can be used in multi-threading applications, with |
The PCRE functions can be used in multi-threading applications, with |
1005 |
the proviso that the memory management functions pointed to by |
the proviso that the memory management functions pointed to by |
1006 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
1007 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
1008 |
|
|
1009 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
1010 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
1011 |
at once. |
at once. |
1012 |
|
|
1014 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
1015 |
|
|
1016 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
1017 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
1018 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
1019 |
pcreprecompile documentation. However, compiling a regular expression |
pcreprecompile documentation. However, compiling a regular expression |
1020 |
with one version of PCRE for use with a different version is not guar- |
with one version of PCRE for use with a different version is not guar- |
1021 |
anteed to work and may cause crashes. |
anteed to work and may cause crashes. |
1022 |
|
|
1023 |
|
|
1025 |
|
|
1026 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
1027 |
|
|
1028 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
1029 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
1030 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
1031 |
tures. |
tures. |
1032 |
|
|
1033 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
1034 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
1035 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
1036 |
available: |
available: |
1037 |
|
|
1038 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
1039 |
|
|
1040 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
1041 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
1042 |
|
|
1043 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
1044 |
|
|
1045 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
1046 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
1047 |
|
|
1048 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
1049 |
|
|
1050 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
1051 |
sequence that is recognized as meaning "newline". The four values that |
sequence that is recognized as meaning "newline". The four values that |
1052 |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
1053 |
and -1 for ANY. Though they are derived from ASCII, the same values |
and -1 for ANY. Though they are derived from ASCII, the same values |
1054 |
are returned in EBCDIC environments. The default should normally corre- |
are returned in EBCDIC environments. The default should normally corre- |
1055 |
spond to the standard sequence for your operating system. |
spond to the standard sequence for your operating system. |
1056 |
|
|
1057 |
PCRE_CONFIG_BSR |
PCRE_CONFIG_BSR |
1058 |
|
|
1059 |
The output is an integer whose value indicates what character sequences |
The output is an integer whose value indicates what character sequences |
1060 |
the \R escape sequence matches by default. A value of 0 means that \R |
the \R escape sequence matches by default. A value of 0 means that \R |
1061 |
matches any Unicode line ending sequence; a value of 1 means that \R |
matches any Unicode line ending sequence; a value of 1 means that \R |
1062 |
matches only CR, LF, or CRLF. The default can be overridden when a pat- |
matches only CR, LF, or CRLF. The default can be overridden when a pat- |
1063 |
tern is compiled or matched. |
tern is compiled or matched. |
1064 |
|
|
1065 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
1066 |
|
|
1067 |
The output is an integer that contains the number of bytes used for |
The output is an integer that contains the number of bytes used for |
1068 |
internal linkage in compiled regular expressions. The value is 2, 3, or |
internal linkage in compiled regular expressions. The value is 2, 3, or |
1069 |
4. Larger values allow larger regular expressions to be compiled, at |
4. Larger values allow larger regular expressions to be compiled, at |
1070 |
the expense of slower matching. The default value of 2 is sufficient |
the expense of slower matching. The default value of 2 is sufficient |
1071 |
for all but the most massive patterns, since it allows the compiled |
for all but the most massive patterns, since it allows the compiled |
1072 |
pattern to be up to 64K in size. |
pattern to be up to 64K in size. |
1073 |
|
|
1074 |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
PCRE_CONFIG_POSIX_MALLOC_THRESHOLD |
1075 |
|
|
1076 |
The output is an integer that contains the threshold above which the |
The output is an integer that contains the threshold above which the |
1077 |
POSIX interface uses malloc() for output vectors. Further details are |
POSIX interface uses malloc() for output vectors. Further details are |
1078 |
given in the pcreposix documentation. |
given in the pcreposix documentation. |
1079 |
|
|
1080 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
1081 |
|
|
1082 |
The output is a long integer that gives the default limit for the num- |
The output is a long integer that gives the default limit for the num- |
1083 |
ber of internal matching function calls in a pcre_exec() execution. |
ber of internal matching function calls in a pcre_exec() execution. |
1084 |
Further details are given with pcre_exec() below. |
Further details are given with pcre_exec() below. |
1085 |
|
|
1086 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
1087 |
|
|
1088 |
The output is a long integer that gives the default limit for the depth |
The output is a long integer that gives the default limit for the depth |
1089 |
of recursion when calling the internal matching function in a |
of recursion when calling the internal matching function in a |
1090 |
pcre_exec() execution. Further details are given with pcre_exec() |
pcre_exec() execution. Further details are given with pcre_exec() |
1091 |
below. |
below. |
1092 |
|
|
1093 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
1094 |
|
|
1095 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
1096 |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
1097 |
the stack to remember their state. This is the usual way that PCRE is |
the stack to remember their state. This is the usual way that PCRE is |
1098 |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
1099 |
on the heap instead of recursive function calls. In this case, |
on the heap instead of recursive function calls. In this case, |
1100 |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
1101 |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
1102 |
|
|
1103 |
|
|
1114 |
|
|
1115 |
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
1116 |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
1117 |
the two interfaces is that pcre_compile2() has an additional argument, |
the two interfaces is that pcre_compile2() has an additional argument, |
1118 |
errorcodeptr, via which a numerical error code can be returned. |
errorcodeptr, via which a numerical error code can be returned. |
1119 |
|
|
1120 |
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
1121 |
the pattern argument. A pointer to a single block of memory that is |
the pattern argument. A pointer to a single block of memory that is |
1122 |
obtained via pcre_malloc is returned. This contains the compiled code |
obtained via pcre_malloc is returned. This contains the compiled code |
1123 |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
1124 |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
1125 |
It is up to the caller to free the memory (via pcre_free) when it is no |
It is up to the caller to free the memory (via pcre_free) when it is no |
1126 |
longer required. |
longer required. |
1127 |
|
|
1128 |
Although the compiled code of a PCRE regex is relocatable, that is, it |
Although the compiled code of a PCRE regex is relocatable, that is, it |
1129 |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
1130 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
1131 |
ment, which is an address (see below). |
ment, which is an address (see below). |
1132 |
|
|
1133 |
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
1134 |
pilation. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
1135 |
options are described below. Some of them (in particular, those that |
options are described below. Some of them (in particular, those that |
1136 |
are compatible with Perl, but also some others) can also be set and |
are compatible with Perl, but also some others) can also be set and |
1137 |
unset from within the pattern (see the detailed description in the |
unset from within the pattern (see the detailed description in the |
1138 |
pcrepattern documentation). For those options that can be different in |
pcrepattern documentation). For those options that can be different in |
1139 |
different parts of the pattern, the contents of the options argument |
different parts of the pattern, the contents of the options argument |
1140 |
specifies their initial settings at the start of compilation and execu- |
specifies their initial settings at the start of compilation and execu- |
1141 |
tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
tion. The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the |
1142 |
time of matching as well as at compile time. |
time of matching as well as at compile time. |
1143 |
|
|
1144 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
1145 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
1146 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
1147 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
1148 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
1149 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
1150 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
1151 |
given. |
given. |
1152 |
|
|
1153 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
1154 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
1155 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
1156 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
1157 |
|
|
1158 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
1159 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
1160 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
1161 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
1162 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
1163 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
1164 |
support below. |
support below. |
1165 |
|
|
1166 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
1167 |
pile(): |
pile(): |
1168 |
|
|
1169 |
pcre *re; |
pcre *re; |
1176 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
1177 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
1178 |
|
|
1179 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
1180 |
file: |
file: |
1181 |
|
|
1182 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1183 |
|
|
1184 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
1185 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
1186 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
1187 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
1188 |
only way to do it in Perl. |
only way to do it in Perl. |
1189 |
|
|
1190 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
1191 |
|
|
1192 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
1193 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
1194 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
1195 |
|
|
1196 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
1197 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
1198 |
|
|
1199 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
1200 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
1201 |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
1202 |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
1203 |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
1204 |
|
|
1205 |
PCRE_CASELESS |
PCRE_CASELESS |
1206 |
|
|
1207 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
1208 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
1209 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
1210 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
1211 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
1212 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
1213 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
1214 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
1215 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
1216 |
UTF-8 support. |
UTF-8 support. |
1217 |
|
|
1218 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
1219 |
|
|
1220 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
1221 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
1222 |
matches immediately before a newline at the end of the string (but not |
matches immediately before a newline at the end of the string (but not |
1223 |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
1224 |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
1225 |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
1226 |
|
|
1227 |
PCRE_DOTALL |
PCRE_DOTALL |
1228 |
|
|
1229 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
1230 |
acters, including those that indicate newline. Without it, a dot does |
acters, including those that indicate newline. Without it, a dot does |
1231 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
1232 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
1233 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
1234 |
newline characters, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
1235 |
|
|
1236 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
1237 |
|
|
1238 |
If this bit is set, names used to identify capturing subpatterns need |
If this bit is set, names used to identify capturing subpatterns need |
1239 |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
1240 |
is known that only one instance of the named subpattern can ever be |
is known that only one instance of the named subpattern can ever be |
1241 |
matched. There are more details of named subpatterns below; see also |
matched. There are more details of named subpatterns below; see also |
1242 |
the pcrepattern documentation. |
the pcrepattern documentation. |
1243 |
|
|
1244 |
PCRE_EXTENDED |
PCRE_EXTENDED |
1245 |
|
|
1246 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
1247 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
1248 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
1249 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
1250 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
1251 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
1252 |
ting. |
ting. |
1253 |
|
|
1254 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
1255 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
1256 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
1257 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
1258 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
1259 |
|
|
1260 |
PCRE_EXTRA |
PCRE_EXTRA |
1261 |
|
|
1262 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
1263 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
1264 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
1265 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
1266 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
1267 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
1268 |
literal. (Perl can, however, be persuaded to give a warning for this.) |
literal. (Perl can, however, be persuaded to give a warning for this.) |
1269 |
There are at present no other features controlled by this option. It |
There are at present no other features controlled by this option. It |
1270 |
can also be set by a (?X) option setting within a pattern. |
can also be set by a (?X) option setting within a pattern. |
1271 |
|
|
1272 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
1273 |
|
|
1274 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
1275 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
1276 |
matched text may continue over the newline. |
matched text may continue over the newline. |
1277 |
|
|
1278 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
1279 |
|
|
1280 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
1281 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
1282 |
follows: |
follows: |
1283 |
|
|
1284 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
1285 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
1286 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
1287 |
option is set. |
option is set. |
1288 |
|
|
1289 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
1290 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
1291 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
1292 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
1293 |
default, for Perl compatibility. |
default, for Perl compatibility. |
1294 |
|
|
1295 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1296 |
|
|
1297 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1298 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
1299 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
1300 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
1301 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
1302 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
1303 |
|
|
1304 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
1305 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
1306 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
1307 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
1308 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
1309 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
1310 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1311 |
|
|
1312 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1315 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1316 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1317 |
|
|
1318 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1319 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1320 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1321 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1322 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1323 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
1324 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1325 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
1326 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1327 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1328 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
1329 |
UTF-8 mode. |
UTF-8 mode. |
1330 |
|
|
1331 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1332 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
1333 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
1334 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
1335 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1336 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1337 |
cause an error. |
cause an error. |
1338 |
|
|
1339 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
1340 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1341 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
1342 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
1343 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
1344 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
1345 |
and are therefore ignored. |
and are therefore ignored. |
1346 |
|
|
1350 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1351 |
|
|
1352 |
If this option is set, it disables the use of numbered capturing paren- |
If this option is set, it disables the use of numbered capturing paren- |
1353 |
theses in the pattern. Any opening parenthesis that is not followed by |
theses in the pattern. Any opening parenthesis that is not followed by |
1354 |
? behaves as if it were followed by ?: but named parentheses can still |
? behaves as if it were followed by ?: but named parentheses can still |
1355 |
be used for capturing (and they acquire numbers in the usual way). |
be used for capturing (and they acquire numbers in the usual way). |
1356 |
There is no equivalent of this option in Perl. |
There is no equivalent of this option in Perl. |
1357 |
|
|
1358 |
PCRE_UNGREEDY |
PCRE_UNGREEDY |
1359 |
|
|
1360 |
This option inverts the "greediness" of the quantifiers so that they |
This option inverts the "greediness" of the quantifiers so that they |
1361 |
are not greedy by default, but become greedy if followed by "?". It is |
are not greedy by default, but become greedy if followed by "?". It is |
1362 |
not compatible with Perl. It can also be set by a (?U) option setting |
not compatible with Perl. It can also be set by a (?U) option setting |
1363 |
within the pattern. |
within the pattern. |
1364 |
|
|
1365 |
PCRE_UTF8 |
PCRE_UTF8 |
1366 |
|
|
1367 |
This option causes PCRE to regard both the pattern and the subject as |
This option causes PCRE to regard both the pattern and the subject as |
1368 |
strings of UTF-8 characters instead of single-byte character strings. |
strings of UTF-8 characters instead of single-byte character strings. |
1369 |
However, it is available only when PCRE is built to include UTF-8 sup- |
However, it is available only when PCRE is built to include UTF-8 sup- |
1370 |
port. If not, the use of this option provokes an error. Details of how |
port. If not, the use of this option provokes an error. Details of how |
1371 |
this option changes the behaviour of PCRE are given in the section on |
this option changes the behaviour of PCRE are given in the section on |
1372 |
UTF-8 support in the main pcre page. |
UTF-8 support in the main pcre page. |
1373 |
|
|
1374 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1375 |
|
|
1376 |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is |
1377 |
automatically checked. There is a discussion about the validity of |
automatically checked. There is a discussion about the validity of |
1378 |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
UTF-8 strings in the main pcre page. If an invalid UTF-8 sequence of |
1379 |
bytes is found, pcre_compile() returns an error. If you already know |
bytes is found, pcre_compile() returns an error. If you already know |
1380 |
that your pattern is valid, and you want to skip this check for perfor- |
that your pattern is valid, and you want to skip this check for perfor- |
1381 |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
mance reasons, you can set the PCRE_NO_UTF8_CHECK option. When it is |
1382 |
set, the effect of passing an invalid UTF-8 string as a pattern is |
set, the effect of passing an invalid UTF-8 string as a pattern is |
1383 |
undefined. It may cause your program to crash. Note that this option |
undefined. It may cause your program to crash. Note that this option |
1384 |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
can also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the |
1385 |
UTF-8 validity checking of subject strings. |
UTF-8 validity checking of subject strings. |
1386 |
|
|
1387 |
|
|
1388 |
COMPILATION ERROR CODES |
COMPILATION ERROR CODES |
1389 |
|
|
1390 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1391 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1392 |
both compiling functions. As PCRE has developed, some error codes have |
both compiling functions. As PCRE has developed, some error codes have |
1393 |
fallen out of use. To avoid confusion, they have not been re-used. |
fallen out of use. To avoid confusion, they have not been re-used. |
1394 |
|
|
1395 |
0 no error |
0 no error |
1445 |
50 [this code is not in use] |
50 [this code is not in use] |
1446 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
1447 |
52 internal error: overran compiling workspace |
52 internal error: overran compiling workspace |
1448 |
53 internal error: previously-checked referenced subpattern not |
53 internal error: previously-checked referenced subpattern not |
1449 |
found |
found |
1450 |
54 DEFINE group contains more than one branch |
54 DEFINE group contains more than one branch |
1451 |
55 repeating a DEFINE group is not allowed |
55 repeating a DEFINE group is not allowed |
1460 |
63 digit expected after (?+ |
63 digit expected after (?+ |
1461 |
64 ] is an invalid data character in JavaScript compatibility mode |
64 ] is an invalid data character in JavaScript compatibility mode |
1462 |
|
|
1463 |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
The numbers 32 and 10000 in errors 48 and 49 are defaults; different |
1464 |
values may be used if the limits were changed when PCRE was built. |
values may be used if the limits were changed when PCRE was built. |
1465 |
|
|
1466 |
|
|
1469 |
pcre_extra *pcre_study(const pcre *code, int options |
pcre_extra *pcre_study(const pcre *code, int options |
1470 |
const char **errptr); |
const char **errptr); |
1471 |
|
|
1472 |
If a compiled pattern is going to be used several times, it is worth |
If a compiled pattern is going to be used several times, it is worth |
1473 |
spending more time analyzing it in order to speed up the time taken for |
spending more time analyzing it in order to speed up the time taken for |
1474 |
matching. The function pcre_study() takes a pointer to a compiled pat- |
matching. The function pcre_study() takes a pointer to a compiled pat- |
1475 |
tern as its first argument. If studying the pattern produces additional |
tern as its first argument. If studying the pattern produces additional |
1476 |
information that will help speed up matching, pcre_study() returns a |
information that will help speed up matching, pcre_study() returns a |
1477 |
pointer to a pcre_extra block, in which the study_data field points to |
pointer to a pcre_extra block, in which the study_data field points to |
1478 |
the results of the study. |
the results of the study. |
1479 |
|
|
1480 |
The returned value from pcre_study() can be passed directly to |
The returned value from pcre_study() can be passed directly to |
1481 |
pcre_exec(). However, a pcre_extra block also contains other fields |
pcre_exec(). However, a pcre_extra block also contains other fields |
1482 |
that can be set by the caller before the block is passed; these are |
that can be set by the caller before the block is passed; these are |
1483 |
described below in the section on matching a pattern. |
described below in the section on matching a pattern. |
1484 |
|
|
1485 |
If studying the pattern does not produce any additional information |
If studying the pattern does not produce any additional information |
1486 |
pcre_study() returns NULL. In that circumstance, if the calling program |
pcre_study() returns NULL. In that circumstance, if the calling program |
1487 |
wants to pass any of the other fields to pcre_exec(), it must set up |
wants to pass any of the other fields to pcre_exec(), it must set up |
1488 |
its own pcre_extra block. |
its own pcre_extra block. |
1489 |
|
|
1490 |
The second argument of pcre_study() contains option bits. At present, |
The second argument of pcre_study() contains option bits. At present, |
1491 |
no options are defined, and this argument should always be zero. |
no options are defined, and this argument should always be zero. |
1492 |
|
|
1493 |
The third argument for pcre_study() is a pointer for an error message. |
The third argument for pcre_study() is a pointer for an error message. |
1494 |
If studying succeeds (even if no data is returned), the variable it |
If studying succeeds (even if no data is returned), the variable it |
1495 |
points to is set to NULL. Otherwise it is set to point to a textual |
points to is set to NULL. Otherwise it is set to point to a textual |
1496 |
error message. This is a static string that is part of the library. You |
error message. This is a static string that is part of the library. You |
1497 |
must not try to free it. You should test the error pointer for NULL |
must not try to free it. You should test the error pointer for NULL |
1498 |
after calling pcre_study(), to be sure that it has run successfully. |
after calling pcre_study(), to be sure that it has run successfully. |
1499 |
|
|
1500 |
This is a typical call to pcre_study(): |
This is a typical call to pcre_study(): |
1506 |
&error); /* set to NULL or points to a message */ |
&error); /* set to NULL or points to a message */ |
1507 |
|
|
1508 |
At present, studying a pattern is useful only for non-anchored patterns |
At present, studying a pattern is useful only for non-anchored patterns |
1509 |
that do not have a single fixed starting character. A bitmap of possi- |
that do not have a single fixed starting character. A bitmap of possi- |
1510 |
ble starting bytes is created. |
ble starting bytes is created. |
1511 |
|
|
1512 |
|
|
1513 |
LOCALE SUPPORT |
LOCALE SUPPORT |
1514 |
|
|
1515 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
1516 |
letters, digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
1517 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1518 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
1519 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
1520 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
1521 |
code is discouraged. If you are handling characters with codes greater |
code is discouraged. If you are handling characters with codes greater |
1522 |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
than 128, you should either use UTF-8 and Unicode, or use locales, but |
1523 |
not try to mix the two. |
not try to mix the two. |
1524 |
|
|
1525 |
PCRE contains an internal set of tables that are used when the final |
PCRE contains an internal set of tables that are used when the final |
1526 |
argument of pcre_compile() is NULL. These are sufficient for many |
argument of pcre_compile() is NULL. These are sufficient for many |
1527 |
applications. Normally, the internal tables recognize only ASCII char- |
applications. Normally, the internal tables recognize only ASCII char- |
1528 |
acters. However, when PCRE is built, it is possible to cause the inter- |
acters. However, when PCRE is built, it is possible to cause the inter- |
1529 |
nal tables to be rebuilt in the default "C" locale of the local system, |
nal tables to be rebuilt in the default "C" locale of the local system, |
1530 |
which may cause them to be different. |
which may cause them to be different. |
1531 |
|
|
1532 |
The internal tables can always be overridden by tables supplied by the |
The internal tables can always be overridden by tables supplied by the |
1533 |
application that calls PCRE. These may be created in a different locale |
application that calls PCRE. These may be created in a different locale |
1534 |
from the default. As more and more applications change to using Uni- |
from the default. As more and more applications change to using Uni- |
1535 |
code, the need for this locale support is expected to die away. |
code, the need for this locale support is expected to die away. |
1536 |
|
|
1537 |
External tables are built by calling the pcre_maketables() function, |
External tables are built by calling the pcre_maketables() function, |
1538 |
which has no arguments, in the relevant locale. The result can then be |
which has no arguments, in the relevant locale. The result can then be |
1539 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1540 |
example, to build and use tables that are appropriate for the French |
example, to build and use tables that are appropriate for the French |
1541 |
locale (where accented characters with values greater than 128 are |
locale (where accented characters with values greater than 128 are |
1542 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1543 |
|
|
1544 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1545 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1546 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1547 |
|
|
1548 |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1549 |
if you are using Windows, the name for the French locale is "french". |
if you are using Windows, the name for the French locale is "french". |
1550 |
|
|
1551 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1552 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1553 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
1554 |
it is needed. |
it is needed. |
1555 |
|
|
1556 |
The pointer that is passed to pcre_compile() is saved with the compiled |
The pointer that is passed to pcre_compile() is saved with the compiled |
1557 |
pattern, and the same tables are used via this pointer by pcre_study() |
pattern, and the same tables are used via this pointer by pcre_study() |
1558 |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
and normally also by pcre_exec(). Thus, by default, for any single pat- |
1559 |
tern, compilation, studying and matching all happen in the same locale, |
tern, compilation, studying and matching all happen in the same locale, |
1560 |
but different patterns can be compiled in different locales. |
but different patterns can be compiled in different locales. |
1561 |
|
|
1562 |
It is possible to pass a table pointer or NULL (indicating the use of |
It is possible to pass a table pointer or NULL (indicating the use of |
1563 |
the internal tables) to pcre_exec(). Although not intended for this |
the internal tables) to pcre_exec(). Although not intended for this |
1564 |
purpose, this facility could be used to match a pattern in a different |
purpose, this facility could be used to match a pattern in a different |
1565 |
locale from the one in which it was compiled. Passing table pointers at |
locale from the one in which it was compiled. Passing table pointers at |
1566 |
run time is discussed below in the section on matching a pattern. |
run time is discussed below in the section on matching a pattern. |
1567 |
|
|
1571 |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
int pcre_fullinfo(const pcre *code, const pcre_extra *extra, |
1572 |
int what, void *where); |
int what, void *where); |
1573 |
|
|
1574 |
The pcre_fullinfo() function returns information about a compiled pat- |
The pcre_fullinfo() function returns information about a compiled pat- |
1575 |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
tern. It replaces the obsolete pcre_info() function, which is neverthe- |
1576 |
less retained for backwards compability (and is documented below). |
less retained for backwards compability (and is documented below). |
1577 |
|
|
1578 |
The first argument for pcre_fullinfo() is a pointer to the compiled |
The first argument for pcre_fullinfo() is a pointer to the compiled |
1579 |
pattern. The second argument is the result of pcre_study(), or NULL if |
pattern. The second argument is the result of pcre_study(), or NULL if |
1580 |
the pattern was not studied. The third argument specifies which piece |
the pattern was not studied. The third argument specifies which piece |
1581 |
of information is required, and the fourth argument is a pointer to a |
of information is required, and the fourth argument is a pointer to a |
1582 |
variable to receive the data. The yield of the function is zero for |
variable to receive the data. The yield of the function is zero for |
1583 |
success, or one of the following negative numbers: |
success, or one of the following negative numbers: |
1584 |
|
|
1585 |
PCRE_ERROR_NULL the argument code was NULL |
PCRE_ERROR_NULL the argument code was NULL |
1587 |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
PCRE_ERROR_BADMAGIC the "magic number" was not found |
1588 |
PCRE_ERROR_BADOPTION the value of what was invalid |
PCRE_ERROR_BADOPTION the value of what was invalid |
1589 |
|
|
1590 |
The "magic number" is placed at the start of each compiled pattern as |
The "magic number" is placed at the start of each compiled pattern as |
1591 |
an simple check against passing an arbitrary memory pointer. Here is a |
an simple check against passing an arbitrary memory pointer. Here is a |
1592 |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
typical call of pcre_fullinfo(), to obtain the length of the compiled |
1593 |
pattern: |
pattern: |
1594 |
|
|
1595 |
int rc; |
int rc; |
1600 |
PCRE_INFO_SIZE, /* what is required */ |
PCRE_INFO_SIZE, /* what is required */ |
1601 |
&length); /* where to put the data */ |
&length); /* where to put the data */ |
1602 |
|
|
1603 |
The possible values for the third argument are defined in pcre.h, and |
The possible values for the third argument are defined in pcre.h, and |
1604 |
are as follows: |
are as follows: |
1605 |
|
|
1606 |
PCRE_INFO_BACKREFMAX |
PCRE_INFO_BACKREFMAX |
1607 |
|
|
1608 |
Return the number of the highest back reference in the pattern. The |
Return the number of the highest back reference in the pattern. The |
1609 |
fourth argument should point to an int variable. Zero is returned if |
fourth argument should point to an int variable. Zero is returned if |
1610 |
there are no back references. |
there are no back references. |
1611 |
|
|
1612 |
PCRE_INFO_CAPTURECOUNT |
PCRE_INFO_CAPTURECOUNT |
1613 |
|
|
1614 |
Return the number of capturing subpatterns in the pattern. The fourth |
Return the number of capturing subpatterns in the pattern. The fourth |
1615 |
argument should point to an int variable. |
argument should point to an int variable. |
1616 |
|
|
1617 |
PCRE_INFO_DEFAULT_TABLES |
PCRE_INFO_DEFAULT_TABLES |
1618 |
|
|
1619 |
Return a pointer to the internal default character tables within PCRE. |
Return a pointer to the internal default character tables within PCRE. |
1620 |
The fourth argument should point to an unsigned char * variable. This |
The fourth argument should point to an unsigned char * variable. This |
1621 |
information call is provided for internal use by the pcre_study() func- |
information call is provided for internal use by the pcre_study() func- |
1622 |
tion. External callers can cause PCRE to use its internal tables by |
tion. External callers can cause PCRE to use its internal tables by |
1623 |
passing a NULL table pointer. |
passing a NULL table pointer. |
1624 |
|
|
1625 |
PCRE_INFO_FIRSTBYTE |
PCRE_INFO_FIRSTBYTE |
1626 |
|
|
1627 |
Return information about the first byte of any matched string, for a |
Return information about the first byte of any matched string, for a |
1628 |
non-anchored pattern. The fourth argument should point to an int vari- |
non-anchored pattern. The fourth argument should point to an int vari- |
1629 |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old name |
1630 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
1631 |
|
|
1632 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1633 |
(cat|cow|coyote), its value is returned. Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
1634 |
|
|
1635 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1636 |
branch starts with "^", or |
branch starts with "^", or |
1637 |
|
|
1638 |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
(b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not |
1639 |
set (if it were set, the pattern would be anchored), |
set (if it were set, the pattern would be anchored), |
1640 |
|
|
1641 |
-1 is returned, indicating that the pattern matches only at the start |
-1 is returned, indicating that the pattern matches only at the start |
1642 |
of a subject string or after any newline within the string. Otherwise |
of a subject string or after any newline within the string. Otherwise |
1643 |
-2 is returned. For anchored patterns, -2 is returned. |
-2 is returned. For anchored patterns, -2 is returned. |
1644 |
|
|
1645 |
PCRE_INFO_FIRSTTABLE |
PCRE_INFO_FIRSTTABLE |
1646 |
|
|
1647 |
If the pattern was studied, and this resulted in the construction of a |
If the pattern was studied, and this resulted in the construction of a |
1648 |
256-bit table indicating a fixed set of bytes for the first byte in any |
256-bit table indicating a fixed set of bytes for the first byte in any |
1649 |
matching string, a pointer to the table is returned. Otherwise NULL is |
matching string, a pointer to the table is returned. Otherwise NULL is |
1650 |
returned. The fourth argument should point to an unsigned char * vari- |
returned. The fourth argument should point to an unsigned char * vari- |
1651 |
able. |
able. |
1652 |
|
|
1653 |
PCRE_INFO_HASCRORLF |
PCRE_INFO_HASCRORLF |
1654 |
|
|
1655 |
Return 1 if the pattern contains any explicit matches for CR or LF |
Return 1 if the pattern contains any explicit matches for CR or LF |
1656 |
characters, otherwise 0. The fourth argument should point to an int |
characters, otherwise 0. The fourth argument should point to an int |
1657 |
variable. An explicit match is either a literal CR or LF character, or |
variable. An explicit match is either a literal CR or LF character, or |
1658 |
\r or \n. |
\r or \n. |
1659 |
|
|
1660 |
PCRE_INFO_JCHANGED |
PCRE_INFO_JCHANGED |
1661 |
|
|
1662 |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
Return 1 if the (?J) or (?-J) option setting is used in the pattern, |
1663 |
otherwise 0. The fourth argument should point to an int variable. (?J) |
otherwise 0. The fourth argument should point to an int variable. (?J) |
1664 |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
and (?-J) set and unset the local PCRE_DUPNAMES option, respectively. |
1665 |
|
|
1666 |
PCRE_INFO_LASTLITERAL |
PCRE_INFO_LASTLITERAL |
1667 |
|
|
1668 |
Return the value of the rightmost literal byte that must exist in any |
Return the value of the rightmost literal byte that must exist in any |
1669 |
matched string, other than at its start, if such a byte has been |
matched string, other than at its start, if such a byte has been |
1670 |
recorded. The fourth argument should point to an int variable. If there |
recorded. The fourth argument should point to an int variable. If there |
1671 |
is no such byte, -1 is returned. For anchored patterns, a last literal |
is no such byte, -1 is returned. For anchored patterns, a last literal |
1672 |
byte is recorded only if it follows something of variable length. For |
byte is recorded only if it follows something of variable length. For |
1673 |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
example, for the pattern /^a\d+z\d+/ the returned value is "z", but for |
1674 |
/^a\dz\d/ the returned value is -1. |
/^a\dz\d/ the returned value is -1. |
1675 |
|
|
1677 |
PCRE_INFO_NAMEENTRYSIZE |
PCRE_INFO_NAMEENTRYSIZE |
1678 |
PCRE_INFO_NAMETABLE |
PCRE_INFO_NAMETABLE |
1679 |
|
|
1680 |
PCRE supports the use of named as well as numbered capturing parenthe- |
PCRE supports the use of named as well as numbered capturing parenthe- |
1681 |
ses. The names are just an additional way of identifying the parenthe- |
ses. The names are just an additional way of identifying the parenthe- |
1682 |
ses, which still acquire numbers. Several convenience functions such as |
ses, which still acquire numbers. Several convenience functions such as |
1683 |
pcre_get_named_substring() are provided for extracting captured sub- |
pcre_get_named_substring() are provided for extracting captured sub- |
1684 |
strings by name. It is also possible to extract the data directly, by |
strings by name. It is also possible to extract the data directly, by |
1685 |
first converting the name to a number in order to access the correct |
first converting the name to a number in order to access the correct |
1686 |
pointers in the output vector (described with pcre_exec() below). To do |
pointers in the output vector (described with pcre_exec() below). To do |
1687 |
the conversion, you need to use the name-to-number map, which is |
the conversion, you need to use the name-to-number map, which is |
1688 |
described by these three values. |
described by these three values. |
1689 |
|
|
1690 |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT |
1691 |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size |
1692 |
of each entry; both of these return an int value. The entry size |
of each entry; both of these return an int value. The entry size |
1693 |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
depends on the length of the longest name. PCRE_INFO_NAMETABLE returns |
1694 |
a pointer to the first entry of the table (a pointer to char). The |
a pointer to the first entry of the table (a pointer to char). The |
1695 |
first two bytes of each entry are the number of the capturing parenthe- |
first two bytes of each entry are the number of the capturing parenthe- |
1696 |
sis, most significant byte first. The rest of the entry is the corre- |
sis, most significant byte first. The rest of the entry is the corre- |
1697 |
sponding name, zero terminated. The names are in alphabetical order. |
sponding name, zero terminated. The names are in alphabetical order. |
1698 |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
When PCRE_DUPNAMES is set, duplicate names are in order of their paren- |
1699 |
theses numbers. For example, consider the following pattern (assume |
theses numbers. For example, consider the following pattern (assume |
1700 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
1701 |
ignored): |
ignored): |
1702 |
|
|
1703 |
(?<date> (?<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
1704 |
(?<month>\d\d) - (?<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
1705 |
|
|
1706 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1707 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1708 |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
with non-printing bytes shows in hexadecimal, and undefined bytes shown |
1709 |
as ??: |
as ??: |
1710 |
|
|
1713 |
00 04 m o n t h 00 |
00 04 m o n t h 00 |
1714 |
00 02 y e a r 00 ?? |
00 02 y e a r 00 ?? |
1715 |
|
|
1716 |
When writing code to extract data from named subpatterns using the |
When writing code to extract data from named subpatterns using the |
1717 |
name-to-number map, remember that the length of the entries is likely |
name-to-number map, remember that the length of the entries is likely |
1718 |
to be different for each compiled pattern. |
to be different for each compiled pattern. |
1719 |
|
|
1720 |
PCRE_INFO_OKPARTIAL |
PCRE_INFO_OKPARTIAL |
1721 |
|
|
1722 |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
Return 1 if the pattern can be used for partial matching, otherwise 0. |
1723 |
The fourth argument should point to an int variable. The pcrepartial |
The fourth argument should point to an int variable. From release 8.00, |
1724 |
documentation lists the restrictions that apply to patterns when par- |
this always returns 1, because the restrictions that previously applied |
1725 |
tial matching is used. |
to partial matching have been lifted. The pcrepartial documentation |
1726 |
|
gives details of partial matching. |
1727 |
|
|
1728 |
PCRE_INFO_OPTIONS |
PCRE_INFO_OPTIONS |
1729 |
|
|
1928 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1929 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1930 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
1931 |
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_SOFT, and PCRE_PARTIAL_HARD. |
1932 |
|
|
1933 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1934 |
|
|
2020 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
2021 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
2022 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
2023 |
this in the pcredemo.c sample program. |
this in the pcredemo sample program. |
2024 |
|
|
2025 |
PCRE_NO_START_OPTIMIZE |
PCRE_NO_START_OPTIMIZE |
2026 |
|
|
2055 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
2056 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
2057 |
|
|
2058 |
PCRE_PARTIAL |
PCRE_PARTIAL_HARD |
2059 |
|
PCRE_PARTIAL_SOFT |
2060 |
|
|
2061 |
This option turns on the partial matching feature. If the subject |
These options turn on the partial matching feature. For backwards com- |
2062 |
string fails to match the pattern, but at some point during the match- |
patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial |
2063 |
ing process the end of the subject was reached (that is, the subject |
match occurs if the end of the subject string is reached successfully, |
2064 |
partially matches the pattern and the failure to match occurred only |
but there are not enough subject characters to complete the match. If |
2065 |
because there were not enough subject characters), pcre_exec() returns |
this happens when PCRE_PARTIAL_HARD is set, pcre_exec() immediately |
2066 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
returns PCRE_ERROR_PARTIAL. Otherwise, if PCRE_PARTIAL_SOFT is set, |
2067 |
used, there are restrictions on what may appear in the pattern. These |
matching continues by testing any other alternatives. Only if they all |
2068 |
are discussed in the pcrepartial documentation. |
fail is PCRE_ERROR_PARTIAL returned (instead of PCRE_ERROR_NOMATCH). |
2069 |
|
The portion of the string that provided the partial match is set as the |
2070 |
|
first matching string. There is a more detailed discussion in the |
2071 |
|
pcrepartial documentation. |
2072 |
|
|
2073 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2074 |
|
|
2075 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
2076 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
2077 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
2078 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
2079 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
2080 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
2081 |
case. |
case. |
2082 |
|
|
2083 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
2084 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
2085 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
2086 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
2087 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
2088 |
|
|
2089 |
\Biss\B |
\Biss\B |
2090 |
|
|
2091 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
2092 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
2093 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
2094 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
2095 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
2096 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
2097 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
2098 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
2099 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
2100 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
2101 |
|
|
2102 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
2103 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
2104 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
2105 |
subject. |
subject. |
2106 |
|
|
2107 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
2108 |
|
|
2109 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
2110 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
2111 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
2112 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
2113 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
2114 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
2115 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
2116 |
|
|
2117 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
2118 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
2119 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
2120 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
2121 |
|
|
2122 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
2123 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
2124 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
2125 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
2126 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
2127 |
it is not, it is rounded down. |
it is not, it is rounded down. |
2128 |
|
|
2129 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
2130 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
2131 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
2132 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
2133 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
2134 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
2135 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
2136 |
|
|
2137 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
2138 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
2139 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
2140 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
2141 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
2142 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
2143 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
2144 |
of offsets has been set. |
of offsets has been set. |
2145 |
|
|
2146 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
2147 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
2148 |
|
|
2149 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
2150 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
2151 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
2152 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
2153 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
2154 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
2155 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
2156 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
2157 |
|
|
2158 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
2159 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
2160 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
2161 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
2162 |
|
|
2163 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
2164 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
2165 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
2166 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
2167 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
2168 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
2169 |
|
|
2170 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
2171 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
2172 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2173 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
2174 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
2175 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
2176 |
the vector is large enough, of course). |
the vector is large enough, of course). |
2177 |
|
|
2178 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
2179 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
2180 |
|
|
2181 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
2182 |
|
|
2183 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
2184 |
defined in the header file: |
defined in the header file: |
2185 |
|
|
2186 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
2189 |
|
|
2190 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
2191 |
|
|
2192 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
2193 |
ovecsize was not zero. |
ovecsize was not zero. |
2194 |
|
|
2195 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
2198 |
|
|
2199 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
2200 |
|
|
2201 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
2202 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
2203 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
2204 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
2205 |
gives when the magic number is not present. |
gives when the magic number is not present. |
2206 |
|
|
2207 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
2208 |
|
|
2209 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
2210 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
2211 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
2212 |
|
|
2213 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2214 |
|
|
2215 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
2216 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
2217 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
2218 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
2219 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
2220 |
|
|
2221 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2222 |
|
|
2223 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
2224 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
2225 |
returned by pcre_exec(). |
returned by pcre_exec(). |
2226 |
|
|
2227 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
2228 |
|
|
2229 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
2230 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
2231 |
above. |
above. |
2232 |
|
|
2233 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
2234 |
|
|
2235 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
2236 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
2237 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
2238 |
|
|
2239 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2240 |
|
|
2241 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2242 |
subject. |
subject. |
2243 |
|
|
2244 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2245 |
|
|
2246 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2247 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2248 |
ter. |
ter. |
2249 |
|
|
2250 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2251 |
|
|
2252 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
2253 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
2254 |
|
|
2255 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2256 |
|
|
2257 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
This code is no longer in use. It was formerly returned when the |
2258 |
items that are not supported for partial matching. See the pcrepartial |
PCRE_PARTIAL option was used with a compiled pattern containing items |
2259 |
documentation for details of partial matching. |
that were not supported for partial matching. From release 8.00 |
2260 |
|
onwards, there are no restrictions on partial matching. |
2261 |
|
|
2262 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2263 |
|
|
2521 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2522 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2523 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2524 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL_HARD, PCRE_PARTIAL_SOFT, PCRE_DFA_SHORTEST, and |
2525 |
three of these are the same as for pcre_exec(), so their description is |
PCRE_DFA_RESTART. All but the last four of these are exactly the same |
2526 |
not repeated here. |
as for pcre_exec(), so their description is not repeated here. |
2527 |
|
|
2528 |
PCRE_PARTIAL |
PCRE_PARTIAL_HARD |
2529 |
|
PCRE_PARTIAL_SOFT |
2530 |
This has the same general effect as it does for pcre_exec(), but the |
|
2531 |
details are slightly different. When PCRE_PARTIAL is set for |
These have the same general effect as they do for pcre_exec(), but the |
2532 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
details are slightly different. When PCRE_PARTIAL_HARD is set for |
2533 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of the sub- |
2534 |
been no complete matches, but there is still at least one matching pos- |
ject is reached and there is still at least one matching possibility |
2535 |
sibility. The portion of the string that provided the partial match is |
that requires additional characters. This happens even if some complete |
2536 |
set as the first matching string. |
matches have also been found. When PCRE_PARTIAL_SOFT is set, the return |
2537 |
|
code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end |
2538 |
|
of the subject is reached, there have been no complete matches, but |
2539 |
|
there is still at least one matching possibility. The portion of the |
2540 |
|
string that provided the longest partial match is set as the first |
2541 |
|
matching string in both cases. |
2542 |
|
|
2543 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2544 |
|
|
2549 |
|
|
2550 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2551 |
|
|
2552 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() returns a partial match, it is possible to call it |
2553 |
returns a partial match, it is possible to call it again, with addi- |
again, with additional subject characters, and have it continue with |
2554 |
tional subject characters, and have it continue with the same match. |
the same match. The PCRE_DFA_RESTART option requests this action; when |
2555 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
it is set, the workspace and wscount options must reference the same |
2556 |
workspace and wscount options must reference the same vector as before |
vector as before because data about the match so far is left in them |
2557 |
because data about the match so far is left in them after a partial |
after a partial match. There is more discussion of this facility in the |
2558 |
match. There is more discussion of this facility in the pcrepartial |
pcrepartial documentation. |
|
documentation. |
|
2559 |
|
|
2560 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2561 |
|
|
2562 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2563 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2564 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2565 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2566 |
if the pattern |
if the pattern |
2567 |
|
|
2568 |
<.*> |
<.*> |
2577 |
<something> <something else> |
<something> <something else> |
2578 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2579 |
|
|
2580 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2581 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2582 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2583 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
2584 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
2585 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
2586 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
2587 |
meaning of the strings is different.) |
meaning of the strings is different.) |
2588 |
|
|
2589 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2590 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2591 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2592 |
filled with the longest matches. |
filled with the longest matches. |
2593 |
|
|
2594 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2595 |
|
|
2596 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2597 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2598 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2599 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2600 |
|
|
2601 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2602 |
|
|
2603 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2604 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2605 |
reference. |
reference. |
2606 |
|
|
2607 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2608 |
|
|
2609 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
2610 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
2611 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
2612 |
|
|
2613 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2614 |
|
|
2615 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2616 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2617 |
(it is meaningless). |
(it is meaningless). |
2618 |
|
|
2619 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2620 |
|
|
2621 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2622 |
workspace vector. |
workspace vector. |
2623 |
|
|
2624 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2625 |
|
|
2626 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2627 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2628 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2629 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2630 |
|
|
2631 |
|
|
2632 |
SEE ALSO |
SEE ALSO |
2633 |
|
|
2634 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2635 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2636 |
|
|
2637 |
|
|
2644 |
|
|
2645 |
REVISION |
REVISION |
2646 |
|
|
2647 |
Last updated: 11 April 2009 |
Last updated: 01 September 2009 |
2648 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2649 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2650 |
|
|
2651 |
|
|
2652 |
PCRECALLOUT(3) PCRECALLOUT(3) |
PCRECALLOUT(3) PCRECALLOUT(3) |
2653 |
|
|
2654 |
|
|
2823 |
Last updated: 15 March 2009 |
Last updated: 15 March 2009 |
2824 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2825 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2826 |
|
|
2827 |
|
|
2828 |
PCRECOMPAT(3) PCRECOMPAT(3) |
PCRECOMPAT(3) PCRECOMPAT(3) |
2829 |
|
|
2830 |
|
|
2837 |
This document describes the differences in the ways that PCRE and Perl |
This document describes the differences in the ways that PCRE and Perl |
2838 |
handle regular expressions. The differences described here are mainly |
handle regular expressions. The differences described here are mainly |
2839 |
with respect to Perl 5.8, though PCRE versions 7.0 and later contain |
with respect to Perl 5.8, though PCRE versions 7.0 and later contain |
2840 |
some features that are expected to be in the forthcoming Perl 5.10. |
some features that are in Perl 5.10. |
2841 |
|
|
2842 |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
2843 |
of what it does have are given in the section on UTF-8 support in the |
of what it does have are given in the section on UTF-8 support in the |
2961 |
|
|
2962 |
REVISION |
REVISION |
2963 |
|
|
2964 |
Last updated: 11 September 2007 |
Last updated: 25 August 2009 |
2965 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2966 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2967 |
|
|
2968 |
|
|
2969 |
PCREPATTERN(3) PCREPATTERN(3) |
PCREPATTERN(3) PCREPATTERN(3) |
2970 |
|
|
2971 |
|
|
5042 |
Last updated: 11 April 2009 |
Last updated: 11 April 2009 |
5043 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5044 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5045 |
|
|
5046 |
|
|
5047 |
PCRESYNTAX(3) PCRESYNTAX(3) |
PCRESYNTAX(3) PCRESYNTAX(3) |
5048 |
|
|
5049 |
|
|
5395 |
Last updated: 11 April 2009 |
Last updated: 11 April 2009 |
5396 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5397 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5398 |
|
|
5399 |
|
|
5400 |
PCREPARTIAL(3) PCREPARTIAL(3) |
PCREPARTIAL(3) PCREPARTIAL(3) |
5401 |
|
|
5402 |
|
|
5420 |
|
|
5421 |
If the application sees the user's keystrokes one by one, and can check |
If the application sees the user's keystrokes one by one, and can check |
5422 |
that what has been typed so far is potentially valid, it is able to |
that what has been typed so far is potentially valid, it is able to |
5423 |
raise an error as soon as a mistake is made, possibly beeping and not |
raise an error as soon as a mistake is made, by beeping and not |
5424 |
reflecting the character that has been typed. This immediate feedback |
reflecting the character that has been typed, for example. This immedi- |
5425 |
is likely to be a better user interface than a check that is delayed |
ate feedback is likely to be a better user interface than a check that |
5426 |
until the entire string has been entered. |
is delayed until the entire string has been entered. Partial matching |
5427 |
|
can also sometimes be useful when the subject string is very long and |
5428 |
PCRE supports the concept of partial matching by means of the PCRE_PAR- |
is not all available at once. |
5429 |
TIAL option, which can be set when calling pcre_exec() or |
|
5430 |
pcre_dfa_exec(). When this flag is set for pcre_exec(), the return code |
PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and |
5431 |
PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time |
PCRE_PARTIAL_HARD options, which can be set when calling pcre_exec() or |
5432 |
during the matching process the last part of the subject string matched |
pcre_dfa_exec(). For backwards compatibility, PCRE_PARTIAL is a synonym |
5433 |
part of the pattern. Unfortunately, for non-anchored matching, it is |
for PCRE_PARTIAL_SOFT. The essential difference between the two options |
5434 |
not possible to obtain the position of the start of the partial match. |
is whether or not a partial match is preferred to an alternative com- |
5435 |
No captured data is set when PCRE_ERROR_PARTIAL is returned. |
plete match, though the details differ between the two matching func- |
5436 |
|
tions. If both options are set, PCRE_PARTIAL_HARD takes precedence. |
5437 |
When PCRE_PARTIAL is set for pcre_dfa_exec(), the return code |
|
5438 |
PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of |
Setting a partial matching option disables one of PCRE's optimizations. |
5439 |
the subject is reached, there have been no complete matches, but there |
PCRE remembers the last literal byte in a pattern, and abandons match- |
5440 |
is still at least one matching possibility. The portion of the string |
ing immediately if such a byte is not present in the subject string. |
5441 |
that provided the partial match is set as the first matching string. |
This optimization cannot be used for a subject string that might match |
5442 |
|
only partially. |
5443 |
Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers |
|
5444 |
the last literal byte in a pattern, and abandons matching immediately |
|
5445 |
if such a byte is not present in the subject string. This optimization |
PARTIAL MATCHING USING pcre_exec() |
5446 |
cannot be used for a subject string that might match only partially. |
|
5447 |
|
A partial match occurs during a call to pcre_exec() whenever the end of |
5448 |
|
the subject string is reached successfully, but matching cannot con- |
5449 |
RESTRICTED PATTERNS FOR PCRE_PARTIAL |
tinue because more characters are needed. However, at least one charac- |
5450 |
|
ter must have been matched. (In other words, a partial match can never |
5451 |
Because of the way certain internal optimizations are implemented in |
be an empty string.) |
5452 |
the pcre_exec() function, the PCRE_PARTIAL option cannot be used with |
|
5453 |
all patterns. These restrictions do not apply when pcre_dfa_exec() is |
If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but |
5454 |
used. For pcre_exec(), repeated single characters such as |
matching continues as normal, and other alternatives in the pattern are |
5455 |
|
tried. If no complete match can be found, pcre_exec() returns |
5456 |
a{2,4} |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH, and if there are at |
5457 |
|
least two slots in the offsets vector, they are filled in with the off- |
5458 |
and repeated single metasequences such as |
sets of the longest string that partially matched. Consider this pat- |
5459 |
|
tern: |
5460 |
\d+ |
|
5461 |
|
/123\w+X|dogY/ |
5462 |
are not permitted if the maximum number of occurrences is greater than |
|
5463 |
one. Optional items such as \d? (where the maximum is one) are permit- |
If this is matched against the subject string "abc123dog", both alter- |
5464 |
ted. Quantifiers with any values are permitted after parentheses, so |
natives fail to match, but the end of the subject is reached during |
5465 |
the invalid examples above can be coded thus: |
matching, so PCRE_ERROR_PARTIAL is returned instead of |
5466 |
|
PCRE_ERROR_NOMATCH. The offsets are set to 3 and 9, identifying |
5467 |
(a){2,4} |
"123dog" as the longest partial match that was found. (In this example, |
5468 |
(\d)+ |
there are two partial matches, because "dog" on its own partially |
5469 |
|
matches the second alternative.) |
5470 |
These constructions run more slowly, but for the kinds of application |
|
5471 |
that are envisaged for this facility, this is not felt to be a major |
If PCRE_PARTIAL_HARD is set for pcre_exec(), it returns PCRE_ERROR_PAR- |
5472 |
restriction. |
TIAL as soon as a partial match is found, without continuing to search |
5473 |
|
for possible complete matches. The difference between the two options |
5474 |
If PCRE_PARTIAL is set for a pattern that does not conform to the |
can be illustrated by a pattern such as: |
5475 |
restrictions, pcre_exec() returns the error code PCRE_ERROR_BADPARTIAL |
|
5476 |
(-13). You can use the PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to |
/dog(sbody)?/ |
5477 |
find out if a compiled pattern can be used for partial matching. |
|
5478 |
|
This matches either "dog" or "dogsbody", greedily (that is, it prefers |
5479 |
|
the longer string if possible). If it is matched against the string |
5480 |
|
"dog" with PCRE_PARTIAL_SOFT, it yields a complete match for "dog". |
5481 |
|
However, if PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. |
5482 |
|
On the other hand, if the pattern is made ungreedy the result is dif- |
5483 |
|
ferent: |
5484 |
|
|
5485 |
|
/dog(sbody)??/ |
5486 |
|
|
5487 |
|
In this case the result is always a complete match because pcre_exec() |
5488 |
|
finds that first, and it never continues after finding a match. It |
5489 |
|
might be easier to follow this explanation by thinking of the two pat- |
5490 |
|
terns like this: |
5491 |
|
|
5492 |
|
/dog(sbody)?/ is the same as /dogsbody|dog/ |
5493 |
|
/dog(sbody)??/ is the same as /dog|dogsbody/ |
5494 |
|
|
5495 |
|
The second pattern will never match "dogsbody" when pcre_exec() is |
5496 |
|
used, because it will always find the shorter match first. |
5497 |
|
|
5498 |
|
|
5499 |
|
PARTIAL MATCHING USING pcre_dfa_exec() |
5500 |
|
|
5501 |
|
The pcre_dfa_exec() function moves along the subject string character |
5502 |
|
by character, without backtracking, searching for all possible matches |
5503 |
|
simultaneously. If the end of the subject is reached before the end of |
5504 |
|
the pattern, there is the possibility of a partial match, again pro- |
5505 |
|
vided that at least one character has matched. |
5506 |
|
|
5507 |
|
When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if |
5508 |
|
there have been no complete matches. Otherwise, the complete matches |
5509 |
|
are returned. However, if PCRE_PARTIAL_HARD is set, a partial match |
5510 |
|
takes precedence over any complete matches. The portion of the string |
5511 |
|
that provided the longest partial match is set as the first matching |
5512 |
|
string, provided there are at least two slots in the offsets vector. |
5513 |
|
|
5514 |
|
Because pcre_dfa_exec() always searches for all possible matches, and |
5515 |
|
there is no difference between greedy and ungreedy repetition, its be- |
5516 |
|
haviour is different from pcre_exec when PCRE_PARTIAL_HARD is set. Con- |
5517 |
|
sider the string "dog" matched against the ungreedy pattern shown |
5518 |
|
above: |
5519 |
|
|
5520 |
|
/dog(sbody)??/ |
5521 |
|
|
5522 |
|
Whereas pcre_exec() stops as soon as it finds the complete match for |
5523 |
|
"dog", pcre_dfa_exec() also finds the partial match for "dogsbody", and |
5524 |
|
so returns that when PCRE_PARTIAL_HARD is set. |
5525 |
|
|
5526 |
|
|
5527 |
|
PARTIAL MATCHING AND WORD BOUNDARIES |
5528 |
|
|
5529 |
|
If a pattern ends with one of sequences \w or \W, which test for word |
5530 |
|
boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter- |
5531 |
|
intuitive results. Consider this pattern: |
5532 |
|
|
5533 |
|
/\bcat\b/ |
5534 |
|
|
5535 |
|
This matches "cat", provided there is a word boundary at either end. If |
5536 |
|
the subject string is "the cat", the comparison of the final "t" with a |
5537 |
|
following character cannot take place, so a partial match is found. |
5538 |
|
However, pcre_exec() carries on with normal matching, which matches \b |
5539 |
|
at the end of the subject when the last character is a letter, thus |
5540 |
|
finding a complete match. The result, therefore, is not PCRE_ERROR_PAR- |
5541 |
|
TIAL. The same thing happens with pcre_dfa_exec(), because it also |
5542 |
|
finds the complete match. |
5543 |
|
|
5544 |
|
Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, |
5545 |
|
because then the partial match takes precedence. |
5546 |
|
|
5547 |
|
|
5548 |
|
FORMERLY RESTRICTED PATTERNS |
5549 |
|
|
5550 |
|
For releases of PCRE prior to 8.00, because of the way certain internal |
5551 |
|
optimizations were implemented in the pcre_exec() function, the |
5552 |
|
PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be |
5553 |
|
used with all patterns. From release 8.00 onwards, the restrictions no |
5554 |
|
longer apply, and partial matching with pcre_exec() can be requested |
5555 |
|
for any pattern. |
5556 |
|
|
5557 |
|
Items that were formerly restricted were repeated single characters and |
5558 |
|
repeated metasequences. If PCRE_PARTIAL was set for a pattern that did |
5559 |
|
not conform to the restrictions, pcre_exec() returned the error code |
5560 |
|
PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The |
5561 |
|
PCRE_INFO_OKPARTIAL call to pcre_fullinfo() to find out if a compiled |
5562 |
|
pattern can be used for partial matching now always returns 1. |
5563 |
|
|
5564 |
|
|
5565 |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
EXAMPLE OF PARTIAL MATCHING USING PCRETEST |
5566 |
|
|
5567 |
If the escape sequence \P is present in a pcretest data line, the |
If the escape sequence \P is present in a pcretest data line, the |
5568 |
PCRE_PARTIAL flag is used for the match. Here is a run of pcretest that |
PCRE_PARTIAL_SOFT option is used for the match. Here is a run of |
5569 |
uses the date example quoted above: |
pcretest that uses the date example quoted above: |
5570 |
|
|
5571 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
5572 |
data> 25jun04\P |
data> 25jun04\P |
5573 |
0: 25jun04 |
0: 25jun04 |
5574 |
1: jun |
1: jun |
5575 |
data> 25dec3\P |
data> 25dec3\P |
5576 |
Partial match |
Partial match: 23dec3 |
5577 |
data> 3ju\P |
data> 3ju\P |
5578 |
Partial match |
Partial match: 3ju |
5579 |
data> 3juj\P |
data> 3juj\P |
5580 |
No match |
No match |
5581 |
data> j\P |
data> j\P |
5583 |
|
|
5584 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
5585 |
matched substrings. The remaining four strings do not match the com- |
matched substrings. The remaining four strings do not match the com- |
5586 |
plete pattern, but the first two are partial matches. The same test, |
plete pattern, but the first two are partial matches. Similar output is |
5587 |
using pcre_dfa_exec() matching (by means of the \D escape sequence), |
obtained when pcre_dfa_exec() is used. |
|
produces the following output: |
|
5588 |
|
|
5589 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
If the escape sequence \P is present more than once in a pcretest data |
5590 |
data> 25jun04\P\D |
line, the PCRE_PARTIAL_HARD option is set for the match. |
|
0: 25jun04 |
|
|
data> 23dec3\P\D |
|
|
Partial match: 23dec3 |
|
|
data> 3ju\P\D |
|
|
Partial match: 3ju |
|
|
data> 3juj\P\D |
|
|
No match |
|
|
data> j\P\D |
|
|
No match |
|
|
|
|
|
Notice that in this case the portion of the string that was matched is |
|
|
made available. |
|
5591 |
|
|
5592 |
|
|
5593 |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
MULTI-SEGMENT MATCHING WITH pcre_dfa_exec() |
5594 |
|
|
5595 |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
5596 |
ble to continue the match by providing additional subject data and |
ble to continue the match by providing additional subject data and |
5597 |
calling pcre_dfa_exec() again with the same compiled regular expres- |
calling pcre_dfa_exec() again with the same compiled regular expres- |
5598 |
sion, this time setting the PCRE_DFA_RESTART option. You must also pass |
sion, this time setting the PCRE_DFA_RESTART option. You must pass the |
5599 |
the same working space as before, because this is where details of the |
same working space as before, because this is where details of the pre- |
5600 |
previous partial match are stored. Here is an example using pcretest, |
vious partial match are stored. Here is an example using pcretest, |
5601 |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\D |
5602 |
\D are as above): |
specifies the use of pcre_dfa_exec()): |
5603 |
|
|
5604 |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ |
5605 |
data> 23ja\P\D |
data> 23ja\P\D |
5607 |
data> n05\R\D |
data> n05\R\D |
5608 |
0: n05 |
0: n05 |
5609 |
|
|
5610 |
The first call has "23ja" as the subject, and requests partial match- |
The first call has "23ja" as the subject, and requests partial match- |
5611 |
ing; the second call has "n05" as the subject for the continued |
ing; the second call has "n05" as the subject for the continued |
5612 |
(restarted) match. Notice that when the match is complete, only the |
(restarted) match. Notice that when the match is complete, only the |
5613 |
last part is shown; PCRE does not retain the previously partially- |
last part is shown; PCRE does not retain the previously partially- |
5614 |
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
5615 |
to. |
to. |
5616 |
|
|
5617 |
You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial |
You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with |
5618 |
matching over multiple segments. This facility can be used to pass very |
PCRE_DFA_RESTART to continue partial matching over multiple segments. |
5619 |
long subject strings to pcre_dfa_exec(). However, some care is needed |
This facility can be used to pass very long subject strings to |
5620 |
for certain types of pattern. |
pcre_dfa_exec(). |
5621 |
|
|
5622 |
1. If the pattern contains tests for the beginning or end of a line, |
|
5623 |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
MULTI-SEGMENT MATCHING WITH pcre_exec() |
5624 |
ate, when the subject string for any call does not contain the begin- |
|
5625 |
|
From release 8.00, pcre_exec() can also be used to do multi-segment |
5626 |
|
matching. Unlike pcre_dfa_exec(), it is not possible to restart the |
5627 |
|
previous match with a new segment of data. Instead, new data must be |
5628 |
|
added to the previous subject string, and the entire match re-run, |
5629 |
|
starting from the point where the partial match occurred. Earlier data |
5630 |
|
can be discarded. Consider an unanchored pattern that matches dates: |
5631 |
|
|
5632 |
|
re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/ |
5633 |
|
data> The date is 23ja\P |
5634 |
|
Partial match: 23ja |
5635 |
|
|
5636 |
|
The this stage, an application could discard the text preceding "23ja", |
5637 |
|
add on text from the next segment, and call pcre_exec() again. Unlike |
5638 |
|
pcre_dfa_exec(), the entire matching string must always be available, |
5639 |
|
and the complete matching process occurs for each call, so more memory |
5640 |
|
and more processing time is needed. |
5641 |
|
|
5642 |
|
|
5643 |
|
ISSUES WITH MULTI-SEGMENT MATCHING |
5644 |
|
|
5645 |
|
Certain types of pattern may give problems with multi-segment matching, |
5646 |
|
whichever matching function is used. |
5647 |
|
|
5648 |
|
1. If the pattern contains tests for the beginning or end of a line, |
5649 |
|
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
5650 |
|
ate, when the subject string for any call does not contain the begin- |
5651 |
ning or end of a line. |
ning or end of a line. |
5652 |
|
|
5653 |
2. If the pattern contains backward assertions (including \b or \B), |
2. If the pattern contains backward assertions (including \b or \B), |
5654 |
you need to arrange for some overlap in the subject strings to allow |
you need to arrange for some overlap in the subject strings to allow |
5655 |
for this. For example, you could pass the subject in chunks that are |
for them to be correctly tested at the start of each substring. For |
5656 |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
example, using pcre_dfa_exec(), you could pass the subject in chunks |
5657 |
set to 200 and the previous 200 bytes at the start of the buffer. |
that are 500 bytes long, but in a buffer of 700 bytes, with the start- |
5658 |
|
ing offset set to 200 and the previous 200 bytes at the start of the |
5659 |
3. Matching a subject string that is split into multiple segments does |
buffer. |
5660 |
not always produce exactly the same result as matching over one single |
|
5661 |
long string. The difference arises when there are multiple matching |
3. Matching a subject string that is split into multiple segments may |
5662 |
possibilities, because a partial match result is given only when there |
not always produce exactly the same result as matching over one single |
5663 |
are no completed matches in a call to pcre_dfa_exec(). This means that |
long string, especially when PCRE_PARTIAL_SOFT is used. The section |
5664 |
as soon as the shortest match has been found, continuation to a new |
"Partial Matching and Word Boundaries" above describes an issue that |
5665 |
subject segment is no longer possible. Consider this pcretest example: |
arises if the pattern ends with \b or \B. Another kind of difference |
5666 |
|
may occur when there are multiple matching possibilities, because a |
5667 |
|
partial match result is given only when there are no completed matches. |
5668 |
|
This means that as soon as the shortest match has been found, continua- |
5669 |
|
tion to a new subject segment is no longer possible. Consider again |
5670 |
|
this pcretest example: |
5671 |
|
|
5672 |
re> /dog(sbody)?/ |
re> /dog(sbody)?/ |
5673 |
|
data> dogsb\P |
5674 |
|
0: dog |
5675 |
data> do\P\D |
data> do\P\D |
5676 |
Partial match: do |
Partial match: do |
5677 |
data> gsb\R\P\D |
data> gsb\R\P\D |
5680 |
0: dogsbody |
0: dogsbody |
5681 |
1: dog |
1: dog |
5682 |
|
|
5683 |
The pattern matches the words "dog" or "dogsbody". When the subject is |
The first data line passes the string "dogsb" to pcre_exec(), setting |
5684 |
presented in several parts ("do" and "gsb" being the first two) the |
the PCRE_PARTIAL_SOFT option. Although the string is a partial match |
5685 |
match stops when "dog" has been found, and it is not possible to con- |
for "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the |
5686 |
tinue. On the other hand, if "dogsbody" is presented as a single |
shorter string "dog" is a complete match. Similarly, when the subject |
5687 |
string, both matches are found. |
is presented to pcre_dfa_exec() in several parts ("do" and "gsb" being |
5688 |
|
the first two) the match stops when "dog" has been found, and it is not |
5689 |
|
possible to continue. On the other hand, if "dogsbody" is presented as |
5690 |
|
a single string, pcre_dfa_exec() finds both matches. |
5691 |
|
|
5692 |
|
Because of these problems, it is probably best to use PCRE_PARTIAL_HARD |
5693 |
|
when matching multi-segment data. The example above then behaves dif- |
5694 |
|
ferently: |
5695 |
|
|
5696 |
|
re> /dog(sbody)?/ |
5697 |
|
data> dogsb\P\P |
5698 |
|
Partial match: dogsb |
5699 |
|
data> do\P\D |
5700 |
|
Partial match: do |
5701 |
|
data> gsb\R\P\P\D |
5702 |
|
Partial match: gsb |
5703 |
|
|
|
Because of this phenomenon, it does not usually make sense to end a |
|
|
pattern that is going to be matched in this way with a variable repeat. |
|
5704 |
|
|
5705 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
5706 |
start with the same pattern item may not work as expected. For example, |
start with the same pattern item may not work as expected when |
5707 |
consider this pattern: |
pcre_dfa_exec() is used. For example, consider this pattern: |
5708 |
|
|
5709 |
1234|3789 |
1234|3789 |
5710 |
|
|
5712 |
first alternative is found at offset 3. There is no partial match for |
first alternative is found at offset 3. There is no partial match for |
5713 |
the second alternative, because such a match does not start at the same |
the second alternative, because such a match does not start at the same |
5714 |
point in the subject string. Attempting to continue with the string |
point in the subject string. Attempting to continue with the string |
5715 |
"789" does not yield a match because only those alternatives that match |
"7890" does not yield a match because only those alternatives that |
5716 |
at one point in the subject are remembered. The problem arises because |
match at one point in the subject are remembered. The problem arises |
5717 |
the start of the second alternative matches within the first alterna- |
because the start of the second alternative matches within the first |
5718 |
tive. There is no problem with anchored patterns or patterns such as: |
alternative. There is no problem with anchored patterns or patterns |
5719 |
|
such as: |
5720 |
|
|
5721 |
1234|ABCD |
1234|ABCD |
5722 |
|
|
5723 |
where no string can be a partial match for both alternatives. |
where no string can be a partial match for both alternatives. This is |
5724 |
|
not a problem if pcre_exec() is used, because the entire match has to |
5725 |
|
be rerun each time: |
5726 |
|
|
5727 |
|
re> /1234|3789/ |
5728 |
|
data> ABC123\P |
5729 |
|
Partial match: 123 |
5730 |
|
data> 1237890 |
5731 |
|
0: 3789 |
5732 |
|
|
5733 |
|
|
5734 |
AUTHOR |
AUTHOR |
5740 |
|
|
5741 |
REVISION |
REVISION |
5742 |
|
|
5743 |
Last updated: 04 June 2007 |
Last updated: 31 August 2009 |
5744 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5745 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5746 |
|
|
5747 |
|
|
5748 |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
PCREPRECOMPILE(3) PCREPRECOMPILE(3) |
5749 |
|
|
5750 |
|
|
5867 |
Last updated: 13 June 2007 |
Last updated: 13 June 2007 |
5868 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
5869 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5870 |
|
|
5871 |
|
|
5872 |
PCREPERFORM(3) PCREPERFORM(3) |
PCREPERFORM(3) PCREPERFORM(3) |
5873 |
|
|
5874 |
|
|
6017 |
Last updated: 06 March 2007 |
Last updated: 06 March 2007 |
6018 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
6019 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6020 |
|
|
6021 |
|
|
6022 |
PCREPOSIX(3) PCREPOSIX(3) |
PCREPOSIX(3) PCREPOSIX(3) |
6023 |
|
|
6024 |
|
|
6136 |
is public: re_nsub contains the number of capturing subpatterns in the |
is public: re_nsub contains the number of capturing subpatterns in the |
6137 |
regular expression. Various error codes are defined in the header file. |
regular expression. Various error codes are defined in the header file. |
6138 |
|
|
6139 |
|
NOTE: If the yield of regcomp() is non-zero, you must not attempt to |
6140 |
|
use the contents of the preg structure. If, for example, you pass it to |
6141 |
|
regexec(), the result is undefined and your program is likely to crash. |
6142 |
|
|
6143 |
|
|
6144 |
MATCHING NEWLINE CHARACTERS |
MATCHING NEWLINE CHARACTERS |
6145 |
|
|
6257 |
|
|
6258 |
REVISION |
REVISION |
6259 |
|
|
6260 |
Last updated: 11 March 2009 |
Last updated: 15 August 2009 |
6261 |
Copyright (c) 1997-2009 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
6262 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6263 |
|
|
6264 |
|
|
6265 |
PCRECPP(3) PCRECPP(3) |
PCRECPP(3) PCRECPP(3) |
6266 |
|
|
6267 |
|
|
6601 |
|
|
6602 |
Last updated: 17 March 2009 |
Last updated: 17 March 2009 |
6603 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6604 |
|
|
6605 |
|
|
6606 |
PCRESAMPLE(3) PCRESAMPLE(3) |
PCRESAMPLE(3) PCRESAMPLE(3) |
6607 |
|
|
6608 |
|
|
6613 |
PCRE SAMPLE PROGRAM |
PCRE SAMPLE PROGRAM |
6614 |
|
|
6615 |
A simple, complete demonstration program, to get you started with using |
A simple, complete demonstration program, to get you started with using |
6616 |
PCRE, is supplied in the file pcredemo.c in the PCRE distribution. |
PCRE, is supplied in the file pcredemo.c in the PCRE distribution. A |
6617 |
|
listing of this program is given in the pcredemo documentation. If you |
6618 |
|
do not have a copy of the PCRE distribution, you can save this listing |
6619 |
|
to re-create pcredemo.c. |
6620 |
|
|
6621 |
The program compiles the regular expression that is its first argument, |
The program compiles the regular expression that is its first argument, |
6622 |
and matches it against the subject string in its second argument. No |
and matches it against the subject string in its second argument. No |
6623 |
PCRE options are set, and default character tables are used. If match- |
PCRE options are set, and default character tables are used. If match- |
6624 |
ing succeeds, the program outputs the portion of the subject that |
ing succeeds, the program outputs the portion of the subject that |
6625 |
matched, together with the contents of any captured substrings. |
matched, together with the contents of any captured substrings. |
6626 |
|
|
6627 |
If the -g option is given on the command line, the program then goes on |
If the -g option is given on the command line, the program then goes on |
6628 |
to check for further matches of the same regular expression in the same |
to check for further matches of the same regular expression in the same |
6629 |
subject string. The logic is a little bit tricky because of the possi- |
subject string. The logic is a little bit tricky because of the possi- |
6630 |
bility of matching an empty string. Comments in the code explain what |
bility of matching an empty string. Comments in the code explain what |
6631 |
is going on. |
is going on. |
6632 |
|
|
6633 |
If PCRE is installed in the standard include and library directories |
If PCRE is installed in the standard include and library directories |
6634 |
for your system, you should be able to compile the demonstration pro- |
for your system, you should be able to compile the demonstration pro- |
6635 |
gram using this command: |
gram using this command: |
6636 |
|
|
6637 |
gcc -o pcredemo pcredemo.c -lpcre |
gcc -o pcredemo pcredemo.c -lpcre |
6638 |
|
|
6639 |
If PCRE is installed elsewhere, you may need to add additional options |
If PCRE is installed elsewhere, you may need to add additional options |
6640 |
to the command line. For example, on a Unix-like system that has PCRE |
to the command line. For example, on a Unix-like system that has PCRE |
6641 |
installed in /usr/local, you can compile the demonstration program |
installed in /usr/local, you can compile the demonstration program |
6642 |
using a command like this: |
using a command like this: |
6643 |
|
|
6644 |
gcc -o pcredemo -I/usr/local/include pcredemo.c \ |
gcc -o pcredemo -I/usr/local/include pcredemo.c \ |
6645 |
-L/usr/local/lib -lpcre |
-L/usr/local/lib -lpcre |
6646 |
|
|
6647 |
Once you have compiled the demonstration program, you can run simple |
Once you have compiled the demonstration program, you can run simple |
6648 |
tests like this: |
tests like this: |
6649 |
|
|
6650 |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
./pcredemo 'cat|dog' 'the cat sat on the mat' |
6651 |
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
./pcredemo -g 'cat|dog' 'the dog sat on the cat' |
6652 |
|
|
6653 |
Note that there is a much more comprehensive test program, called |
Note that there is a much more comprehensive test program, called |
6654 |
pcretest, which supports many more facilities for testing regular |
pcretest, which supports many more facilities for testing regular |
6655 |
expressions and the PCRE library. The pcredemo program is provided as a |
expressions and the PCRE library. The pcredemo program is provided as a |
6656 |
simple coding example. |
simple coding example. |
6657 |
|
|
6658 |
On some operating systems (e.g. Solaris), when PCRE is not installed in |
When you try to run pcredemo when PCRE is not installed in the standard |
6659 |
the standard library directory, you may get an error like this when you |
library directory, you may get an error like this on some operating |
6660 |
try to run pcredemo: |
systems (e.g. Solaris): |
6661 |
|
|
6662 |
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or |
ld.so.1: a.out: fatal: libpcre.so.0: open failed: No such file or |
6663 |
directory |
directory |
6664 |
|
|
6665 |
This is caused by the way shared library support works on those sys- |
This is caused by the way shared library support works on those sys- |
6666 |
tems. You need to add |
tems. You need to add |
6667 |
|
|
6668 |
-R/usr/local/lib |
-R/usr/local/lib |
6679 |
|
|
6680 |
REVISION |
REVISION |
6681 |
|
|
6682 |
Last updated: 23 January 2008 |
Last updated: 01 September 2009 |
6683 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
6684 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6685 |
PCRESTACK(3) PCRESTACK(3) |
PCRESTACK(3) PCRESTACK(3) |
6686 |
|
|
6818 |
Last updated: 09 July 2008 |
Last updated: 09 July 2008 |
6819 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2008 University of Cambridge. |
6820 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6821 |
|
|
6822 |
|
|