94 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
95 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
96 |
|
|
97 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
98 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
99 |
|
|
100 |
|
|
101 |
LIMITATIONS |
LIMITATIONS |
102 |
|
|
103 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
104 |
never in practice be relevant. |
never in practice be relevant. |
105 |
|
|
106 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
107 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
108 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
109 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
110 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
111 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
112 |
of execution is slower. |
of execution is slower. |
113 |
|
|
114 |
All values in repeating quantifiers must be less than 65536. |
All values in repeating quantifiers must be less than 65536. |
119 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
120 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
121 |
|
|
122 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
123 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
124 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
125 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
126 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
127 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
128 |
|
|
129 |
|
|
130 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
131 |
|
|
132 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
133 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
134 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
135 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
136 |
|
|
137 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
138 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
139 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
140 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
141 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
142 |
|
|
143 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
144 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
145 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
146 |
very big. |
very big. |
147 |
|
|
148 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
149 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
150 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
151 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
152 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
153 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
154 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
155 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
156 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
157 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
158 |
does not support this. |
does not support this. |
159 |
|
|
160 |
Validity of UTF-8 strings |
Validity of UTF-8 strings |
161 |
|
|
162 |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
When you set the PCRE_UTF8 flag, the strings passed as patterns and |
163 |
subjects are (by default) checked for validity on entry to the relevant |
subjects are (by default) checked for validity on entry to the relevant |
164 |
functions. From release 7.3 of PCRE, the check is according the rules |
functions. From release 7.3 of PCRE, the check is according the rules |
165 |
of RFC 3629, which are themselves derived from the Unicode specifica- |
of RFC 3629, which are themselves derived from the Unicode specifica- |
166 |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
tion. Earlier releases of PCRE followed the rules of RFC 2279, which |
167 |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
allows the full range of 31-bit values (0 to 0x7FFFFFFF). The current |
168 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
check allows only values in the range U+0 to U+10FFFF, excluding U+D800 |
169 |
to U+DFFF. |
to U+DFFF. |
170 |
|
|
171 |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
The excluded code points are the "Low Surrogate Area" of Unicode, of |
172 |
which the Unicode Standard says this: "The Low Surrogate Area does not |
which the Unicode Standard says this: "The Low Surrogate Area does not |
173 |
contain any character assignments, consequently no character code |
contain any character assignments, consequently no character code |
174 |
charts or namelists are provided for this area. Surrogates are reserved |
charts or namelists are provided for this area. Surrogates are reserved |
175 |
for use with UTF-16 and then must be used in pairs." The code points |
for use with UTF-16 and then must be used in pairs." The code points |
176 |
that are encoded by UTF-16 pairs are available as independent code |
that are encoded by UTF-16 pairs are available as independent code |
177 |
points in the UTF-8 encoding. (In other words, the whole surrogate |
points in the UTF-8 encoding. (In other words, the whole surrogate |
178 |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
179 |
|
|
180 |
If an invalid UTF-8 string is passed to PCRE, an error return |
If an invalid UTF-8 string is passed to PCRE, an error return |
181 |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
(PCRE_ERROR_BADUTF8) is given. In some situations, you may already know |
182 |
that your strings are valid, and therefore want to skip these checks in |
that your strings are valid, and therefore want to skip these checks in |
183 |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at |
184 |
compile time or at run time, PCRE assumes that the pattern or subject |
compile time or at run time, PCRE assumes that the pattern or subject |
185 |
it is given (respectively) contains only valid UTF-8 codes. In this |
it is given (respectively) contains only valid UTF-8 codes. In this |
186 |
case, it does not diagnose an invalid UTF-8 string. |
case, it does not diagnose an invalid UTF-8 string. |
187 |
|
|
188 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, |
189 |
what happens depends on why the string is invalid. If the string con- |
what happens depends on why the string is invalid. If the string con- |
190 |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a |
191 |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
string of characters in the range 0 to 0x7FFFFFFF. In other words, |
192 |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
apart from the initial validity test, PCRE (when in UTF-8 mode) handles |
193 |
strings according to the more liberal rules of RFC 2279. However, if |
strings according to the more liberal rules of RFC 2279. However, if |
194 |
the string does not even conform to RFC 2279, the result is undefined. |
the string does not even conform to RFC 2279, the result is undefined. |
195 |
Your program may crash. |
Your program may crash. |
196 |
|
|
197 |
If you want to process strings of values in the full range 0 to |
If you want to process strings of values in the full range 0 to |
198 |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you can |
199 |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in |
200 |
this situation, you will have to apply your own validity check. |
this situation, you will have to apply your own validity check. |
201 |
|
|
202 |
General comments about UTF-8 mode |
General comments about UTF-8 mode |
203 |
|
|
204 |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
1. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
205 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
206 |
|
|
207 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
2. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
208 |
characters for values greater than \177. |
characters for values greater than \177. |
209 |
|
|
210 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
3. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
211 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
212 |
|
|
213 |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
4. The dot metacharacter matches one UTF-8 character instead of a sin- |
214 |
gle byte. |
gle byte. |
215 |
|
|
216 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
5. The escape sequence \C can be used to match a single byte in UTF-8 |
217 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
218 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
219 |
|
|
220 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
221 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
222 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
223 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
224 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
225 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
226 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
227 |
\p{Nd}. |
\p{Nd}. |
228 |
|
|
229 |
7. Similarly, characters that match the POSIX named character classes |
7. Similarly, characters that match the POSIX named character classes |
230 |
are all low-valued characters. |
are all low-valued characters. |
231 |
|
|
232 |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
8. However, the Perl 5.10 horizontal and vertical whitespace matching |
233 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
234 |
acters. |
acters. |
235 |
|
|
236 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
237 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
238 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
239 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
240 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
241 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
242 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
243 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
244 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
245 |
ported by PCRE. |
ported by PCRE. |
246 |
|
|
247 |
|
|
251 |
University Computing Service |
University Computing Service |
252 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
253 |
|
|
254 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
255 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
256 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
257 |
|
|
258 |
|
|
307 |
|
|
308 |
UTF-8 SUPPORT |
UTF-8 SUPPORT |
309 |
|
|
310 |
To build PCRE with support for UTF-8 character strings, add |
To build PCRE with support for UTF-8 Unicode character strings, add |
311 |
|
|
312 |
--enable-utf8 |
--enable-utf8 |
313 |
|
|
316 |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
have have to set the PCRE_UTF8 option when you call the pcre_compile() |
317 |
function. |
function. |
318 |
|
|
319 |
|
If you set --enable-utf8 when compiling in an EBCDIC environment, PCRE |
320 |
|
expects its input to be either ASCII or UTF-8 (depending on the runtime |
321 |
|
option). It is not possible to support both EBCDIC and UTF-8 codes in |
322 |
|
the same version of the library. Consequently, --enable-utf8 and |
323 |
|
--enable-ebcdic are mutually exclusive. |
324 |
|
|
325 |
|
|
326 |
UNICODE CHARACTER PROPERTY SUPPORT |
UNICODE CHARACTER PROPERTY SUPPORT |
327 |
|
|
343 |
|
|
344 |
CODE VALUE OF NEWLINE |
CODE VALUE OF NEWLINE |
345 |
|
|
346 |
By default, PCRE interprets character 10 (linefeed, LF) as indicating |
By default, PCRE interprets the linefeed (LF) character as indicating |
347 |
the end of a line. This is the normal newline character on Unix-like |
the end of a line. This is the normal newline character on Unix-like |
348 |
systems. You can compile PCRE to use character 13 (carriage return, CR) |
systems. You can compile PCRE to use carriage return (CR) instead, by |
349 |
instead, by adding |
adding |
350 |
|
|
351 |
--enable-newline-is-cr |
--enable-newline-is-cr |
352 |
|
|
369 |
|
|
370 |
causes PCRE to recognize any Unicode newline sequence. |
causes PCRE to recognize any Unicode newline sequence. |
371 |
|
|
372 |
Whatever line ending convention is selected when PCRE is built can be |
Whatever line ending convention is selected when PCRE is built can be |
373 |
overridden when the library functions are called. At build time it is |
overridden when the library functions are called. At build time it is |
374 |
conventional to use the standard for your operating system. |
conventional to use the standard for your operating system. |
375 |
|
|
376 |
|
|
377 |
WHAT \R MATCHES |
WHAT \R MATCHES |
378 |
|
|
379 |
By default, the sequence \R in a pattern matches any Unicode newline |
By default, the sequence \R in a pattern matches any Unicode newline |
380 |
sequence, whatever has been selected as the line ending sequence. If |
sequence, whatever has been selected as the line ending sequence. If |
381 |
you specify |
you specify |
382 |
|
|
383 |
--enable-bsr-anycrlf |
--enable-bsr-anycrlf |
384 |
|
|
385 |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
the default is changed so that \R matches only CR, LF, or CRLF. What- |
386 |
ever is selected when PCRE is built can be overridden when the library |
ever is selected when PCRE is built can be overridden when the library |
387 |
functions are called. |
functions are called. |
388 |
|
|
389 |
|
|
390 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
391 |
|
|
392 |
The PCRE building process uses libtool to build both shared and static |
The PCRE building process uses libtool to build both shared and static |
393 |
Unix libraries by default. You can suppress one of these by adding one |
Unix libraries by default. You can suppress one of these by adding one |
394 |
of |
of |
395 |
|
|
396 |
--disable-shared |
--disable-shared |
402 |
POSIX MALLOC USAGE |
POSIX MALLOC USAGE |
403 |
|
|
404 |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
405 |
umentation), additional working storage is required for holding the |
umentation), additional working storage is required for holding the |
406 |
pointers to capturing substrings, because PCRE requires three integers |
pointers to capturing substrings, because PCRE requires three integers |
407 |
per substring, whereas the POSIX interface provides only two. If the |
per substring, whereas the POSIX interface provides only two. If the |
408 |
number of expected substrings is small, the wrapper function uses space |
number of expected substrings is small, the wrapper function uses space |
409 |
on the stack, because this is faster than using malloc() for each call. |
on the stack, because this is faster than using malloc() for each call. |
410 |
The default threshold above which the stack is no longer used is 10; it |
The default threshold above which the stack is no longer used is 10; it |
417 |
|
|
418 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
419 |
|
|
420 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
421 |
part to another (for example, from an opening parenthesis to an alter- |
part to another (for example, from an opening parenthesis to an alter- |
422 |
nation metacharacter). By default, two-byte values are used for these |
nation metacharacter). By default, two-byte values are used for these |
423 |
offsets, leading to a maximum size for a compiled pattern of around |
offsets, leading to a maximum size for a compiled pattern of around |
424 |
64K. This is sufficient to handle all but the most gigantic patterns. |
64K. This is sufficient to handle all but the most gigantic patterns. |
425 |
Nevertheless, some people do want to process enormous patterns, so it |
Nevertheless, some people do want to process enormous patterns, so it |
426 |
is possible to compile PCRE to use three-byte or four-byte offsets by |
is possible to compile PCRE to use three-byte or four-byte offsets by |
427 |
adding a setting such as |
adding a setting such as |
428 |
|
|
429 |
--with-link-size=3 |
--with-link-size=3 |
430 |
|
|
431 |
to the configure command. The value given must be 2, 3, or 4. Using |
to the configure command. The value given must be 2, 3, or 4. Using |
432 |
longer offsets slows down the operation of PCRE because it has to load |
longer offsets slows down the operation of PCRE because it has to load |
433 |
additional bytes when handling them. |
additional bytes when handling them. |
434 |
|
|
435 |
|
|
436 |
AVOIDING EXCESSIVE STACK USAGE |
AVOIDING EXCESSIVE STACK USAGE |
437 |
|
|
438 |
When matching with the pcre_exec() function, PCRE implements backtrack- |
When matching with the pcre_exec() function, PCRE implements backtrack- |
439 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
440 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
441 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
442 |
suffer from this problem, but it may sometimes be necessary to increase |
suffer from this problem, but it may sometimes be necessary to increase |
443 |
the maximum stack size. There is a discussion in the pcrestack docu- |
the maximum stack size. There is a discussion in the pcrestack docu- |
444 |
mentation.) An alternative approach to recursion that uses memory from |
mentation.) An alternative approach to recursion that uses memory from |
445 |
the heap to remember data, instead of using recursive function calls, |
the heap to remember data, instead of using recursive function calls, |
446 |
has been implemented to work round the problem of limited stack size. |
has been implemented to work round the problem of limited stack size. |
447 |
If you want to build a version of PCRE that works this way, add |
If you want to build a version of PCRE that works this way, add |
448 |
|
|
449 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
450 |
|
|
451 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
452 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
453 |
ment functions. By default these point to malloc() and free(), but you |
ment functions. By default these point to malloc() and free(), but you |
454 |
can replace the pointers so that your own functions are used. |
can replace the pointers so that your own functions are used. |
455 |
|
|
456 |
Separate functions are provided rather than using pcre_malloc and |
Separate functions are provided rather than using pcre_malloc and |
457 |
pcre_free because the usage is very predictable: the block sizes |
pcre_free because the usage is very predictable: the block sizes |
458 |
requested are always the same, and the blocks are always freed in |
requested are always the same, and the blocks are always freed in |
459 |
reverse order. A calling program might be able to implement optimized |
reverse order. A calling program might be able to implement optimized |
460 |
functions that perform better than malloc() and free(). PCRE runs |
functions that perform better than malloc() and free(). PCRE runs |
461 |
noticeably more slowly when built in this way. This option affects only |
noticeably more slowly when built in this way. This option affects only |
462 |
the pcre_exec() function; it is not relevant for the the |
the pcre_exec() function; it is not relevant for the the |
463 |
pcre_dfa_exec() function. |
pcre_dfa_exec() function. |
464 |
|
|
465 |
|
|
466 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
467 |
|
|
468 |
Internally, PCRE has a function called match(), which it calls repeat- |
Internally, PCRE has a function called match(), which it calls repeat- |
469 |
edly (sometimes recursively) when matching a pattern with the |
edly (sometimes recursively) when matching a pattern with the |
470 |
pcre_exec() function. By controlling the maximum number of times this |
pcre_exec() function. By controlling the maximum number of times this |
471 |
function may be called during a single matching operation, a limit can |
function may be called during a single matching operation, a limit can |
472 |
be placed on the resources used by a single call to pcre_exec(). The |
be placed on the resources used by a single call to pcre_exec(). The |
473 |
limit can be changed at run time, as described in the pcreapi documen- |
limit can be changed at run time, as described in the pcreapi documen- |
474 |
tation. The default is 10 million, but this can be changed by adding a |
tation. The default is 10 million, but this can be changed by adding a |
475 |
setting such as |
setting such as |
476 |
|
|
477 |
--with-match-limit=500000 |
--with-match-limit=500000 |
478 |
|
|
479 |
to the configure command. This setting has no effect on the |
to the configure command. This setting has no effect on the |
480 |
pcre_dfa_exec() matching function. |
pcre_dfa_exec() matching function. |
481 |
|
|
482 |
In some environments it is desirable to limit the depth of recursive |
In some environments it is desirable to limit the depth of recursive |
483 |
calls of match() more strictly than the total number of calls, in order |
calls of match() more strictly than the total number of calls, in order |
484 |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
485 |
for-recursion is specified) that is used. A second limit controls this; |
for-recursion is specified) that is used. A second limit controls this; |
486 |
it defaults to the value that is set for --with-match-limit, which |
it defaults to the value that is set for --with-match-limit, which |
487 |
imposes no additional constraints. However, you can set a lower limit |
imposes no additional constraints. However, you can set a lower limit |
488 |
by adding, for example, |
by adding, for example, |
489 |
|
|
490 |
--with-match-limit-recursion=10000 |
--with-match-limit-recursion=10000 |
491 |
|
|
492 |
to the configure command. This value can also be overridden at run |
to the configure command. This value can also be overridden at run |
493 |
time. |
time. |
494 |
|
|
495 |
|
|
496 |
CREATING CHARACTER TABLES AT BUILD TIME |
CREATING CHARACTER TABLES AT BUILD TIME |
497 |
|
|
498 |
PCRE uses fixed tables for processing characters whose code values are |
PCRE uses fixed tables for processing characters whose code values are |
499 |
less than 256. By default, PCRE is built with a set of tables that are |
less than 256. By default, PCRE is built with a set of tables that are |
500 |
distributed in the file pcre_chartables.c.dist. These tables are for |
distributed in the file pcre_chartables.c.dist. These tables are for |
501 |
ASCII codes only. If you add |
ASCII codes only. If you add |
502 |
|
|
503 |
--enable-rebuild-chartables |
--enable-rebuild-chartables |
504 |
|
|
505 |
to the configure command, the distributed tables are no longer used. |
to the configure command, the distributed tables are no longer used. |
506 |
Instead, a program called dftables is compiled and run. This outputs |
Instead, a program called dftables is compiled and run. This outputs |
507 |
the source for new set of tables, created in the default locale of your |
the source for new set of tables, created in the default locale of your |
508 |
C runtime system. (This method of replacing the tables does not work if |
C runtime system. (This method of replacing the tables does not work if |
509 |
you are cross compiling, because dftables is run on the local host. If |
you are cross compiling, because dftables is run on the local host. If |
510 |
you need to create alternative tables when cross compiling, you will |
you need to create alternative tables when cross compiling, you will |
511 |
have to do so "by hand".) |
have to do so "by hand".) |
512 |
|
|
513 |
|
|
514 |
USING EBCDIC CODE |
USING EBCDIC CODE |
515 |
|
|
516 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
517 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
518 |
This is the case for most computer operating systems. PCRE can, how- |
This is the case for most computer operating systems. PCRE can, how- |
519 |
ever, be compiled to run in an EBCDIC environment by adding |
ever, be compiled to run in an EBCDIC environment by adding |
520 |
|
|
521 |
--enable-ebcdic |
--enable-ebcdic |
522 |
|
|
523 |
to the configure command. This setting implies --enable-rebuild-charta- |
to the configure command. This setting implies --enable-rebuild-charta- |
524 |
bles. You should only use it if you know that you are in an EBCDIC |
bles. You should only use it if you know that you are in an EBCDIC |
525 |
environment (for example, an IBM mainframe operating system). |
environment (for example, an IBM mainframe operating system). The |
526 |
|
--enable-ebcdic option is incompatible with --enable-utf8. |
527 |
|
|
528 |
|
|
529 |
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT |
PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT |
585 |
|
|
586 |
REVISION |
REVISION |
587 |
|
|
588 |
Last updated: 13 April 2008 |
Last updated: 17 March 2009 |
589 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
590 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
591 |
|
|
592 |
|
|
1006 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
1007 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
1008 |
|
|
1009 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
1010 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
1011 |
at once. |
at once. |
1012 |
|
|
1014 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
1015 |
|
|
1016 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
1017 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
1018 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
1019 |
pcreprecompile documentation. However, compiling a regular expression |
pcreprecompile documentation. However, compiling a regular expression |
1020 |
with one version of PCRE for use with a different version is not guar- |
with one version of PCRE for use with a different version is not guar- |
1021 |
anteed to work and may cause crashes. |
anteed to work and may cause crashes. |
1022 |
|
|
1023 |
|
|
1025 |
|
|
1026 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
1027 |
|
|
1028 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
1029 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
1030 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
1031 |
tures. |
tures. |
1032 |
|
|
1033 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
1034 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
1035 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
1036 |
available: |
available: |
1037 |
|
|
1038 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
1039 |
|
|
1040 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
1041 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
1042 |
|
|
1043 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
1044 |
|
|
1045 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
1046 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
1047 |
|
|
1048 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
1049 |
|
|
1050 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
1051 |
sequence that is recognized as meaning "newline". The four values that |
sequence that is recognized as meaning "newline". The four values that |
1052 |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF, |
1053 |
and -1 for ANY. The default should normally be the standard sequence |
and -1 for ANY. Though they are derived from ASCII, the same values |
1054 |
for your operating system. |
are returned in EBCDIC environments. The default should normally corre- |
1055 |
|
spond to the standard sequence for your operating system. |
1056 |
|
|
1057 |
PCRE_CONFIG_BSR |
PCRE_CONFIG_BSR |
1058 |
|
|
1079 |
|
|
1080 |
PCRE_CONFIG_MATCH_LIMIT |
PCRE_CONFIG_MATCH_LIMIT |
1081 |
|
|
1082 |
The output is an integer that gives the default limit for the number of |
The output is a long integer that gives the default limit for the num- |
1083 |
internal matching function calls in a pcre_exec() execution. Further |
ber of internal matching function calls in a pcre_exec() execution. |
1084 |
details are given with pcre_exec() below. |
Further details are given with pcre_exec() below. |
1085 |
|
|
1086 |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
PCRE_CONFIG_MATCH_LIMIT_RECURSION |
1087 |
|
|
1088 |
The output is an integer that gives the default limit for the depth of |
The output is a long integer that gives the default limit for the depth |
1089 |
recursion when calling the internal matching function in a pcre_exec() |
of recursion when calling the internal matching function in a |
1090 |
execution. Further details are given with pcre_exec() below. |
pcre_exec() execution. Further details are given with pcre_exec() |
1091 |
|
below. |
1092 |
|
|
1093 |
PCRE_CONFIG_STACKRECURSE |
PCRE_CONFIG_STACKRECURSE |
1094 |
|
|
1095 |
The output is an integer that is set to one if internal recursion when |
The output is an integer that is set to one if internal recursion when |
1096 |
running pcre_exec() is implemented by recursive function calls that use |
running pcre_exec() is implemented by recursive function calls that use |
1097 |
the stack to remember their state. This is the usual way that PCRE is |
the stack to remember their state. This is the usual way that PCRE is |
1098 |
compiled. The output is zero if PCRE was compiled to use blocks of data |
compiled. The output is zero if PCRE was compiled to use blocks of data |
1099 |
on the heap instead of recursive function calls. In this case, |
on the heap instead of recursive function calls. In this case, |
1100 |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
pcre_stack_malloc and pcre_stack_free are called to manage memory |
1101 |
blocks on the heap, thus avoiding the use of the stack. |
blocks on the heap, thus avoiding the use of the stack. |
1102 |
|
|
1103 |
|
|
1114 |
|
|
1115 |
Either of the functions pcre_compile() or pcre_compile2() can be called |
Either of the functions pcre_compile() or pcre_compile2() can be called |
1116 |
to compile a pattern into an internal form. The only difference between |
to compile a pattern into an internal form. The only difference between |
1117 |
the two interfaces is that pcre_compile2() has an additional argument, |
the two interfaces is that pcre_compile2() has an additional argument, |
1118 |
errorcodeptr, via which a numerical error code can be returned. |
errorcodeptr, via which a numerical error code can be returned. |
1119 |
|
|
1120 |
The pattern is a C string terminated by a binary zero, and is passed in |
The pattern is a C string terminated by a binary zero, and is passed in |
1121 |
the pattern argument. A pointer to a single block of memory that is |
the pattern argument. A pointer to a single block of memory that is |
1122 |
obtained via pcre_malloc is returned. This contains the compiled code |
obtained via pcre_malloc is returned. This contains the compiled code |
1123 |
and related data. The pcre type is defined for the returned block; this |
and related data. The pcre type is defined for the returned block; this |
1124 |
is a typedef for a structure whose contents are not externally defined. |
is a typedef for a structure whose contents are not externally defined. |
1125 |
It is up to the caller to free the memory (via pcre_free) when it is no |
It is up to the caller to free the memory (via pcre_free) when it is no |
1126 |
longer required. |
longer required. |
1127 |
|
|
1128 |
Although the compiled code of a PCRE regex is relocatable, that is, it |
Although the compiled code of a PCRE regex is relocatable, that is, it |
1129 |
does not depend on memory location, the complete pcre data block is not |
does not depend on memory location, the complete pcre data block is not |
1130 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
1131 |
ment, which is an address (see below). |
ment, which is an address (see below). |
1132 |
|
|
1133 |
The options argument contains various bit settings that affect the com- |
The options argument contains various bit settings that affect the com- |
1134 |
pilation. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
1135 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them, in particular, those that |
1136 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, can also be set and unset from within the |
1137 |
pattern (see the detailed description in the pcrepattern documenta- |
pattern (see the detailed description in the pcrepattern documenta- |
1138 |
tion). For these options, the contents of the options argument speci- |
tion). For these options, the contents of the options argument speci- |
1139 |
fies their initial settings at the start of compilation and execution. |
fies their initial settings at the start of compilation and execution. |
1140 |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at the time |
1141 |
of matching as well as at compile time. |
of matching as well as at compile time. |
1142 |
|
|
1143 |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
If errptr is NULL, pcre_compile() returns NULL immediately. Otherwise, |
1144 |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
if compilation of a pattern fails, pcre_compile() returns NULL, and |
1145 |
sets the variable pointed to by errptr to point to a textual error mes- |
sets the variable pointed to by errptr to point to a textual error mes- |
1146 |
sage. This is a static string that is part of the library. You must not |
sage. This is a static string that is part of the library. You must not |
1147 |
try to free it. The offset from the start of the pattern to the charac- |
try to free it. The offset from the start of the pattern to the charac- |
1148 |
ter where the error was discovered is placed in the variable pointed to |
ter where the error was discovered is placed in the variable pointed to |
1149 |
by erroffset, which must not be NULL. If it is, an immediate error is |
by erroffset, which must not be NULL. If it is, an immediate error is |
1150 |
given. |
given. |
1151 |
|
|
1152 |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
If pcre_compile2() is used instead of pcre_compile(), and the error- |
1153 |
codeptr argument is not NULL, a non-zero error code number is returned |
codeptr argument is not NULL, a non-zero error code number is returned |
1154 |
via this argument in the event of an error. This is in addition to the |
via this argument in the event of an error. This is in addition to the |
1155 |
textual error message. Error codes and messages are listed below. |
textual error message. Error codes and messages are listed below. |
1156 |
|
|
1157 |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
If the final argument, tableptr, is NULL, PCRE uses a default set of |
1158 |
character tables that are built when PCRE is compiled, using the |
character tables that are built when PCRE is compiled, using the |
1159 |
default C locale. Otherwise, tableptr must be an address that is the |
default C locale. Otherwise, tableptr must be an address that is the |
1160 |
result of a call to pcre_maketables(). This value is stored with the |
result of a call to pcre_maketables(). This value is stored with the |
1161 |
compiled pattern, and used again by pcre_exec(), unless another table |
compiled pattern, and used again by pcre_exec(), unless another table |
1162 |
pointer is passed to it. For more discussion, see the section on locale |
pointer is passed to it. For more discussion, see the section on locale |
1163 |
support below. |
support below. |
1164 |
|
|
1165 |
This code fragment shows a typical straightforward call to pcre_com- |
This code fragment shows a typical straightforward call to pcre_com- |
1166 |
pile(): |
pile(): |
1167 |
|
|
1168 |
pcre *re; |
pcre *re; |
1175 |
&erroffset, /* for error offset */ |
&erroffset, /* for error offset */ |
1176 |
NULL); /* use default character tables */ |
NULL); /* use default character tables */ |
1177 |
|
|
1178 |
The following names for option bits are defined in the pcre.h header |
The following names for option bits are defined in the pcre.h header |
1179 |
file: |
file: |
1180 |
|
|
1181 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1182 |
|
|
1183 |
If this bit is set, the pattern is forced to be "anchored", that is, it |
If this bit is set, the pattern is forced to be "anchored", that is, it |
1184 |
is constrained to match only at the first matching point in the string |
is constrained to match only at the first matching point in the string |
1185 |
that is being searched (the "subject string"). This effect can also be |
that is being searched (the "subject string"). This effect can also be |
1186 |
achieved by appropriate constructs in the pattern itself, which is the |
achieved by appropriate constructs in the pattern itself, which is the |
1187 |
only way to do it in Perl. |
only way to do it in Perl. |
1188 |
|
|
1189 |
PCRE_AUTO_CALLOUT |
PCRE_AUTO_CALLOUT |
1190 |
|
|
1191 |
If this bit is set, pcre_compile() automatically inserts callout items, |
If this bit is set, pcre_compile() automatically inserts callout items, |
1192 |
all with number 255, before each pattern item. For discussion of the |
all with number 255, before each pattern item. For discussion of the |
1193 |
callout facility, see the pcrecallout documentation. |
callout facility, see the pcrecallout documentation. |
1194 |
|
|
1195 |
PCRE_BSR_ANYCRLF |
PCRE_BSR_ANYCRLF |
1196 |
PCRE_BSR_UNICODE |
PCRE_BSR_UNICODE |
1197 |
|
|
1198 |
These options (which are mutually exclusive) control what the \R escape |
These options (which are mutually exclusive) control what the \R escape |
1199 |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
sequence matches. The choice is either to match only CR, LF, or CRLF, |
1200 |
or to match any Unicode newline sequence. The default is specified when |
or to match any Unicode newline sequence. The default is specified when |
1201 |
PCRE is built. It can be overridden from within the pattern, or by set- |
PCRE is built. It can be overridden from within the pattern, or by set- |
1202 |
ting an option when a compiled pattern is matched. |
ting an option when a compiled pattern is matched. |
1203 |
|
|
1204 |
PCRE_CASELESS |
PCRE_CASELESS |
1205 |
|
|
1206 |
If this bit is set, letters in the pattern match both upper and lower |
If this bit is set, letters in the pattern match both upper and lower |
1207 |
case letters. It is equivalent to Perl's /i option, and it can be |
case letters. It is equivalent to Perl's /i option, and it can be |
1208 |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
changed within a pattern by a (?i) option setting. In UTF-8 mode, PCRE |
1209 |
always understands the concept of case for characters whose values are |
always understands the concept of case for characters whose values are |
1210 |
less than 128, so caseless matching is always possible. For characters |
less than 128, so caseless matching is always possible. For characters |
1211 |
with higher values, the concept of case is supported if PCRE is com- |
with higher values, the concept of case is supported if PCRE is com- |
1212 |
piled with Unicode property support, but not otherwise. If you want to |
piled with Unicode property support, but not otherwise. If you want to |
1213 |
use caseless matching for characters 128 and above, you must ensure |
use caseless matching for characters 128 and above, you must ensure |
1214 |
that PCRE is compiled with Unicode property support as well as with |
that PCRE is compiled with Unicode property support as well as with |
1215 |
UTF-8 support. |
UTF-8 support. |
1216 |
|
|
1217 |
PCRE_DOLLAR_ENDONLY |
PCRE_DOLLAR_ENDONLY |
1218 |
|
|
1219 |
If this bit is set, a dollar metacharacter in the pattern matches only |
If this bit is set, a dollar metacharacter in the pattern matches only |
1220 |
at the end of the subject string. Without this option, a dollar also |
at the end of the subject string. Without this option, a dollar also |
1221 |
matches immediately before a newline at the end of the string (but not |
matches immediately before a newline at the end of the string (but not |
1222 |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
before any other newlines). The PCRE_DOLLAR_ENDONLY option is ignored |
1223 |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
if PCRE_MULTILINE is set. There is no equivalent to this option in |
1224 |
Perl, and no way to set it within a pattern. |
Perl, and no way to set it within a pattern. |
1225 |
|
|
1226 |
PCRE_DOTALL |
PCRE_DOTALL |
1227 |
|
|
1228 |
If this bit is set, a dot metacharater in the pattern matches all char- |
If this bit is set, a dot metacharater in the pattern matches all char- |
1229 |
acters, including those that indicate newline. Without it, a dot does |
acters, including those that indicate newline. Without it, a dot does |
1230 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
1231 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
1232 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
1233 |
newline characters, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
1234 |
|
|
1235 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
1236 |
|
|
1237 |
If this bit is set, names used to identify capturing subpatterns need |
If this bit is set, names used to identify capturing subpatterns need |
1238 |
not be unique. This can be helpful for certain types of pattern when it |
not be unique. This can be helpful for certain types of pattern when it |
1239 |
is known that only one instance of the named subpattern can ever be |
is known that only one instance of the named subpattern can ever be |
1240 |
matched. There are more details of named subpatterns below; see also |
matched. There are more details of named subpatterns below; see also |
1241 |
the pcrepattern documentation. |
the pcrepattern documentation. |
1242 |
|
|
1243 |
PCRE_EXTENDED |
PCRE_EXTENDED |
1244 |
|
|
1245 |
If this bit is set, whitespace data characters in the pattern are |
If this bit is set, whitespace data characters in the pattern are |
1246 |
totally ignored except when escaped or inside a character class. White- |
totally ignored except when escaped or inside a character class. White- |
1247 |
space does not include the VT character (code 11). In addition, charac- |
space does not include the VT character (code 11). In addition, charac- |
1248 |
ters between an unescaped # outside a character class and the next new- |
ters between an unescaped # outside a character class and the next new- |
1249 |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
line, inclusive, are also ignored. This is equivalent to Perl's /x |
1250 |
option, and it can be changed within a pattern by a (?x) option set- |
option, and it can be changed within a pattern by a (?x) option set- |
1251 |
ting. |
ting. |
1252 |
|
|
1253 |
This option makes it possible to include comments inside complicated |
This option makes it possible to include comments inside complicated |
1254 |
patterns. Note, however, that this applies only to data characters. |
patterns. Note, however, that this applies only to data characters. |
1255 |
Whitespace characters may never appear within special character |
Whitespace characters may never appear within special character |
1256 |
sequences in a pattern, for example within the sequence (?( which |
sequences in a pattern, for example within the sequence (?( which |
1257 |
introduces a conditional subpattern. |
introduces a conditional subpattern. |
1258 |
|
|
1259 |
PCRE_EXTRA |
PCRE_EXTRA |
1260 |
|
|
1261 |
This option was invented in order to turn on additional functionality |
This option was invented in order to turn on additional functionality |
1262 |
of PCRE that is incompatible with Perl, but it is currently of very |
of PCRE that is incompatible with Perl, but it is currently of very |
1263 |
little use. When set, any backslash in a pattern that is followed by a |
little use. When set, any backslash in a pattern that is followed by a |
1264 |
letter that has no special meaning causes an error, thus reserving |
letter that has no special meaning causes an error, thus reserving |
1265 |
these combinations for future expansion. By default, as in Perl, a |
these combinations for future expansion. By default, as in Perl, a |
1266 |
backslash followed by a letter with no special meaning is treated as a |
backslash followed by a letter with no special meaning is treated as a |
1267 |
literal. (Perl can, however, be persuaded to give a warning for this.) |
literal. (Perl can, however, be persuaded to give a warning for this.) |
1268 |
There are at present no other features controlled by this option. It |
There are at present no other features controlled by this option. It |
1269 |
can also be set by a (?X) option setting within a pattern. |
can also be set by a (?X) option setting within a pattern. |
1270 |
|
|
1271 |
PCRE_FIRSTLINE |
PCRE_FIRSTLINE |
1272 |
|
|
1273 |
If this option is set, an unanchored pattern is required to match |
If this option is set, an unanchored pattern is required to match |
1274 |
before or at the first newline in the subject string, though the |
before or at the first newline in the subject string, though the |
1275 |
matched text may continue over the newline. |
matched text may continue over the newline. |
1276 |
|
|
1277 |
PCRE_JAVASCRIPT_COMPAT |
PCRE_JAVASCRIPT_COMPAT |
1278 |
|
|
1279 |
If this option is set, PCRE's behaviour is changed in some ways so that |
If this option is set, PCRE's behaviour is changed in some ways so that |
1280 |
it is compatible with JavaScript rather than Perl. The changes are as |
it is compatible with JavaScript rather than Perl. The changes are as |
1281 |
follows: |
follows: |
1282 |
|
|
1283 |
(1) A lone closing square bracket in a pattern causes a compile-time |
(1) A lone closing square bracket in a pattern causes a compile-time |
1284 |
error, because this is illegal in JavaScript (by default it is treated |
error, because this is illegal in JavaScript (by default it is treated |
1285 |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
as a data character). Thus, the pattern AB]CD becomes illegal when this |
1286 |
option is set. |
option is set. |
1287 |
|
|
1288 |
(2) At run time, a back reference to an unset subpattern group matches |
(2) At run time, a back reference to an unset subpattern group matches |
1289 |
an empty string (by default this causes the current matching alterna- |
an empty string (by default this causes the current matching alterna- |
1290 |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
tive to fail). A pattern such as (\1)(a) succeeds when this option is |
1291 |
set (assuming it can find an "a" in the subject), whereas it fails by |
set (assuming it can find an "a" in the subject), whereas it fails by |
1292 |
default, for Perl compatibility. |
default, for Perl compatibility. |
1293 |
|
|
1294 |
PCRE_MULTILINE |
PCRE_MULTILINE |
1295 |
|
|
1296 |
By default, PCRE treats the subject string as consisting of a single |
By default, PCRE treats the subject string as consisting of a single |
1297 |
line of characters (even if it actually contains newlines). The "start |
line of characters (even if it actually contains newlines). The "start |
1298 |
of line" metacharacter (^) matches only at the start of the string, |
of line" metacharacter (^) matches only at the start of the string, |
1299 |
while the "end of line" metacharacter ($) matches only at the end of |
while the "end of line" metacharacter ($) matches only at the end of |
1300 |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY |
1301 |
is set). This is the same as Perl. |
is set). This is the same as Perl. |
1302 |
|
|
1303 |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
When PCRE_MULTILINE it is set, the "start of line" and "end of line" |
1304 |
constructs match immediately following or immediately before internal |
constructs match immediately following or immediately before internal |
1305 |
newlines in the subject string, respectively, as well as at the very |
newlines in the subject string, respectively, as well as at the very |
1306 |
start and end. This is equivalent to Perl's /m option, and it can be |
start and end. This is equivalent to Perl's /m option, and it can be |
1307 |
changed within a pattern by a (?m) option setting. If there are no new- |
changed within a pattern by a (?m) option setting. If there are no new- |
1308 |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
lines in a subject string, or no occurrences of ^ or $ in a pattern, |
1309 |
setting PCRE_MULTILINE has no effect. |
setting PCRE_MULTILINE has no effect. |
1310 |
|
|
1311 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1314 |
PCRE_NEWLINE_ANYCRLF |
PCRE_NEWLINE_ANYCRLF |
1315 |
PCRE_NEWLINE_ANY |
PCRE_NEWLINE_ANY |
1316 |
|
|
1317 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1318 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1319 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1320 |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1321 |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANYCRLF specifies |
1322 |
that any of the three preceding sequences should be recognized. Setting |
that any of the three preceding sequences should be recognized. Setting |
1323 |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should be |
1324 |
recognized. The Unicode newline sequences are the three just mentioned, |
recognized. The Unicode newline sequences are the three just mentioned, |
1325 |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
plus the single characters VT (vertical tab, U+000B), FF (formfeed, |
1326 |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
U+000C), NEL (next line, U+0085), LS (line separator, U+2028), and PS |
1327 |
(paragraph separator, U+2029). The last two are recognized only in |
(paragraph separator, U+2029). The last two are recognized only in |
1328 |
UTF-8 mode. |
UTF-8 mode. |
1329 |
|
|
1330 |
The newline setting in the options word uses three bits that are |
The newline setting in the options word uses three bits that are |
1331 |
treated as a number, giving eight possibilities. Currently only six are |
treated as a number, giving eight possibilities. Currently only six are |
1332 |
used (default plus the five values above). This means that if you set |
used (default plus the five values above). This means that if you set |
1333 |
more than one newline option, the combination may or may not be sensi- |
more than one newline option, the combination may or may not be sensi- |
1334 |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to |
1335 |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers and |
1336 |
cause an error. |
cause an error. |
1337 |
|
|
1338 |
The only time that a line break is specially recognized when compiling |
The only time that a line break is specially recognized when compiling |
1339 |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1340 |
character class is encountered. This indicates a comment that lasts |
character class is encountered. This indicates a comment that lasts |
1341 |
until after the next line break sequence. In other circumstances, line |
until after the next line break sequence. In other circumstances, line |
1342 |
break sequences are treated as literal data, except that in |
break sequences are treated as literal data, except that in |
1343 |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
1344 |
and are therefore ignored. |
and are therefore ignored. |
1345 |
|
|
1346 |
The newline option that is set at compile time becomes the default that |
The newline option that is set at compile time becomes the default that |
1347 |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
1348 |
|
|
1349 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1350 |
|
|
1812 |
const char *subject, int length, int startoffset, |
const char *subject, int length, int startoffset, |
1813 |
int options, int *ovector, int ovecsize); |
int options, int *ovector, int ovecsize); |
1814 |
|
|
1815 |
The function pcre_exec() is called to match a subject string against a |
The function pcre_exec() is called to match a subject string against a |
1816 |
compiled pattern, which is passed in the code argument. If the pattern |
compiled pattern, which is passed in the code argument. If the pattern |
1817 |
has been studied, the result of the study should be passed in the extra |
has been studied, the result of the study should be passed in the extra |
1818 |
argument. This function is the main matching facility of the library, |
argument. This function is the main matching facility of the library, |
1819 |
and it operates in a Perl-like manner. For specialist use there is also |
and it operates in a Perl-like manner. For specialist use there is also |
1820 |
an alternative matching function, which is described below in the sec- |
an alternative matching function, which is described below in the sec- |
1821 |
tion about the pcre_dfa_exec() function. |
tion about the pcre_dfa_exec() function. |
1822 |
|
|
1823 |
In most applications, the pattern will have been compiled (and option- |
In most applications, the pattern will have been compiled (and option- |
1824 |
ally studied) in the same process that calls pcre_exec(). However, it |
ally studied) in the same process that calls pcre_exec(). However, it |
1825 |
is possible to save compiled patterns and study data, and then use them |
is possible to save compiled patterns and study data, and then use them |
1826 |
later in different processes, possibly even on different hosts. For a |
later in different processes, possibly even on different hosts. For a |
1827 |
discussion about this, see the pcreprecompile documentation. |
discussion about this, see the pcreprecompile documentation. |
1828 |
|
|
1829 |
Here is an example of a simple call to pcre_exec(): |
Here is an example of a simple call to pcre_exec(): |
1842 |
|
|
1843 |
Extra data for pcre_exec() |
Extra data for pcre_exec() |
1844 |
|
|
1845 |
If the extra argument is not NULL, it must point to a pcre_extra data |
If the extra argument is not NULL, it must point to a pcre_extra data |
1846 |
block. The pcre_study() function returns such a block (when it doesn't |
block. The pcre_study() function returns such a block (when it doesn't |
1847 |
return NULL), but you can also create one for yourself, and pass addi- |
return NULL), but you can also create one for yourself, and pass addi- |
1848 |
tional information in it. The pcre_extra block contains the following |
tional information in it. The pcre_extra block contains the following |
1849 |
fields (not necessarily in this order): |
fields (not necessarily in this order): |
1850 |
|
|
1851 |
unsigned long int flags; |
unsigned long int flags; |
1855 |
void *callout_data; |
void *callout_data; |
1856 |
const unsigned char *tables; |
const unsigned char *tables; |
1857 |
|
|
1858 |
The flags field is a bitmap that specifies which of the other fields |
The flags field is a bitmap that specifies which of the other fields |
1859 |
are set. The flag bits are: |
are set. The flag bits are: |
1860 |
|
|
1861 |
PCRE_EXTRA_STUDY_DATA |
PCRE_EXTRA_STUDY_DATA |
1864 |
PCRE_EXTRA_CALLOUT_DATA |
PCRE_EXTRA_CALLOUT_DATA |
1865 |
PCRE_EXTRA_TABLES |
PCRE_EXTRA_TABLES |
1866 |
|
|
1867 |
Other flag bits should be set to zero. The study_data field is set in |
Other flag bits should be set to zero. The study_data field is set in |
1868 |
the pcre_extra block that is returned by pcre_study(), together with |
the pcre_extra block that is returned by pcre_study(), together with |
1869 |
the appropriate flag bit. You should not set this yourself, but you may |
the appropriate flag bit. You should not set this yourself, but you may |
1870 |
add to the block by setting the other fields and their corresponding |
add to the block by setting the other fields and their corresponding |
1871 |
flag bits. |
flag bits. |
1872 |
|
|
1873 |
The match_limit field provides a means of preventing PCRE from using up |
The match_limit field provides a means of preventing PCRE from using up |
1874 |
a vast amount of resources when running patterns that are not going to |
a vast amount of resources when running patterns that are not going to |
1875 |
match, but which have a very large number of possibilities in their |
match, but which have a very large number of possibilities in their |
1876 |
search trees. The classic example is the use of nested unlimited |
search trees. The classic example is the use of nested unlimited |
1877 |
repeats. |
repeats. |
1878 |
|
|
1879 |
Internally, PCRE uses a function called match() which it calls repeat- |
Internally, PCRE uses a function called match() which it calls repeat- |
1880 |
edly (sometimes recursively). The limit set by match_limit is imposed |
edly (sometimes recursively). The limit set by match_limit is imposed |
1881 |
on the number of times this function is called during a match, which |
on the number of times this function is called during a match, which |
1882 |
has the effect of limiting the amount of backtracking that can take |
has the effect of limiting the amount of backtracking that can take |
1883 |
place. For patterns that are not anchored, the count restarts from zero |
place. For patterns that are not anchored, the count restarts from zero |
1884 |
for each position in the subject string. |
for each position in the subject string. |
1885 |
|
|
1886 |
The default value for the limit can be set when PCRE is built; the |
The default value for the limit can be set when PCRE is built; the |
1887 |
default default is 10 million, which handles all but the most extreme |
default default is 10 million, which handles all but the most extreme |
1888 |
cases. You can override the default by suppling pcre_exec() with a |
cases. You can override the default by suppling pcre_exec() with a |
1889 |
pcre_extra block in which match_limit is set, and |
pcre_extra block in which match_limit is set, and |
1890 |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
PCRE_EXTRA_MATCH_LIMIT is set in the flags field. If the limit is |
1891 |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT. |
1892 |
|
|
1893 |
The match_limit_recursion field is similar to match_limit, but instead |
The match_limit_recursion field is similar to match_limit, but instead |
1894 |
of limiting the total number of times that match() is called, it limits |
of limiting the total number of times that match() is called, it limits |
1895 |
the depth of recursion. The recursion depth is a smaller number than |
the depth of recursion. The recursion depth is a smaller number than |
1896 |
the total number of calls, because not all calls to match() are recur- |
the total number of calls, because not all calls to match() are recur- |
1897 |
sive. This limit is of use only if it is set smaller than match_limit. |
sive. This limit is of use only if it is set smaller than match_limit. |
1898 |
|
|
1899 |
Limiting the recursion depth limits the amount of stack that can be |
Limiting the recursion depth limits the amount of stack that can be |
1925 |
|
|
1926 |
The unused bits of the options argument for pcre_exec() must be zero. |
The unused bits of the options argument for pcre_exec() must be zero. |
1927 |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
The only bits that may be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx, |
1928 |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK and |
PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_START_OPTIMIZE, |
1929 |
PCRE_PARTIAL. |
PCRE_NO_UTF8_CHECK and PCRE_PARTIAL. |
1930 |
|
|
1931 |
PCRE_ANCHORED |
PCRE_ANCHORED |
1932 |
|
|
2020 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
2021 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
2022 |
|
|
2023 |
|
PCRE_NO_START_OPTIMIZE |
2024 |
|
|
2025 |
|
There are a number of optimizations that pcre_exec() uses at the start |
2026 |
|
of a match, in order to speed up the process. For example, if it is |
2027 |
|
known that a match must start with a specific character, it searches |
2028 |
|
the subject for that character, and fails immediately if it cannot find |
2029 |
|
it, without actually running the main matching function. When callouts |
2030 |
|
are in use, these optimizations can cause them to be skipped. This |
2031 |
|
option disables the "start-up" optimizations, causing performance to |
2032 |
|
suffer, but ensuring that the callouts do occur. |
2033 |
|
|
2034 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
2035 |
|
|
2036 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
2037 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
2038 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
2039 |
points to the start of a UTF-8 character. There is a discussion about |
points to the start of a UTF-8 character. There is a discussion about |
2040 |
the validity of UTF-8 strings in the section on UTF-8 support in the |
the validity of UTF-8 strings in the section on UTF-8 support in the |
2041 |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
main pcre page. If an invalid UTF-8 sequence of bytes is found, |
2042 |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
pcre_exec() returns the error PCRE_ERROR_BADUTF8. If startoffset con- |
2043 |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned. |
2044 |
|
|
2045 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
2046 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
2047 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
2048 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
2049 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
2050 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
2051 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
2052 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
2053 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
2054 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
2055 |
|
|
2056 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2057 |
|
|
2058 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
2059 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
2060 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
2061 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
2062 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
2063 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
2064 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
2065 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
2066 |
|
|
2067 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
2068 |
|
|
2069 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
2070 |
length (in bytes) in length, and a starting byte offset in startoffset. |
length (in bytes) in length, and a starting byte offset in startoffset. |
2071 |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
In UTF-8 mode, the byte offset must point to the start of a UTF-8 char- |
2072 |
acter. Unlike the pattern string, the subject may contain binary zero |
acter. Unlike the pattern string, the subject may contain binary zero |
2073 |
bytes. When the starting offset is zero, the search for a match starts |
bytes. When the starting offset is zero, the search for a match starts |
2074 |
at the beginning of the subject, and this is by far the most common |
at the beginning of the subject, and this is by far the most common |
2075 |
case. |
case. |
2076 |
|
|
2077 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
2078 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
2079 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
2080 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
2081 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
2082 |
|
|
2083 |
\Biss\B |
\Biss\B |
2084 |
|
|
2085 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
2086 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
2087 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
2088 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
2089 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
2090 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
2091 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
2092 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
2093 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
2094 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
2095 |
|
|
2096 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
2097 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
2098 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
2099 |
subject. |
subject. |
2100 |
|
|
2101 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
2102 |
|
|
2103 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
2104 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
2105 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
2106 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
2107 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
2108 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
2109 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
2110 |
|
|
2111 |
Captured substrings are returned to the caller via a vector of integers |
Captured substrings are returned to the caller via a vector of integers |
2112 |
whose address is passed in ovector. The number of elements in the vec- |
whose address is passed in ovector. The number of elements in the vec- |
2113 |
tor is passed in ovecsize, which must be a non-negative number. Note: |
tor is passed in ovecsize, which must be a non-negative number. Note: |
2114 |
this argument is NOT the size of ovector in bytes. |
this argument is NOT the size of ovector in bytes. |
2115 |
|
|
2116 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
2117 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
2118 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
2119 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
2120 |
The number passed in ovecsize should always be a multiple of three. If |
The number passed in ovecsize should always be a multiple of three. If |
2121 |
it is not, it is rounded down. |
it is not, it is rounded down. |
2122 |
|
|
2123 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
2124 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
2125 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
2126 |
element of each pair is set to the byte offset of the first character |
element of each pair is set to the byte offset of the first character |
2127 |
in a substring, and the second is set to the byte offset of the first |
in a substring, and the second is set to the byte offset of the first |
2128 |
character after the end of a substring. Note: these values are always |
character after the end of a substring. Note: these values are always |
2129 |
byte offsets, even in UTF-8 mode. They are not character counts. |
byte offsets, even in UTF-8 mode. They are not character counts. |
2130 |
|
|
2131 |
The first pair of integers, ovector[0] and ovector[1], identify the |
The first pair of integers, ovector[0] and ovector[1], identify the |
2132 |
portion of the subject string matched by the entire pattern. The next |
portion of the subject string matched by the entire pattern. The next |
2133 |
pair is used for the first capturing subpattern, and so on. The value |
pair is used for the first capturing subpattern, and so on. The value |
2134 |
returned by pcre_exec() is one more than the highest numbered pair that |
returned by pcre_exec() is one more than the highest numbered pair that |
2135 |
has been set. For example, if two substrings have been captured, the |
has been set. For example, if two substrings have been captured, the |
2136 |
returned value is 3. If there are no capturing subpatterns, the return |
returned value is 3. If there are no capturing subpatterns, the return |
2137 |
value from a successful match is 1, indicating that just the first pair |
value from a successful match is 1, indicating that just the first pair |
2138 |
of offsets has been set. |
of offsets has been set. |
2139 |
|
|
2140 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
2141 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
2142 |
|
|
2143 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
2144 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
2145 |
function returns a value of zero. If the substring offsets are not of |
function returns a value of zero. If the substring offsets are not of |
2146 |
interest, pcre_exec() may be called with ovector passed as NULL and |
interest, pcre_exec() may be called with ovector passed as NULL and |
2147 |
ovecsize as zero. However, if the pattern contains back references and |
ovecsize as zero. However, if the pattern contains back references and |
2148 |
the ovector is not big enough to remember the related substrings, PCRE |
the ovector is not big enough to remember the related substrings, PCRE |
2149 |
has to get additional memory for use during matching. Thus it is usu- |
has to get additional memory for use during matching. Thus it is usu- |
2150 |
ally advisable to supply an ovector. |
ally advisable to supply an ovector. |
2151 |
|
|
2152 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
2153 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
2154 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
2155 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
2156 |
|
|
2157 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
2158 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
2159 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
2160 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
2161 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
2162 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
2163 |
|
|
2164 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
2165 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
2166 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
2167 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
2168 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
2169 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
2170 |
the vector is large enough, of course). |
the vector is large enough, of course). |
2171 |
|
|
2172 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
2173 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
2174 |
|
|
2175 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
2176 |
|
|
2177 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
2178 |
defined in the header file: |
defined in the header file: |
2179 |
|
|
2180 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
2183 |
|
|
2184 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
2185 |
|
|
2186 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
2187 |
ovecsize was not zero. |
ovecsize was not zero. |
2188 |
|
|
2189 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
2192 |
|
|
2193 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
2194 |
|
|
2195 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
2196 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
2197 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
2198 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
2199 |
gives when the magic number is not present. |
gives when the magic number is not present. |
2200 |
|
|
2201 |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
2202 |
|
|
2203 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
2204 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
2205 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
2206 |
|
|
2207 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2208 |
|
|
2209 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
2210 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
2211 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
2212 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
2213 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
2214 |
|
|
2215 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2216 |
|
|
2217 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
2218 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
2219 |
returned by pcre_exec(). |
returned by pcre_exec(). |
2220 |
|
|
2221 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
2222 |
|
|
2223 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
2224 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
2225 |
above. |
above. |
2226 |
|
|
2227 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
2228 |
|
|
2229 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
2230 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
2231 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
2232 |
|
|
2233 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
2234 |
|
|
2235 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
2236 |
subject. |
subject. |
2237 |
|
|
2238 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
2239 |
|
|
2240 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
2241 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
2242 |
ter. |
ter. |
2243 |
|
|
2244 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
2245 |
|
|
2246 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
2247 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
2248 |
|
|
2249 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
2250 |
|
|
2251 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
2252 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
2253 |
documentation for details of partial matching. |
documentation for details of partial matching. |
2254 |
|
|
2255 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
2256 |
|
|
2257 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
2258 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
2259 |
|
|
2260 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
2261 |
|
|
2262 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
2263 |
|
|
2264 |
PCRE_ERROR_RECURSIONLIMIT (-21) |
PCRE_ERROR_RECURSIONLIMIT (-21) |
2265 |
|
|
2409 |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
ate. NOTE: If PCRE_DUPNAMES is set and there are duplicate names, the |
2410 |
behaviour may not be what you want (see the next section). |
behaviour may not be what you want (see the next section). |
2411 |
|
|
2412 |
|
Warning: If the pattern uses the "(?|" feature to set up multiple sub- |
2413 |
|
patterns with the same number, you cannot use names to distinguish |
2414 |
|
them, because names are not included in the compiled code. The matching |
2415 |
|
process uses only numbers. |
2416 |
|
|
2417 |
|
|
2418 |
DUPLICATE SUBPATTERN NAMES |
DUPLICATE SUBPATTERN NAMES |
2419 |
|
|
2420 |
int pcre_get_stringtable_entries(const pcre *code, |
int pcre_get_stringtable_entries(const pcre *code, |
2421 |
const char *name, char **first, char **last); |
const char *name, char **first, char **last); |
2422 |
|
|
2423 |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
When a pattern is compiled with the PCRE_DUPNAMES option, names for |
2424 |
subpatterns are not required to be unique. Normally, patterns with |
subpatterns are not required to be unique. Normally, patterns with |
2425 |
duplicate names are such that in any one match, only one of the named |
duplicate names are such that in any one match, only one of the named |
2426 |
subpatterns participates. An example is shown in the pcrepattern docu- |
subpatterns participates. An example is shown in the pcrepattern docu- |
2427 |
mentation. |
mentation. |
2428 |
|
|
2429 |
When duplicates are present, pcre_copy_named_substring() and |
When duplicates are present, pcre_copy_named_substring() and |
2430 |
pcre_get_named_substring() return the first substring corresponding to |
pcre_get_named_substring() return the first substring corresponding to |
2431 |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
the given name that is set. If none are set, PCRE_ERROR_NOSUBSTRING |
2432 |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
(-7) is returned; no data is returned. The pcre_get_stringnumber() |
2433 |
function returns one of the numbers that are associated with the name, |
function returns one of the numbers that are associated with the name, |
2434 |
but it is not defined which it is. |
but it is not defined which it is. |
2435 |
|
|
2436 |
If you want to get full details of all captured substrings for a given |
If you want to get full details of all captured substrings for a given |
2437 |
name, you must use the pcre_get_stringtable_entries() function. The |
name, you must use the pcre_get_stringtable_entries() function. The |
2438 |
first argument is the compiled pattern, and the second is the name. The |
first argument is the compiled pattern, and the second is the name. The |
2439 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2440 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2441 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2442 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2443 |
there are none. The format of the table is described above in the sec- |
there are none. The format of the table is described above in the sec- |
2444 |
tion entitled Information about a pattern. Given all the relevant |
tion entitled Information about a pattern. Given all the relevant |
2445 |
entries for the name, you can extract each of their numbers, and hence |
entries for the name, you can extract each of their numbers, and hence |
2446 |
the captured data, if any. |
the captured data, if any. |
2447 |
|
|
2448 |
|
|
2449 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2450 |
|
|
2451 |
The traditional matching function uses a similar algorithm to Perl, |
The traditional matching function uses a similar algorithm to Perl, |
2452 |
which stops when it finds the first match, starting at a given point in |
which stops when it finds the first match, starting at a given point in |
2453 |
the subject. If you want to find all possible matches, or the longest |
the subject. If you want to find all possible matches, or the longest |
2454 |
possible match, consider using the alternative matching function (see |
possible match, consider using the alternative matching function (see |
2455 |
below) instead. If you cannot use the alternative function, but still |
below) instead. If you cannot use the alternative function, but still |
2456 |
need to find all possible matches, you can kludge it up by making use |
need to find all possible matches, you can kludge it up by making use |
2457 |
of the callout facility, which is described in the pcrecallout documen- |
of the callout facility, which is described in the pcrecallout documen- |
2458 |
tation. |
tation. |
2459 |
|
|
2460 |
What you have to do is to insert a callout right at the end of the pat- |
What you have to do is to insert a callout right at the end of the pat- |
2461 |
tern. When your callout function is called, extract and save the cur- |
tern. When your callout function is called, extract and save the cur- |
2462 |
rent matched substring. Then return 1, which forces pcre_exec() to |
rent matched substring. Then return 1, which forces pcre_exec() to |
2463 |
backtrack and try other alternatives. Ultimately, when it runs out of |
backtrack and try other alternatives. Ultimately, when it runs out of |
2464 |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
matches, pcre_exec() will yield PCRE_ERROR_NOMATCH. |
2465 |
|
|
2466 |
|
|
2471 |
int options, int *ovector, int ovecsize, |
int options, int *ovector, int ovecsize, |
2472 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2473 |
|
|
2474 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2475 |
against a compiled pattern, using a matching algorithm that scans the |
against a compiled pattern, using a matching algorithm that scans the |
2476 |
subject string just once, and does not backtrack. This has different |
subject string just once, and does not backtrack. This has different |
2477 |
characteristics to the normal algorithm, and is not compatible with |
characteristics to the normal algorithm, and is not compatible with |
2478 |
Perl. Some of the features of PCRE patterns are not supported. Never- |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2479 |
theless, there are times when this kind of matching can be useful. For |
theless, there are times when this kind of matching can be useful. For |
2480 |
a discussion of the two matching algorithms, see the pcrematching docu- |
a discussion of the two matching algorithms, see the pcrematching docu- |
2481 |
mentation. |
mentation. |
2482 |
|
|
2483 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2484 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2485 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2486 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2487 |
repeated here. |
repeated here. |
2488 |
|
|
2489 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2490 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2491 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2492 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2493 |
lot of potential matches. |
lot of potential matches. |
2494 |
|
|
2495 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2511 |
|
|
2512 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2513 |
|
|
2514 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2515 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2516 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2517 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2518 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
2519 |
not repeated here. |
not repeated here. |
2520 |
|
|
2521 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2522 |
|
|
2523 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
2524 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
2525 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
2526 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
2527 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
2528 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
2529 |
set as the first matching string. |
set as the first matching string. |
2530 |
|
|
2531 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2532 |
|
|
2533 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2534 |
stop as soon as it has found one match. Because of the way the alterna- |
stop as soon as it has found one match. Because of the way the alterna- |
2535 |
tive algorithm works, this is necessarily the shortest possible match |
tive algorithm works, this is necessarily the shortest possible match |
2536 |
at the first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
2537 |
|
|
2538 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2539 |
|
|
2540 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2541 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2542 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2543 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2544 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2545 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2546 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2547 |
documentation. |
documentation. |
2548 |
|
|
2549 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2550 |
|
|
2551 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2552 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2553 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2554 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2555 |
if the pattern |
if the pattern |
2556 |
|
|
2557 |
<.*> |
<.*> |
2566 |
<something> <something else> |
<something> <something else> |
2567 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2568 |
|
|
2569 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2570 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2571 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2572 |
the offset to the start, and the second is the offset to the end. In |
the offset to the start, and the second is the offset to the end. In |
2573 |
fact, all the strings have the same start offset. (Space could have |
fact, all the strings have the same start offset. (Space could have |
2574 |
been saved by giving this only once, but it was decided to retain some |
been saved by giving this only once, but it was decided to retain some |
2575 |
compatibility with the way pcre_exec() returns data, even though the |
compatibility with the way pcre_exec() returns data, even though the |
2576 |
meaning of the strings is different.) |
meaning of the strings is different.) |
2577 |
|
|
2578 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2579 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2580 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2581 |
filled with the longest matches. |
filled with the longest matches. |
2582 |
|
|
2583 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2584 |
|
|
2585 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2586 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2587 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2588 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2589 |
|
|
2590 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2591 |
|
|
2592 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2593 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2594 |
reference. |
reference. |
2595 |
|
|
2596 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2597 |
|
|
2598 |
This return is given if pcre_dfa_exec() encounters a condition item |
This return is given if pcre_dfa_exec() encounters a condition item |
2599 |
that uses a back reference for the condition, or a test for recursion |
that uses a back reference for the condition, or a test for recursion |
2600 |
in a specific group. These are not supported. |
in a specific group. These are not supported. |
2601 |
|
|
2602 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2603 |
|
|
2604 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2605 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2606 |
(it is meaningless). |
(it is meaningless). |
2607 |
|
|
2608 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2609 |
|
|
2610 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2611 |
workspace vector. |
workspace vector. |
2612 |
|
|
2613 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2614 |
|
|
2615 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2616 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2617 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2618 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2619 |
|
|
2620 |
|
|
2621 |
SEE ALSO |
SEE ALSO |
2622 |
|
|
2623 |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2624 |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2625 |
|
|
2626 |
|
|
2627 |
AUTHOR |
AUTHOR |
2633 |
|
|
2634 |
REVISION |
REVISION |
2635 |
|
|
2636 |
Last updated: 24 August 2008 |
Last updated: 17 March 2009 |
2637 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2638 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2639 |
|
|
2640 |
|
|
2685 |
MISSING CALLOUTS |
MISSING CALLOUTS |
2686 |
|
|
2687 |
You should be aware that, because of optimizations in the way PCRE |
You should be aware that, because of optimizations in the way PCRE |
2688 |
matches patterns, callouts sometimes do not happen. For example, if the |
matches patterns by default, callouts sometimes do not happen. For |
2689 |
pattern is |
example, if the pattern is |
2690 |
|
|
2691 |
ab(?C4)cd |
ab(?C4)cd |
2692 |
|
|
2695 |
ever start, and the callout is never reached. However, with "abyd", |
ever start, and the callout is never reached. However, with "abyd", |
2696 |
though the result is still no match, the callout is obeyed. |
though the result is still no match, the callout is obeyed. |
2697 |
|
|
2698 |
|
You can disable these optimizations by passing the PCRE_NO_START_OPTI- |
2699 |
|
MIZE option to pcre_exec() or pcre_dfa_exec(). This slows down the |
2700 |
|
matching process, but does ensure that callouts such as the example |
2701 |
|
above are obeyed. |
2702 |
|
|
2703 |
|
|
2704 |
THE CALLOUT INTERFACE |
THE CALLOUT INTERFACE |
2705 |
|
|
2706 |
During matching, when PCRE reaches a callout point, the external func- |
During matching, when PCRE reaches a callout point, the external func- |
2707 |
tion defined by pcre_callout is called (if it is set). This applies to |
tion defined by pcre_callout is called (if it is set). This applies to |
2708 |
both the pcre_exec() and the pcre_dfa_exec() matching functions. The |
both the pcre_exec() and the pcre_dfa_exec() matching functions. The |
2709 |
only argument to the callout function is a pointer to a pcre_callout |
only argument to the callout function is a pointer to a pcre_callout |
2710 |
block. This structure contains the following fields: |
block. This structure contains the following fields: |
2711 |
|
|
2712 |
int version; |
int version; |
2722 |
int pattern_position; |
int pattern_position; |
2723 |
int next_item_length; |
int next_item_length; |
2724 |
|
|
2725 |
The version field is an integer containing the version number of the |
The version field is an integer containing the version number of the |
2726 |
block format. The initial version was 0; the current version is 1. The |
block format. The initial version was 0; the current version is 1. The |
2727 |
version number will change again in future if additional fields are |
version number will change again in future if additional fields are |
2728 |
added, but the intention is never to remove any of the existing fields. |
added, but the intention is never to remove any of the existing fields. |
2729 |
|
|
2730 |
The callout_number field contains the number of the callout, as com- |
The callout_number field contains the number of the callout, as com- |
2809 |
|
|
2810 |
REVISION |
REVISION |
2811 |
|
|
2812 |
Last updated: 29 May 2007 |
Last updated: 15 March 2009 |
2813 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
2814 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2815 |
|
|
2816 |
|
|
3089 |
syntax) |
syntax) |
3090 |
] terminates the character class |
] terminates the character class |
3091 |
|
|
3092 |
The following sections describe the use of each of the metacharacters. |
The following sections describe the use of each of the metacharacters. |
3093 |
|
|
3094 |
|
|
3095 |
BACKSLASH |
BACKSLASH |
3096 |
|
|
3097 |
The backslash character has several uses. Firstly, if it is followed by |
The backslash character has several uses. Firstly, if it is followed by |
3098 |
a non-alphanumeric character, it takes away any special meaning that |
a non-alphanumeric character, it takes away any special meaning that |
3099 |
character may have. This use of backslash as an escape character |
character may have. This use of backslash as an escape character |
3100 |
applies both inside and outside character classes. |
applies both inside and outside character classes. |
3101 |
|
|
3102 |
For example, if you want to match a * character, you write \* in the |
For example, if you want to match a * character, you write \* in the |
3103 |
pattern. This escaping action applies whether or not the following |
pattern. This escaping action applies whether or not the following |
3104 |
character would otherwise be interpreted as a metacharacter, so it is |
character would otherwise be interpreted as a metacharacter, so it is |
3105 |
always safe to precede a non-alphanumeric with backslash to specify |
always safe to precede a non-alphanumeric with backslash to specify |
3106 |
that it stands for itself. In particular, if you want to match a back- |
that it stands for itself. In particular, if you want to match a back- |
3107 |
slash, you write \\. |
slash, you write \\. |
3108 |
|
|
3109 |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
If a pattern is compiled with the PCRE_EXTENDED option, whitespace in |
3110 |
the pattern (other than in a character class) and characters between a |
the pattern (other than in a character class) and characters between a |
3111 |
# outside a character class and the next newline are ignored. An escap- |
# outside a character class and the next newline are ignored. An escap- |
3112 |
ing backslash can be used to include a whitespace or # character as |
ing backslash can be used to include a whitespace or # character as |
3113 |
part of the pattern. |
part of the pattern. |
3114 |
|
|
3115 |
If you want to remove the special meaning from a sequence of charac- |
If you want to remove the special meaning from a sequence of charac- |
3116 |
ters, you can do so by putting them between \Q and \E. This is differ- |
ters, you can do so by putting them between \Q and \E. This is differ- |
3117 |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
ent from Perl in that $ and @ are handled as literals in \Q...\E |
3118 |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
sequences in PCRE, whereas in Perl, $ and @ cause variable interpola- |
3119 |
tion. Note the following examples: |
tion. Note the following examples: |
3120 |
|
|
3121 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
3125 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
3126 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
3127 |
|
|
3128 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
3129 |
classes. |
classes. |
3130 |
|
|
3131 |
Non-printing characters |
Non-printing characters |
3132 |
|
|
3133 |
A second use of backslash provides a way of encoding non-printing char- |
A second use of backslash provides a way of encoding non-printing char- |
3134 |
acters in patterns in a visible manner. There is no restriction on the |
acters in patterns in a visible manner. There is no restriction on the |
3135 |
appearance of non-printing characters, apart from the binary zero that |
appearance of non-printing characters, apart from the binary zero that |
3136 |
terminates a pattern, but when a pattern is being prepared by text |
terminates a pattern, but when a pattern is being prepared by text |
3137 |
editing, it is usually easier to use one of the following escape |
editing, it is usually easier to use one of the following escape |
3138 |
sequences than the binary character it represents: |
sequences than the binary character it represents: |
3139 |
|
|
3140 |
\a alarm, that is, the BEL character (hex 07) |
\a alarm, that is, the BEL character (hex 07) |
3148 |
\xhh character with hex code hh |
\xhh character with hex code hh |
3149 |
\x{hhh..} character with hex code hhh.. |
\x{hhh..} character with hex code hhh.. |
3150 |
|
|
3151 |
The precise effect of \cx is as follows: if x is a lower case letter, |
The precise effect of \cx is as follows: if x is a lower case letter, |
3152 |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
it is converted to upper case. Then bit 6 of the character (hex 40) is |
3153 |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
inverted. Thus \cz becomes hex 1A, but \c{ becomes hex 3B, while \c; |
3154 |
becomes hex 7B. |
becomes hex 7B. |
3155 |
|
|
3156 |
After \x, from zero to two hexadecimal digits are read (letters can be |
After \x, from zero to two hexadecimal digits are read (letters can be |
3157 |
in upper or lower case). Any number of hexadecimal digits may appear |
in upper or lower case). Any number of hexadecimal digits may appear |
3158 |
between \x{ and }, but the value of the character code must be less |
between \x{ and }, but the value of the character code must be less |
3159 |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is, |
3160 |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
the maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger |
3161 |
than the largest Unicode code point, which is 10FFFF. |
than the largest Unicode code point, which is 10FFFF. |
3162 |
|
|
3163 |
If characters other than hexadecimal digits appear between \x{ and }, |
If characters other than hexadecimal digits appear between \x{ and }, |
3164 |
or if there is no terminating }, this form of escape is not recognized. |
or if there is no terminating }, this form of escape is not recognized. |
3165 |
Instead, the initial \x will be interpreted as a basic hexadecimal |
Instead, the initial \x will be interpreted as a basic hexadecimal |
3166 |
escape, with no following digits, giving a character whose value is |
escape, with no following digits, giving a character whose value is |
3167 |
zero. |
zero. |
3168 |
|
|
3169 |
Characters whose value is less than 256 can be defined by either of the |
Characters whose value is less than 256 can be defined by either of the |
3170 |
two syntaxes for \x. There is no difference in the way they are han- |
two syntaxes for \x. There is no difference in the way they are han- |
3171 |
dled. For example, \xdc is exactly the same as \x{dc}. |
dled. For example, \xdc is exactly the same as \x{dc}. |
3172 |
|
|
3173 |
After \0 up to two further octal digits are read. If there are fewer |
After \0 up to two further octal digits are read. If there are fewer |
3174 |
than two digits, just those that are present are used. Thus the |
than two digits, just those that are present are used. Thus the |
3175 |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
sequence \0\x\07 specifies two binary zeros followed by a BEL character |
3176 |
(code value 7). Make sure you supply two digits after the initial zero |
(code value 7). Make sure you supply two digits after the initial zero |
3177 |
if the pattern character that follows is itself an octal digit. |
if the pattern character that follows is itself an octal digit. |
3178 |
|
|
3179 |
The handling of a backslash followed by a digit other than 0 is compli- |
The handling of a backslash followed by a digit other than 0 is compli- |
3180 |
cated. Outside a character class, PCRE reads it and any following dig- |
cated. Outside a character class, PCRE reads it and any following dig- |
3181 |
its as a decimal number. If the number is less than 10, or if there |
its as a decimal number. If the number is less than 10, or if there |
3182 |
have been at least that many previous capturing left parentheses in the |
have been at least that many previous capturing left parentheses in the |
3183 |
expression, the entire sequence is taken as a back reference. A |
expression, the entire sequence is taken as a back reference. A |
3184 |
description of how this works is given later, following the discussion |
description of how this works is given later, following the discussion |
3185 |
of parenthesized subpatterns. |
of parenthesized subpatterns. |
3186 |
|
|
3187 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
3188 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
3189 |
up to three octal digits following the backslash, and uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
3190 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
3191 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
3192 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
3193 |
example: |
example: |
3194 |
|
|
3195 |
\040 is another way of writing a space |
\040 is another way of writing a space |
3207 |
\81 is either a back reference, or a binary zero |
\81 is either a back reference, or a binary zero |
3208 |
followed by the two characters "8" and "1" |
followed by the two characters "8" and "1" |
3209 |
|
|
3210 |
Note that octal values of 100 or greater must not be introduced by a |
Note that octal values of 100 or greater must not be introduced by a |
3211 |
leading zero, because no more than three octal digits are ever read. |
leading zero, because no more than three octal digits are ever read. |
3212 |
|
|
3213 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
3214 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
3215 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
3216 |
08), and the sequences \R and \X are interpreted as the characters "R" |
08), and the sequences \R and \X are interpreted as the characters "R" |
3217 |
and "X", respectively. Outside a character class, these sequences have |
and "X", respectively. Outside a character class, these sequences have |
3218 |
different meanings (see below). |
different meanings (see below). |
3219 |
|
|
3220 |
Absolute and relative back references |
Absolute and relative back references |
3221 |
|
|
3222 |
The sequence \g followed by an unsigned or a negative number, option- |
The sequence \g followed by an unsigned or a negative number, option- |
3223 |
ally enclosed in braces, is an absolute or relative back reference. A |
ally enclosed in braces, is an absolute or relative back reference. A |
3224 |
named back reference can be coded as \g{name}. Back references are dis- |
named back reference can be coded as \g{name}. Back references are dis- |
3225 |
cussed later, following the discussion of parenthesized subpatterns. |
cussed later, following the discussion of parenthesized subpatterns. |
3226 |
|
|
3227 |
Absolute and relative subroutine calls |
Absolute and relative subroutine calls |
3228 |
|
|
3229 |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
For compatibility with Oniguruma, the non-Perl syntax \g followed by a |
3230 |
name or a number enclosed either in angle brackets or single quotes, is |
name or a number enclosed either in angle brackets or single quotes, is |
3231 |
an alternative syntax for referencing a subpattern as a "subroutine". |
an alternative syntax for referencing a subpattern as a "subroutine". |
3232 |
Details are discussed later. Note that \g{...} (Perl syntax) and |
Details are discussed later. Note that \g{...} (Perl syntax) and |
3233 |
\g<...> (Oniguruma syntax) are not synonymous. The former is a back |
\g<...> (Oniguruma syntax) are not synonymous. The former is a back |
3234 |
reference; the latter is a subroutine call. |
reference; the latter is a subroutine call. |
3235 |
|
|
3236 |
Generic character types |
Generic character types |
3250 |
\W any "non-word" character |
\W any "non-word" character |
3251 |
|
|
3252 |
Each pair of escape sequences partitions the complete set of characters |
Each pair of escape sequences partitions the complete set of characters |
3253 |
into two disjoint sets. Any given character matches one, and only one, |
into two disjoint sets. Any given character matches one, and only one, |
3254 |
of each pair. |
of each pair. |
3255 |
|
|
3256 |
These character type sequences can appear both inside and outside char- |
These character type sequences can appear both inside and outside char- |
3257 |
acter classes. They each match one character of the appropriate type. |
acter classes. They each match one character of the appropriate type. |
3258 |
If the current matching point is at the end of the subject string, all |
If the current matching point is at the end of the subject string, all |
3259 |
of them fail, since there is no character to match. |
of them fail, since there is no character to match. |
3260 |
|
|
3261 |
For compatibility with Perl, \s does not match the VT character (code |
For compatibility with Perl, \s does not match the VT character (code |
3262 |
11). This makes it different from the the POSIX "space" class. The \s |
11). This makes it different from the the POSIX "space" class. The \s |
3263 |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
characters are HT (9), LF (10), FF (12), CR (13), and space (32). If |
3264 |
"use locale;" is included in a Perl script, \s may match the VT charac- |
"use locale;" is included in a Perl script, \s may match the VT charac- |
3265 |
ter. In PCRE, it never does. |
ter. In PCRE, it never does. |
3266 |
|
|
3267 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
3268 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
3269 |
code character property support is available. These sequences retain |
code character property support is available. These sequences retain |
3270 |
their original meanings from before UTF-8 support was available, mainly |
their original meanings from before UTF-8 support was available, mainly |
3271 |
for efficiency reasons. |
for efficiency reasons. |
3272 |
|
|
3273 |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
The sequences \h, \H, \v, and \V are Perl 5.10 features. In contrast to |
3274 |
the other sequences, these do match certain high-valued codepoints in |
the other sequences, these do match certain high-valued codepoints in |
3275 |
UTF-8 mode. The horizontal space characters are: |
UTF-8 mode. The horizontal space characters are: |
3276 |
|
|
3277 |
U+0009 Horizontal tab |
U+0009 Horizontal tab |
3305 |
U+2029 Paragraph separator |
U+2029 Paragraph separator |
3306 |
|
|
3307 |
A "word" character is an underscore or any character less than 256 that |
A "word" character is an underscore or any character less than 256 that |
3308 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
3309 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
3310 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
3311 |
page). For example, in a French locale such as "fr_FR" in Unix-like |
page). For example, in a French locale such as "fr_FR" in Unix-like |
3312 |
systems, or "french" in Windows, some character codes greater than 128 |
systems, or "french" in Windows, some character codes greater than 128 |
3313 |
are used for accented letters, and these are matched by \w. The use of |
are used for accented letters, and these are matched by \w. The use of |
3314 |
locales with Unicode is discouraged. |
locales with Unicode is discouraged. |
3315 |
|
|
3316 |
Newline sequences |
Newline sequences |
3317 |
|
|
3318 |
Outside a character class, by default, the escape sequence \R matches |
Outside a character class, by default, the escape sequence \R matches |
3319 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
any Unicode newline sequence. This is a Perl 5.10 feature. In non-UTF-8 |
3320 |
mode \R is equivalent to the following: |
mode \R is equivalent to the following: |
3321 |
|
|
3322 |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
(?>\r\n|\n|\x0b|\f|\r|\x85) |
3323 |
|
|
3324 |
This is an example of an "atomic group", details of which are given |
This is an example of an "atomic group", details of which are given |
3325 |
below. This particular group matches either the two-character sequence |
below. This particular group matches either the two-character sequence |
3326 |
CR followed by LF, or one of the single characters LF (linefeed, |
CR followed by LF, or one of the single characters LF (linefeed, |
3327 |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
3328 |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
3329 |
is treated as a single unit that cannot be split. |
is treated as a single unit that cannot be split. |
3330 |
|
|
3331 |
In UTF-8 mode, two additional characters whose codepoints are greater |
In UTF-8 mode, two additional characters whose codepoints are greater |
3332 |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
3333 |
rator, U+2029). Unicode character property support is not needed for |
rator, U+2029). Unicode character property support is not needed for |
3334 |
these characters to be recognized. |
these characters to be recognized. |
3335 |
|
|
3336 |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
It is possible to restrict \R to match only CR, LF, or CRLF (instead of |
3337 |
the complete set of Unicode line endings) by setting the option |
the complete set of Unicode line endings) by setting the option |
3338 |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
PCRE_BSR_ANYCRLF either at compile time or when the pattern is matched. |
3339 |
(BSR is an abbrevation for "backslash R".) This can be made the default |
(BSR is an abbrevation for "backslash R".) This can be made the default |
3340 |
when PCRE is built; if this is the case, the other behaviour can be |
when PCRE is built; if this is the case, the other behaviour can be |
3341 |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
requested via the PCRE_BSR_UNICODE option. It is also possible to |
3342 |
specify these settings by starting a pattern string with one of the |
specify these settings by starting a pattern string with one of the |
3343 |
following sequences: |
following sequences: |
3344 |
|
|
3345 |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
(*BSR_ANYCRLF) CR, LF, or CRLF only |
3348 |
These override the default and the options given to pcre_compile(), but |
These override the default and the options given to pcre_compile(), but |
3349 |
they can be overridden by options given to pcre_exec(). Note that these |
they can be overridden by options given to pcre_exec(). Note that these |
3350 |
special settings, which are not Perl-compatible, are recognized only at |
special settings, which are not Perl-compatible, are recognized only at |
3351 |
the very start of a pattern, and that they must be in upper case. If |
the very start of a pattern, and that they must be in upper case. If |
3352 |
more than one of them is present, the last one is used. They can be |
more than one of them is present, the last one is used. They can be |
3353 |
combined with a change of newline convention, for example, a pattern |
combined with a change of newline convention, for example, a pattern |
3354 |
can start with: |
can start with: |
3355 |
|
|
3356 |
(*ANY)(*BSR_ANYCRLF) |
(*ANY)(*BSR_ANYCRLF) |
3360 |
Unicode character properties |
Unicode character properties |
3361 |
|
|
3362 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
3363 |
tional escape sequences that match characters with specific properties |
tional escape sequences that match characters with specific properties |
3364 |
are available. When not in UTF-8 mode, these sequences are of course |
are available. When not in UTF-8 mode, these sequences are of course |
3365 |
limited to testing characters whose codepoints are less than 256, but |
limited to testing characters whose codepoints are less than 256, but |
3366 |
they do work in this mode. The extra escape sequences are: |
they do work in this mode. The extra escape sequences are: |
3367 |
|
|
3368 |
\p{xx} a character with the xx property |
\p{xx} a character with the xx property |
3369 |
\P{xx} a character without the xx property |
\P{xx} a character without the xx property |
3370 |
\X an extended Unicode sequence |
\X an extended Unicode sequence |
3371 |
|
|
3372 |
The property names represented by xx above are limited to the Unicode |
The property names represented by xx above are limited to the Unicode |
3373 |
script names, the general category properties, and "Any", which matches |
script names, the general category properties, and "Any", which matches |
3374 |
any character (including newline). Other properties such as "InMusical- |
any character (including newline). Other properties such as "InMusical- |
3375 |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
Symbols" are not currently supported by PCRE. Note that \P{Any} does |
3376 |
not match any characters, so always causes a match failure. |
not match any characters, so always causes a match failure. |
3377 |
|
|
3378 |
Sets of Unicode characters are defined as belonging to certain scripts. |
Sets of Unicode characters are defined as belonging to certain scripts. |
3379 |
A character from one of these sets can be matched using a script name. |
A character from one of these sets can be matched using a script name. |
3380 |
For example: |
For example: |
3381 |
|
|
3382 |
\p{Greek} |
\p{Greek} |
3383 |
\P{Han} |
\P{Han} |
3384 |
|
|
3385 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
3386 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
3387 |
|
|
3388 |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
3389 |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
3390 |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
3391 |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
3392 |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
3393 |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
3394 |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
3395 |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
3396 |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
3397 |
|
|
3398 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
3399 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
3400 |
specified by including a circumflex between the opening brace and the |
specified by including a circumflex between the opening brace and the |
3401 |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
property name. For example, \p{^Lu} is the same as \P{Lu}. |
3402 |
|
|
3403 |
If only one letter is specified with \p or \P, it includes all the gen- |
If only one letter is specified with \p or \P, it includes all the gen- |
3404 |
eral category properties that start with that letter. In this case, in |
eral category properties that start with that letter. In this case, in |
3405 |
the absence of negation, the curly brackets in the escape sequence are |
the absence of negation, the curly brackets in the escape sequence are |
3406 |
optional; these two examples have the same effect: |
optional; these two examples have the same effect: |
3407 |
|
|
3408 |
\p{L} |
\p{L} |
3454 |
Zp Paragraph separator |
Zp Paragraph separator |
3455 |
Zs Space separator |
Zs Space separator |
3456 |
|
|
3457 |
The special property L& is also supported: it matches a character that |
The special property L& is also supported: it matches a character that |
3458 |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
has the Lu, Ll, or Lt property, in other words, a letter that is not |
3459 |
classified as a modifier or "other". |
classified as a modifier or "other". |
3460 |
|
|
3461 |
The Cs (Surrogate) property applies only to characters in the range |
The Cs (Surrogate) property applies only to characters in the range |
3462 |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
U+D800 to U+DFFF. Such characters are not valid in UTF-8 strings (see |
3463 |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
RFC 3629) and so cannot be tested by PCRE, unless UTF-8 validity check- |
3464 |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
ing has been turned off (see the discussion of PCRE_NO_UTF8_CHECK in |
3465 |
the pcreapi page). |
the pcreapi page). |
3466 |
|
|
3467 |
The long synonyms for these properties that Perl supports (such as |
The long synonyms for these properties that Perl supports (such as |
3468 |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
\p{Letter}) are not supported by PCRE, nor is it permitted to prefix |
3469 |
any of these properties with "Is". |
any of these properties with "Is". |
3470 |
|
|
3471 |
No character that is in the Unicode table has the Cn (unassigned) prop- |
No character that is in the Unicode table has the Cn (unassigned) prop- |
3472 |
erty. Instead, this property is assumed for any code point that is not |
erty. Instead, this property is assumed for any code point that is not |
3473 |
in the Unicode table. |
in the Unicode table. |
3474 |
|
|
3475 |
Specifying caseless matching does not affect these escape sequences. |
Specifying caseless matching does not affect these escape sequences. |
3476 |
For example, \p{Lu} always matches only upper case letters. |
For example, \p{Lu} always matches only upper case letters. |
3477 |
|
|
3478 |
The \X escape matches any number of Unicode characters that form an |
The \X escape matches any number of Unicode characters that form an |
3479 |
extended Unicode sequence. \X is equivalent to |
extended Unicode sequence. \X is equivalent to |
3480 |
|
|
3481 |
(?>\PM\pM*) |
(?>\PM\pM*) |
3482 |
|
|
3483 |
That is, it matches a character without the "mark" property, followed |
That is, it matches a character without the "mark" property, followed |
3484 |
by zero or more characters with the "mark" property, and treats the |
by zero or more characters with the "mark" property, and treats the |
3485 |
sequence as an atomic group (see below). Characters with the "mark" |
sequence as an atomic group (see below). Characters with the "mark" |
3486 |
property are typically accents that affect the preceding character. |
property are typically accents that affect the preceding character. |
3487 |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
None of them have codepoints less than 256, so in non-UTF-8 mode \X |
3488 |
matches any one character. |
matches any one character. |
3489 |
|
|
3490 |
Matching characters by Unicode property is not fast, because PCRE has |
Matching characters by Unicode property is not fast, because PCRE has |
3491 |
to search a structure that contains data for over fifteen thousand |
to search a structure that contains data for over fifteen thousand |
3492 |
characters. That is why the traditional escape sequences such as \d and |
characters. That is why the traditional escape sequences such as \d and |
3493 |
\w do not use Unicode properties in PCRE. |
\w do not use Unicode properties in PCRE. |
3494 |
|
|
3495 |
Resetting the match start |
Resetting the match start |
3496 |
|
|
3497 |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
The escape sequence \K, which is a Perl 5.10 feature, causes any previ- |
3498 |
ously matched characters not to be included in the final matched |
ously matched characters not to be included in the final matched |
3499 |
sequence. For example, the pattern: |
sequence. For example, the pattern: |
3500 |
|
|
3501 |
foo\Kbar |
foo\Kbar |
3502 |
|
|
3503 |
matches "foobar", but reports that it has matched "bar". This feature |
matches "foobar", but reports that it has matched "bar". This feature |
3504 |
is similar to a lookbehind assertion (described below). However, in |
is similar to a lookbehind assertion (described below). However, in |
3505 |
this case, the part of the subject before the real match does not have |
this case, the part of the subject before the real match does not have |
3506 |
to be of fixed length, as lookbehind assertions do. The use of \K does |
to be of fixed length, as lookbehind assertions do. The use of \K does |
3507 |
not interfere with the setting of captured substrings. For example, |
not interfere with the setting of captured substrings. For example, |
3508 |
when the pattern |
when the pattern |
3509 |
|
|
3510 |
(foo)\Kbar |
(foo)\Kbar |
3513 |
|
|
3514 |
Simple assertions |
Simple assertions |
3515 |
|
|
3516 |
The final use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
3517 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
3518 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
3519 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
3520 |
The backslashed assertions are: |
The backslashed assertions are: |
3521 |
|
|
3522 |
\b matches at a word boundary |
\b matches at a word boundary |
3527 |
\z matches only at the end of the subject |
\z matches only at the end of the subject |
3528 |
\G matches at the first matching position in the subject |
\G matches at the first matching position in the subject |
3529 |
|
|
3530 |
These assertions may not appear in character classes (but note that \b |
These assertions may not appear in character classes (but note that \b |
3531 |
has a different meaning, namely the backspace character, inside a char- |
has a different meaning, namely the backspace character, inside a char- |
3532 |
acter class). |
acter class). |
3533 |
|
|
3534 |
A word boundary is a position in the subject string where the current |
A word boundary is a position in the subject string where the current |
3535 |
character and the previous character do not both match \w or \W (i.e. |
character and the previous character do not both match \w or \W (i.e. |
3536 |
one matches \w and the other matches \W), or the start or end of the |
one matches \w and the other matches \W), or the start or end of the |
3537 |
string if the first or last character matches \w, respectively. |
string if the first or last character matches \w, respectively. |
3538 |
|
|
3539 |
The \A, \Z, and \z assertions differ from the traditional circumflex |
The \A, \Z, and \z assertions differ from the traditional circumflex |
3540 |
and dollar (described in the next section) in that they only ever match |
and dollar (described in the next section) in that they only ever match |
3541 |
at the very start and end of the subject string, whatever options are |
at the very start and end of the subject string, whatever options are |
3542 |
set. Thus, they are independent of multiline mode. These three asser- |
set. Thus, they are independent of multiline mode. These three asser- |
3543 |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
tions are not affected by the PCRE_NOTBOL or PCRE_NOTEOL options, which |
3544 |
affect only the behaviour of the circumflex and dollar metacharacters. |
affect only the behaviour of the circumflex and dollar metacharacters. |
3545 |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
However, if the startoffset argument of pcre_exec() is non-zero, indi- |
3546 |
cating that matching is to start at a point other than the beginning of |
cating that matching is to start at a point other than the beginning of |
3547 |
the subject, \A can never match. The difference between \Z and \z is |
the subject, \A can never match. The difference between \Z and \z is |
3548 |
that \Z matches before a newline at the end of the string as well as at |
that \Z matches before a newline at the end of the string as well as at |
3549 |
the very end, whereas \z matches only at the end. |
the very end, whereas \z matches only at the end. |
3550 |
|
|
3551 |
The \G assertion is true only when the current matching position is at |
The \G assertion is true only when the current matching position is at |
3552 |
the start point of the match, as specified by the startoffset argument |
the start point of the match, as specified by the startoffset argument |
3553 |
of pcre_exec(). It differs from \A when the value of startoffset is |
of pcre_exec(). It differs from \A when the value of startoffset is |
3554 |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
non-zero. By calling pcre_exec() multiple times with appropriate argu- |
3555 |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
ments, you can mimic Perl's /g option, and it is in this kind of imple- |
3556 |
mentation where \G can be useful. |
mentation where \G can be useful. |
3557 |
|
|
3558 |
Note, however, that PCRE's interpretation of \G, as the start of the |
Note, however, that PCRE's interpretation of \G, as the start of the |
3559 |
current match, is subtly different from Perl's, which defines it as the |
current match, is subtly different from Perl's, which defines it as the |
3560 |
end of the previous match. In Perl, these can be different when the |
end of the previous match. In Perl, these can be different when the |
3561 |
previously matched string was empty. Because PCRE does just one match |
previously matched string was empty. Because PCRE does just one match |
3562 |
at a time, it cannot reproduce this behaviour. |
at a time, it cannot reproduce this behaviour. |
3563 |
|
|
3564 |
If all the alternatives of a pattern begin with \G, the expression is |
If all the alternatives of a pattern begin with \G, the expression is |
3565 |
anchored to the starting match position, and the "anchored" flag is set |
anchored to the starting match position, and the "anchored" flag is set |
3566 |
in the compiled regular expression. |
in the compiled regular expression. |
3567 |
|
|
3569 |
CIRCUMFLEX AND DOLLAR |
CIRCUMFLEX AND DOLLAR |
3570 |
|
|
3571 |
Outside a character class, in the default matching mode, the circumflex |
Outside a character class, in the default matching mode, the circumflex |
3572 |
character is an assertion that is true only if the current matching |
character is an assertion that is true only if the current matching |
3573 |
point is at the start of the subject string. If the startoffset argu- |
point is at the start of the subject string. If the startoffset argu- |
3574 |
ment of pcre_exec() is non-zero, circumflex can never match if the |
ment of pcre_exec() is non-zero, circumflex can never match if the |
3575 |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
PCRE_MULTILINE option is unset. Inside a character class, circumflex |
3576 |
has an entirely different meaning (see below). |
has an entirely different meaning (see below). |
3577 |
|
|
3578 |
Circumflex need not be the first character of the pattern if a number |
Circumflex need not be the first character of the pattern if a number |
3579 |
of alternatives are involved, but it should be the first thing in each |
of alternatives are involved, but it should be the first thing in each |
3580 |
alternative in which it appears if the pattern is ever to match that |
alternative in which it appears if the pattern is ever to match that |
3581 |
branch. If all possible alternatives start with a circumflex, that is, |
branch. If all possible alternatives start with a circumflex, that is, |
3582 |
if the pattern is constrained to match only at the start of the sub- |
if the pattern is constrained to match only at the start of the sub- |
3583 |
ject, it is said to be an "anchored" pattern. (There are also other |
ject, it is said to be an "anchored" pattern. (There are also other |
3584 |
constructs that can cause a pattern to be anchored.) |
constructs that can cause a pattern to be anchored.) |
3585 |
|
|
3586 |
A dollar character is an assertion that is true only if the current |
A dollar character is an assertion that is true only if the current |
3587 |
matching point is at the end of the subject string, or immediately |
matching point is at the end of the subject string, or immediately |
3588 |
before a newline at the end of the string (by default). Dollar need not |
before a newline at the end of the string (by default). Dollar need not |
3589 |
be the last character of the pattern if a number of alternatives are |
be the last character of the pattern if a number of alternatives are |
3590 |
involved, but it should be the last item in any branch in which it |
involved, but it should be the last item in any branch in which it |
3591 |
appears. Dollar has no special meaning in a character class. |
appears. Dollar has no special meaning in a character class. |
3592 |
|
|
3593 |
The meaning of dollar can be changed so that it matches only at the |
The meaning of dollar can be changed so that it matches only at the |
3594 |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
very end of the string, by setting the PCRE_DOLLAR_ENDONLY option at |
3595 |
compile time. This does not affect the \Z assertion. |
compile time. This does not affect the \Z assertion. |
3596 |
|
|
3597 |
The meanings of the circumflex and dollar characters are changed if the |
The meanings of the circumflex and dollar characters are changed if the |
3598 |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
PCRE_MULTILINE option is set. When this is the case, a circumflex |
3599 |
matches immediately after internal newlines as well as at the start of |
matches immediately after internal newlines as well as at the start of |
3600 |
the subject string. It does not match after a newline that ends the |
the subject string. It does not match after a newline that ends the |
3601 |
string. A dollar matches before any newlines in the string, as well as |
string. A dollar matches before any newlines in the string, as well as |
3602 |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
at the very end, when PCRE_MULTILINE is set. When newline is specified |
3603 |
as the two-character sequence CRLF, isolated CR and LF characters do |
as the two-character sequence CRLF, isolated CR and LF characters do |
3604 |
not indicate newlines. |
not indicate newlines. |
3605 |
|
|
3606 |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
For example, the pattern /^abc$/ matches the subject string "def\nabc" |
3607 |
(where \n represents a newline) in multiline mode, but not otherwise. |
(where \n represents a newline) in multiline mode, but not otherwise. |
3608 |
Consequently, patterns that are anchored in single line mode because |
Consequently, patterns that are anchored in single line mode because |
3609 |
all branches start with ^ are not anchored in multiline mode, and a |
all branches start with ^ are not anchored in multiline mode, and a |
3610 |
match for circumflex is possible when the startoffset argument of |
match for circumflex is possible when the startoffset argument of |
3611 |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
pcre_exec() is non-zero. The PCRE_DOLLAR_ENDONLY option is ignored if |
3612 |
PCRE_MULTILINE is set. |
PCRE_MULTILINE is set. |
3613 |
|
|
3614 |
Note that the sequences \A, \Z, and \z can be used to match the start |
Note that the sequences \A, \Z, and \z can be used to match the start |
3615 |
and end of the subject in both modes, and if all branches of a pattern |
and end of the subject in both modes, and if all branches of a pattern |
3616 |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
start with \A it is always anchored, whether or not PCRE_MULTILINE is |
3617 |
set. |
set. |
3618 |
|
|
3619 |
|
|
3620 |
FULL STOP (PERIOD, DOT) |
FULL STOP (PERIOD, DOT) |
3621 |
|
|
3622 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
3623 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
3624 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
3625 |
more than one byte long. |
more than one byte long. |
3626 |
|
|
3627 |
When a line ending is defined as a single character, dot never matches |
When a line ending is defined as a single character, dot never matches |
3628 |
that character; when the two-character sequence CRLF is used, dot does |
that character; when the two-character sequence CRLF is used, dot does |
3629 |
not match CR if it is immediately followed by LF, but otherwise it |
not match CR if it is immediately followed by LF, but otherwise it |
3630 |
matches all characters (including isolated CRs and LFs). When any Uni- |
matches all characters (including isolated CRs and LFs). When any Uni- |
3631 |
code line endings are being recognized, dot does not match CR or LF or |
code line endings are being recognized, dot does not match CR or LF or |
3632 |
any of the other line ending characters. |
any of the other line ending characters. |
3633 |
|
|
3634 |
The behaviour of dot with regard to newlines can be changed. If the |
The behaviour of dot with regard to newlines can be changed. If the |
3635 |
PCRE_DOTALL option is set, a dot matches any one character, without |
PCRE_DOTALL option is set, a dot matches any one character, without |
3636 |
exception. If the two-character sequence CRLF is present in the subject |
exception. If the two-character sequence CRLF is present in the subject |
3637 |
string, it takes two dots to match it. |
string, it takes two dots to match it. |
3638 |
|
|
3639 |
The handling of dot is entirely independent of the handling of circum- |
The handling of dot is entirely independent of the handling of circum- |
3640 |
flex and dollar, the only relationship being that they both involve |
flex and dollar, the only relationship being that they both involve |
3641 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
3642 |
|
|
3643 |
|
|
3644 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
3645 |
|
|
3646 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
3647 |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
3648 |
line-ending characters. The feature is provided in Perl in order to |
line-ending characters. The feature is provided in Perl in order to |
3649 |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
3650 |
acters into individual bytes, what remains in the string may be a mal- |
acters into individual bytes, what remains in the string may be a mal- |
3651 |
formed UTF-8 string. For this reason, the \C escape sequence is best |
formed UTF-8 string. For this reason, the \C escape sequence is best |
3652 |
avoided. |
avoided. |
3653 |
|
|
3654 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
3655 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
3656 |
late the length of the lookbehind. |
late the length of the lookbehind. |
3657 |
|
|
3658 |
|
|
3661 |
An opening square bracket introduces a character class, terminated by a |
An opening square bracket introduces a character class, terminated by a |
3662 |
closing square bracket. A closing square bracket on its own is not spe- |
closing square bracket. A closing square bracket on its own is not spe- |
3663 |
cial. If a closing square bracket is required as a member of the class, |
cial. If a closing square bracket is required as a member of the class, |
3664 |
it should be the first data character in the class (after an initial |
it should be the first data character in the class (after an initial |
3665 |
circumflex, if present) or escaped with a backslash. |
circumflex, if present) or escaped with a backslash. |
3666 |
|
|
3667 |
A character class matches a single character in the subject. In UTF-8 |
A character class matches a single character in the subject. In UTF-8 |
3668 |
mode, the character may occupy more than one byte. A matched character |
mode, the character may occupy more than one byte. A matched character |
3669 |
must be in the set of characters defined by the class, unless the first |
must be in the set of characters defined by the class, unless the first |
3670 |
character in the class definition is a circumflex, in which case the |
character in the class definition is a circumflex, in which case the |
3671 |
subject character must not be in the set defined by the class. If a |
subject character must not be in the set defined by the class. If a |
3672 |
circumflex is actually required as a member of the class, ensure it is |
circumflex is actually required as a member of the class, ensure it is |
3673 |
not the first character, or escape it with a backslash. |
not the first character, or escape it with a backslash. |
3674 |
|
|
3675 |
For example, the character class [aeiou] matches any lower case vowel, |
For example, the character class [aeiou] matches any lower case vowel, |
3676 |
while [^aeiou] matches any character that is not a lower case vowel. |
while [^aeiou] matches any character that is not a lower case vowel. |
3677 |
Note that a circumflex is just a convenient notation for specifying the |
Note that a circumflex is just a convenient notation for specifying the |
3678 |
characters that are in the class by enumerating those that are not. A |
characters that are in the class by enumerating those that are not. A |
3679 |
class that starts with a circumflex is not an assertion: it still con- |
class that starts with a circumflex is not an assertion: it still con- |
3680 |
sumes a character from the subject string, and therefore it fails if |
sumes a character from the subject string, and therefore it fails if |
3681 |
the current pointer is at the end of the string. |
the current pointer is at the end of the string. |
3682 |
|
|
3683 |
In UTF-8 mode, characters with values greater than 255 can be included |
In UTF-8 mode, characters with values greater than 255 can be included |
3684 |
in a class as a literal string of bytes, or by using the \x{ escaping |
in a class as a literal string of bytes, or by using the \x{ escaping |
3685 |
mechanism. |
mechanism. |
3686 |
|
|
3687 |
When caseless matching is set, any letters in a class represent both |
When caseless matching is set, any letters in a class represent both |
3688 |
their upper case and lower case versions, so for example, a caseless |
their upper case and lower case versions, so for example, a caseless |
3689 |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
[aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not |
3690 |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
match "A", whereas a caseful version would. In UTF-8 mode, PCRE always |
3691 |
understands the concept of case for characters whose values are less |
understands the concept of case for characters whose values are less |
3692 |
than 128, so caseless matching is always possible. For characters with |
than 128, so caseless matching is always possible. For characters with |
3693 |
higher values, the concept of case is supported if PCRE is compiled |
higher values, the concept of case is supported if PCRE is compiled |
3694 |
with Unicode property support, but not otherwise. If you want to use |
with Unicode property support, but not otherwise. If you want to use |
3695 |
caseless matching for characters 128 and above, you must ensure that |
caseless matching for characters 128 and above, you must ensure that |
3696 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
3697 |
support. |
support. |
3698 |
|
|
3699 |
Characters that might indicate line breaks are never treated in any |
Characters that might indicate line breaks are never treated in any |
3700 |
special way when matching character classes, whatever line-ending |
special way when matching character classes, whatever line-ending |
3701 |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
3702 |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
3703 |
of these characters. |
of these characters. |
3704 |
|
|
3705 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
3706 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
3707 |
between d and m, inclusive. If a minus character is required in a |
between d and m, inclusive. If a minus character is required in a |
3708 |
class, it must be escaped with a backslash or appear in a position |
class, it must be escaped with a backslash or appear in a position |
3709 |
where it cannot be interpreted as indicating a range, typically as the |
where it cannot be interpreted as indicating a range, typically as the |
3710 |
first or last character in the class. |
first or last character in the class. |
3711 |
|
|
3712 |
It is not possible to have the literal character "]" as the end charac- |
It is not possible to have the literal character "]" as the end charac- |
3713 |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
ter of a range. A pattern such as [W-]46] is interpreted as a class of |
3714 |
two characters ("W" and "-") followed by a literal string "46]", so it |
two characters ("W" and "-") followed by a literal string "46]", so it |
3715 |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
would match "W46]" or "-46]". However, if the "]" is escaped with a |
3716 |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
backslash it is interpreted as the end of range, so [W-\]46] is inter- |
3717 |
preted as a class containing a range followed by two other characters. |
preted as a class containing a range followed by two other characters. |
3718 |
The octal or hexadecimal representation of "]" can also be used to end |
The octal or hexadecimal representation of "]" can also be used to end |
3719 |
a range. |
a range. |
3720 |
|
|
3721 |
Ranges operate in the collating sequence of character values. They can |
Ranges operate in the collating sequence of character values. They can |
3722 |
also be used for characters specified numerically, for example |
also be used for characters specified numerically, for example |
3723 |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
[\000-\037]. In UTF-8 mode, ranges can include characters whose values |
3724 |
are greater than 255, for example [\x{100}-\x{2ff}]. |
are greater than 255, for example [\x{100}-\x{2ff}]. |
3725 |
|
|
3726 |
If a range that includes letters is used when caseless matching is set, |
If a range that includes letters is used when caseless matching is set, |
3727 |
it matches the letters in either case. For example, [W-c] is equivalent |
it matches the letters in either case. For example, [W-c] is equivalent |
3728 |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
3729 |
character tables for a French locale are in use, [\xc8-\xcb] matches |
character tables for a French locale are in use, [\xc8-\xcb] matches |
3730 |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
3731 |
concept of case for characters with values greater than 128 only when |
concept of case for characters with values greater than 128 only when |
3732 |
it is compiled with Unicode property support. |
it is compiled with Unicode property support. |
3733 |
|
|
3734 |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
The character types \d, \D, \p, \P, \s, \S, \w, and \W may also appear |
3735 |
in a character class, and add the characters that they match to the |
in a character class, and add the characters that they match to the |
3736 |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
class. For example, [\dABCDEF] matches any hexadecimal digit. A circum- |
3737 |
flex can conveniently be used with the upper case character types to |
flex can conveniently be used with the upper case character types to |
3738 |
specify a more restricted set of characters than the matching lower |
specify a more restricted set of characters than the matching lower |
3739 |
case type. For example, the class [^\W_] matches any letter or digit, |
case type. For example, the class [^\W_] matches any letter or digit, |
3740 |
but not underscore. |
but not underscore. |
3741 |
|
|
3742 |
The only metacharacters that are recognized in character classes are |
The only metacharacters that are recognized in character classes are |
3743 |
backslash, hyphen (only where it can be interpreted as specifying a |
backslash, hyphen (only where it can be interpreted as specifying a |
3744 |
range), circumflex (only at the start), opening square bracket (only |
range), circumflex (only at the start), opening square bracket (only |
3745 |
when it can be interpreted as introducing a POSIX class name - see the |
when it can be interpreted as introducing a POSIX class name - see the |
3746 |
next section), and the terminating closing square bracket. However, |
next section), and the terminating closing square bracket. However, |
3747 |
escaping other non-alphanumeric characters does no harm. |
escaping other non-alphanumeric characters does no harm. |
3748 |
|
|
3749 |
|
|
3750 |
POSIX CHARACTER CLASSES |
POSIX CHARACTER CLASSES |
3751 |
|
|
3752 |
Perl supports the POSIX notation for character classes. This uses names |
Perl supports the POSIX notation for character classes. This uses names |
3753 |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
enclosed by [: and :] within the enclosing square brackets. PCRE also |
3754 |
supports this notation. For example, |
supports this notation. For example, |
3755 |
|
|
3756 |
[01[:alpha:]%] |
[01[:alpha:]%] |
3773 |
word "word" characters (same as \w) |
word "word" characters (same as \w) |
3774 |
xdigit hexadecimal digits |
xdigit hexadecimal digits |
3775 |
|
|
3776 |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
The "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), |
3777 |
and space (32). Notice that this list includes the VT character (code |
and space (32). Notice that this list includes the VT character (code |
3778 |
11). This makes "space" different to \s, which does not include VT (for |
11). This makes "space" different to \s, which does not include VT (for |
3779 |
Perl compatibility). |
Perl compatibility). |
3780 |
|
|
3781 |
The name "word" is a Perl extension, and "blank" is a GNU extension |
The name "word" is a Perl extension, and "blank" is a GNU extension |
3782 |
from Perl 5.8. Another Perl extension is negation, which is indicated |
from Perl 5.8. Another Perl extension is negation, which is indicated |
3783 |
by a ^ character after the colon. For example, |
by a ^ character after the colon. For example, |
3784 |
|
|
3785 |
[12[:^digit:]] |
[12[:^digit:]] |
3786 |
|
|
3787 |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
matches "1", "2", or any non-digit. PCRE (and Perl) also recognize the |
3788 |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but |
3789 |
these are not supported, and an error is given if they are encountered. |
these are not supported, and an error is given if they are encountered. |
3790 |
|
|
3804 |
string). The matching process tries each alternative in turn, from left |
string). The matching process tries each alternative in turn, from left |
3805 |
to right, and the first one that succeeds is used. If the alternatives |
to right, and the first one that succeeds is used. If the alternatives |
3806 |
are within a subpattern (defined below), "succeeds" means matching the |
are within a subpattern (defined below), "succeeds" means matching the |
3807 |
rest of the main pattern as well as the alternative in the subpattern. |
rest of the main pattern as well as the alternative in the subpattern. |
3808 |
|
|
3809 |
|
|
3810 |
INTERNAL OPTION SETTING |
INTERNAL OPTION SETTING |
3811 |
|
|
3812 |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
The settings of the PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and |
3813 |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
PCRE_EXTENDED options (which are Perl-compatible) can be changed from |
3814 |
within the pattern by a sequence of Perl option letters enclosed |
within the pattern by a sequence of Perl option letters enclosed |
3815 |
between "(?" and ")". The option letters are |
between "(?" and ")". The option letters are |
3816 |
|
|
3817 |
i for PCRE_CASELESS |
i for PCRE_CASELESS |
3821 |
|
|
3822 |
For example, (?im) sets caseless, multiline matching. It is also possi- |
For example, (?im) sets caseless, multiline matching. It is also possi- |
3823 |
ble to unset these options by preceding the letter with a hyphen, and a |
ble to unset these options by preceding the letter with a hyphen, and a |
3824 |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
combined setting and unsetting such as (?im-sx), which sets PCRE_CASE- |
3825 |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
LESS and PCRE_MULTILINE while unsetting PCRE_DOTALL and PCRE_EXTENDED, |
3826 |
is also permitted. If a letter appears both before and after the |
is also permitted. If a letter appears both before and after the |
3827 |
hyphen, the option is unset. |
hyphen, the option is unset. |
3828 |
|
|
3829 |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
3830 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
3831 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
3832 |
|
|
3833 |
When an option change occurs at top level (that is, not inside subpat- |
When an option change occurs at top level (that is, not inside subpat- |
3834 |
tern parentheses), the change applies to the remainder of the pattern |
tern parentheses), the change applies to the remainder of the pattern |
3835 |
that follows. If the change is placed right at the start of a pattern, |
that follows. If the change is placed right at the start of a pattern, |
3836 |
PCRE extracts it into the global options (and it will therefore show up |
PCRE extracts it into the global options (and it will therefore show up |
3837 |
in data extracted by the pcre_fullinfo() function). |
in data extracted by the pcre_fullinfo() function). |
3838 |
|
|
3839 |
An option change within a subpattern (see below for a description of |
An option change within a subpattern (see below for a description of |
3840 |
subpatterns) affects only that part of the current pattern that follows |
subpatterns) affects only that part of the current pattern that follows |
3841 |
it, so |
it, so |
3842 |
|
|
3843 |
(a(?i)b)c |
(a(?i)b)c |
3844 |
|
|
3845 |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
3846 |
used). By this means, options can be made to have different settings |
used). By this means, options can be made to have different settings |
3847 |
in different parts of the pattern. Any changes made in one alternative |
in different parts of the pattern. Any changes made in one alternative |
3848 |
do carry on into subsequent branches within the same subpattern. For |
do carry on into subsequent branches within the same subpattern. For |
3849 |
example, |
example, |
3850 |
|
|
3851 |
(a(?i)b|c) |
(a(?i)b|c) |
3852 |
|
|
3853 |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
3854 |
first branch is abandoned before the option setting. This is because |
first branch is abandoned before the option setting. This is because |
3855 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
3856 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
3857 |
|
|
3858 |
Note: There are other PCRE-specific options that can be set by the |
Note: There are other PCRE-specific options that can be set by the |
3859 |
application when the compile or match functions are called. In some |
application when the compile or match functions are called. In some |
3860 |
cases the pattern can contain special leading sequences to override |
cases the pattern can contain special leading sequences to override |
3861 |
what the application has set or what has been defaulted. Details are |
what the application has set or what has been defaulted. Details are |
3862 |
given in the section entitled "Newline sequences" above. |
given in the section entitled "Newline sequences" above. |
3863 |
|
|
3864 |
|
|
3871 |
|
|
3872 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
3873 |
|
|
3874 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
3875 |
the parentheses, it would match "cataract", "erpillar" or an empty |
the parentheses, it would match "cataract", "erpillar" or an empty |
3876 |
string. |
string. |
3877 |
|
|
3878 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
3879 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
3880 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
3881 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
3882 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
3883 |
subpatterns. |
subpatterns. |
3884 |
|
|
3885 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
3886 |
tern |
tern |
3887 |
|
|
3888 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
3890 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
3891 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
3892 |
|
|
3893 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
3894 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
3895 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
3896 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
3897 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
3898 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
3899 |
matched against the pattern |
matched against the pattern |
3900 |
|
|
3901 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
3903 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
3904 |
1 and 2. The maximum number of capturing subpatterns is 65535. |
1 and 2. The maximum number of capturing subpatterns is 65535. |
3905 |
|
|
3906 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
3907 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
3908 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
3909 |
|
|
3910 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
3911 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
3912 |
|
|
3913 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
3914 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
3915 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
3916 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
3917 |
"Saturday". |
"Saturday". |
3918 |
|
|
3919 |
|
|
3920 |
DUPLICATE SUBPATTERN NUMBERS |
DUPLICATE SUBPATTERN NUMBERS |
3921 |
|
|
3922 |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
Perl 5.10 introduced a feature whereby each alternative in a subpattern |
3923 |
uses the same numbers for its capturing parentheses. Such a subpattern |
uses the same numbers for its capturing parentheses. Such a subpattern |
3924 |
starts with (?| and is itself a non-capturing subpattern. For example, |
starts with (?| and is itself a non-capturing subpattern. For example, |
3925 |
consider this pattern: |
consider this pattern: |
3926 |
|
|
3927 |
(?|(Sat)ur|(Sun))day |
(?|(Sat)ur|(Sun))day |
3928 |
|
|
3929 |
Because the two alternatives are inside a (?| group, both sets of cap- |
Because the two alternatives are inside a (?| group, both sets of cap- |
3930 |
turing parentheses are numbered one. Thus, when the pattern matches, |
turing parentheses are numbered one. Thus, when the pattern matches, |
3931 |
you can look at captured substring number one, whichever alternative |
you can look at captured substring number one, whichever alternative |
3932 |
matched. This construct is useful when you want to capture part, but |
matched. This construct is useful when you want to capture part, but |
3933 |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
not all, of one of a number of alternatives. Inside a (?| group, paren- |
3934 |
theses are numbered as usual, but the number is reset at the start of |
theses are numbered as usual, but the number is reset at the start of |
3935 |
each branch. The numbers of any capturing buffers that follow the sub- |
each branch. The numbers of any capturing buffers that follow the sub- |
3936 |
pattern start after the highest number used in any branch. The follow- |
pattern start after the highest number used in any branch. The follow- |
3937 |
ing example is taken from the Perl documentation. The numbers under- |
ing example is taken from the Perl documentation. The numbers under- |
3938 |
neath show in which buffer the captured content will be stored. |
neath show in which buffer the captured content will be stored. |
3939 |
|
|
3940 |
# before ---------------branch-reset----------- after |
# before ---------------branch-reset----------- after |
3941 |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
/ ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x |
3942 |
# 1 2 2 3 2 3 4 |
# 1 2 2 3 2 3 4 |
3943 |
|
|
3944 |
A backreference or a recursive call to a numbered subpattern always |
A backreference or a recursive call to a numbered subpattern always |
3945 |
refers to the first one in the pattern with the given number. |
refers to the first one in the pattern with the given number. |
3946 |
|
|
3947 |
An alternative approach to using this "branch reset" feature is to use |
An alternative approach to using this "branch reset" feature is to use |
3948 |
duplicate named subpatterns, as described in the next section. |
duplicate named subpatterns, as described in the next section. |
3949 |
|
|
3950 |
|
|
3951 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
3952 |
|
|
3953 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
3954 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
3955 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
3956 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
3957 |
patterns. This feature was not added to Perl until release 5.10. Python |
patterns. This feature was not added to Perl until release 5.10. Python |
3958 |
had the feature earlier, and PCRE introduced it at release 4.0, using |
had the feature earlier, and PCRE introduced it at release 4.0, using |
3959 |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
3960 |
tax. |
tax. |
3961 |
|
|
3962 |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
3963 |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
3964 |
to capturing parentheses from other parts of the pattern, such as back- |
to capturing parentheses from other parts of the pattern, such as back- |
3965 |
references, recursion, and conditions, can be made by name as well as |
references, recursion, and conditions, can be made by name as well as |
3966 |
by number. |
by number. |
3967 |
|
|
3968 |
Names consist of up to 32 alphanumeric characters and underscores. |
Names consist of up to 32 alphanumeric characters and underscores. |
3969 |
Named capturing parentheses are still allocated numbers as well as |
Named capturing parentheses are still allocated numbers as well as |
3970 |
names, exactly as if the names were not present. The PCRE API provides |
names, exactly as if the names were not present. The PCRE API provides |
3971 |
function calls for extracting the name-to-number translation table from |
function calls for extracting the name-to-number translation table from |
3972 |
a compiled pattern. There is also a convenience function for extracting |
a compiled pattern. There is also a convenience function for extracting |
3973 |
a captured substring by name. |
a captured substring by name. |
3974 |
|
|
3975 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
3976 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
3977 |
time. This can be useful for patterns where only one instance of the |
time. This can be useful for patterns where only one instance of the |
3978 |
named parentheses can match. Suppose you want to match the name of a |
named parentheses can match. Suppose you want to match the name of a |
3979 |
weekday, either as a 3-letter abbreviation or as the full name, and in |
weekday, either as a 3-letter abbreviation or as the full name, and in |
3980 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
3981 |
the line breaks) does the job: |
the line breaks) does the job: |
3982 |
|
|
3986 |
(?<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
3987 |
(?<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
3988 |
|
|
3989 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
3990 |
match. (An alternative way of solving this problem is to use a "branch |
match. (An alternative way of solving this problem is to use a "branch |
3991 |
reset" subpattern, as described in the previous section.) |
reset" subpattern, as described in the previous section.) |
3992 |
|
|
3993 |
The convenience function for extracting the data by name returns the |
The convenience function for extracting the data by name returns the |
3994 |
substring for the first (and in this example, the only) subpattern of |
substring for the first (and in this example, the only) subpattern of |
3995 |
that name that matched. This saves searching to find which numbered |
that name that matched. This saves searching to find which numbered |
3996 |
subpattern it was. If you make a reference to a non-unique named sub- |
subpattern it was. If you make a reference to a non-unique named sub- |
3997 |
pattern from elsewhere in the pattern, the one that corresponds to the |
pattern from elsewhere in the pattern, the one that corresponds to the |
3998 |
lowest number is used. For further details of the interfaces for han- |
lowest number is used. For further details of the interfaces for han- |
3999 |
dling named subpatterns, see the pcreapi documentation. |
dling named subpatterns, see the pcreapi documentation. |
4000 |
|
|
4001 |
|
Warning: You cannot use different names to distinguish between two sub- |
4002 |
|
patterns with the same number (see the previous section) because PCRE |
4003 |
|
uses only the numbers when matching. |
4004 |
|
|
4005 |
|
|
4006 |
REPETITION |
REPETITION |
4007 |
|
|
4008 |
Repetition is specified by quantifiers, which can follow any of the |
Repetition is specified by quantifiers, which can follow any of the |
4009 |
following items: |
following items: |
4010 |
|
|
4011 |
a literal data character |
a literal data character |
4018 |
a back reference (see next section) |
a back reference (see next section) |
4019 |
a parenthesized subpattern (unless it is an assertion) |
a parenthesized subpattern (unless it is an assertion) |
4020 |
|
|
4021 |
The general repetition quantifier specifies a minimum and maximum num- |
The general repetition quantifier specifies a minimum and maximum num- |
4022 |
ber of permitted matches, by giving the two numbers in curly brackets |
ber of permitted matches, by giving the two numbers in curly brackets |
4023 |
(braces), separated by a comma. The numbers must be less than 65536, |
(braces), separated by a comma. The numbers must be less than 65536, |
4024 |
and the first must be less than or equal to the second. For example: |
and the first must be less than or equal to the second. For example: |
4025 |
|
|
4026 |
z{2,4} |
z{2,4} |
4027 |
|
|
4028 |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
matches "zz", "zzz", or "zzzz". A closing brace on its own is not a |
4029 |
special character. If the second number is omitted, but the comma is |
special character. If the second number is omitted, but the comma is |
4030 |
present, there is no upper limit; if the second number and the comma |
present, there is no upper limit; if the second number and the comma |
4031 |
are both omitted, the quantifier specifies an exact number of required |
are both omitted, the quantifier specifies an exact number of required |
4032 |
matches. Thus |
matches. Thus |
4033 |
|
|
4034 |
[aeiou]{3,} |
[aeiou]{3,} |
4037 |
|
|
4038 |
\d{8} |
\d{8} |
4039 |
|
|
4040 |
matches exactly 8 digits. An opening curly bracket that appears in a |
matches exactly 8 digits. An opening curly bracket that appears in a |
4041 |
position where a quantifier is not allowed, or one that does not match |
position where a quantifier is not allowed, or one that does not match |
4042 |
the syntax of a quantifier, is taken as a literal character. For exam- |
the syntax of a quantifier, is taken as a literal character. For exam- |
4043 |
ple, {,6} is not a quantifier, but a literal string of four characters. |
ple, {,6} is not a quantifier, but a literal string of four characters. |
4044 |
|
|
4045 |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
In UTF-8 mode, quantifiers apply to UTF-8 characters rather than to |
4183 |
|
|
4184 |
(?>\d+)foo |
(?>\d+)foo |
4185 |
|
|
4186 |
This kind of parenthesis "locks up" the part of the pattern it con- |
This kind of parenthesis "locks up" the part of the pattern it con- |
4187 |
tains once it has matched, and a failure further into the pattern is |
tains once it has matched, and a failure further into the pattern is |
4188 |
prevented from backtracking into it. Backtracking past it to previous |
prevented from backtracking into it. Backtracking past it to previous |
4189 |
items, however, works as normal. |
items, however, works as normal. |
4190 |
|
|
4191 |
An alternative description is that a subpattern of this type matches |
An alternative description is that a subpattern of this type matches |
4192 |
the string of characters that an identical standalone pattern would |
the string of characters that an identical standalone pattern would |
4193 |
match, if anchored at the current point in the subject string. |
match, if anchored at the current point in the subject string. |
4194 |
|
|
4195 |
Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
Atomic grouping subpatterns are not capturing subpatterns. Simple cases |
4196 |
such as the above example can be thought of as a maximizing repeat that |
such as the above example can be thought of as a maximizing repeat that |
4197 |
must swallow everything it can. So, while both \d+ and \d+? are pre- |
must swallow everything it can. So, while both \d+ and \d+? are pre- |
4198 |
pared to adjust the number of digits they match in order to make the |
pared to adjust the number of digits they match in order to make the |
4199 |
rest of the pattern match, (?>\d+) can only match an entire sequence of |
rest of the pattern match, (?>\d+) can only match an entire sequence of |
4200 |
digits. |
digits. |
4201 |
|
|
4202 |
Atomic groups in general can of course contain arbitrarily complicated |
Atomic groups in general can of course contain arbitrarily complicated |
4203 |
subpatterns, and can be nested. However, when the subpattern for an |
subpatterns, and can be nested. However, when the subpattern for an |
4204 |
atomic group is just a single repeated item, as in the example above, a |
atomic group is just a single repeated item, as in the example above, a |
4205 |
simpler notation, called a "possessive quantifier" can be used. This |
simpler notation, called a "possessive quantifier" can be used. This |
4206 |
consists of an additional + character following a quantifier. Using |
consists of an additional + character following a quantifier. Using |
4207 |
this notation, the previous example can be rewritten as |
this notation, the previous example can be rewritten as |
4208 |
|
|
4209 |
\d++foo |
\d++foo |
4213 |
|
|
4214 |
(abc|xyz){2,3}+ |
(abc|xyz){2,3}+ |
4215 |
|
|
4216 |
Possessive quantifiers are always greedy; the setting of the |
Possessive quantifiers are always greedy; the setting of the |
4217 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
4218 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
4219 |
meaning of a possessive quantifier and the equivalent atomic group, |
meaning of a possessive quantifier and the equivalent atomic group, |
4220 |
though there may be a performance difference; possessive quantifiers |
though there may be a performance difference; possessive quantifiers |
4221 |
should be slightly faster. |
should be slightly faster. |
4222 |
|
|
4223 |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
4224 |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
4225 |
edition of his book. Mike McCloskey liked it, so implemented it when he |
edition of his book. Mike McCloskey liked it, so implemented it when he |
4226 |
built Sun's Java package, and PCRE copied it from there. It ultimately |
built Sun's Java package, and PCRE copied it from there. It ultimately |
4227 |
found its way into Perl at release 5.10. |
found its way into Perl at release 5.10. |
4228 |
|
|
4229 |
PCRE has an optimization that automatically "possessifies" certain sim- |
PCRE has an optimization that automatically "possessifies" certain sim- |
4230 |
ple pattern constructs. For example, the sequence A+B is treated as |
ple pattern constructs. For example, the sequence A+B is treated as |
4231 |
A++B because there is no point in backtracking into a sequence of A's |
A++B because there is no point in backtracking into a sequence of A's |
4232 |
when B must follow. |
when B must follow. |
4233 |
|
|
4234 |
When a pattern contains an unlimited repeat inside a subpattern that |
When a pattern contains an unlimited repeat inside a subpattern that |
4235 |
can itself be repeated an unlimited number of times, the use of an |
can itself be repeated an unlimited number of times, the use of an |
4236 |
atomic group is the only way to avoid some failing matches taking a |
atomic group is the only way to avoid some failing matches taking a |
4237 |
very long time indeed. The pattern |
very long time indeed. The pattern |
4238 |
|
|
4239 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
4240 |
|
|
4241 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
4242 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
4243 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
4244 |
|
|
4245 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
4246 |
|
|
4247 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
4248 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
4249 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
4250 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
4251 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
4252 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
4253 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
4254 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
4255 |
group, like this: |
group, like this: |
4256 |
|
|
4257 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
4258 |
|
|
4259 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
4260 |
|
|
4261 |
|
|
4262 |
BACK REFERENCES |
BACK REFERENCES |
5019 |
|
|
5020 |
REVISION |
REVISION |
5021 |
|
|
5022 |
Last updated: 19 April 2008 |
Last updated: 08 March 2009 |
5023 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
5024 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5025 |
|
|
5026 |
|
|
5548 |
0: dogsbody |
0: dogsbody |
5549 |
1: dog |
1: dog |
5550 |
|
|
5551 |
The pattern matches the words "dog" or "dogsbody". When the subject is |
The pattern matches the words "dog" or "dogsbody". When the subject is |
5552 |
presented in several parts ("do" and "gsb" being the first two) the |
presented in several parts ("do" and "gsb" being the first two) the |
5553 |
match stops when "dog" has been found, and it is not possible to con- |
match stops when "dog" has been found, and it is not possible to con- |
5554 |
tinue. On the other hand, if "dogsbody" is presented as a single |
tinue. On the other hand, if "dogsbody" is presented as a single |
5555 |
string, both matches are found. |
string, both matches are found. |
5556 |
|
|
5557 |
Because of this phenomenon, it does not usually make sense to end a |
Because of this phenomenon, it does not usually make sense to end a |
5558 |
pattern that is going to be matched in this way with a variable repeat. |
pattern that is going to be matched in this way with a variable repeat. |
5559 |
|
|
5560 |
4. Patterns that contain alternatives at the top level which do not all |
4. Patterns that contain alternatives at the top level which do not all |
5901 |
command for linking an application that uses them. Because the POSIX |
command for linking an application that uses them. Because the POSIX |
5902 |
functions call the native ones, it is also necessary to add -lpcre. |
functions call the native ones, it is also necessary to add -lpcre. |
5903 |
|
|
5904 |
I have implemented only those option bits that can be reasonably mapped |
I have implemented only those POSIX option bits that can be reasonably |
5905 |
to PCRE native options. In addition, the option REG_EXTENDED is defined |
mapped to PCRE native options. In addition, the option REG_EXTENDED is |
5906 |
with the value zero. This has no effect, but since programs that are |
defined with the value zero. This has no effect, but since programs |
5907 |
written to the POSIX interface often use it, this makes it easier to |
that are written to the POSIX interface often use it, this makes it |
5908 |
slot in PCRE as a replacement library. Other POSIX options are not even |
easier to slot in PCRE as a replacement library. Other POSIX options |
5909 |
defined. |
are not even defined. |
5910 |
|
|
5911 |
When PCRE is called via these functions, it is only the API that is |
When PCRE is called via these functions, it is only the API that is |
5912 |
POSIX-like in style. The syntax and semantics of the regular expres- |
POSIX-like in style. The syntax and semantics of the regular expres- |
5986 |
MATCHING NEWLINE CHARACTERS |
MATCHING NEWLINE CHARACTERS |
5987 |
|
|
5988 |
This area is not simple, because POSIX and Perl take different views of |
This area is not simple, because POSIX and Perl take different views of |
5989 |
things. It is not possible to get PCRE to obey POSIX semantics, but |
things. It is not possible to get PCRE to obey POSIX semantics, but |
5990 |
then PCRE was never intended to be a POSIX engine. The following table |
then PCRE was never intended to be a POSIX engine. The following table |
5991 |
lists the different possibilities for matching newline characters in |
lists the different possibilities for matching newline characters in |
5992 |
PCRE: |
PCRE: |
5993 |
|
|
5994 |
Default Change with |
Default Change with |
6010 |
^ matches \n in middle no REG_NEWLINE |
^ matches \n in middle no REG_NEWLINE |
6011 |
|
|
6012 |
PCRE's behaviour is the same as Perl's, except that there is no equiva- |
PCRE's behaviour is the same as Perl's, except that there is no equiva- |
6013 |
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is |
lent for PCRE_DOLLAR_ENDONLY in Perl. In both PCRE and Perl, there is |
6014 |
no way to stop newline from matching [^a]. |
no way to stop newline from matching [^a]. |
6015 |
|
|
6016 |
The default POSIX newline handling can be obtained by setting |
The default POSIX newline handling can be obtained by setting |
6017 |
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE |
PCRE_DOTALL and PCRE_DOLLAR_ENDONLY, but there is no way to make PCRE |
6018 |
behave exactly as for the REG_NEWLINE action. |
behave exactly as for the REG_NEWLINE action. |
6019 |
|
|
6020 |
|
|
6021 |
MATCHING A PATTERN |
MATCHING A PATTERN |
6022 |
|
|
6023 |
The function regexec() is called to match a compiled pattern preg |
The function regexec() is called to match a compiled pattern preg |
6024 |
against a given string, which is by default terminated by a zero byte |
against a given string, which is by default terminated by a zero byte |
6025 |
(but see REG_STARTEND below), subject to the options in eflags. These |
(but see REG_STARTEND below), subject to the options in eflags. These |
6026 |
can be: |
can be: |
6027 |
|
|
6028 |
REG_NOTBOL |
REG_NOTBOL |
6030 |
The PCRE_NOTBOL option is set when calling the underlying PCRE matching |
The PCRE_NOTBOL option is set when calling the underlying PCRE matching |
6031 |
function. |
function. |
6032 |
|
|
6033 |
|
REG_NOTEMPTY |
6034 |
|
|
6035 |
|
The PCRE_NOTEMPTY option is set when calling the underlying PCRE match- |
6036 |
|
ing function. Note that REG_NOTEMPTY is not part of the POSIX standard. |
6037 |
|
However, setting this option can give more POSIX-like behaviour in some |
6038 |
|
situations. |
6039 |
|
|
6040 |
REG_NOTEOL |
REG_NOTEOL |
6041 |
|
|
6042 |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
The PCRE_NOTEOL option is set when calling the underlying PCRE matching |
6099 |
|
|
6100 |
REVISION |
REVISION |
6101 |
|
|
6102 |
Last updated: 05 April 2008 |
Last updated: 11 March 2009 |
6103 |
Copyright (c) 1997-2008 University of Cambridge. |
Copyright (c) 1997-2009 University of Cambridge. |
6104 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6105 |
|
|
6106 |
|
|
6204 |
need more, consider using the more general interface |
need more, consider using the more general interface |
6205 |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
6206 |
|
|
6207 |
|
NOTE: Do not use no_arg, which is used internally to mark the end of a |
6208 |
|
list of optional arguments, as a placeholder for missing arguments, as |
6209 |
|
this can lead to segfaults. |
6210 |
|
|
6211 |
|
|
6212 |
QUOTING METACHARACTERS |
QUOTING METACHARACTERS |
6213 |
|
|
6441 |
|
|
6442 |
REVISION |
REVISION |
6443 |
|
|
6444 |
Last updated: 12 November 2007 |
Last updated: 17 March 2009 |
6445 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
6446 |
|
|
6447 |
|
|