18 |
|
|
19 |
The PCRE library is a set of functions that implement regular expres- |
The PCRE library is a set of functions that implement regular expres- |
20 |
sion pattern matching using the same syntax and semantics as Perl, with |
sion pattern matching using the same syntax and semantics as Perl, with |
21 |
just a few differences. The current implementation of PCRE (release |
just a few differences. (Certain features that appeared in Python and |
22 |
6.x) corresponds approximately with Perl 5.8, including support for |
PCRE before they appeared in Perl are also available using the Python |
23 |
UTF-8 encoded strings and Unicode general category properties. However, |
syntax.) |
24 |
this support has to be explicitly enabled; it is not the default. |
|
25 |
|
The current implementation of PCRE (release 7.x) corresponds approxi- |
26 |
In addition to the Perl-compatible matching function, PCRE also con- |
mately with Perl 5.10, including support for UTF-8 encoded strings and |
27 |
tains an alternative matching function that matches the same compiled |
Unicode general category properties. However, UTF-8 and Unicode support |
28 |
patterns in a different way. In certain circumstances, the alternative |
has to be explicitly enabled; it is not the default. The Unicode tables |
29 |
function has some advantages. For a discussion of the two matching |
correspond to Unicode release 5.0.0. |
30 |
algorithms, see the pcrematching page. |
|
31 |
|
In addition to the Perl-compatible matching function, PCRE contains an |
32 |
PCRE is written in C and released as a C library. A number of people |
alternative matching function that matches the same compiled patterns |
33 |
have written wrappers and interfaces of various kinds. In particular, |
in a different way. In certain circumstances, the alternative function |
34 |
Google Inc. have provided a comprehensive C++ wrapper. This is now |
has some advantages. For a discussion of the two matching algorithms, |
35 |
|
see the pcrematching page. |
36 |
|
|
37 |
|
PCRE is written in C and released as a C library. A number of people |
38 |
|
have written wrappers and interfaces of various kinds. In particular, |
39 |
|
Google Inc. have provided a comprehensive C++ wrapper. This is now |
40 |
included as part of the PCRE distribution. The pcrecpp page has details |
included as part of the PCRE distribution. The pcrecpp page has details |
41 |
of this interface. Other people's contributions can be found in the |
of this interface. Other people's contributions can be found in the |
42 |
Contrib directory at the primary FTP site, which is: |
Contrib directory at the primary FTP site, which is: |
43 |
|
|
44 |
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre |
45 |
|
|
46 |
Details of exactly which Perl regular expression features are and are |
Details of exactly which Perl regular expression features are and are |
47 |
not supported by PCRE are given in separate documents. See the pcrepat- |
not supported by PCRE are given in separate documents. See the pcrepat- |
48 |
tern and pcrecompat pages. |
tern and pcrecompat pages. |
49 |
|
|
50 |
Some features of PCRE can be included, excluded, or changed when the |
Some features of PCRE can be included, excluded, or changed when the |
51 |
library is built. The pcre_config() function makes it possible for a |
library is built. The pcre_config() function makes it possible for a |
52 |
client to discover which features are available. The features them- |
client to discover which features are available. The features them- |
53 |
selves are described in the pcrebuild page. Documentation about build- |
selves are described in the pcrebuild page. Documentation about build- |
54 |
ing PCRE for various operating systems can be found in the README file |
ing PCRE for various operating systems can be found in the README file |
55 |
in the source distribution. |
in the source distribution. |
56 |
|
|
57 |
The library contains a number of undocumented internal functions and |
The library contains a number of undocumented internal functions and |
58 |
data tables that are used by more than one of the exported external |
data tables that are used by more than one of the exported external |
59 |
functions, but which are not intended for use by external callers. |
functions, but which are not intended for use by external callers. |
60 |
Their names all begin with "_pcre_", which hopefully will not provoke |
Their names all begin with "_pcre_", which hopefully will not provoke |
61 |
any name clashes. In some environments, it is possible to control which |
any name clashes. In some environments, it is possible to control which |
62 |
external symbols are exported when a shared library is built, and in |
external symbols are exported when a shared library is built, and in |
63 |
these cases the undocumented symbols are not exported. |
these cases the undocumented symbols are not exported. |
64 |
|
|
65 |
|
|
66 |
USER DOCUMENTATION |
USER DOCUMENTATION |
67 |
|
|
68 |
The user documentation for PCRE comprises a number of different sec- |
The user documentation for PCRE comprises a number of different sec- |
69 |
tions. In the "man" format, each of these is a separate "man page". In |
tions. In the "man" format, each of these is a separate "man page". In |
70 |
the HTML format, each is a separate page, linked from the index page. |
the HTML format, each is a separate page, linked from the index page. |
71 |
In the plain text format, all the sections are concatenated, for ease |
In the plain text format, all the sections are concatenated, for ease |
72 |
of searching. The sections are as follows: |
of searching. The sections are as follows: |
73 |
|
|
74 |
pcre this document |
pcre this document |
89 |
pcrestack discussion of stack usage |
pcrestack discussion of stack usage |
90 |
pcretest description of the pcretest testing command |
pcretest description of the pcretest testing command |
91 |
|
|
92 |
In addition, in the "man" and HTML formats, there is a short page for |
In addition, in the "man" and HTML formats, there is a short page for |
93 |
each C library function, listing its arguments and results. |
each C library function, listing its arguments and results. |
94 |
|
|
95 |
|
|
96 |
LIMITATIONS |
LIMITATIONS |
97 |
|
|
98 |
There are some size limitations in PCRE but it is hoped that they will |
There are some size limitations in PCRE but it is hoped that they will |
99 |
never in practice be relevant. |
never in practice be relevant. |
100 |
|
|
101 |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
The maximum length of a compiled pattern is 65539 (sic) bytes if PCRE |
102 |
is compiled with the default internal linkage size of 2. If you want to |
is compiled with the default internal linkage size of 2. If you want to |
103 |
process regular expressions that are truly enormous, you can compile |
process regular expressions that are truly enormous, you can compile |
104 |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
PCRE with an internal linkage size of 3 or 4 (see the README file in |
105 |
the source distribution and the pcrebuild documentation for details). |
the source distribution and the pcrebuild documentation for details). |
106 |
In these cases the limit is substantially larger. However, the speed |
In these cases the limit is substantially larger. However, the speed |
107 |
of execution will be slower. |
of execution is slower. |
108 |
|
|
109 |
All values in repeating quantifiers must be less than 65536. The maxi- |
All values in repeating quantifiers must be less than 65536. The maxi- |
110 |
mum compiled length of subpattern with an explicit repeat count is |
mum compiled length of subpattern with an explicit repeat count is |
111 |
30000 bytes. The maximum number of capturing subpatterns is 65535. |
30000 bytes. The maximum number of capturing subpatterns is 65535. |
112 |
|
|
113 |
There is no limit to the number of non-capturing subpatterns, but the |
There is no limit to the number of parenthesized subpatterns, but there |
114 |
maximum depth of nesting of all kinds of parenthesized subpattern, |
can be no more than 65535 capturing subpatterns. |
|
including capturing subpatterns, assertions, and other types of subpat- |
|
|
tern, is 200. |
|
115 |
|
|
116 |
The maximum length of name for a named subpattern is 32, and the maxi- |
The maximum length of name for a named subpattern is 32 characters, and |
117 |
mum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
118 |
|
|
119 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
120 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
121 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
122 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
123 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
124 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
125 |
|
|
126 |
|
|
127 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
128 |
|
|
129 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
130 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
131 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
132 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
133 |
|
|
134 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
135 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
136 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
137 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
138 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
139 |
|
|
140 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
141 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
142 |
is limited to testing the PCRE_UTF8 flag in several places, so should |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
143 |
not be very large. |
very big. |
144 |
|
|
145 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
146 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
147 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
148 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
149 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
150 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
151 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
152 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
153 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
154 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
155 |
does not support this. |
does not support this. |
156 |
|
|
157 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
158 |
|
|
159 |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
160 |
subjects are checked for validity on entry to the relevant functions. |
subjects are checked for validity on entry to the relevant functions. |
161 |
If an invalid UTF-8 string is passed, an error return is given. In some |
If an invalid UTF-8 string is passed, an error return is given. In some |
162 |
situations, you may already know that your strings are valid, and |
situations, you may already know that your strings are valid, and |
163 |
therefore want to skip these checks in order to improve performance. If |
therefore want to skip these checks in order to improve performance. If |
164 |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
165 |
PCRE assumes that the pattern or subject it is given (respectively) |
PCRE assumes that the pattern or subject it is given (respectively) |
166 |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
167 |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
168 |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
169 |
crash. |
crash. |
170 |
|
|
171 |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
172 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
173 |
|
|
174 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
175 |
characters for values greater than \177. |
characters for values greater than \177. |
176 |
|
|
177 |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
178 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
179 |
|
|
180 |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
181 |
gle byte. |
gle byte. |
182 |
|
|
183 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
184 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
185 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
186 |
|
|
187 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
188 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
189 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
190 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
191 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
192 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
193 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
194 |
\p{Nd}. |
\p{Nd}. |
195 |
|
|
196 |
8. Similarly, characters that match the POSIX named character classes |
8. Similarly, characters that match the POSIX named character classes |
197 |
are all low-valued characters. |
are all low-valued characters. |
198 |
|
|
199 |
9. Case-insensitive matching applies only to characters whose values |
9. Case-insensitive matching applies only to characters whose values |
200 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
201 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
202 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
203 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
204 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
205 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
206 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
207 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
208 |
ported by PCRE. |
ported by PCRE. |
209 |
|
|
210 |
|
|
212 |
|
|
213 |
Philip Hazel |
Philip Hazel |
214 |
University Computing Service, |
University Computing Service, |
215 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QH, England. |
216 |
|
|
217 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
218 |
so I've taken it away. If you want to email me, use my initial and sur- |
so I've taken it away. If you want to email me, use my initial and sur- |
219 |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
name, separated by a dot, at the domain ucs.cam.ac.uk. |
220 |
|
|
221 |
Last updated: 05 June 2006 |
Last updated: 23 November 2006 |
222 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
223 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
224 |
|
|
308 |
|
|
309 |
--enable-newline-is-crlf |
--enable-newline-is-crlf |
310 |
|
|
311 |
to the configure command. Whatever line ending convention is selected |
to the configure command. There is a fourth option, specified by |
312 |
when PCRE is built can be overridden when the library functions are |
|
313 |
called. At build time it is conventional to use the standard for your |
--enable-newline-is-any |
314 |
operating system. |
|
315 |
|
which causes PCRE to recognize any Unicode newline sequence. |
316 |
|
|
317 |
|
Whatever line ending convention is selected when PCRE is built can be |
318 |
|
overridden when the library functions are called. At build time it is |
319 |
|
conventional to use the standard for your operating system. |
320 |
|
|
321 |
|
|
322 |
BUILDING SHARED AND STATIC LIBRARIES |
BUILDING SHARED AND STATIC LIBRARIES |
323 |
|
|
324 |
The PCRE building process uses libtool to build both shared and static |
The PCRE building process uses libtool to build both shared and static |
325 |
Unix libraries by default. You can suppress one of these by adding one |
Unix libraries by default. You can suppress one of these by adding one |
326 |
of |
of |
327 |
|
|
328 |
--disable-shared |
--disable-shared |
334 |
POSIX MALLOC USAGE |
POSIX MALLOC USAGE |
335 |
|
|
336 |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
When PCRE is called through the POSIX interface (see the pcreposix doc- |
337 |
umentation), additional working storage is required for holding the |
umentation), additional working storage is required for holding the |
338 |
pointers to capturing substrings, because PCRE requires three integers |
pointers to capturing substrings, because PCRE requires three integers |
339 |
per substring, whereas the POSIX interface provides only two. If the |
per substring, whereas the POSIX interface provides only two. If the |
340 |
number of expected substrings is small, the wrapper function uses space |
number of expected substrings is small, the wrapper function uses space |
341 |
on the stack, because this is faster than using malloc() for each call. |
on the stack, because this is faster than using malloc() for each call. |
342 |
The default threshold above which the stack is no longer used is 10; it |
The default threshold above which the stack is no longer used is 10; it |
349 |
|
|
350 |
HANDLING VERY LARGE PATTERNS |
HANDLING VERY LARGE PATTERNS |
351 |
|
|
352 |
Within a compiled pattern, offset values are used to point from one |
Within a compiled pattern, offset values are used to point from one |
353 |
part to another (for example, from an opening parenthesis to an alter- |
part to another (for example, from an opening parenthesis to an alter- |
354 |
nation metacharacter). By default, two-byte values are used for these |
nation metacharacter). By default, two-byte values are used for these |
355 |
offsets, leading to a maximum size for a compiled pattern of around |
offsets, leading to a maximum size for a compiled pattern of around |
356 |
64K. This is sufficient to handle all but the most gigantic patterns. |
64K. This is sufficient to handle all but the most gigantic patterns. |
357 |
Nevertheless, some people do want to process enormous patterns, so it |
Nevertheless, some people do want to process enormous patterns, so it |
358 |
is possible to compile PCRE to use three-byte or four-byte offsets by |
is possible to compile PCRE to use three-byte or four-byte offsets by |
359 |
adding a setting such as |
adding a setting such as |
360 |
|
|
361 |
--with-link-size=3 |
--with-link-size=3 |
362 |
|
|
363 |
to the configure command. The value given must be 2, 3, or 4. Using |
to the configure command. The value given must be 2, 3, or 4. Using |
364 |
longer offsets slows down the operation of PCRE because it has to load |
longer offsets slows down the operation of PCRE because it has to load |
365 |
additional bytes when handling them. |
additional bytes when handling them. |
366 |
|
|
367 |
If you build PCRE with an increased link size, test 2 (and test 5 if |
If you build PCRE with an increased link size, test 2 (and test 5 if |
368 |
you are using UTF-8) will fail. Part of the output of these tests is a |
you are using UTF-8) will fail. Part of the output of these tests is a |
369 |
representation of the compiled pattern, and this changes with the link |
representation of the compiled pattern, and this changes with the link |
370 |
size. |
size. |
371 |
|
|
372 |
|
|
373 |
AVOIDING EXCESSIVE STACK USAGE |
AVOIDING EXCESSIVE STACK USAGE |
374 |
|
|
375 |
When matching with the pcre_exec() function, PCRE implements backtrack- |
When matching with the pcre_exec() function, PCRE implements backtrack- |
376 |
ing by making recursive calls to an internal function called match(). |
ing by making recursive calls to an internal function called match(). |
377 |
In environments where the size of the stack is limited, this can se- |
In environments where the size of the stack is limited, this can se- |
378 |
verely limit PCRE's operation. (The Unix environment does not usually |
verely limit PCRE's operation. (The Unix environment does not usually |
379 |
suffer from this problem, but it may sometimes be necessary to increase |
suffer from this problem, but it may sometimes be necessary to increase |
380 |
the maximum stack size. There is a discussion in the pcrestack docu- |
the maximum stack size. There is a discussion in the pcrestack docu- |
381 |
mentation.) An alternative approach to recursion that uses memory from |
mentation.) An alternative approach to recursion that uses memory from |
382 |
the heap to remember data, instead of using recursive function calls, |
the heap to remember data, instead of using recursive function calls, |
383 |
has been implemented to work round the problem of limited stack size. |
has been implemented to work round the problem of limited stack size. |
384 |
If you want to build a version of PCRE that works this way, add |
If you want to build a version of PCRE that works this way, add |
385 |
|
|
386 |
--disable-stack-for-recursion |
--disable-stack-for-recursion |
387 |
|
|
388 |
to the configure command. With this configuration, PCRE will use the |
to the configure command. With this configuration, PCRE will use the |
389 |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
pcre_stack_malloc and pcre_stack_free variables to call memory manage- |
390 |
ment functions. Separate functions are provided because the usage is |
ment functions. Separate functions are provided because the usage is |
391 |
very predictable: the block sizes requested are always the same, and |
very predictable: the block sizes requested are always the same, and |
392 |
the blocks are always freed in reverse order. A calling program might |
the blocks are always freed in reverse order. A calling program might |
393 |
be able to implement optimized functions that perform better than the |
be able to implement optimized functions that perform better than the |
394 |
standard malloc() and free() functions. PCRE runs noticeably more |
standard malloc() and free() functions. PCRE runs noticeably more |
395 |
slowly when built in this way. This option affects only the pcre_exec() |
slowly when built in this way. This option affects only the pcre_exec() |
396 |
function; it is not relevant for the the pcre_dfa_exec() function. |
function; it is not relevant for the the pcre_dfa_exec() function. |
397 |
|
|
398 |
|
|
399 |
LIMITING PCRE RESOURCE USAGE |
LIMITING PCRE RESOURCE USAGE |
400 |
|
|
401 |
Internally, PCRE has a function called match(), which it calls repeat- |
Internally, PCRE has a function called match(), which it calls repeat- |
402 |
edly (sometimes recursively) when matching a pattern with the |
edly (sometimes recursively) when matching a pattern with the |
403 |
pcre_exec() function. By controlling the maximum number of times this |
pcre_exec() function. By controlling the maximum number of times this |
404 |
function may be called during a single matching operation, a limit can |
function may be called during a single matching operation, a limit can |
405 |
be placed on the resources used by a single call to pcre_exec(). The |
be placed on the resources used by a single call to pcre_exec(). The |
406 |
limit can be changed at run time, as described in the pcreapi documen- |
limit can be changed at run time, as described in the pcreapi documen- |
407 |
tation. The default is 10 million, but this can be changed by adding a |
tation. The default is 10 million, but this can be changed by adding a |
408 |
setting such as |
setting such as |
409 |
|
|
410 |
--with-match-limit=500000 |
--with-match-limit=500000 |
411 |
|
|
412 |
to the configure command. This setting has no effect on the |
to the configure command. This setting has no effect on the |
413 |
pcre_dfa_exec() matching function. |
pcre_dfa_exec() matching function. |
414 |
|
|
415 |
In some environments it is desirable to limit the depth of recursive |
In some environments it is desirable to limit the depth of recursive |
416 |
calls of match() more strictly than the total number of calls, in order |
calls of match() more strictly than the total number of calls, in order |
417 |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
to restrict the maximum amount of stack (or heap, if --disable-stack- |
418 |
for-recursion is specified) that is used. A second limit controls this; |
for-recursion is specified) that is used. A second limit controls this; |
419 |
it defaults to the value that is set for --with-match-limit, which |
it defaults to the value that is set for --with-match-limit, which |
420 |
imposes no additional constraints. However, you can set a lower limit |
imposes no additional constraints. However, you can set a lower limit |
421 |
by adding, for example, |
by adding, for example, |
422 |
|
|
423 |
--with-match-limit-recursion=10000 |
--with-match-limit-recursion=10000 |
424 |
|
|
425 |
to the configure command. This value can also be overridden at run |
to the configure command. This value can also be overridden at run |
426 |
time. |
time. |
427 |
|
|
428 |
|
|
429 |
USING EBCDIC CODE |
USING EBCDIC CODE |
430 |
|
|
431 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
432 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
433 |
PCRE can, however, be compiled to run in an EBCDIC environment by |
PCRE can, however, be compiled to run in an EBCDIC environment by |
434 |
adding |
adding |
435 |
|
|
436 |
--enable-ebcdic |
--enable-ebcdic |
437 |
|
|
438 |
to the configure command. |
to the configure command. |
439 |
|
|
440 |
Last updated: 06 June 2006 |
|
441 |
|
SEE ALSO |
442 |
|
|
443 |
|
pcreapi(3), pcre_config(3). |
444 |
|
|
445 |
|
Last updated: 30 November 2006 |
446 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
447 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
448 |
|
|
479 |
<something> <something else> <something further> |
<something> <something else> <something further> |
480 |
|
|
481 |
there are three possible answers. The standard algorithm finds only one |
there are three possible answers. The standard algorithm finds only one |
482 |
of them, whereas the DFA algorithm finds all three. |
of them, whereas the alternative algorithm finds all three. |
483 |
|
|
484 |
|
|
485 |
REGULAR EXPRESSIONS AS TREES |
REGULAR EXPRESSIONS AS TREES |
520 |
This provides support for capturing parentheses and back references. |
This provides support for capturing parentheses and back references. |
521 |
|
|
522 |
|
|
523 |
THE DFA MATCHING ALGORITHM |
THE ALTERNATIVE MATCHING ALGORITHM |
524 |
|
|
525 |
DFA stands for "deterministic finite automaton", but you do not need to |
This algorithm conducts a breadth-first search of the tree. Starting |
526 |
understand the origins of that name. This algorithm conducts a breadth- |
from the first matching point in the subject, it scans the subject |
527 |
first search of the tree. Starting from the first matching point in the |
string from left to right, once, character by character, and as it does |
528 |
subject, it scans the subject string from left to right, once, charac- |
this, it remembers all the paths through the tree that represent valid |
529 |
ter by character, and as it does this, it remembers all the paths |
matches. In Friedl's terminology, this is a kind of "DFA algorithm", |
530 |
through the tree that represent valid matches. |
though it is not implemented as a traditional finite state machine (it |
531 |
|
keeps multiple states active simultaneously). |
532 |
The scan continues until either the end of the subject is reached, or |
|
533 |
there are no more unterminated paths. At this point, terminated paths |
The scan continues until either the end of the subject is reached, or |
534 |
represent the different matching possibilities (if there are none, the |
there are no more unterminated paths. At this point, terminated paths |
535 |
match has failed). Thus, if there is more than one possible match, |
represent the different matching possibilities (if there are none, the |
536 |
|
match has failed). Thus, if there is more than one possible match, |
537 |
this algorithm finds all of them, and in particular, it finds the long- |
this algorithm finds all of them, and in particular, it finds the long- |
538 |
est. In PCRE, there is an option to stop the algorithm after the first |
est. In PCRE, there is an option to stop the algorithm after the first |
539 |
match (which is necessarily the shortest) has been found. |
match (which is necessarily the shortest) has been found. |
540 |
|
|
541 |
Note that all the matches that are found start at the same point in the |
Note that all the matches that are found start at the same point in the |
543 |
|
|
544 |
cat(er(pillar)?) |
cat(er(pillar)?) |
545 |
|
|
546 |
is matched against the string "the caterpillar catchment", the result |
is matched against the string "the caterpillar catchment", the result |
547 |
will be the three strings "cat", "cater", and "caterpillar" that start |
will be the three strings "cat", "cater", and "caterpillar" that start |
548 |
at the fourth character of the subject. The algorithm does not automat- |
at the fourth character of the subject. The algorithm does not automat- |
549 |
ically move on to find matches that start at later positions. |
ically move on to find matches that start at later positions. |
550 |
|
|
551 |
There are a number of features of PCRE regular expressions that are not |
There are a number of features of PCRE regular expressions that are not |
552 |
supported by the DFA matching algorithm. They are as follows: |
supported by the alternative matching algorithm. They are as follows: |
553 |
|
|
554 |
1. Because the algorithm finds all possible matches, the greedy or |
1. Because the algorithm finds all possible matches, the greedy or |
555 |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
ungreedy nature of repetition quantifiers is not relevant. Greedy and |
556 |
ungreedy quantifiers are treated in exactly the same way. |
ungreedy quantifiers are treated in exactly the same way. However, pos- |
557 |
|
sessive quantifiers can make a difference when what follows could also |
558 |
|
match what is quantified, for example in a pattern like this: |
559 |
|
|
560 |
|
^a++\w! |
561 |
|
|
562 |
|
This pattern matches "aaab!" but not "aaa!", which would be matched by |
563 |
|
a non-possessive quantifier. Similarly, if an atomic group is present, |
564 |
|
it is matched as if it were a standalone pattern at the current point, |
565 |
|
and the longest match is then "locked in" for the rest of the overall |
566 |
|
pattern. |
567 |
|
|
568 |
2. When dealing with multiple paths through the tree simultaneously, it |
2. When dealing with multiple paths through the tree simultaneously, it |
569 |
is not straightforward to keep track of captured substrings for the |
is not straightforward to keep track of captured substrings for the |
570 |
different matching possibilities, and PCRE's implementation of this |
different matching possibilities, and PCRE's implementation of this |
571 |
algorithm does not attempt to do this. This means that no captured sub- |
algorithm does not attempt to do this. This means that no captured sub- |
572 |
strings are available. |
strings are available. |
573 |
|
|
574 |
3. Because no substrings are captured, back references within the pat- |
3. Because no substrings are captured, back references within the pat- |
575 |
tern are not supported, and cause errors if encountered. |
tern are not supported, and cause errors if encountered. |
576 |
|
|
577 |
4. For the same reason, conditional expressions that use a backrefer- |
4. For the same reason, conditional expressions that use a backrefer- |
578 |
ence as the condition are not supported. |
ence as the condition or test for a specific group recursion are not |
579 |
|
supported. |
580 |
|
|
581 |
5. Callouts are supported, but the value of the capture_top field is |
5. Callouts are supported, but the value of the capture_top field is |
582 |
always 1, and the value of the capture_last field is always -1. |
always 1, and the value of the capture_last field is always -1. |
583 |
|
|
584 |
6. The \C escape sequence, which (in the standard algorithm) matches a |
6. The \C escape sequence, which (in the standard algorithm) matches a |
585 |
single byte, even in UTF-8 mode, is not supported because the DFA algo- |
single byte, even in UTF-8 mode, is not supported because the alterna- |
586 |
rithm moves through the subject string one character at a time, for all |
tive algorithm moves through the subject string one character at a |
587 |
active paths through the tree. |
time, for all active paths through the tree. |
588 |
|
|
589 |
|
|
590 |
ADVANTAGES OF THE DFA ALGORITHM |
ADVANTAGES OF THE ALTERNATIVE ALGORITHM |
591 |
|
|
592 |
Using the DFA matching algorithm provides the following advantages: |
Using the alternative matching algorithm provides the following advan- |
593 |
|
tages: |
594 |
|
|
595 |
1. All possible matches (at a single point in the subject) are automat- |
1. All possible matches (at a single point in the subject) are automat- |
596 |
ically found, and in particular, the longest match is found. To find |
ically found, and in particular, the longest match is found. To find |
597 |
more than one match using the standard algorithm, you have to do kludgy |
more than one match using the standard algorithm, you have to do kludgy |
598 |
things with callouts. |
things with callouts. |
599 |
|
|
600 |
2. There is much better support for partial matching. The restrictions |
2. There is much better support for partial matching. The restrictions |
601 |
on the content of the pattern that apply when using the standard algo- |
on the content of the pattern that apply when using the standard algo- |
602 |
rithm for partial matching do not apply to the DFA algorithm. For non- |
rithm for partial matching do not apply to the alternative algorithm. |
603 |
anchored patterns, the starting position of a partial match is avail- |
For non-anchored patterns, the starting position of a partial match is |
604 |
able. |
available. |
605 |
|
|
606 |
3. Because the DFA algorithm scans the subject string just once, and |
3. Because the alternative algorithm scans the subject string just |
607 |
never needs to backtrack, it is possible to pass very long subject |
once, and never needs to backtrack, it is possible to pass very long |
608 |
strings to the matching function in several pieces, checking for par- |
subject strings to the matching function in several pieces, checking |
609 |
tial matching each time. |
for partial matching each time. |
610 |
|
|
611 |
|
|
612 |
DISADVANTAGES OF THE DFA ALGORITHM |
DISADVANTAGES OF THE ALTERNATIVE ALGORITHM |
613 |
|
|
614 |
The DFA algorithm suffers from a number of disadvantages: |
The alternative algorithm suffers from a number of disadvantages: |
615 |
|
|
616 |
1. It is substantially slower than the standard algorithm. This is |
1. It is substantially slower than the standard algorithm. This is |
617 |
partly because it has to search for all possible matches, but is also |
partly because it has to search for all possible matches, but is also |
618 |
because it is less susceptible to optimization. |
because it is less susceptible to optimization. |
619 |
|
|
620 |
2. Capturing parentheses and back references are not supported. |
2. Capturing parentheses and back references are not supported. |
621 |
|
|
622 |
3. The "atomic group" feature of PCRE regular expressions is supported, |
3. Although atomic groups are supported, their use does not provide the |
623 |
but does not provide the advantage that it does for the standard algo- |
performance advantage that it does for the standard algorithm. |
|
rithm. |
|
624 |
|
|
625 |
Last updated: 06 June 2006 |
Last updated: 24 November 2006 |
626 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
627 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
628 |
|
|
717 |
PCRE API OVERVIEW |
PCRE API OVERVIEW |
718 |
|
|
719 |
PCRE has its own native API, which is described in this document. There |
PCRE has its own native API, which is described in this document. There |
720 |
is also a set of wrapper functions that correspond to the POSIX regular |
are also some wrapper functions that correspond to the POSIX regular |
721 |
expression API. These are described in the pcreposix documentation. |
expression API. These are described in the pcreposix documentation. |
722 |
Both of these APIs define a set of C function calls. A C++ wrapper is |
Both of these APIs define a set of C function calls. A C++ wrapper is |
723 |
distributed with PCRE. It is documented in the pcrecpp page. |
distributed with PCRE. It is documented in the pcrecpp page. |
740 |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
A second matching function, pcre_dfa_exec(), which is not Perl-compati- |
741 |
ble, is also provided. This uses a different algorithm for the match- |
ble, is also provided. This uses a different algorithm for the match- |
742 |
ing. The alternative algorithm finds all possible matches (at a given |
ing. The alternative algorithm finds all possible matches (at a given |
743 |
point in the subject). However, this algorithm does not return captured |
point in the subject), and scans the subject just once. However, this |
744 |
substrings. A description of the two matching algorithms and their |
algorithm does not return captured substrings. A description of the two |
745 |
advantages and disadvantages is given in the pcrematching documenta- |
matching algorithms and their advantages and disadvantages is given in |
746 |
tion. |
the pcrematching documentation. |
747 |
|
|
748 |
In addition to the main compiling and matching functions, there are |
In addition to the main compiling and matching functions, there are |
749 |
convenience functions for extracting captured substrings from a subject |
convenience functions for extracting captured substrings from a subject |
804 |
|
|
805 |
|
|
806 |
NEWLINES |
NEWLINES |
807 |
PCRE supports three different conventions for indicating line breaks in |
|
808 |
strings: a single CR character, a single LF character, or the two-char- |
PCRE supports four different conventions for indicating line breaks in |
809 |
acter sequence CRLF. All three are used as "standard" by different |
strings: a single CR (carriage return) character, a single LF (line- |
810 |
operating systems. When PCRE is built, a default can be specified. The |
feed) character, the two-character sequence CRLF, or any Unicode new- |
811 |
default default is LF, which is the Unix standard. When PCRE is run, |
line sequence. The Unicode newline sequences are the three just men- |
812 |
the default can be overridden, either when a pattern is compiled, or |
tioned, plus the single characters VT (vertical tab, U+000B), FF (form- |
813 |
when it is matched. |
feed, U+000C), NEL (next line, U+0085), LS (line separator, U+2028), |
814 |
|
and PS (paragraph separator, U+2029). |
815 |
|
|
816 |
|
Each of the first three conventions is used by at least one operating |
817 |
|
system as its standard newline sequence. When PCRE is built, a default |
818 |
|
can be specified. The default default is LF, which is the Unix stan- |
819 |
|
dard. When PCRE is run, the default can be overridden, either when a |
820 |
|
pattern is compiled, or when it is matched. |
821 |
|
|
822 |
In the PCRE documentation the word "newline" is used to mean "the char- |
In the PCRE documentation the word "newline" is used to mean "the char- |
823 |
acter or pair of characters that indicate a line break". |
acter or pair of characters that indicate a line break". The choice of |
824 |
|
newline convention affects the handling of the dot, circumflex, and |
825 |
|
dollar metacharacters, the handling of #-comments in /x mode, and, when |
826 |
|
CRLF is a recognized line ending sequence, the match position advance- |
827 |
|
ment for a non-anchored pattern. The choice of newline convention does |
828 |
|
not affect the interpretation of the \n or \r escape sequences. |
829 |
|
|
830 |
|
|
831 |
MULTITHREADING |
MULTITHREADING |
832 |
|
|
833 |
The PCRE functions can be used in multi-threading applications, with |
The PCRE functions can be used in multi-threading applications, with |
834 |
the proviso that the memory management functions pointed to by |
the proviso that the memory management functions pointed to by |
835 |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the |
836 |
callout function pointed to by pcre_callout, are shared by all threads. |
callout function pointed to by pcre_callout, are shared by all threads. |
837 |
|
|
838 |
The compiled form of a regular expression is not altered during match- |
The compiled form of a regular expression is not altered during match- |
839 |
ing, so the same compiled pattern can safely be used by several threads |
ing, so the same compiled pattern can safely be used by several threads |
840 |
at once. |
at once. |
841 |
|
|
843 |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
SAVING PRECOMPILED PATTERNS FOR LATER USE |
844 |
|
|
845 |
The compiled form of a regular expression can be saved and re-used at a |
The compiled form of a regular expression can be saved and re-used at a |
846 |
later time, possibly by a different program, and even on a host other |
later time, possibly by a different program, and even on a host other |
847 |
than the one on which it was compiled. Details are given in the |
than the one on which it was compiled. Details are given in the |
848 |
pcreprecompile documentation. |
pcreprecompile documentation. |
849 |
|
|
850 |
|
|
852 |
|
|
853 |
int pcre_config(int what, void *where); |
int pcre_config(int what, void *where); |
854 |
|
|
855 |
The function pcre_config() makes it possible for a PCRE client to dis- |
The function pcre_config() makes it possible for a PCRE client to dis- |
856 |
cover which optional features have been compiled into the PCRE library. |
cover which optional features have been compiled into the PCRE library. |
857 |
The pcrebuild documentation has more details about these optional fea- |
The pcrebuild documentation has more details about these optional fea- |
858 |
tures. |
tures. |
859 |
|
|
860 |
The first argument for pcre_config() is an integer, specifying which |
The first argument for pcre_config() is an integer, specifying which |
861 |
information is required; the second argument is a pointer to a variable |
information is required; the second argument is a pointer to a variable |
862 |
into which the information is placed. The following information is |
into which the information is placed. The following information is |
863 |
available: |
available: |
864 |
|
|
865 |
PCRE_CONFIG_UTF8 |
PCRE_CONFIG_UTF8 |
866 |
|
|
867 |
The output is an integer that is set to one if UTF-8 support is avail- |
The output is an integer that is set to one if UTF-8 support is avail- |
868 |
able; otherwise it is set to zero. |
able; otherwise it is set to zero. |
869 |
|
|
870 |
PCRE_CONFIG_UNICODE_PROPERTIES |
PCRE_CONFIG_UNICODE_PROPERTIES |
871 |
|
|
872 |
The output is an integer that is set to one if support for Unicode |
The output is an integer that is set to one if support for Unicode |
873 |
character properties is available; otherwise it is set to zero. |
character properties is available; otherwise it is set to zero. |
874 |
|
|
875 |
PCRE_CONFIG_NEWLINE |
PCRE_CONFIG_NEWLINE |
876 |
|
|
877 |
The output is an integer whose value specifies the default character |
The output is an integer whose value specifies the default character |
878 |
sequence that is recognized as meaning "newline". The three values that |
sequence that is recognized as meaning "newline". The four values that |
879 |
are supported are: 10 for LF, 13 for CR, and 3338 for CRLF. The default |
are supported are: 10 for LF, 13 for CR, 3338 for CRLF, and -1 for ANY. |
880 |
should normally be the standard sequence for your operating system. |
The default should normally be the standard sequence for your operating |
881 |
|
system. |
882 |
|
|
883 |
PCRE_CONFIG_LINK_SIZE |
PCRE_CONFIG_LINK_SIZE |
884 |
|
|
947 |
fully relocatable, because it may contain a copy of the tableptr argu- |
fully relocatable, because it may contain a copy of the tableptr argu- |
948 |
ment, which is an address (see below). |
ment, which is an address (see below). |
949 |
|
|
950 |
The options argument contains independent bits that affect the compila- |
The options argument contains various bit settings that affect the com- |
951 |
tion. It should be zero if no options are required. The available |
pilation. It should be zero if no options are required. The available |
952 |
options are described below. Some of them, in particular, those that |
options are described below. Some of them, in particular, those that |
953 |
are compatible with Perl, can also be set and unset from within the |
are compatible with Perl, can also be set and unset from within the |
954 |
pattern (see the detailed description in the pcrepattern documenta- |
pattern (see the detailed description in the pcrepattern documenta- |
1038 |
not match when the current position is at a newline. This option is |
not match when the current position is at a newline. This option is |
1039 |
equivalent to Perl's /s option, and it can be changed within a pattern |
equivalent to Perl's /s option, and it can be changed within a pattern |
1040 |
by a (?s) option setting. A negative class such as [^a] always matches |
by a (?s) option setting. A negative class such as [^a] always matches |
1041 |
newlines, independent of the setting of this option. |
newline characters, independent of the setting of this option. |
1042 |
|
|
1043 |
PCRE_DUPNAMES |
PCRE_DUPNAMES |
1044 |
|
|
1102 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1103 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
1104 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
1105 |
|
PCRE_NEWLINE_ANY |
1106 |
|
|
1107 |
These options override the default newline definition that was chosen |
These options override the default newline definition that was chosen |
1108 |
when PCRE was built. Setting the first or the second specifies that a |
when PCRE was built. Setting the first or the second specifies that a |
1109 |
newline is indicated by a single character (CR or LF, respectively). |
newline is indicated by a single character (CR or LF, respectively). |
1110 |
Setting both of them specifies that a newline is indicated by the two- |
Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by the |
1111 |
character CRLF sequence. For convenience, PCRE_NEWLINE_CRLF is defined |
two-character CRLF sequence. Setting PCRE_NEWLINE_ANY specifies that |
1112 |
to contain both bits. The only time that a line break is relevant when |
any Unicode newline sequence should be recognized. The Unicode newline |
1113 |
compiling a pattern is if PCRE_EXTENDED is set, and an unescaped # out- |
sequences are the three just mentioned, plus the single characters VT |
1114 |
side a character class is encountered. This indicates a comment that |
(vertical tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), |
1115 |
lasts until after the next newline. |
LS (line separator, U+2028), and PS (paragraph separator, U+2029). The |
1116 |
|
last two are recognized only in UTF-8 mode. |
1117 |
|
|
1118 |
|
The newline setting in the options word uses three bits that are |
1119 |
|
treated as a number, giving eight possibilities. Currently only five |
1120 |
|
are used (default plus the four values above). This means that if you |
1121 |
|
set more than one newline option, the combination may or may not be |
1122 |
|
sensible. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equiva- |
1123 |
|
lent to PCRE_NEWLINE_CRLF, but other combinations yield unused numbers |
1124 |
|
and cause an error. |
1125 |
|
|
1126 |
|
The only time that a line break is specially recognized when compiling |
1127 |
|
a pattern is if PCRE_EXTENDED is set, and an unescaped # outside a |
1128 |
|
character class is encountered. This indicates a comment that lasts |
1129 |
|
until after the next line break sequence. In other circumstances, line |
1130 |
|
break sequences are treated as literal data, except that in |
1131 |
|
PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters |
1132 |
|
and are therefore ignored. |
1133 |
|
|
1134 |
The newline option set at compile time becomes the default that is used |
The newline option that is set at compile time becomes the default that |
1135 |
for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden. |
1136 |
|
|
1137 |
PCRE_NO_AUTO_CAPTURE |
PCRE_NO_AUTO_CAPTURE |
1138 |
|
|
1175 |
|
|
1176 |
The following table lists the error codes than may be returned by |
The following table lists the error codes than may be returned by |
1177 |
pcre_compile2(), along with the error messages that may be returned by |
pcre_compile2(), along with the error messages that may be returned by |
1178 |
both compiling functions. |
both compiling functions. As PCRE has developed, some error codes have |
1179 |
|
fallen out of use. To avoid confusion, they have not been re-used. |
1180 |
|
|
1181 |
0 no error |
0 no error |
1182 |
1 \ at end of pattern |
1 \ at end of pattern |
1188 |
7 invalid escape sequence in character class |
7 invalid escape sequence in character class |
1189 |
8 range out of order in character class |
8 range out of order in character class |
1190 |
9 nothing to repeat |
9 nothing to repeat |
1191 |
10 operand of unlimited repeat could match the empty string |
10 [this code is not in use] |
1192 |
11 internal error: unexpected repeat |
11 internal error: unexpected repeat |
1193 |
12 unrecognized character after (? |
12 unrecognized character after (? |
1194 |
13 POSIX named classes are supported only within a class |
13 POSIX named classes are supported only within a class |
1197 |
16 erroffset passed as NULL |
16 erroffset passed as NULL |
1198 |
17 unknown option bit(s) set |
17 unknown option bit(s) set |
1199 |
18 missing ) after comment |
18 missing ) after comment |
1200 |
19 parentheses nested too deeply |
19 [this code is not in use] |
1201 |
20 regular expression too large |
20 regular expression too large |
1202 |
21 failed to get memory |
21 failed to get memory |
1203 |
22 unmatched parentheses |
22 unmatched parentheses |
1211 |
30 unknown POSIX class name |
30 unknown POSIX class name |
1212 |
31 POSIX collating elements are not supported |
31 POSIX collating elements are not supported |
1213 |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
32 this version of PCRE is not compiled with PCRE_UTF8 support |
1214 |
33 spare error |
33 [this code is not in use] |
1215 |
34 character value in \x{...} sequence is too large |
34 character value in \x{...} sequence is too large |
1216 |
35 invalid condition (?(0) |
35 invalid condition (?(0) |
1217 |
36 \C not allowed in lookbehind assertion |
36 \C not allowed in lookbehind assertion |
1220 |
39 closing ) for (?C expected |
39 closing ) for (?C expected |
1221 |
40 recursive call could loop indefinitely |
40 recursive call could loop indefinitely |
1222 |
41 unrecognized character after (?P |
41 unrecognized character after (?P |
1223 |
42 syntax error after (?P |
42 syntax error in subpattern name (missing terminator) |
1224 |
43 two named subpatterns have the same name |
43 two named subpatterns have the same name |
1225 |
44 invalid UTF-8 string |
44 invalid UTF-8 string |
1226 |
45 support for \P, \p, and \X has not been compiled |
45 support for \P, \p, and \X has not been compiled |
1230 |
49 too many named subpatterns (maximum 10,000) |
49 too many named subpatterns (maximum 10,000) |
1231 |
50 repeated subpattern is too long |
50 repeated subpattern is too long |
1232 |
51 octal value is greater than \377 (not in UTF-8 mode) |
51 octal value is greater than \377 (not in UTF-8 mode) |
1233 |
|
52 internal error: overran compiling workspace |
1234 |
|
53 internal error: previously-checked referenced subpattern not |
1235 |
|
found |
1236 |
|
54 DEFINE group contains more than one branch |
1237 |
|
55 repeating a DEFINE group is not allowed |
1238 |
|
56 inconsistent NEWLINE options" |
1239 |
|
|
1240 |
|
|
1241 |
STUDYING A PATTERN |
STUDYING A PATTERN |
1394 |
is still recognized for backwards compatibility.) |
is still recognized for backwards compatibility.) |
1395 |
|
|
1396 |
If there is a fixed first byte, for example, from a pattern such as |
If there is a fixed first byte, for example, from a pattern such as |
1397 |
(cat|cow|coyote). Otherwise, if either |
(cat|cow|coyote), its value is returned. Otherwise, if either |
1398 |
|
|
1399 |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
(a) the pattern was compiled with the PCRE_MULTILINE option, and every |
1400 |
branch starts with "^", or |
branch starts with "^", or |
1451 |
PCRE_EXTENDED is set, so white space - including newlines - is |
PCRE_EXTENDED is set, so white space - including newlines - is |
1452 |
ignored): |
ignored): |
1453 |
|
|
1454 |
(?P<date> (?P<year>(\d\d)?\d\d) - |
(?<date> (?<year>(\d\d)?\d\d) - |
1455 |
(?P<month>\d\d) - (?P<day>\d\d) ) |
(?<month>\d\d) - (?<day>\d\d) ) |
1456 |
|
|
1457 |
There are four named subpatterns, so the table has four entries, and |
There are four named subpatterns, so the table has four entries, and |
1458 |
each entry in the table is eight bytes long. The table is as follows, |
each entry in the table is eight bytes long. The table is as follows, |
1679 |
PCRE_NEWLINE_CR |
PCRE_NEWLINE_CR |
1680 |
PCRE_NEWLINE_LF |
PCRE_NEWLINE_LF |
1681 |
PCRE_NEWLINE_CRLF |
PCRE_NEWLINE_CRLF |
1682 |
|
PCRE_NEWLINE_ANY |
1683 |
|
|
1684 |
These options override the newline definition that was chosen or |
These options override the newline definition that was chosen or |
1685 |
defaulted when the pattern was compiled. For details, see the descrip- |
defaulted when the pattern was compiled. For details, see the descrip- |
1686 |
tion pcre_compile() above. During matching, the newline choice affects |
tion of pcre_compile() above. During matching, the newline choice |
1687 |
the behaviour of the dot, circumflex, and dollar metacharacters. |
affects the behaviour of the dot, circumflex, and dollar metacharac- |
1688 |
|
ters. It may also alter the way the match position is advanced after a |
1689 |
|
match failure for an unanchored pattern. When PCRE_NEWLINE_CRLF or |
1690 |
|
PCRE_NEWLINE_ANY is set, and a match attempt fails when the current |
1691 |
|
position is at a CRLF sequence, the match position is advanced by two |
1692 |
|
characters instead of one, in other words, to after the CRLF. |
1693 |
|
|
1694 |
PCRE_NOTBOL |
PCRE_NOTBOL |
1695 |
|
|
1696 |
This option specifies that first character of the subject string is not |
This option specifies that first character of the subject string is not |
1697 |
the beginning of a line, so the circumflex metacharacter should not |
the beginning of a line, so the circumflex metacharacter should not |
1698 |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
match before it. Setting this without PCRE_MULTILINE (at compile time) |
1699 |
causes circumflex never to match. This option affects only the behav- |
causes circumflex never to match. This option affects only the behav- |
1700 |
iour of the circumflex metacharacter. It does not affect \A. |
iour of the circumflex metacharacter. It does not affect \A. |
1701 |
|
|
1702 |
PCRE_NOTEOL |
PCRE_NOTEOL |
1703 |
|
|
1704 |
This option specifies that the end of the subject string is not the end |
This option specifies that the end of the subject string is not the end |
1705 |
of a line, so the dollar metacharacter should not match it nor (except |
of a line, so the dollar metacharacter should not match it nor (except |
1706 |
in multiline mode) a newline immediately before it. Setting this with- |
in multiline mode) a newline immediately before it. Setting this with- |
1707 |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
out PCRE_MULTILINE (at compile time) causes dollar never to match. This |
1708 |
option affects only the behaviour of the dollar metacharacter. It does |
option affects only the behaviour of the dollar metacharacter. It does |
1709 |
not affect \Z or \z. |
not affect \Z or \z. |
1710 |
|
|
1711 |
PCRE_NOTEMPTY |
PCRE_NOTEMPTY |
1712 |
|
|
1713 |
An empty string is not considered to be a valid match if this option is |
An empty string is not considered to be a valid match if this option is |
1714 |
set. If there are alternatives in the pattern, they are tried. If all |
set. If there are alternatives in the pattern, they are tried. If all |
1715 |
the alternatives match the empty string, the entire match fails. For |
the alternatives match the empty string, the entire match fails. For |
1716 |
example, if the pattern |
example, if the pattern |
1717 |
|
|
1718 |
a?b? |
a?b? |
1719 |
|
|
1720 |
is applied to a string not beginning with "a" or "b", it matches the |
is applied to a string not beginning with "a" or "b", it matches the |
1721 |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
empty string at the start of the subject. With PCRE_NOTEMPTY set, this |
1722 |
match is not valid, so PCRE searches further into the string for occur- |
match is not valid, so PCRE searches further into the string for occur- |
1723 |
rences of "a" or "b". |
rences of "a" or "b". |
1724 |
|
|
1725 |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe- |
1726 |
cial case of a pattern match of the empty string within its split() |
cial case of a pattern match of the empty string within its split() |
1727 |
function, and when using the /g modifier. It is possible to emulate |
function, and when using the /g modifier. It is possible to emulate |
1728 |
Perl's behaviour after matching a null string by first trying the match |
Perl's behaviour after matching a null string by first trying the match |
1729 |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then |
1730 |
if that fails by advancing the starting offset (see below) and trying |
if that fails by advancing the starting offset (see below) and trying |
1731 |
an ordinary match again. There is some code that demonstrates how to do |
an ordinary match again. There is some code that demonstrates how to do |
1732 |
this in the pcredemo.c sample program. |
this in the pcredemo.c sample program. |
1733 |
|
|
1734 |
PCRE_NO_UTF8_CHECK |
PCRE_NO_UTF8_CHECK |
1735 |
|
|
1736 |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
When PCRE_UTF8 is set at compile time, the validity of the subject as a |
1737 |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
UTF-8 string is automatically checked when pcre_exec() is subsequently |
1738 |
called. The value of startoffset is also checked to ensure that it |
called. The value of startoffset is also checked to ensure that it |
1739 |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
points to the start of a UTF-8 character. If an invalid UTF-8 sequence |
1740 |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If |
1741 |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
startoffset contains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is |
1742 |
returned. |
returned. |
1743 |
|
|
1744 |
If you already know that your subject is valid, and you want to skip |
If you already know that your subject is valid, and you want to skip |
1745 |
these checks for performance reasons, you can set the |
these checks for performance reasons, you can set the |
1746 |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might want to |
1747 |
do this for the second and subsequent calls to pcre_exec() if you are |
do this for the second and subsequent calls to pcre_exec() if you are |
1748 |
making repeated calls to find all the matches in a single subject |
making repeated calls to find all the matches in a single subject |
1749 |
string. However, you should be sure that the value of startoffset |
string. However, you should be sure that the value of startoffset |
1750 |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
points to the start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is |
1751 |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
set, the effect of passing an invalid UTF-8 string as a subject, or a |
1752 |
value of startoffset that does not point to the start of a UTF-8 char- |
value of startoffset that does not point to the start of a UTF-8 char- |
1753 |
acter, is undefined. Your program may crash. |
acter, is undefined. Your program may crash. |
1754 |
|
|
1755 |
PCRE_PARTIAL |
PCRE_PARTIAL |
1756 |
|
|
1757 |
This option turns on the partial matching feature. If the subject |
This option turns on the partial matching feature. If the subject |
1758 |
string fails to match the pattern, but at some point during the match- |
string fails to match the pattern, but at some point during the match- |
1759 |
ing process the end of the subject was reached (that is, the subject |
ing process the end of the subject was reached (that is, the subject |
1760 |
partially matches the pattern and the failure to match occurred only |
partially matches the pattern and the failure to match occurred only |
1761 |
because there were not enough subject characters), pcre_exec() returns |
because there were not enough subject characters), pcre_exec() returns |
1762 |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL is |
1763 |
used, there are restrictions on what may appear in the pattern. These |
used, there are restrictions on what may appear in the pattern. These |
1764 |
are discussed in the pcrepartial documentation. |
are discussed in the pcrepartial documentation. |
1765 |
|
|
1766 |
The string to be matched by pcre_exec() |
The string to be matched by pcre_exec() |
1767 |
|
|
1768 |
The subject string is passed to pcre_exec() as a pointer in subject, a |
The subject string is passed to pcre_exec() as a pointer in subject, a |
1769 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
length in length, and a starting byte offset in startoffset. In UTF-8 |
1770 |
mode, the byte offset must point to the start of a UTF-8 character. |
mode, the byte offset must point to the start of a UTF-8 character. |
1771 |
Unlike the pattern string, the subject may contain binary zero bytes. |
Unlike the pattern string, the subject may contain binary zero bytes. |
1772 |
When the starting offset is zero, the search for a match starts at the |
When the starting offset is zero, the search for a match starts at the |
1773 |
beginning of the subject, and this is by far the most common case. |
beginning of the subject, and this is by far the most common case. |
1774 |
|
|
1775 |
A non-zero starting offset is useful when searching for another match |
A non-zero starting offset is useful when searching for another match |
1776 |
in the same subject by calling pcre_exec() again after a previous suc- |
in the same subject by calling pcre_exec() again after a previous suc- |
1777 |
cess. Setting startoffset differs from just passing over a shortened |
cess. Setting startoffset differs from just passing over a shortened |
1778 |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
string and setting PCRE_NOTBOL in the case of a pattern that begins |
1779 |
with any kind of lookbehind. For example, consider the pattern |
with any kind of lookbehind. For example, consider the pattern |
1780 |
|
|
1781 |
\Biss\B |
\Biss\B |
1782 |
|
|
1783 |
which finds occurrences of "iss" in the middle of words. (\B matches |
which finds occurrences of "iss" in the middle of words. (\B matches |
1784 |
only if the current position in the subject is not a word boundary.) |
only if the current position in the subject is not a word boundary.) |
1785 |
When applied to the string "Mississipi" the first call to pcre_exec() |
When applied to the string "Mississipi" the first call to pcre_exec() |
1786 |
finds the first occurrence. If pcre_exec() is called again with just |
finds the first occurrence. If pcre_exec() is called again with just |
1787 |
the remainder of the subject, namely "issipi", it does not match, |
the remainder of the subject, namely "issipi", it does not match, |
1788 |
because \B is always false at the start of the subject, which is deemed |
because \B is always false at the start of the subject, which is deemed |
1789 |
to be a word boundary. However, if pcre_exec() is passed the entire |
to be a word boundary. However, if pcre_exec() is passed the entire |
1790 |
string again, but with startoffset set to 4, it finds the second occur- |
string again, but with startoffset set to 4, it finds the second occur- |
1791 |
rence of "iss" because it is able to look behind the starting point to |
rence of "iss" because it is able to look behind the starting point to |
1792 |
discover that it is preceded by a letter. |
discover that it is preceded by a letter. |
1793 |
|
|
1794 |
If a non-zero starting offset is passed when the pattern is anchored, |
If a non-zero starting offset is passed when the pattern is anchored, |
1795 |
one attempt to match at the given offset is made. This can only succeed |
one attempt to match at the given offset is made. This can only succeed |
1796 |
if the pattern does not require the match to be at the start of the |
if the pattern does not require the match to be at the start of the |
1797 |
subject. |
subject. |
1798 |
|
|
1799 |
How pcre_exec() returns captured substrings |
How pcre_exec() returns captured substrings |
1800 |
|
|
1801 |
In general, a pattern matches a certain portion of the subject, and in |
In general, a pattern matches a certain portion of the subject, and in |
1802 |
addition, further substrings from the subject may be picked out by |
addition, further substrings from the subject may be picked out by |
1803 |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
parts of the pattern. Following the usage in Jeffrey Friedl's book, |
1804 |
this is called "capturing" in what follows, and the phrase "capturing |
this is called "capturing" in what follows, and the phrase "capturing |
1805 |
subpattern" is used for a fragment of a pattern that picks out a sub- |
subpattern" is used for a fragment of a pattern that picks out a sub- |
1806 |
string. PCRE supports several other kinds of parenthesized subpattern |
string. PCRE supports several other kinds of parenthesized subpattern |
1807 |
that do not cause substrings to be captured. |
that do not cause substrings to be captured. |
1808 |
|
|
1809 |
Captured substrings are returned to the caller via a vector of integer |
Captured substrings are returned to the caller via a vector of integer |
1810 |
offsets whose address is passed in ovector. The number of elements in |
offsets whose address is passed in ovector. The number of elements in |
1811 |
the vector is passed in ovecsize, which must be a non-negative number. |
the vector is passed in ovecsize, which must be a non-negative number. |
1812 |
Note: this argument is NOT the size of ovector in bytes. |
Note: this argument is NOT the size of ovector in bytes. |
1813 |
|
|
1814 |
The first two-thirds of the vector is used to pass back captured sub- |
The first two-thirds of the vector is used to pass back captured sub- |
1815 |
strings, each substring using a pair of integers. The remaining third |
strings, each substring using a pair of integers. The remaining third |
1816 |
of the vector is used as workspace by pcre_exec() while matching cap- |
of the vector is used as workspace by pcre_exec() while matching cap- |
1817 |
turing subpatterns, and is not available for passing back information. |
turing subpatterns, and is not available for passing back information. |
1818 |
The length passed in ovecsize should always be a multiple of three. If |
The length passed in ovecsize should always be a multiple of three. If |
1819 |
it is not, it is rounded down. |
it is not, it is rounded down. |
1820 |
|
|
1821 |
When a match is successful, information about captured substrings is |
When a match is successful, information about captured substrings is |
1822 |
returned in pairs of integers, starting at the beginning of ovector, |
returned in pairs of integers, starting at the beginning of ovector, |
1823 |
and continuing up to two-thirds of its length at the most. The first |
and continuing up to two-thirds of its length at the most. The first |
1824 |
element of a pair is set to the offset of the first character in a sub- |
element of a pair is set to the offset of the first character in a sub- |
1825 |
string, and the second is set to the offset of the first character |
string, and the second is set to the offset of the first character |
1826 |
after the end of a substring. The first pair, ovector[0] and ovec- |
after the end of a substring. The first pair, ovector[0] and ovec- |
1827 |
tor[1], identify the portion of the subject string matched by the |
tor[1], identify the portion of the subject string matched by the |
1828 |
entire pattern. The next pair is used for the first capturing subpat- |
entire pattern. The next pair is used for the first capturing subpat- |
1829 |
tern, and so on. The value returned by pcre_exec() is one more than the |
tern, and so on. The value returned by pcre_exec() is one more than the |
1830 |
highest numbered pair that has been set. For example, if two substrings |
highest numbered pair that has been set. For example, if two substrings |
1831 |
have been captured, the returned value is 3. If there are no capturing |
have been captured, the returned value is 3. If there are no capturing |
1832 |
subpatterns, the return value from a successful match is 1, indicating |
subpatterns, the return value from a successful match is 1, indicating |
1833 |
that just the first pair of offsets has been set. |
that just the first pair of offsets has been set. |
1834 |
|
|
1835 |
If a capturing subpattern is matched repeatedly, it is the last portion |
If a capturing subpattern is matched repeatedly, it is the last portion |
1836 |
of the string that it matched that is returned. |
of the string that it matched that is returned. |
1837 |
|
|
1838 |
If the vector is too small to hold all the captured substring offsets, |
If the vector is too small to hold all the captured substring offsets, |
1839 |
it is used as far as possible (up to two-thirds of its length), and the |
it is used as far as possible (up to two-thirds of its length), and the |
1840 |
function returns a value of zero. In particular, if the substring off- |
function returns a value of zero. In particular, if the substring off- |
1841 |
sets are not of interest, pcre_exec() may be called with ovector passed |
sets are not of interest, pcre_exec() may be called with ovector passed |
1842 |
as NULL and ovecsize as zero. However, if the pattern contains back |
as NULL and ovecsize as zero. However, if the pattern contains back |
1843 |
references and the ovector is not big enough to remember the related |
references and the ovector is not big enough to remember the related |
1844 |
substrings, PCRE has to get additional memory for use during matching. |
substrings, PCRE has to get additional memory for use during matching. |
1845 |
Thus it is usually advisable to supply an ovector. |
Thus it is usually advisable to supply an ovector. |
1846 |
|
|
1847 |
The pcre_info() function can be used to find out how many capturing |
The pcre_info() function can be used to find out how many capturing |
1848 |
subpatterns there are in a compiled pattern. The smallest size for |
subpatterns there are in a compiled pattern. The smallest size for |
1849 |
ovector that will allow for n captured substrings, in addition to the |
ovector that will allow for n captured substrings, in addition to the |
1850 |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
offsets of the substring matched by the whole pattern, is (n+1)*3. |
1851 |
|
|
1852 |
It is possible for capturing subpattern number n+1 to match some part |
It is possible for capturing subpattern number n+1 to match some part |
1853 |
of the subject when subpattern n has not been used at all. For example, |
of the subject when subpattern n has not been used at all. For example, |
1854 |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
if the string "abc" is matched against the pattern (a|(z))(bc) the |
1855 |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
return from the function is 4, and subpatterns 1 and 3 are matched, but |
1856 |
2 is not. When this happens, both values in the offset pairs corre- |
2 is not. When this happens, both values in the offset pairs corre- |
1857 |
sponding to unused subpatterns are set to -1. |
sponding to unused subpatterns are set to -1. |
1858 |
|
|
1859 |
Offset values that correspond to unused subpatterns at the end of the |
Offset values that correspond to unused subpatterns at the end of the |
1860 |
expression are also set to -1. For example, if the string "abc" is |
expression are also set to -1. For example, if the string "abc" is |
1861 |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not |
1862 |
matched. The return from the function is 2, because the highest used |
matched. The return from the function is 2, because the highest used |
1863 |
capturing subpattern number is 1. However, you can refer to the offsets |
capturing subpattern number is 1. However, you can refer to the offsets |
1864 |
for the second and third capturing subpatterns if you wish (assuming |
for the second and third capturing subpatterns if you wish (assuming |
1865 |
the vector is large enough, of course). |
the vector is large enough, of course). |
1866 |
|
|
1867 |
Some convenience functions are provided for extracting the captured |
Some convenience functions are provided for extracting the captured |
1868 |
substrings as separate strings. These are described below. |
substrings as separate strings. These are described below. |
1869 |
|
|
1870 |
Error return values from pcre_exec() |
Error return values from pcre_exec() |
1871 |
|
|
1872 |
If pcre_exec() fails, it returns a negative number. The following are |
If pcre_exec() fails, it returns a negative number. The following are |
1873 |
defined in the header file: |
defined in the header file: |
1874 |
|
|
1875 |
PCRE_ERROR_NOMATCH (-1) |
PCRE_ERROR_NOMATCH (-1) |
1878 |
|
|
1879 |
PCRE_ERROR_NULL (-2) |
PCRE_ERROR_NULL (-2) |
1880 |
|
|
1881 |
Either code or subject was passed as NULL, or ovector was NULL and |
Either code or subject was passed as NULL, or ovector was NULL and |
1882 |
ovecsize was not zero. |
ovecsize was not zero. |
1883 |
|
|
1884 |
PCRE_ERROR_BADOPTION (-3) |
PCRE_ERROR_BADOPTION (-3) |
1887 |
|
|
1888 |
PCRE_ERROR_BADMAGIC (-4) |
PCRE_ERROR_BADMAGIC (-4) |
1889 |
|
|
1890 |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
PCRE stores a 4-byte "magic number" at the start of the compiled code, |
1891 |
to catch the case when it is passed a junk pointer and to detect when a |
to catch the case when it is passed a junk pointer and to detect when a |
1892 |
pattern that was compiled in an environment of one endianness is run in |
pattern that was compiled in an environment of one endianness is run in |
1893 |
an environment with the other endianness. This is the error that PCRE |
an environment with the other endianness. This is the error that PCRE |
1894 |
gives when the magic number is not present. |
gives when the magic number is not present. |
1895 |
|
|
1896 |
PCRE_ERROR_UNKNOWN_NODE (-5) |
PCRE_ERROR_UNKNOWN_OPCODE (-5) |
1897 |
|
|
1898 |
While running the pattern match, an unknown item was encountered in the |
While running the pattern match, an unknown item was encountered in the |
1899 |
compiled pattern. This error could be caused by a bug in PCRE or by |
compiled pattern. This error could be caused by a bug in PCRE or by |
1900 |
overwriting of the compiled pattern. |
overwriting of the compiled pattern. |
1901 |
|
|
1902 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
1903 |
|
|
1904 |
If a pattern contains back references, but the ovector that is passed |
If a pattern contains back references, but the ovector that is passed |
1905 |
to pcre_exec() is not big enough to remember the referenced substrings, |
to pcre_exec() is not big enough to remember the referenced substrings, |
1906 |
PCRE gets a block of memory at the start of matching to use for this |
PCRE gets a block of memory at the start of matching to use for this |
1907 |
purpose. If the call via pcre_malloc() fails, this error is given. The |
purpose. If the call via pcre_malloc() fails, this error is given. The |
1908 |
memory is automatically freed at the end of matching. |
memory is automatically freed at the end of matching. |
1909 |
|
|
1910 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
1911 |
|
|
1912 |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
This error is used by the pcre_copy_substring(), pcre_get_substring(), |
1913 |
and pcre_get_substring_list() functions (see below). It is never |
and pcre_get_substring_list() functions (see below). It is never |
1914 |
returned by pcre_exec(). |
returned by pcre_exec(). |
1915 |
|
|
1916 |
PCRE_ERROR_MATCHLIMIT (-8) |
PCRE_ERROR_MATCHLIMIT (-8) |
1917 |
|
|
1918 |
The backtracking limit, as specified by the match_limit field in a |
The backtracking limit, as specified by the match_limit field in a |
1919 |
pcre_extra structure (or defaulted) was reached. See the description |
pcre_extra structure (or defaulted) was reached. See the description |
1920 |
above. |
above. |
1921 |
|
|
|
PCRE_ERROR_RECURSIONLIMIT (-21) |
|
|
|
|
|
The internal recursion limit, as specified by the match_limit_recursion |
|
|
field in a pcre_extra structure (or defaulted) was reached. See the |
|
|
description above. |
|
|
|
|
1922 |
PCRE_ERROR_CALLOUT (-9) |
PCRE_ERROR_CALLOUT (-9) |
1923 |
|
|
1924 |
This error is never generated by pcre_exec() itself. It is provided for |
This error is never generated by pcre_exec() itself. It is provided for |
1925 |
use by callout functions that want to yield a distinctive error code. |
use by callout functions that want to yield a distinctive error code. |
1926 |
See the pcrecallout documentation for details. |
See the pcrecallout documentation for details. |
1927 |
|
|
1928 |
PCRE_ERROR_BADUTF8 (-10) |
PCRE_ERROR_BADUTF8 (-10) |
1929 |
|
|
1930 |
A string that contains an invalid UTF-8 byte sequence was passed as a |
A string that contains an invalid UTF-8 byte sequence was passed as a |
1931 |
subject. |
subject. |
1932 |
|
|
1933 |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
PCRE_ERROR_BADUTF8_OFFSET (-11) |
1934 |
|
|
1935 |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
The UTF-8 byte sequence that was passed as a subject was valid, but the |
1936 |
value of startoffset did not point to the beginning of a UTF-8 charac- |
value of startoffset did not point to the beginning of a UTF-8 charac- |
1937 |
ter. |
ter. |
1938 |
|
|
1939 |
PCRE_ERROR_PARTIAL (-12) |
PCRE_ERROR_PARTIAL (-12) |
1940 |
|
|
1941 |
The subject string did not match, but it did match partially. See the |
The subject string did not match, but it did match partially. See the |
1942 |
pcrepartial documentation for details of partial matching. |
pcrepartial documentation for details of partial matching. |
1943 |
|
|
1944 |
PCRE_ERROR_BADPARTIAL (-13) |
PCRE_ERROR_BADPARTIAL (-13) |
1945 |
|
|
1946 |
The PCRE_PARTIAL option was used with a compiled pattern containing |
The PCRE_PARTIAL option was used with a compiled pattern containing |
1947 |
items that are not supported for partial matching. See the pcrepartial |
items that are not supported for partial matching. See the pcrepartial |
1948 |
documentation for details of partial matching. |
documentation for details of partial matching. |
1949 |
|
|
1950 |
PCRE_ERROR_INTERNAL (-14) |
PCRE_ERROR_INTERNAL (-14) |
1951 |
|
|
1952 |
An unexpected internal error has occurred. This error could be caused |
An unexpected internal error has occurred. This error could be caused |
1953 |
by a bug in PCRE or by overwriting of the compiled pattern. |
by a bug in PCRE or by overwriting of the compiled pattern. |
1954 |
|
|
1955 |
PCRE_ERROR_BADCOUNT (-15) |
PCRE_ERROR_BADCOUNT (-15) |
1956 |
|
|
1957 |
This error is given if the value of the ovecsize argument is negative. |
This error is given if the value of the ovecsize argument is negative. |
1958 |
|
|
1959 |
|
PCRE_ERROR_RECURSIONLIMIT (-21) |
1960 |
|
|
1961 |
|
The internal recursion limit, as specified by the match_limit_recursion |
1962 |
|
field in a pcre_extra structure (or defaulted) was reached. See the |
1963 |
|
description above. |
1964 |
|
|
1965 |
|
PCRE_ERROR_NULLWSLIMIT (-22) |
1966 |
|
|
1967 |
|
When a group that can match an empty substring is repeated with an |
1968 |
|
unbounded upper limit, the subject position at the start of the group |
1969 |
|
must be remembered, so that a test for an empty string can be made when |
1970 |
|
the end of the group is reached. Some workspace is required for this; |
1971 |
|
if it runs out, this error is given. |
1972 |
|
|
1973 |
|
PCRE_ERROR_BADNEWLINE (-23) |
1974 |
|
|
1975 |
|
An invalid combination of PCRE_NEWLINE_xxx options was given. |
1976 |
|
|
1977 |
|
Error numbers -16 to -20 are not used by pcre_exec(). |
1978 |
|
|
1979 |
|
|
1980 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
1990 |
int pcre_get_substring_list(const char *subject, |
int pcre_get_substring_list(const char *subject, |
1991 |
int *ovector, int stringcount, const char ***listptr); |
int *ovector, int stringcount, const char ***listptr); |
1992 |
|
|
1993 |
Captured substrings can be accessed directly by using the offsets |
Captured substrings can be accessed directly by using the offsets |
1994 |
returned by pcre_exec() in ovector. For convenience, the functions |
returned by pcre_exec() in ovector. For convenience, the functions |
1995 |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
pcre_copy_substring(), pcre_get_substring(), and pcre_get_sub- |
1996 |
string_list() are provided for extracting captured substrings as new, |
string_list() are provided for extracting captured substrings as new, |
1997 |
separate, zero-terminated strings. These functions identify substrings |
separate, zero-terminated strings. These functions identify substrings |
1998 |
by number. The next section describes functions for extracting named |
by number. The next section describes functions for extracting named |
1999 |
substrings. |
substrings. |
2000 |
|
|
2001 |
A substring that contains a binary zero is correctly extracted and has |
A substring that contains a binary zero is correctly extracted and has |
2002 |
a further zero added on the end, but the result is not, of course, a C |
a further zero added on the end, but the result is not, of course, a C |
2003 |
string. However, you can process such a string by referring to the |
string. However, you can process such a string by referring to the |
2004 |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
length that is returned by pcre_copy_substring() and pcre_get_sub- |
2005 |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
string(). Unfortunately, the interface to pcre_get_substring_list() is |
2006 |
not adequate for handling strings containing binary zeros, because the |
not adequate for handling strings containing binary zeros, because the |
2007 |
end of the final string is not independently indicated. |
end of the final string is not independently indicated. |
2008 |
|
|
2009 |
The first three arguments are the same for all three of these func- |
The first three arguments are the same for all three of these func- |
2010 |
tions: subject is the subject string that has just been successfully |
tions: subject is the subject string that has just been successfully |
2011 |
matched, ovector is a pointer to the vector of integer offsets that was |
matched, ovector is a pointer to the vector of integer offsets that was |
2012 |
passed to pcre_exec(), and stringcount is the number of substrings that |
passed to pcre_exec(), and stringcount is the number of substrings that |
2013 |
were captured by the match, including the substring that matched the |
were captured by the match, including the substring that matched the |
2014 |
entire regular expression. This is the value returned by pcre_exec() if |
entire regular expression. This is the value returned by pcre_exec() if |
2015 |
it is greater than zero. If pcre_exec() returned zero, indicating that |
it is greater than zero. If pcre_exec() returned zero, indicating that |
2016 |
it ran out of space in ovector, the value passed as stringcount should |
it ran out of space in ovector, the value passed as stringcount should |
2017 |
be the number of elements in the vector divided by three. |
be the number of elements in the vector divided by three. |
2018 |
|
|
2019 |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
The functions pcre_copy_substring() and pcre_get_substring() extract a |
2020 |
single substring, whose number is given as stringnumber. A value of |
single substring, whose number is given as stringnumber. A value of |
2021 |
zero extracts the substring that matched the entire pattern, whereas |
zero extracts the substring that matched the entire pattern, whereas |
2022 |
higher values extract the captured substrings. For pcre_copy_sub- |
higher values extract the captured substrings. For pcre_copy_sub- |
2023 |
string(), the string is placed in buffer, whose length is given by |
string(), the string is placed in buffer, whose length is given by |
2024 |
buffersize, while for pcre_get_substring() a new block of memory is |
buffersize, while for pcre_get_substring() a new block of memory is |
2025 |
obtained via pcre_malloc, and its address is returned via stringptr. |
obtained via pcre_malloc, and its address is returned via stringptr. |
2026 |
The yield of the function is the length of the string, not including |
The yield of the function is the length of the string, not including |
2027 |
the terminating zero, or one of |
the terminating zero, or one of these error codes: |
2028 |
|
|
2029 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2030 |
|
|
2031 |
The buffer was too small for pcre_copy_substring(), or the attempt to |
The buffer was too small for pcre_copy_substring(), or the attempt to |
2032 |
get memory failed for pcre_get_substring(). |
get memory failed for pcre_get_substring(). |
2033 |
|
|
2034 |
PCRE_ERROR_NOSUBSTRING (-7) |
PCRE_ERROR_NOSUBSTRING (-7) |
2035 |
|
|
2036 |
There is no substring whose number is stringnumber. |
There is no substring whose number is stringnumber. |
2037 |
|
|
2038 |
The pcre_get_substring_list() function extracts all available sub- |
The pcre_get_substring_list() function extracts all available sub- |
2039 |
strings and builds a list of pointers to them. All this is done in a |
strings and builds a list of pointers to them. All this is done in a |
2040 |
single block of memory that is obtained via pcre_malloc. The address of |
single block of memory that is obtained via pcre_malloc. The address of |
2041 |
the memory block is returned via listptr, which is also the start of |
the memory block is returned via listptr, which is also the start of |
2042 |
the list of string pointers. The end of the list is marked by a NULL |
the list of string pointers. The end of the list is marked by a NULL |
2043 |
pointer. The yield of the function is zero if all went well, or |
pointer. The yield of the function is zero if all went well, or the |
2044 |
|
error code |
2045 |
|
|
2046 |
PCRE_ERROR_NOMEMORY (-6) |
PCRE_ERROR_NOMEMORY (-6) |
2047 |
|
|
2083 |
To extract a substring by name, you first have to find associated num- |
To extract a substring by name, you first have to find associated num- |
2084 |
ber. For example, for this pattern |
ber. For example, for this pattern |
2085 |
|
|
2086 |
(a+)b(?P<xxx>\d+)... |
(a+)b(?<xxx>\d+)... |
2087 |
|
|
2088 |
the number of the subpattern called "xxx" is 2. If the name is known to |
the number of the subpattern called "xxx" is 2. If the name is known to |
2089 |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
be unique (PCRE_DUPNAMES was not set), you can find the number from the |
2134 |
third and fourth are pointers to variables which are updated by the |
third and fourth are pointers to variables which are updated by the |
2135 |
function. After it has run, they point to the first and last entries in |
function. After it has run, they point to the first and last entries in |
2136 |
the name-to-number table for the given name. The function itself |
the name-to-number table for the given name. The function itself |
2137 |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING if there |
returns the length of each entry, or PCRE_ERROR_NOSUBSTRING (-7) if |
2138 |
are none. The format of the table is described above in the section |
there are none. The format of the table is described above in the sec- |
2139 |
entitled Information about a pattern. Given all the relevant entries |
tion entitled Information about a pattern. Given all the relevant |
2140 |
for the name, you can extract each of their numbers, and hence the cap- |
entries for the name, you can extract each of their numbers, and hence |
2141 |
tured data, if any. |
the captured data, if any. |
2142 |
|
|
2143 |
|
|
2144 |
FINDING ALL POSSIBLE MATCHES |
FINDING ALL POSSIBLE MATCHES |
2167 |
int *workspace, int wscount); |
int *workspace, int wscount); |
2168 |
|
|
2169 |
The function pcre_dfa_exec() is called to match a subject string |
The function pcre_dfa_exec() is called to match a subject string |
2170 |
against a compiled pattern, using a "DFA" matching algorithm. This has |
against a compiled pattern, using a matching algorithm that scans the |
2171 |
different characteristics to the normal algorithm, and is not compati- |
subject string just once, and does not backtrack. This has different |
2172 |
ble with Perl. Some of the features of PCRE patterns are not supported. |
characteristics to the normal algorithm, and is not compatible with |
2173 |
Nevertheless, there are times when this kind of matching can be useful. |
Perl. Some of the features of PCRE patterns are not supported. Never- |
2174 |
For a discussion of the two matching algorithms, see the pcrematching |
theless, there are times when this kind of matching can be useful. For |
2175 |
documentation. |
a discussion of the two matching algorithms, see the pcrematching docu- |
2176 |
|
mentation. |
2177 |
|
|
2178 |
The arguments for the pcre_dfa_exec() function are the same as for |
The arguments for the pcre_dfa_exec() function are the same as for |
2179 |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
pcre_exec(), plus two extras. The ovector argument is used in a differ- |
2180 |
ent way, and this is described below. The other common arguments are |
ent way, and this is described below. The other common arguments are |
2181 |
used in the same way as for pcre_exec(), so their description is not |
used in the same way as for pcre_exec(), so their description is not |
2182 |
repeated here. |
repeated here. |
2183 |
|
|
2184 |
The two additional arguments provide workspace for the function. The |
The two additional arguments provide workspace for the function. The |
2185 |
workspace vector should contain at least 20 elements. It is used for |
workspace vector should contain at least 20 elements. It is used for |
2186 |
keeping track of multiple paths through the pattern tree. More |
keeping track of multiple paths through the pattern tree. More |
2187 |
workspace will be needed for patterns and subjects where there are a |
workspace will be needed for patterns and subjects where there are a |
2188 |
lot of potential matches. |
lot of potential matches. |
2189 |
|
|
2190 |
Here is an example of a simple call to pcre_dfa_exec(): |
Here is an example of a simple call to pcre_dfa_exec(): |
2206 |
|
|
2207 |
Option bits for pcre_dfa_exec() |
Option bits for pcre_dfa_exec() |
2208 |
|
|
2209 |
The unused bits of the options argument for pcre_dfa_exec() must be |
The unused bits of the options argument for pcre_dfa_exec() must be |
2210 |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
zero. The only bits that may be set are PCRE_ANCHORED, PCRE_NEW- |
2211 |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
LINE_xxx, PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK, |
2212 |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last |
2213 |
three of these are the same as for pcre_exec(), so their description is |
three of these are the same as for pcre_exec(), so their description is |
2214 |
not repeated here. |
not repeated here. |
2215 |
|
|
2216 |
PCRE_PARTIAL |
PCRE_PARTIAL |
2217 |
|
|
2218 |
This has the same general effect as it does for pcre_exec(), but the |
This has the same general effect as it does for pcre_exec(), but the |
2219 |
details are slightly different. When PCRE_PARTIAL is set for |
details are slightly different. When PCRE_PARTIAL is set for |
2220 |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is converted into |
2221 |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
PCRE_ERROR_PARTIAL if the end of the subject is reached, there have |
2222 |
been no complete matches, but there is still at least one matching pos- |
been no complete matches, but there is still at least one matching pos- |
2223 |
sibility. The portion of the string that provided the partial match is |
sibility. The portion of the string that provided the partial match is |
2224 |
set as the first matching string. |
set as the first matching string. |
2225 |
|
|
2226 |
PCRE_DFA_SHORTEST |
PCRE_DFA_SHORTEST |
2227 |
|
|
2228 |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
Setting the PCRE_DFA_SHORTEST option causes the matching algorithm to |
2229 |
stop as soon as it has found one match. Because of the way the DFA |
stop as soon as it has found one match. Because of the way the alterna- |
2230 |
algorithm works, this is necessarily the shortest possible match at the |
tive algorithm works, this is necessarily the shortest possible match |
2231 |
first possible matching point in the subject string. |
at the first possible matching point in the subject string. |
2232 |
|
|
2233 |
PCRE_DFA_RESTART |
PCRE_DFA_RESTART |
2234 |
|
|
2235 |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
When pcre_dfa_exec() is called with the PCRE_PARTIAL option, and |
2236 |
returns a partial match, it is possible to call it again, with addi- |
returns a partial match, it is possible to call it again, with addi- |
2237 |
tional subject characters, and have it continue with the same match. |
tional subject characters, and have it continue with the same match. |
2238 |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
The PCRE_DFA_RESTART option requests this action; when it is set, the |
2239 |
workspace and wscount options must reference the same vector as before |
workspace and wscount options must reference the same vector as before |
2240 |
because data about the match so far is left in them after a partial |
because data about the match so far is left in them after a partial |
2241 |
match. There is more discussion of this facility in the pcrepartial |
match. There is more discussion of this facility in the pcrepartial |
2242 |
documentation. |
documentation. |
2243 |
|
|
2244 |
Successful returns from pcre_dfa_exec() |
Successful returns from pcre_dfa_exec() |
2245 |
|
|
2246 |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
When pcre_dfa_exec() succeeds, it may have matched more than one sub- |
2247 |
string in the subject. Note, however, that all the matches from one run |
string in the subject. Note, however, that all the matches from one run |
2248 |
of the function start at the same point in the subject. The shorter |
of the function start at the same point in the subject. The shorter |
2249 |
matches are all initial substrings of the longer matches. For example, |
matches are all initial substrings of the longer matches. For example, |
2250 |
if the pattern |
if the pattern |
2251 |
|
|
2252 |
<.*> |
<.*> |
2261 |
<something> <something else> |
<something> <something else> |
2262 |
<something> <something else> <something further> |
<something> <something else> <something further> |
2263 |
|
|
2264 |
On success, the yield of the function is a number greater than zero, |
On success, the yield of the function is a number greater than zero, |
2265 |
which is the number of matched substrings. The substrings themselves |
which is the number of matched substrings. The substrings themselves |
2266 |
are returned in ovector. Each string uses two elements; the first is |
are returned in ovector. Each string uses two elements; the first is |
2267 |
the offset to the start, and the second is the offset to the end. All |
the offset to the start, and the second is the offset to the end. In |
2268 |
the strings have the same start offset. (Space could have been saved by |
fact, all the strings have the same start offset. (Space could have |
2269 |
giving this only once, but it was decided to retain some compatibility |
been saved by giving this only once, but it was decided to retain some |
2270 |
with the way pcre_exec() returns data, even though the meaning of the |
compatibility with the way pcre_exec() returns data, even though the |
2271 |
strings is different.) |
meaning of the strings is different.) |
2272 |
|
|
2273 |
The strings are returned in reverse order of length; that is, the long- |
The strings are returned in reverse order of length; that is, the long- |
2274 |
est matching string is given first. If there were too many matches to |
est matching string is given first. If there were too many matches to |
2275 |
fit into ovector, the yield of the function is zero, and the vector is |
fit into ovector, the yield of the function is zero, and the vector is |
2276 |
filled with the longest matches. |
filled with the longest matches. |
2277 |
|
|
2278 |
Error returns from pcre_dfa_exec() |
Error returns from pcre_dfa_exec() |
2279 |
|
|
2280 |
The pcre_dfa_exec() function returns a negative number when it fails. |
The pcre_dfa_exec() function returns a negative number when it fails. |
2281 |
Many of the errors are the same as for pcre_exec(), and these are |
Many of the errors are the same as for pcre_exec(), and these are |
2282 |
described above. There are in addition the following errors that are |
described above. There are in addition the following errors that are |
2283 |
specific to pcre_dfa_exec(): |
specific to pcre_dfa_exec(): |
2284 |
|
|
2285 |
PCRE_ERROR_DFA_UITEM (-16) |
PCRE_ERROR_DFA_UITEM (-16) |
2286 |
|
|
2287 |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
This return is given if pcre_dfa_exec() encounters an item in the pat- |
2288 |
tern that it does not support, for instance, the use of \C or a back |
tern that it does not support, for instance, the use of \C or a back |
2289 |
reference. |
reference. |
2290 |
|
|
2291 |
PCRE_ERROR_DFA_UCOND (-17) |
PCRE_ERROR_DFA_UCOND (-17) |
2292 |
|
|
2293 |
This return is given if pcre_dfa_exec() encounters a condition item in |
This return is given if pcre_dfa_exec() encounters a condition item |
2294 |
a pattern that uses a back reference for the condition. This is not |
that uses a back reference for the condition, or a test for recursion |
2295 |
supported. |
in a specific group. These are not supported. |
2296 |
|
|
2297 |
PCRE_ERROR_DFA_UMLIMIT (-18) |
PCRE_ERROR_DFA_UMLIMIT (-18) |
2298 |
|
|
2299 |
This return is given if pcre_dfa_exec() is called with an extra block |
This return is given if pcre_dfa_exec() is called with an extra block |
2300 |
that contains a setting of the match_limit field. This is not supported |
that contains a setting of the match_limit field. This is not supported |
2301 |
(it is meaningless). |
(it is meaningless). |
2302 |
|
|
2303 |
PCRE_ERROR_DFA_WSSIZE (-19) |
PCRE_ERROR_DFA_WSSIZE (-19) |
2304 |
|
|
2305 |
This return is given if pcre_dfa_exec() runs out of space in the |
This return is given if pcre_dfa_exec() runs out of space in the |
2306 |
workspace vector. |
workspace vector. |
2307 |
|
|
2308 |
PCRE_ERROR_DFA_RECURSE (-20) |
PCRE_ERROR_DFA_RECURSE (-20) |
2309 |
|
|
2310 |
When a recursive subpattern is processed, the matching function calls |
When a recursive subpattern is processed, the matching function calls |
2311 |
itself recursively, using private vectors for ovector and workspace. |
itself recursively, using private vectors for ovector and workspace. |
2312 |
This error is given if the output vector is not large enough. This |
This error is given if the output vector is not large enough. This |
2313 |
should be extremely rare, as a vector of size 1000 is used. |
should be extremely rare, as a vector of size 1000 is used. |
2314 |
|
|
2315 |
Last updated: 08 June 2006 |
|
2316 |
|
SEE ALSO |
2317 |
|
|
2318 |
|
pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar- |
2319 |
|
tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3). |
2320 |
|
|
2321 |
|
Last updated: 30 November 2006 |
2322 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
2323 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2324 |
|
|
2492 |
DIFFERENCES BETWEEN PCRE AND PERL |
DIFFERENCES BETWEEN PCRE AND PERL |
2493 |
|
|
2494 |
This document describes the differences in the ways that PCRE and Perl |
This document describes the differences in the ways that PCRE and Perl |
2495 |
handle regular expressions. The differences described here are with |
handle regular expressions. The differences described here are mainly |
2496 |
respect to Perl 5.8. |
with respect to Perl 5.8, though PCRE version 7.0 contains some fea- |
2497 |
|
tures that are expected to be in the forthcoming Perl 5.10. |
2498 |
|
|
2499 |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
1. PCRE has only a subset of Perl's UTF-8 and Unicode support. Details |
2500 |
of what it does have are given in the section on UTF-8 support in the |
of what it does have are given in the section on UTF-8 support in the |
2501 |
main pcre page. |
main pcre page. |
2502 |
|
|
2503 |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl |
2504 |
permits them, but they do not mean what you might think. For example, |
permits them, but they do not mean what you might think. For example, |
2505 |
(?!a){3} does not assert that the next three characters are not "a". It |
(?!a){3} does not assert that the next three characters are not "a". It |
2506 |
just asserts that the next character is not "a" three times. |
just asserts that the next character is not "a" three times. |
2507 |
|
|
2508 |
3. Capturing subpatterns that occur inside negative lookahead asser- |
3. Capturing subpatterns that occur inside negative lookahead asser- |
2509 |
tions are counted, but their entries in the offsets vector are never |
tions are counted, but their entries in the offsets vector are never |
2510 |
set. Perl sets its numerical variables from any such patterns that are |
set. Perl sets its numerical variables from any such patterns that are |
2511 |
matched before the assertion fails to match something (thereby succeed- |
matched before the assertion fails to match something (thereby succeed- |
2512 |
ing), but only if the negative lookahead assertion contains just one |
ing), but only if the negative lookahead assertion contains just one |
2513 |
branch. |
branch. |
2514 |
|
|
2515 |
4. Though binary zero characters are supported in the subject string, |
4. Though binary zero characters are supported in the subject string, |
2516 |
they are not allowed in a pattern string because it is passed as a nor- |
they are not allowed in a pattern string because it is passed as a nor- |
2517 |
mal C string, terminated by zero. The escape sequence \0 can be used in |
mal C string, terminated by zero. The escape sequence \0 can be used in |
2518 |
the pattern to represent a binary zero. |
the pattern to represent a binary zero. |
2519 |
|
|
2520 |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
5. The following Perl escape sequences are not supported: \l, \u, \L, |
2521 |
\U, and \N. In fact these are implemented by Perl's general string-han- |
\U, and \N. In fact these are implemented by Perl's general string-han- |
2522 |
dling and are not part of its pattern matching engine. If any of these |
dling and are not part of its pattern matching engine. If any of these |
2523 |
are encountered by PCRE, an error is generated. |
are encountered by PCRE, an error is generated. |
2524 |
|
|
2525 |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
6. The Perl escape sequences \p, \P, and \X are supported only if PCRE |
2526 |
is built with Unicode character property support. The properties that |
is built with Unicode character property support. The properties that |
2527 |
can be tested with \p and \P are limited to the general category prop- |
can be tested with \p and \P are limited to the general category prop- |
2528 |
erties such as Lu and Nd, script names such as Greek or Han, and the |
erties such as Lu and Nd, script names such as Greek or Han, and the |
2529 |
derived properties Any and L&. |
derived properties Any and L&. |
2530 |
|
|
2531 |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
7. PCRE does support the \Q...\E escape for quoting substrings. Charac- |
2532 |
ters in between are treated as literals. This is slightly different |
ters in between are treated as literals. This is slightly different |
2533 |
from Perl in that $ and @ are also handled as literals inside the |
from Perl in that $ and @ are also handled as literals inside the |
2534 |
quotes. In Perl, they cause variable interpolation (but of course PCRE |
quotes. In Perl, they cause variable interpolation (but of course PCRE |
2535 |
does not have variables). Note the following examples: |
does not have variables). Note the following examples: |
2536 |
|
|
2537 |
Pattern PCRE matches Perl matches |
Pattern PCRE matches Perl matches |
2541 |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
\Qabc\$xyz\E abc\$xyz abc\$xyz |
2542 |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
\Qabc\E\$\Qxyz\E abc$xyz abc$xyz |
2543 |
|
|
2544 |
The \Q...\E sequence is recognized both inside and outside character |
The \Q...\E sequence is recognized both inside and outside character |
2545 |
classes. |
classes. |
2546 |
|
|
2547 |
8. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) |
8. Fairly obviously, PCRE does not support the (?{code}) and (??{code}) |
2548 |
constructions. However, there is support for recursive patterns using |
constructions. However, there is support for recursive patterns. This |
2549 |
the non-Perl items (?R), (?number), and (?P>name). Also, the PCRE |
is not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE |
2550 |
"callout" feature allows an external function to be called during pat- |
"callout" feature allows an external function to be called during pat- |
2551 |
tern matching. See the pcrecallout documentation for details. |
tern matching. See the pcrecallout documentation for details. |
2552 |
|
|
2553 |
9. There are some differences that are concerned with the settings of |
9. Subpatterns that are called recursively or as "subroutines" are |
2554 |
captured strings when part of a pattern is repeated. For example, |
always treated as atomic groups in PCRE. This is like Python, but |
2555 |
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
unlike Perl. |
2556 |
|
|
2557 |
|
10. There are some differences that are concerned with the settings of |
2558 |
|
captured strings when part of a pattern is repeated. For example, |
2559 |
|
matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2 |
2560 |
unset, but in PCRE it is set to "b". |
unset, but in PCRE it is set to "b". |
2561 |
|
|
2562 |
10. PCRE provides some extensions to the Perl regular expression facil- |
11. PCRE provides some extensions to the Perl regular expression facil- |
2563 |
ities: |
ities. Perl 5.10 will include new features that are not in earlier |
2564 |
|
versions, some of which (such as named parentheses) have been in PCRE |
2565 |
|
for some time. This list is with respect to Perl 5.10: |
2566 |
|
|
2567 |
(a) Although lookbehind assertions must match fixed length strings, |
(a) Although lookbehind assertions must match fixed length strings, |
2568 |
each alternative branch of a lookbehind assertion can match a different |
each alternative branch of a lookbehind assertion can match a different |
2569 |
length of string. Perl requires them all to have the same length. |
length of string. Perl requires them all to have the same length. |
2570 |
|
|
2571 |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
(b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $ |
2572 |
meta-character matches only at the very end of the string. |
meta-character matches only at the very end of the string. |
2573 |
|
|
2574 |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
(c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe- |
2575 |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
cial meaning is faulted. Otherwise, like Perl, the backslash is |
2576 |
ignored. (Perl can be made to issue a warning.) |
ignored. (Perl can be made to issue a warning.) |
2577 |
|
|
2578 |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
(d) If PCRE_UNGREEDY is set, the greediness of the repetition quanti- |
2579 |
fiers is inverted, that is, by default they are not greedy, but if fol- |
fiers is inverted, that is, by default they are not greedy, but if fol- |
2580 |
lowed by a question mark they are. |
lowed by a question mark they are. |
2581 |
|
|
2582 |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
(e) PCRE_ANCHORED can be used at matching time to force a pattern to be |
2583 |
tried only at the first matching position in the subject string. |
tried only at the first matching position in the subject string. |
2584 |
|
|
2585 |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
(f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP- |
2586 |
TURE options for pcre_exec() have no Perl equivalents. |
TURE options for pcre_exec() have no Perl equivalents. |
2587 |
|
|
2588 |
(g) The (?R), (?number), and (?P>name) constructs allows for recursive |
(g) The callout facility is PCRE-specific. |
|
pattern matching (Perl can do this using the (?p{code}) construct, |
|
|
which PCRE cannot support.) |
|
|
|
|
|
(h) PCRE supports named capturing substrings, using the Python syntax. |
|
|
|
|
|
(i) PCRE supports the possessive quantifier "++" syntax, taken from |
|
|
Sun's Java package. |
|
|
|
|
|
(j) The (R) condition, for testing recursion, is a PCRE extension. |
|
|
|
|
|
(k) The callout facility is PCRE-specific. |
|
2589 |
|
|
2590 |
(l) The partial matching facility is PCRE-specific. |
(h) The partial matching facility is PCRE-specific. |
2591 |
|
|
2592 |
(m) Patterns compiled by PCRE can be saved and re-used at a later time, |
(i) Patterns compiled by PCRE can be saved and re-used at a later time, |
2593 |
even on different hosts that have the other endianness. |
even on different hosts that have the other endianness. |
2594 |
|
|
2595 |
(n) The alternative matching function (pcre_dfa_exec()) matches in a |
(j) The alternative matching function (pcre_dfa_exec()) matches in a |
2596 |
different way and is not Perl-compatible. |
different way and is not Perl-compatible. |
2597 |
|
|
2598 |
Last updated: 06 June 2006 |
Last updated: 28 November 2006 |
2599 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
2600 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2601 |
|
|
2632 |
function, and how it differs from the normal function, are discussed in |
function, and how it differs from the normal function, are discussed in |
2633 |
the pcrematching page. |
the pcrematching page. |
2634 |
|
|
2635 |
|
|
2636 |
|
CHARACTERS AND METACHARACTERS |
2637 |
|
|
2638 |
A regular expression is a pattern that is matched against a subject |
A regular expression is a pattern that is matched against a subject |
2639 |
string from left to right. Most characters stand for themselves in a |
string from left to right. Most characters stand for themselves in a |
2640 |
pattern, and match the corresponding characters in the subject. As a |
pattern, and match the corresponding characters in the subject. As a |
2659 |
|
|
2660 |
There are two different sets of metacharacters: those that are recog- |
There are two different sets of metacharacters: those that are recog- |
2661 |
nized anywhere in the pattern except within square brackets, and those |
nized anywhere in the pattern except within square brackets, and those |
2662 |
that are recognized in square brackets. Outside square brackets, the |
that are recognized within square brackets. Outside square brackets, |
2663 |
metacharacters are as follows: |
the metacharacters are as follows: |
2664 |
|
|
2665 |
\ general escape character with several uses |
\ general escape character with several uses |
2666 |
^ assert start of string (or line, in multiline mode) |
^ assert start of string (or line, in multiline mode) |
2782 |
|
|
2783 |
Inside a character class, or if the decimal number is greater than 9 |
Inside a character class, or if the decimal number is greater than 9 |
2784 |
and there have not been that many capturing subpatterns, PCRE re-reads |
and there have not been that many capturing subpatterns, PCRE re-reads |
2785 |
up to three octal digits following the backslash, ane uses them to gen- |
up to three octal digits following the backslash, and uses them to gen- |
2786 |
erate a data character. Any subsequent digits stand for themselves. In |
erate a data character. Any subsequent digits stand for themselves. In |
2787 |
non-UTF-8 mode, the value of a character specified in octal must be |
non-UTF-8 mode, the value of a character specified in octal must be |
2788 |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
less than \400. In UTF-8 mode, values up to \777 are permitted. For |
2809 |
All the sequences that define a single character value can be used both |
All the sequences that define a single character value can be used both |
2810 |
inside and outside character classes. In addition, inside a character |
inside and outside character classes. In addition, inside a character |
2811 |
class, the sequence \b is interpreted as the backspace character (hex |
class, the sequence \b is interpreted as the backspace character (hex |
2812 |
08), and the sequence \X is interpreted as the character "X". Outside a |
08), and the sequences \R and \X are interpreted as the characters "R" |
2813 |
character class, these sequences have different meanings (see below). |
and "X", respectively. Outside a character class, these sequences have |
2814 |
|
different meanings (see below). |
2815 |
|
|
2816 |
|
Absolute and relative back references |
2817 |
|
|
2818 |
|
The sequence \g followed by a positive or negative number, optionally |
2819 |
|
enclosed in braces, is an absolute or relative back reference. Back |
2820 |
|
references are discussed later, following the discussion of parenthe- |
2821 |
|
sized subpatterns. |
2822 |
|
|
2823 |
Generic character types |
Generic character types |
2824 |
|
|
2825 |
The third use of backslash is for specifying generic character types. |
Another use of backslash is for specifying generic character types. The |
2826 |
The following are always recognized: |
following are always recognized: |
2827 |
|
|
2828 |
\d any decimal digit |
\d any decimal digit |
2829 |
\D any character that is not a decimal digit |
\D any character that is not a decimal digit |
2860 |
code character property support is available. The use of locales with |
code character property support is available. The use of locales with |
2861 |
Unicode is discouraged. |
Unicode is discouraged. |
2862 |
|
|
2863 |
|
Newline sequences |
2864 |
|
|
2865 |
|
Outside a character class, the escape sequence \R matches any Unicode |
2866 |
|
newline sequence. This is an extension to Perl. In non-UTF-8 mode \R is |
2867 |
|
equivalent to the following: |
2868 |
|
|
2869 |
|
(?>\r\n|\n|\x0b|\f|\r|\x85) |
2870 |
|
|
2871 |
|
This is an example of an "atomic group", details of which are given |
2872 |
|
below. This particular group matches either the two-character sequence |
2873 |
|
CR followed by LF, or one of the single characters LF (linefeed, |
2874 |
|
U+000A), VT (vertical tab, U+000B), FF (formfeed, U+000C), CR (carriage |
2875 |
|
return, U+000D), or NEL (next line, U+0085). The two-character sequence |
2876 |
|
is treated as a single unit that cannot be split. |
2877 |
|
|
2878 |
|
In UTF-8 mode, two additional characters whose codepoints are greater |
2879 |
|
than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa- |
2880 |
|
rator, U+2029). Unicode character property support is not needed for |
2881 |
|
these characters to be recognized. |
2882 |
|
|
2883 |
|
Inside a character class, \R matches the letter "R". |
2884 |
|
|
2885 |
Unicode character properties |
Unicode character properties |
2886 |
|
|
2887 |
When PCRE is built with Unicode character property support, three addi- |
When PCRE is built with Unicode character property support, three addi- |
2908 |
Those that are not part of an identified script are lumped together as |
Those that are not part of an identified script are lumped together as |
2909 |
"Common". The current list of scripts is: |
"Common". The current list of scripts is: |
2910 |
|
|
2911 |
Arabic, Armenian, Bengali, Bopomofo, Braille, Buginese, Buhid, Cana- |
Arabic, Armenian, Balinese, Bengali, Bopomofo, Braille, Buginese, |
2912 |
dian_Aboriginal, Cherokee, Common, Coptic, Cypriot, Cyrillic, Deseret, |
Buhid, Canadian_Aboriginal, Cherokee, Common, Coptic, Cuneiform, |
2913 |
Devanagari, Ethiopic, Georgian, Glagolitic, Gothic, Greek, Gujarati, |
Cypriot, Cyrillic, Deseret, Devanagari, Ethiopic, Georgian, Glagolitic, |
2914 |
Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hiragana, Inherited, Kannada, |
Gothic, Greek, Gujarati, Gurmukhi, Han, Hangul, Hanunoo, Hebrew, Hira- |
2915 |
Katakana, Kharoshthi, Khmer, Lao, Latin, Limbu, Linear_B, Malayalam, |
gana, Inherited, Kannada, Katakana, Kharoshthi, Khmer, Lao, Latin, |
2916 |
Mongolian, Myanmar, New_Tai_Lue, Ogham, Old_Italic, Old_Persian, Oriya, |
Limbu, Linear_B, Malayalam, Mongolian, Myanmar, New_Tai_Lue, Nko, |
2917 |
Osmanya, Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tag- |
Ogham, Old_Italic, Old_Persian, Oriya, Osmanya, Phags_Pa, Phoenician, |
2918 |
banwa, Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, |
Runic, Shavian, Sinhala, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, |
2919 |
Ugaritic, Yi. |
Tai_Le, Tamil, Telugu, Thaana, Thai, Tibetan, Tifinagh, Ugaritic, Yi. |
2920 |
|
|
2921 |
Each character has exactly one general category property, specified by |
Each character has exactly one general category property, specified by |
2922 |
a two-letter abbreviation. For compatibility with Perl, negation can be |
a two-letter abbreviation. For compatibility with Perl, negation can be |
3009 |
|
|
3010 |
Simple assertions |
Simple assertions |
3011 |
|
|
3012 |
The fourth use of backslash is for certain simple assertions. An asser- |
The final use of backslash is for certain simple assertions. An asser- |
3013 |
tion specifies a condition that has to be met at a particular point in |
tion specifies a condition that has to be met at a particular point in |
3014 |
a match, without consuming any characters from the subject string. The |
a match, without consuming any characters from the subject string. The |
3015 |
use of subpatterns for more complicated assertions is described below. |
use of subpatterns for more complicated assertions is described below. |
3017 |
|
|
3018 |
\b matches at a word boundary |
\b matches at a word boundary |
3019 |
\B matches when not at a word boundary |
\B matches when not at a word boundary |
3020 |
\A matches at start of subject |
\A matches at the start of the subject |
3021 |
\Z matches at end of subject or before newline at end |
\Z matches at the end of the subject |
3022 |
\z matches at end of subject |
also matches before a newline at the end of the subject |
3023 |
\G matches at first matching position in subject |
\z matches only at the end of the subject |
3024 |
|
\G matches at the first matching position in the subject |
3025 |
|
|
3026 |
These assertions may not appear in character classes (but note that \b |
These assertions may not appear in character classes (but note that \b |
3027 |
has a different meaning, namely the backspace character, inside a char- |
has a different meaning, namely the backspace character, inside a char- |
3118 |
Outside a character class, a dot in the pattern matches any one charac- |
Outside a character class, a dot in the pattern matches any one charac- |
3119 |
ter in the subject string except (by default) a character that signi- |
ter in the subject string except (by default) a character that signi- |
3120 |
fies the end of a line. In UTF-8 mode, the matched character may be |
fies the end of a line. In UTF-8 mode, the matched character may be |
3121 |
more than one byte long. When a line ending is defined as a single |
more than one byte long. |
|
character (CR or LF), dot never matches that character; when the two- |
|
|
character sequence CRLF is used, dot does not match CR if it is immedi- |
|
|
ately followed by LF, but otherwise it matches all characters (includ- |
|
|
ing isolated CRs and LFs). |
|
|
|
|
|
The behaviour of dot with regard to newlines can be changed. If the |
|
|
PCRE_DOTALL option is set, a dot matches any one character, without |
|
|
exception. If newline is defined as the two-character sequence CRLF, it |
|
|
takes two dots to match it. |
|
3122 |
|
|
3123 |
The handling of dot is entirely independent of the handling of circum- |
When a line ending is defined as a single character, dot never matches |
3124 |
flex and dollar, the only relationship being that they both involve |
that character; when the two-character sequence CRLF is used, dot does |
3125 |
|
not match CR if it is immediately followed by LF, but otherwise it |
3126 |
|
matches all characters (including isolated CRs and LFs). When any Uni- |
3127 |
|
code line endings are being recognized, dot does not match CR or LF or |
3128 |
|
any of the other line ending characters. |
3129 |
|
|
3130 |
|
The behaviour of dot with regard to newlines can be changed. If the |
3131 |
|
PCRE_DOTALL option is set, a dot matches any one character, without |
3132 |
|
exception. If the two-character sequence CRLF is present in the subject |
3133 |
|
string, it takes two dots to match it. |
3134 |
|
|
3135 |
|
The handling of dot is entirely independent of the handling of circum- |
3136 |
|
flex and dollar, the only relationship being that they both involve |
3137 |
newlines. Dot has no special meaning in a character class. |
newlines. Dot has no special meaning in a character class. |
3138 |
|
|
3139 |
|
|
3140 |
MATCHING A SINGLE BYTE |
MATCHING A SINGLE BYTE |
3141 |
|
|
3142 |
Outside a character class, the escape sequence \C matches any one byte, |
Outside a character class, the escape sequence \C matches any one byte, |
3143 |
both in and out of UTF-8 mode. Unlike a dot, it always matches CR and |
both in and out of UTF-8 mode. Unlike a dot, it always matches any |
3144 |
LF. The feature is provided in Perl in order to match individual bytes |
line-ending characters. The feature is provided in Perl in order to |
3145 |
in UTF-8 mode. Because it breaks up UTF-8 characters into individual |
match individual bytes in UTF-8 mode. Because it breaks up UTF-8 char- |
3146 |
bytes, what remains in the string may be a malformed UTF-8 string. For |
acters into individual bytes, what remains in the string may be a mal- |
3147 |
this reason, the \C escape sequence is best avoided. |
formed UTF-8 string. For this reason, the \C escape sequence is best |
3148 |
|
avoided. |
3149 |
|
|
3150 |
PCRE does not allow \C to appear in lookbehind assertions (described |
PCRE does not allow \C to appear in lookbehind assertions (described |
3151 |
below), because in UTF-8 mode this would make it impossible to calcu- |
below), because in UTF-8 mode this would make it impossible to calcu- |
3192 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
PCRE is compiled with Unicode property support as well as with UTF-8 |
3193 |
support. |
support. |
3194 |
|
|
3195 |
Characters that might indicate line breaks (CR and LF) are never |
Characters that might indicate line breaks are never treated in any |
3196 |
treated in any special way when matching character classes, whatever |
special way when matching character classes, whatever line-ending |
3197 |
line-ending sequence is in use, and whatever setting of the PCRE_DOTALL |
sequence is in use, and whatever setting of the PCRE_DOTALL and |
3198 |
and PCRE_MULTILINE options is used. A class such as [^a] always matches |
PCRE_MULTILINE options is used. A class such as [^a] always matches one |
3199 |
one of these characters. |
of these characters. |
3200 |
|
|
3201 |
The minus (hyphen) character can be used to specify a range of charac- |
The minus (hyphen) character can be used to specify a range of charac- |
3202 |
ters in a character class. For example, [d-m] matches any letter |
ters in a character class. For example, [d-m] matches any letter |
3328 |
PCRE extracts it into the global options (and it will therefore show up |
PCRE extracts it into the global options (and it will therefore show up |
3329 |
in data extracted by the pcre_fullinfo() function). |
in data extracted by the pcre_fullinfo() function). |
3330 |
|
|
3331 |
An option change within a subpattern affects only that part of the cur- |
An option change within a subpattern (see below for a description of |
3332 |
rent pattern that follows it, so |
subpatterns) affects only that part of the current pattern that follows |
3333 |
|
it, so |
3334 |
|
|
3335 |
(a(?i)b)c |
(a(?i)b)c |
3336 |
|
|
3337 |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
matches abc and aBc and no other strings (assuming PCRE_CASELESS is not |
3338 |
used). By this means, options can be made to have different settings |
used). By this means, options can be made to have different settings |
3339 |
in different parts of the pattern. Any changes made in one alternative |
in different parts of the pattern. Any changes made in one alternative |
3340 |
do carry on into subsequent branches within the same subpattern. For |
do carry on into subsequent branches within the same subpattern. For |
3341 |
example, |
example, |
3342 |
|
|
3343 |
(a(?i)b|c) |
(a(?i)b|c) |
3344 |
|
|
3345 |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
matches "ab", "aB", "c", and "C", even though when matching "C" the |
3346 |
first branch is abandoned before the option setting. This is because |
first branch is abandoned before the option setting. This is because |
3347 |
the effects of option settings happen at compile time. There would be |
the effects of option settings happen at compile time. There would be |
3348 |
some very weird behaviour otherwise. |
some very weird behaviour otherwise. |
3349 |
|
|
3350 |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
The PCRE-specific options PCRE_DUPNAMES, PCRE_UNGREEDY, and PCRE_EXTRA |
3351 |
can be changed in the same way as the Perl-compatible options by using |
can be changed in the same way as the Perl-compatible options by using |
3352 |
the characters J, U and X respectively. |
the characters J, U and X respectively. |
3353 |
|
|
3354 |
|
|
3361 |
|
|
3362 |
cat(aract|erpillar|) |
cat(aract|erpillar|) |
3363 |
|
|
3364 |
matches one of the words "cat", "cataract", or "caterpillar". Without |
matches one of the words "cat", "cataract", or "caterpillar". Without |
3365 |
the parentheses, it would match "cataract", "erpillar" or the empty |
the parentheses, it would match "cataract", "erpillar" or an empty |
3366 |
string. |
string. |
3367 |
|
|
3368 |
2. It sets up the subpattern as a capturing subpattern. This means |
2. It sets up the subpattern as a capturing subpattern. This means |
3369 |
that, when the whole pattern matches, that portion of the subject |
that, when the whole pattern matches, that portion of the subject |
3370 |
string that matched the subpattern is passed back to the caller via the |
string that matched the subpattern is passed back to the caller via the |
3371 |
ovector argument of pcre_exec(). Opening parentheses are counted from |
ovector argument of pcre_exec(). Opening parentheses are counted from |
3372 |
left to right (starting from 1) to obtain numbers for the capturing |
left to right (starting from 1) to obtain numbers for the capturing |
3373 |
subpatterns. |
subpatterns. |
3374 |
|
|
3375 |
For example, if the string "the red king" is matched against the pat- |
For example, if the string "the red king" is matched against the pat- |
3376 |
tern |
tern |
3377 |
|
|
3378 |
the ((red|white) (king|queen)) |
the ((red|white) (king|queen)) |
3380 |
the captured substrings are "red king", "red", and "king", and are num- |
the captured substrings are "red king", "red", and "king", and are num- |
3381 |
bered 1, 2, and 3, respectively. |
bered 1, 2, and 3, respectively. |
3382 |
|
|
3383 |
The fact that plain parentheses fulfil two functions is not always |
The fact that plain parentheses fulfil two functions is not always |
3384 |
helpful. There are often times when a grouping subpattern is required |
helpful. There are often times when a grouping subpattern is required |
3385 |
without a capturing requirement. If an opening parenthesis is followed |
without a capturing requirement. If an opening parenthesis is followed |
3386 |
by a question mark and a colon, the subpattern does not do any captur- |
by a question mark and a colon, the subpattern does not do any captur- |
3387 |
ing, and is not counted when computing the number of any subsequent |
ing, and is not counted when computing the number of any subsequent |
3388 |
capturing subpatterns. For example, if the string "the white queen" is |
capturing subpatterns. For example, if the string "the white queen" is |
3389 |
matched against the pattern |
matched against the pattern |
3390 |
|
|
3391 |
the ((?:red|white) (king|queen)) |
the ((?:red|white) (king|queen)) |
3392 |
|
|
3393 |
the captured substrings are "white queen" and "queen", and are numbered |
the captured substrings are "white queen" and "queen", and are numbered |
3394 |
1 and 2. The maximum number of capturing subpatterns is 65535, and the |
1 and 2. The maximum number of capturing subpatterns is 65535. |
|
maximum depth of nesting of all subpatterns, both capturing and non- |
|
|
capturing, is 200. |
|
3395 |
|
|
3396 |
As a convenient shorthand, if any option settings are required at the |
As a convenient shorthand, if any option settings are required at the |
3397 |
start of a non-capturing subpattern, the option letters may appear |
start of a non-capturing subpattern, the option letters may appear |
3398 |
between the "?" and the ":". Thus the two patterns |
between the "?" and the ":". Thus the two patterns |
3399 |
|
|
3400 |
(?i:saturday|sunday) |
(?i:saturday|sunday) |
3401 |
(?:(?i)saturday|sunday) |
(?:(?i)saturday|sunday) |
3402 |
|
|
3403 |
match exactly the same set of strings. Because alternative branches are |
match exactly the same set of strings. Because alternative branches are |
3404 |
tried from left to right, and options are not reset until the end of |
tried from left to right, and options are not reset until the end of |
3405 |
the subpattern is reached, an option setting in one branch does affect |
the subpattern is reached, an option setting in one branch does affect |
3406 |
subsequent branches, so the above patterns match "SUNDAY" as well as |
subsequent branches, so the above patterns match "SUNDAY" as well as |
3407 |
"Saturday". |
"Saturday". |
3408 |
|
|
3409 |
|
|
3410 |
NAMED SUBPATTERNS |
NAMED SUBPATTERNS |
3411 |
|
|
3412 |
Identifying capturing parentheses by number is simple, but it can be |
Identifying capturing parentheses by number is simple, but it can be |
3413 |
very hard to keep track of the numbers in complicated regular expres- |
very hard to keep track of the numbers in complicated regular expres- |
3414 |
sions. Furthermore, if an expression is modified, the numbers may |
sions. Furthermore, if an expression is modified, the numbers may |
3415 |
change. To help with this difficulty, PCRE supports the naming of sub- |
change. To help with this difficulty, PCRE supports the naming of sub- |
3416 |
patterns, something that Perl does not provide. The Python syntax |
patterns. This feature was not added to Perl until release 5.10. Python |
3417 |
(?P<name>...) is used. References to capturing parentheses from other |
had the feature earlier, and PCRE introduced it at release 4.0, using |
3418 |
parts of the pattern, such as backreferences, recursion, and condi- |
the Python syntax. PCRE now supports both the Perl and the Python syn- |
3419 |
tions, can be made by name as well as by number. |
tax. |
3420 |
|
|
3421 |
Names consist of up to 32 alphanumeric characters and underscores. |
In PCRE, a subpattern can be named in one of three ways: (?<name>...) |
3422 |
Named capturing parentheses are still allocated numbers as well as |
or (?'name'...) as in Perl, or (?P<name>...) as in Python. References |
3423 |
names. The PCRE API provides function calls for extracting the name-to- |
to capturing parentheses from other parts of the pattern, such as back- |
3424 |
number translation table from a compiled pattern. There is also a con- |
references, recursion, and conditions, can be made by name as well as |
3425 |
venience function for extracting a captured substring by name. |
by number. |
3426 |
|
|
3427 |
|
Names consist of up to 32 alphanumeric characters and underscores. |
3428 |
|
Named capturing parentheses are still allocated numbers as well as |
3429 |
|
names, exactly as if the names were not present. The PCRE API provides |
3430 |
|
function calls for extracting the name-to-number translation table from |
3431 |
|
a compiled pattern. There is also a convenience function for extracting |
3432 |
|
a captured substring by name. |
3433 |
|
|
3434 |
By default, a name must be unique within a pattern, but it is possible |
By default, a name must be unique within a pattern, but it is possible |
3435 |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
to relax this constraint by setting the PCRE_DUPNAMES option at compile |
3439 |
both cases you want to extract the abbreviation. This pattern (ignoring |
both cases you want to extract the abbreviation. This pattern (ignoring |
3440 |
the line breaks) does the job: |
the line breaks) does the job: |
3441 |
|
|
3442 |
(?P<DN>Mon|Fri|Sun)(?:day)?| |
(?<DN>Mon|Fri|Sun)(?:day)?| |
3443 |
(?P<DN>Tue)(?:sday)?| |
(?<DN>Tue)(?:sday)?| |
3444 |
(?P<DN>Wed)(?:nesday)?| |
(?<DN>Wed)(?:nesday)?| |
3445 |
(?P<DN>Thu)(?:rsday)?| |
(?<DN>Thu)(?:rsday)?| |
3446 |
(?P<DN>Sat)(?:urday)? |
(?<DN>Sat)(?:urday)? |
3447 |
|
|
3448 |
There are five capturing substrings, but only one is ever set after a |
There are five capturing substrings, but only one is ever set after a |
3449 |
match. The convenience function for extracting the data by name |
match. The convenience function for extracting the data by name |
3450 |
returns the substring for the first, and in this example, the only, |
returns the substring for the first (and in this example, the only) |
3451 |
subpattern of that name that matched. This saves searching to find |
subpattern of that name that matched. This saves searching to find |
3452 |
which numbered subpattern it was. If you make a reference to a non- |
which numbered subpattern it was. If you make a reference to a non- |
3453 |
unique named subpattern from elsewhere in the pattern, the one that |
unique named subpattern from elsewhere in the pattern, the one that |
3462 |
following items: |
following items: |
3463 |
|
|
3464 |
a literal data character |
a literal data character |
3465 |
the . metacharacter |
the dot metacharacter |
3466 |
the \C escape sequence |
the \C escape sequence |
3467 |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
the \X escape sequence (in UTF-8 mode with Unicode properties) |
3468 |
|
the \R escape sequence |
3469 |
an escape such as \d that matches a single character |
an escape such as \d that matches a single character |
3470 |
a character class |
a character class |
3471 |
a back reference (see next section) |
a back reference (see next section) |
3505 |
The quantifier {0} is permitted, causing the expression to behave as if |
The quantifier {0} is permitted, causing the expression to behave as if |
3506 |
the previous item and the quantifier were not present. |
the previous item and the quantifier were not present. |
3507 |
|
|
3508 |
For convenience (and historical compatibility) the three most common |
For convenience, the three most common quantifiers have single-charac- |
3509 |
quantifiers have single-character abbreviations: |
ter abbreviations: |
3510 |
|
|
3511 |
* is equivalent to {0,} |
* is equivalent to {0,} |
3512 |
+ is equivalent to {1,} |
+ is equivalent to {1,} |
3558 |
which matches one digit by preference, but can match two if that is the |
which matches one digit by preference, but can match two if that is the |
3559 |
only way the rest of the pattern matches. |
only way the rest of the pattern matches. |
3560 |
|
|
3561 |
If the PCRE_UNGREEDY option is set (an option which is not available in |
If the PCRE_UNGREEDY option is set (an option that is not available in |
3562 |
Perl), the quantifiers are not greedy by default, but individual ones |
Perl), the quantifiers are not greedy by default, but individual ones |
3563 |
can be made greedy by following them with a question mark. In other |
can be made greedy by following them with a question mark. In other |
3564 |
words, it inverts the default behaviour. |
words, it inverts the default behaviour. |
3569 |
minimum or maximum. |
minimum or maximum. |
3570 |
|
|
3571 |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equiv- |
3572 |
alent to Perl's /s) is set, thus allowing the . to match newlines, the |
alent to Perl's /s) is set, thus allowing the dot to match newlines, |
3573 |
pattern is implicitly anchored, because whatever follows will be tried |
the pattern is implicitly anchored, because whatever follows will be |
3574 |
against every character position in the subject string, so there is no |
tried against every character position in the subject string, so there |
3575 |
point in retrying the overall match at any position after the first. |
is no point in retrying the overall match at any position after the |
3576 |
PCRE normally treats such a pattern as though it were preceded by \A. |
first. PCRE normally treats such a pattern as though it were preceded |
3577 |
|
by \A. |
3578 |
|
|
3579 |
In cases where it is known that the subject string contains no new- |
In cases where it is known that the subject string contains no new- |
3580 |
lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
lines, it is worth setting PCRE_DOTALL in order to obtain this opti- |
3581 |
mization, or alternatively using ^ to indicate anchoring explicitly. |
mization, or alternatively using ^ to indicate anchoring explicitly. |
3582 |
|
|
3583 |
However, there is one situation where the optimization cannot be used. |
However, there is one situation where the optimization cannot be used. |
3584 |
When .* is inside capturing parentheses that are the subject of a |
When .* is inside capturing parentheses that are the subject of a |
3585 |
backreference elsewhere in the pattern, a match at the start may fail, |
backreference elsewhere in the pattern, a match at the start may fail |
3586 |
and a later one succeed. Consider, for example: |
where a later one succeeds. Consider, for example: |
3587 |
|
|
3588 |
(.*)abc\1 |
(.*)abc\1 |
3589 |
|
|
3590 |
If the subject is "xyz123abc123" the match point is the fourth charac- |
If the subject is "xyz123abc123" the match point is the fourth charac- |
3591 |
ter. For this reason, such a pattern is not implicitly anchored. |
ter. For this reason, such a pattern is not implicitly anchored. |
3592 |
|
|
3593 |
When a capturing subpattern is repeated, the value captured is the sub- |
When a capturing subpattern is repeated, the value captured is the sub- |
3596 |
(tweedle[dume]{3}\s*)+ |
(tweedle[dume]{3}\s*)+ |
3597 |
|
|
3598 |
has matched "tweedledum tweedledee" the value of the captured substring |
has matched "tweedledum tweedledee" the value of the captured substring |
3599 |
is "tweedledee". However, if there are nested capturing subpatterns, |
is "tweedledee". However, if there are nested capturing subpatterns, |
3600 |
the corresponding captured values may have been set in previous itera- |
the corresponding captured values may have been set in previous itera- |
3601 |
tions. For example, after |
tions. For example, after |
3602 |
|
|
3603 |
/(a|(b))+/ |
/(a|(b))+/ |
3607 |
|
|
3608 |
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS |
3609 |
|
|
3610 |
With both maximizing and minimizing repetition, failure of what follows |
With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") |
3611 |
normally causes the repeated item to be re-evaluated to see if a dif- |
repetition, failure of what follows normally causes the repeated item |
3612 |
ferent number of repeats allows the rest of the pattern to match. Some- |
to be re-evaluated to see if a different number of repeats allows the |
3613 |
times it is useful to prevent this, either to change the nature of the |
rest of the pattern to match. Sometimes it is useful to prevent this, |
3614 |
match, or to cause it fail earlier than it otherwise might, when the |
either to change the nature of the match, or to cause it fail earlier |
3615 |
author of the pattern knows there is no point in carrying on. |
than it otherwise might, when the author of the pattern knows there is |
3616 |
|
no point in carrying on. |
3617 |
|
|
3618 |
Consider, for example, the pattern \d+foo when applied to the subject |
Consider, for example, the pattern \d+foo when applied to the subject |
3619 |
line |
line |
3627 |
the means for specifying that once a subpattern has matched, it is not |
the means for specifying that once a subpattern has matched, it is not |
3628 |
to be re-evaluated in this way. |
to be re-evaluated in this way. |
3629 |
|
|
3630 |
If we use atomic grouping for the previous example, the matcher would |
If we use atomic grouping for the previous example, the matcher gives |
3631 |
give up immediately on failing to match "foo" the first time. The nota- |
up immediately on failing to match "foo" the first time. The notation |
3632 |
tion is a kind of special parenthesis, starting with (?> as in this |
is a kind of special parenthesis, starting with (?> as in this example: |
|
example: |
|
3633 |
|
|
3634 |
(?>\d+)foo |
(?>\d+)foo |
3635 |
|
|
3661 |
Possessive quantifiers are always greedy; the setting of the |
Possessive quantifiers are always greedy; the setting of the |
3662 |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
PCRE_UNGREEDY option is ignored. They are a convenient notation for the |
3663 |
simpler forms of atomic group. However, there is no difference in the |
simpler forms of atomic group. However, there is no difference in the |
3664 |
meaning or processing of a possessive quantifier and the equivalent |
meaning of a possessive quantifier and the equivalent atomic group, |
3665 |
atomic group. |
though there may be a performance difference; possessive quantifiers |
3666 |
|
should be slightly faster. |
3667 |
The possessive quantifier syntax is an extension to the Perl syntax. |
|
3668 |
Jeffrey Friedl originated the idea (and the name) in the first edition |
The possessive quantifier syntax is an extension to the Perl 5.8 syn- |
3669 |
of his book. Mike McCloskey liked it, so implemented it when he built |
tax. Jeffrey Friedl originated the idea (and the name) in the first |
3670 |
Sun's Java package, and PCRE copied it from there. |
edition of his book. Mike McCloskey liked it, so implemented it when he |
3671 |
|
built Sun's Java package, and PCRE copied it from there. It ultimately |
3672 |
When a pattern contains an unlimited repeat inside a subpattern that |
found its way into Perl at release 5.10. |
3673 |
can itself be repeated an unlimited number of times, the use of an |
|
3674 |
atomic group is the only way to avoid some failing matches taking a |
PCRE has an optimization that automatically "possessifies" certain sim- |
3675 |
|
ple pattern constructs. For example, the sequence A+B is treated as |
3676 |
|
A++B because there is no point in backtracking into a sequence of A's |
3677 |
|
when B must follow. |
3678 |
|
|
3679 |
|
When a pattern contains an unlimited repeat inside a subpattern that |
3680 |
|
can itself be repeated an unlimited number of times, the use of an |
3681 |
|
atomic group is the only way to avoid some failing matches taking a |
3682 |
very long time indeed. The pattern |
very long time indeed. The pattern |
3683 |
|
|
3684 |
(\D+|<\d+>)*[!?] |
(\D+|<\d+>)*[!?] |
3685 |
|
|
3686 |
matches an unlimited number of substrings that either consist of non- |
matches an unlimited number of substrings that either consist of non- |
3687 |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
digits, or digits enclosed in <>, followed by either ! or ?. When it |
3688 |
matches, it runs quickly. However, if it is applied to |
matches, it runs quickly. However, if it is applied to |
3689 |
|
|
3690 |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa |
3691 |
|
|
3692 |
it takes a long time before reporting failure. This is because the |
it takes a long time before reporting failure. This is because the |
3693 |
string can be divided between the internal \D+ repeat and the external |
string can be divided between the internal \D+ repeat and the external |
3694 |
* repeat in a large number of ways, and all have to be tried. (The |
* repeat in a large number of ways, and all have to be tried. (The |
3695 |
example uses [!?] rather than a single character at the end, because |
example uses [!?] rather than a single character at the end, because |
3696 |
both PCRE and Perl have an optimization that allows for fast failure |
both PCRE and Perl have an optimization that allows for fast failure |
3697 |
when a single character is used. They remember the last single charac- |
when a single character is used. They remember the last single charac- |
3698 |
ter that is required for a match, and fail early if it is not present |
ter that is required for a match, and fail early if it is not present |
3699 |
in the string.) If the pattern is changed so that it uses an atomic |
in the string.) If the pattern is changed so that it uses an atomic |
3700 |
group, like this: |
group, like this: |
3701 |
|
|
3702 |
((?>\D+)|<\d+>)*[!?] |
((?>\D+)|<\d+>)*[!?] |
3703 |
|
|
3704 |
sequences of non-digits cannot be broken, and failure happens quickly. |
sequences of non-digits cannot be broken, and failure happens quickly. |
3705 |
|
|
3706 |
|
|
3707 |
BACK REFERENCES |
BACK REFERENCES |
3708 |
|
|
3709 |
Outside a character class, a backslash followed by a digit greater than |
Outside a character class, a backslash followed by a digit greater than |
3710 |
0 (and possibly further digits) is a back reference to a capturing sub- |
0 (and possibly further digits) is a back reference to a capturing sub- |
3711 |
pattern earlier (that is, to its left) in the pattern, provided there |
pattern earlier (that is, to its left) in the pattern, provided there |
3712 |
have been that many previous capturing left parentheses. |
have been that many previous capturing left parentheses. |
3713 |
|
|
3714 |
However, if the decimal number following the backslash is less than 10, |
However, if the decimal number following the backslash is less than 10, |
3715 |
it is always taken as a back reference, and causes an error only if |
it is always taken as a back reference, and causes an error only if |
3716 |
there are not that many capturing left parentheses in the entire pat- |
there are not that many capturing left parentheses in the entire pat- |
3717 |
tern. In other words, the parentheses that are referenced need not be |
tern. In other words, the parentheses that are referenced need not be |
3718 |
to the left of the reference for numbers less than 10. A "forward back |
to the left of the reference for numbers less than 10. A "forward back |
3719 |
reference" of this type can make sense when a repetition is involved |
reference" of this type can make sense when a repetition is involved |
3720 |
and the subpattern to the right has participated in an earlier itera- |
and the subpattern to the right has participated in an earlier itera- |
3721 |
tion. |
tion. |
3722 |
|
|
3723 |
It is not possible to have a numerical "forward back reference" to sub- |
It is not possible to have a numerical "forward back reference" to a |
3724 |
pattern whose number is 10 or more. However, a back reference to any |
subpattern whose number is 10 or more using this syntax because a |
3725 |
subpattern is possible using named parentheses (see below). See also |
sequence such as \50 is interpreted as a character defined in octal. |
3726 |
the subsection entitled "Non-printing characters" above for further |
See the subsection entitled "Non-printing characters" above for further |
3727 |
details of the handling of digits following a backslash. |
details of the handling of digits following a backslash. There is no |
3728 |
|
such problem when named parentheses are used. A back reference to any |
3729 |
|
subpattern is possible using named parentheses (see below). |
3730 |
|
|
3731 |
|
Another way of avoiding the ambiguity inherent in the use of digits |
3732 |
|
following a backslash is to use the \g escape sequence, which is a fea- |
3733 |
|
ture introduced in Perl 5.10. This escape must be followed by a posi- |
3734 |
|
tive or a negative number, optionally enclosed in braces. These exam- |
3735 |
|
ples are all identical: |
3736 |
|
|
3737 |
|
(ring), \1 |
3738 |
|
(ring), \g1 |
3739 |
|
(ring), \g{1} |
3740 |
|
|
3741 |
|
A positive number specifies an absolute reference without the ambiguity |
3742 |
|
that is present in the older syntax. It is also useful when literal |
3743 |
|
digits follow the reference. A negative number is a relative reference. |
3744 |
|
Consider this example: |
3745 |
|
|
3746 |
|
(abc(def)ghi)\g{-1} |
3747 |
|
|
3748 |
|
The sequence \g{-1} is a reference to the most recently started captur- |
3749 |
|
ing subpattern before \g, that is, is it equivalent to \2. Similarly, |
3750 |
|
\g{-2} would be equivalent to \1. The use of relative references can be |
3751 |
|
helpful in long patterns, and also in patterns that are created by |
3752 |
|
joining together fragments that contain references within themselves. |
3753 |
|
|
3754 |
A back reference matches whatever actually matched the capturing sub- |
A back reference matches whatever actually matched the capturing sub- |
3755 |
pattern in the current subject string, rather than anything matching |
pattern in the current subject string, rather than anything matching |
3768 |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
matches "rah rah" and "RAH RAH", but not "RAH rah", even though the |
3769 |
original capturing subpattern is matched caselessly. |
original capturing subpattern is matched caselessly. |
3770 |
|
|
3771 |
Back references to named subpatterns use the Python syntax (?P=name). |
Back references to named subpatterns use the Perl syntax \k<name> or |
3772 |
We could rewrite the above example as follows: |
\k'name' or the Python syntax (?P=name). We could rewrite the above |
3773 |
|
example in either of the following ways: |
3774 |
|
|
3775 |
|
(?<p1>(?i)rah)\s+\k<p1> |
3776 |
(?P<p1>(?i)rah)\s+(?P=p1) |
(?P<p1>(?i)rah)\s+(?P=p1) |
3777 |
|
|
3778 |
A subpattern that is referenced by name may appear in the pattern |
A subpattern that is referenced by name may appear in the pattern |
3779 |
before or after the reference. |
before or after the reference. |
3780 |
|
|
3781 |
There may be more than one back reference to the same subpattern. If a |
There may be more than one back reference to the same subpattern. If a |
3782 |
subpattern has not actually been used in a particular match, any back |
subpattern has not actually been used in a particular match, any back |
3783 |
references to it always fail. For example, the pattern |
references to it always fail. For example, the pattern |
3784 |
|
|
3785 |
(a|(bc))\2 |
(a|(bc))\2 |
3786 |
|
|
3787 |
always fails if it starts to match "a" rather than "bc". Because there |
always fails if it starts to match "a" rather than "bc". Because there |
3788 |
may be many capturing parentheses in a pattern, all digits following |
may be many capturing parentheses in a pattern, all digits following |
3789 |
the backslash are taken as part of a potential back reference number. |
the backslash are taken as part of a potential back reference number. |
3790 |
If the pattern continues with a digit character, some delimiter must be |
If the pattern continues with a digit character, some delimiter must be |
3791 |
used to terminate the back reference. If the PCRE_EXTENDED option is |
used to terminate the back reference. If the PCRE_EXTENDED option is |
3792 |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
set, this can be whitespace. Otherwise an empty comment (see "Com- |
3793 |
ments" below) can be used. |
ments" below) can be used. |
3794 |
|
|
3795 |
A back reference that occurs inside the parentheses to which it refers |
A back reference that occurs inside the parentheses to which it refers |
3796 |
fails when the subpattern is first used, so, for example, (a\1) never |
fails when the subpattern is first used, so, for example, (a\1) never |
3797 |
matches. However, such references can be useful inside repeated sub- |
matches. However, such references can be useful inside repeated sub- |
3798 |
patterns. For example, the pattern |
patterns. For example, the pattern |
3799 |
|
|
3800 |
(a|b\1)+ |
(a|b\1)+ |
3801 |
|
|
3802 |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iter- |
3803 |
ation of the subpattern, the back reference matches the character |
ation of the subpattern, the back reference matches the character |
3804 |
string corresponding to the previous iteration. In order for this to |
string corresponding to the previous iteration. In order for this to |
3805 |
work, the pattern must be such that the first iteration does not need |
work, the pattern must be such that the first iteration does not need |
3806 |
to match the back reference. This can be done using alternation, as in |
to match the back reference. This can be done using alternation, as in |
3807 |
the example above, or by a quantifier with a minimum of zero. |
the example above, or by a quantifier with a minimum of zero. |
3808 |
|
|
3809 |
|
|
3810 |
ASSERTIONS |
ASSERTIONS |
3811 |
|
|
3812 |
An assertion is a test on the characters following or preceding the |
An assertion is a test on the characters following or preceding the |
3813 |
current matching point that does not actually consume any characters. |
current matching point that does not actually consume any characters. |
3814 |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
The simple assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are |
3815 |
described above. |
described above. |
3816 |
|
|
3817 |
More complicated assertions are coded as subpatterns. There are two |
More complicated assertions are coded as subpatterns. There are two |
3818 |
kinds: those that look ahead of the current position in the subject |
kinds: those that look ahead of the current position in the subject |
3819 |
string, and those that look behind it. An assertion subpattern is |
string, and those that look behind it. An assertion subpattern is |
3820 |
matched in the normal way, except that it does not cause the current |
matched in the normal way, except that it does not cause the current |
3821 |
matching position to be changed. |
matching position to be changed. |
3822 |
|
|
3823 |
Assertion subpatterns are not capturing subpatterns, and may not be |
Assertion subpatterns are not capturing subpatterns, and may not be |
3824 |
repeated, because it makes no sense to assert the same thing several |
repeated, because it makes no sense to assert the same thing several |
3825 |
times. If any kind of assertion contains capturing subpatterns within |
times. If any kind of assertion contains capturing subpatterns within |
3826 |
it, these are counted for the purposes of numbering the capturing sub- |
it, these are counted for the purposes of numbering the capturing sub- |
3827 |
patterns in the whole pattern. However, substring capturing is carried |
patterns in the whole pattern. However, substring capturing is carried |
3828 |
out only for positive assertions, because it does not make sense for |
out only for positive assertions, because it does not make sense for |
3829 |
negative assertions. |
negative assertions. |
3830 |
|
|
3831 |
Lookahead assertions |
Lookahead assertions |
3835 |
|
|
3836 |
\w+(?=;) |
\w+(?=;) |
3837 |
|
|
3838 |
matches a word followed by a semicolon, but does not include the semi- |
matches a word followed by a semicolon, but does not include the semi- |
3839 |
colon in the match, and |
colon in the match, and |
3840 |
|
|
3841 |
foo(?!bar) |
foo(?!bar) |
3842 |
|
|
3843 |
matches any occurrence of "foo" that is not followed by "bar". Note |
matches any occurrence of "foo" that is not followed by "bar". Note |
3844 |
that the apparently similar pattern |
that the apparently similar pattern |
3845 |
|
|
3846 |
(?!foo)bar |
(?!foo)bar |
3847 |
|
|
3848 |
does not find an occurrence of "bar" that is preceded by something |
does not find an occurrence of "bar" that is preceded by something |
3849 |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
other than "foo"; it finds any occurrence of "bar" whatsoever, because |
3850 |
the assertion (?!foo) is always true when the next three characters are |
the assertion (?!foo) is always true when the next three characters are |
3851 |
"bar". A lookbehind assertion is needed to achieve the other effect. |
"bar". A lookbehind assertion is needed to achieve the other effect. |
3852 |
|
|
3853 |
If you want to force a matching failure at some point in a pattern, the |
If you want to force a matching failure at some point in a pattern, the |
3854 |
most convenient way to do it is with (?!) because an empty string |
most convenient way to do it is with (?!) because an empty string |
3855 |
always matches, so an assertion that requires there not to be an empty |
always matches, so an assertion that requires there not to be an empty |
3856 |
string must always fail. |
string must always fail. |
3857 |
|
|
3858 |
Lookbehind assertions |
Lookbehind assertions |
3859 |
|
|
3860 |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
Lookbehind assertions start with (?<= for positive assertions and (?<! |
3861 |
for negative assertions. For example, |
for negative assertions. For example, |
3862 |
|
|
3863 |
(?<!foo)bar |
(?<!foo)bar |
3864 |
|
|
3865 |
does find an occurrence of "bar" that is not preceded by "foo". The |
does find an occurrence of "bar" that is not preceded by "foo". The |
3866 |
contents of a lookbehind assertion are restricted such that all the |
contents of a lookbehind assertion are restricted such that all the |
3867 |
strings it matches must have a fixed length. However, if there are sev- |
strings it matches must have a fixed length. However, if there are sev- |
3868 |
eral top-level alternatives, they do not all have to have the same |
eral top-level alternatives, they do not all have to have the same |
3869 |
fixed length. Thus |
fixed length. Thus |
3870 |
|
|
3871 |
(?<=bullock|donkey) |
(?<=bullock|donkey) |
3874 |
|
|
3875 |
(?<!dogs?|cats?) |
(?<!dogs?|cats?) |
3876 |
|
|
3877 |
causes an error at compile time. Branches that match different length |
causes an error at compile time. Branches that match different length |
3878 |
strings are permitted only at the top level of a lookbehind assertion. |
strings are permitted only at the top level of a lookbehind assertion. |
3879 |
This is an extension compared with Perl (at least for 5.8), which |
This is an extension compared with Perl (at least for 5.8), which |
3880 |
requires all branches to match the same length of string. An assertion |
requires all branches to match the same length of string. An assertion |
3881 |
such as |
such as |
3882 |
|
|
3883 |
(?<=ab(c|de)) |
(?<=ab(c|de)) |
3884 |
|
|
3885 |
is not permitted, because its single top-level branch can match two |
is not permitted, because its single top-level branch can match two |
3886 |
different lengths, but it is acceptable if rewritten to use two top- |
different lengths, but it is acceptable if rewritten to use two top- |
3887 |
level branches: |
level branches: |
3888 |
|
|
3889 |
(?<=abc|abde) |
(?<=abc|abde) |
3890 |
|
|
3891 |
The implementation of lookbehind assertions is, for each alternative, |
The implementation of lookbehind assertions is, for each alternative, |
3892 |
to temporarily move the current position back by the fixed width and |
to temporarily move the current position back by the fixed length and |
3893 |
then try to match. If there are insufficient characters before the cur- |
then try to match. If there are insufficient characters before the cur- |
3894 |
rent position, the match is deemed to fail. |
rent position, the assertion fails. |
3895 |
|
|
3896 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
PCRE does not allow the \C escape (which matches a single byte in UTF-8 |
3897 |
mode) to appear in lookbehind assertions, because it makes it impossi- |
mode) to appear in lookbehind assertions, because it makes it impossi- |
3898 |
ble to calculate the length of the lookbehind. The \X escape, which can |
ble to calculate the length of the lookbehind. The \X and \R escapes, |
3899 |
match different numbers of bytes, is also not permitted. |
which can match different numbers of bytes, are also not permitted. |
3900 |
|
|
3901 |
Atomic groups can be used in conjunction with lookbehind assertions to |
Possessive quantifiers can be used in conjunction with lookbehind |
3902 |
specify efficient matching at the end of the subject string. Consider a |
assertions to specify efficient matching at the end of the subject |
3903 |
simple pattern such as |
string. Consider a simple pattern such as |
3904 |
|
|
3905 |
abcd$ |
abcd$ |
3906 |
|
|
3907 |
when applied to a long string that does not match. Because matching |
when applied to a long string that does not match. Because matching |
3908 |
proceeds from left to right, PCRE will look for each "a" in the subject |
proceeds from left to right, PCRE will look for each "a" in the subject |
3909 |
and then see if what follows matches the rest of the pattern. If the |
and then see if what follows matches the rest of the pattern. If the |
3910 |
pattern is specified as |
pattern is specified as |
3911 |
|
|
3912 |
^.*abcd$ |
^.*abcd$ |
3913 |
|
|
3914 |
the initial .* matches the entire string at first, but when this fails |
the initial .* matches the entire string at first, but when this fails |
3915 |
(because there is no following "a"), it backtracks to match all but the |
(because there is no following "a"), it backtracks to match all but the |
3916 |
last character, then all but the last two characters, and so on. Once |
last character, then all but the last two characters, and so on. Once |
3917 |
again the search for "a" covers the entire string, from right to left, |
again the search for "a" covers the entire string, from right to left, |
3918 |
so we are no better off. However, if the pattern is written as |
so we are no better off. However, if the pattern is written as |
3919 |
|
|
|
^(?>.*)(?<=abcd) |
|
|
|
|
|
or, equivalently, using the possessive quantifier syntax, |
|
|
|
|
3920 |
^.*+(?<=abcd) |
^.*+(?<=abcd) |
3921 |
|
|
3922 |
there can be no backtracking for the .* item; it can match only the |
there can be no backtracking for the .*+ item; it can match only the |
3923 |
entire string. The subsequent lookbehind assertion does a single test |
entire string. The subsequent lookbehind assertion does a single test |
3924 |
on the last four characters. If it fails, the match fails immediately. |
on the last four characters. If it fails, the match fails immediately. |
3925 |
For long strings, this approach makes a significant difference to the |
For long strings, this approach makes a significant difference to the |
3926 |
processing time. |
processing time. |
3927 |
|
|
3928 |
Using multiple assertions |
Using multiple assertions |
3931 |
|
|
3932 |
(?<=\d{3})(?<!999)foo |
(?<=\d{3})(?<!999)foo |
3933 |
|
|
3934 |
matches "foo" preceded by three digits that are not "999". Notice that |
matches "foo" preceded by three digits that are not "999". Notice that |
3935 |
each of the assertions is applied independently at the same point in |
each of the assertions is applied independently at the same point in |
3936 |
the subject string. First there is a check that the previous three |
the subject string. First there is a check that the previous three |
3937 |
characters are all digits, and then there is a check that the same |
characters are all digits, and then there is a check that the same |
3938 |
three characters are not "999". This pattern does not match "foo" pre- |
three characters are not "999". This pattern does not match "foo" pre- |
3939 |
ceded by six characters, the first of which are digits and the last |
ceded by six characters, the first of which are digits and the last |
3940 |
three of which are not "999". For example, it doesn't match "123abc- |
three of which are not "999". For example, it doesn't match "123abc- |
3941 |
foo". A pattern to do that is |
foo". A pattern to do that is |
3942 |
|
|
3943 |
(?<=\d{3}...)(?<!999)foo |
(?<=\d{3}...)(?<!999)foo |
3944 |
|
|
3945 |
This time the first assertion looks at the preceding six characters, |
This time the first assertion looks at the preceding six characters, |
3946 |
checking that the first three are digits, and then the second assertion |
checking that the first three are digits, and then the second assertion |
3947 |
checks that the preceding three characters are not "999". |
checks that the preceding three characters are not "999". |
3948 |
|
|
3950 |
|
|
3951 |
(?<=(?<!foo)bar)baz |
(?<=(?<!foo)bar)baz |
3952 |
|
|
3953 |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
matches an occurrence of "baz" that is preceded by "bar" which in turn |
3954 |
is not preceded by "foo", while |
is not preceded by "foo", while |
3955 |
|
|
3956 |
(?<=\d{3}(?!999)...)foo |
(?<=\d{3}(?!999)...)foo |
3957 |
|
|
3958 |
is another pattern that matches "foo" preceded by three digits and any |
is another pattern that matches "foo" preceded by three digits and any |
3959 |
three characters that are not "999". |
three characters that are not "999". |
3960 |
|
|
3961 |
|
|
3962 |
CONDITIONAL SUBPATTERNS |
CONDITIONAL SUBPATTERNS |
3963 |
|
|
3964 |
It is possible to cause the matching process to obey a subpattern con- |
It is possible to cause the matching process to obey a subpattern con- |
3965 |
ditionally or to choose between two alternative subpatterns, depending |
ditionally or to choose between two alternative subpatterns, depending |
3966 |
on the result of an assertion, or whether a previous capturing subpat- |
on the result of an assertion, or whether a previous capturing subpat- |
3967 |
tern matched or not. The two possible forms of conditional subpattern |
tern matched or not. The two possible forms of conditional subpattern |
3968 |
are |
are |
3969 |
|
|
3970 |
(?(condition)yes-pattern) |
(?(condition)yes-pattern) |
3971 |
(?(condition)yes-pattern|no-pattern) |
(?(condition)yes-pattern|no-pattern) |
3972 |
|
|
3973 |
If the condition is satisfied, the yes-pattern is used; otherwise the |
If the condition is satisfied, the yes-pattern is used; otherwise the |
3974 |
no-pattern (if present) is used. If there are more than two alterna- |
no-pattern (if present) is used. If there are more than two alterna- |
3975 |
tives in the subpattern, a compile-time error occurs. |
tives in the subpattern, a compile-time error occurs. |
3976 |
|
|
3977 |
There are three kinds of condition. If the text between the parentheses |
There are four kinds of condition: references to subpatterns, refer- |
3978 |
consists of a sequence of digits, or a sequence of alphanumeric charac- |
ences to recursion, a pseudo-condition called DEFINE, and assertions. |
3979 |
ters and underscores, the condition is satisfied if the capturing sub- |
|
3980 |
pattern of that number or name has previously matched. There is a pos- |
Checking for a used subpattern by number |
3981 |
sible ambiguity here, because subpattern names may consist entirely of |
|
3982 |
digits. PCRE looks first for a named subpattern; if it cannot find one |
If the text between the parentheses consists of a sequence of digits, |
3983 |
and the text consists entirely of digits, it looks for a subpattern of |
the condition is true if the capturing subpattern of that number has |
3984 |
that number, which must be greater than zero. Using subpattern names |
previously matched. |
|
that consist entirely of digits is not recommended. |
|
3985 |
|
|
3986 |
Consider the following pattern, which contains non-significant white |
Consider the following pattern, which contains non-significant white |
3987 |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
space to make it more readable (assume the PCRE_EXTENDED option) and to |
3998 |
tern is executed and a closing parenthesis is required. Otherwise, |
tern is executed and a closing parenthesis is required. Otherwise, |
3999 |
since no-pattern is not present, the subpattern matches nothing. In |
since no-pattern is not present, the subpattern matches nothing. In |
4000 |
other words, this pattern matches a sequence of non-parentheses, |
other words, this pattern matches a sequence of non-parentheses, |
4001 |
optionally enclosed in parentheses. Rewriting it to use a named subpat- |
optionally enclosed in parentheses. |
4002 |
tern gives this: |
|
4003 |
|
Checking for a used subpattern by name |
4004 |
|
|
4005 |
|
Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a |
4006 |
|
used subpattern by name. For compatibility with earlier versions of |
4007 |
|
PCRE, which had this facility before Perl, the syntax (?(name)...) is |
4008 |
|
also recognized. However, there is a possible ambiguity with this syn- |
4009 |
|
tax, because subpattern names may consist entirely of digits. PCRE |
4010 |
|
looks first for a named subpattern; if it cannot find one and the name |
4011 |
|
consists entirely of digits, PCRE looks for a subpattern of that num- |
4012 |
|
ber, which must be greater than zero. Using subpattern names that con- |
4013 |
|
sist entirely of digits is not recommended. |
4014 |
|
|
4015 |
|
Rewriting the above example to use a named subpattern gives this: |
4016 |
|
|
4017 |
(?P<OPEN> \( )? [^()]+ (?(OPEN) \) ) |
(?<OPEN> \( )? [^()]+ (?(<OPEN>) \) ) |
4018 |
|
|
4019 |
|
|
4020 |
|
Checking for pattern recursion |
4021 |
|
|
4022 |
If the condition is the string (R), and there is no subpattern with the |
If the condition is the string (R), and there is no subpattern with the |
4023 |
name R, the condition is satisfied if a recursive call to the pattern |
name R, the condition is true if a recursive call to the whole pattern |
4024 |
or subpattern has been made. At "top level", the condition is false. |
or any subpattern has been made. If digits or a name preceded by amper- |
4025 |
This is a PCRE extension. Recursive patterns are described in the next |
sand follow the letter R, for example: |
4026 |
section. |
|
4027 |
|
(?(R3)...) or (?(R&name)...) |
4028 |
|
|
4029 |
|
the condition is true if the most recent recursion is into the subpat- |
4030 |
|
tern whose number or name is given. This condition does not check the |
4031 |
|
entire recursion stack. |
4032 |
|
|
4033 |
|
At "top level", all these recursion test conditions are false. Recur- |
4034 |
|
sive patterns are described below. |
4035 |
|
|
4036 |
|
Defining subpatterns for use by reference only |
4037 |
|
|
4038 |
|
If the condition is the string (DEFINE), and there is no subpattern |
4039 |
|
with the name DEFINE, the condition is always false. In this case, |
4040 |
|
there may be only one alternative in the subpattern. It is always |
4041 |
|
skipped if control reaches this point in the pattern; the idea of |
4042 |
|
DEFINE is that it can be used to define "subroutines" that can be ref- |
4043 |
|
erenced from elsewhere. (The use of "subroutines" is described below.) |
4044 |
|
For example, a pattern to match an IPv4 address could be written like |
4045 |
|
this (ignore whitespace and line breaks): |
4046 |
|
|
4047 |
|
(?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) ) |
4048 |
|
\b (?&byte) (\.(?&byte)){3} \b |
4049 |
|
|
4050 |
|
The first part of the pattern is a DEFINE group inside which a another |
4051 |
|
group named "byte" is defined. This matches an individual component of |
4052 |
|
an IPv4 address (a number less than 256). When matching takes place, |
4053 |
|
this part of the pattern is skipped because DEFINE acts like a false |
4054 |
|
condition. |
4055 |
|
|
4056 |
|
The rest of the pattern uses references to the named group to match the |
4057 |
|
four dot-separated components of an IPv4 address, insisting on a word |
4058 |
|
boundary at each end. |
4059 |
|
|
4060 |
|
Assertion conditions |
4061 |
|
|
4062 |
If the condition is not a sequence of digits or (R), it must be an |
If the condition is not in any of the above formats, it must be an |
4063 |
assertion. This may be a positive or negative lookahead or lookbehind |
assertion. This may be a positive or negative lookahead or lookbehind |
4064 |
assertion. Consider this pattern, again containing non-significant |
assertion. Consider this pattern, again containing non-significant |
4065 |
white space, and with the two alternatives on the second line: |
white space, and with the two alternatives on the second line: |
4094 |
unlimited nested parentheses. Without the use of recursion, the best |
unlimited nested parentheses. Without the use of recursion, the best |
4095 |
that can be done is to use a pattern that matches up to some fixed |
that can be done is to use a pattern that matches up to some fixed |
4096 |
depth of nesting. It is not possible to handle an arbitrary nesting |
depth of nesting. It is not possible to handle an arbitrary nesting |
4097 |
depth. Perl provides a facility that allows regular expressions to |
depth. |
4098 |
recurse (amongst other things). It does this by interpolating Perl code |
|
4099 |
in the expression at run time, and the code can refer to the expression |
For some time, Perl has provided a facility that allows regular expres- |
4100 |
itself. A Perl pattern to solve the parentheses problem can be created |
sions to recurse (amongst other things). It does this by interpolating |
4101 |
like this: |
Perl code in the expression at run time, and the code can refer to the |
4102 |
|
expression itself. A Perl pattern using code interpolation to solve the |
4103 |
|
parentheses problem can be created like this: |
4104 |
|
|
4105 |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
$re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x; |
4106 |
|
|
4107 |
The (?p{...}) item interpolates Perl code at run time, and in this case |
The (?p{...}) item interpolates Perl code at run time, and in this case |
4108 |
refers recursively to the pattern in which it appears. Obviously, PCRE |
refers recursively to the pattern in which it appears. |
4109 |
cannot support the interpolation of Perl code. Instead, it supports |
|
4110 |
some special syntax for recursion of the entire pattern, and also for |
Obviously, PCRE cannot support the interpolation of Perl code. Instead, |
4111 |
individual subpattern recursion. |
it supports special syntax for recursion of the entire pattern, and |
4112 |
|
also for individual subpattern recursion. After its introduction in |
4113 |
|
PCRE and Python, this kind of recursion was introduced into Perl at |
4114 |
|
release 5.10. |
4115 |
|
|
4116 |
The special item that consists of (? followed by a number greater than |
A special item that consists of (? followed by a number greater than |
4117 |
zero and a closing parenthesis is a recursive call of the subpattern of |
zero and a closing parenthesis is a recursive call of the subpattern of |
4118 |
the given number, provided that it occurs inside that subpattern. (If |
the given number, provided that it occurs inside that subpattern. (If |
4119 |
not, it is a "subroutine" call, which is described in the next sec- |
not, it is a "subroutine" call, which is described in the next sec- |
4120 |
tion.) The special item (?R) is a recursive call of the entire regular |
tion.) The special item (?R) or (?0) is a recursive call of the entire |
4121 |
expression. |
regular expression. |
4122 |
|
|
4123 |
A recursive subpattern call is always treated as an atomic group. That |
In PCRE (like Python, but unlike Perl), a recursive subpattern call is |
4124 |
is, once it has matched some of the subject string, it is never re- |
always treated as an atomic group. That is, once it has matched some of |
4125 |
entered, even if it contains untried alternatives and there is a subse- |
the subject string, it is never re-entered, even if it contains untried |
4126 |
quent matching failure. |
alternatives and there is a subsequent matching failure. |
4127 |
|
|
4128 |
This PCRE pattern solves the nested parentheses problem (assume the |
This PCRE pattern solves the nested parentheses problem (assume the |
4129 |
PCRE_EXTENDED option is set so that white space is ignored): |
PCRE_EXTENDED option is set so that white space is ignored): |
4130 |
|
|
4131 |
\( ( (?>[^()]+) | (?R) )* \) |
\( ( (?>[^()]+) | (?R) )* \) |
4132 |
|
|
4133 |
First it matches an opening parenthesis. Then it matches any number of |
First it matches an opening parenthesis. Then it matches any number of |
4134 |
substrings which can either be a sequence of non-parentheses, or a |
substrings which can either be a sequence of non-parentheses, or a |
4135 |
recursive match of the pattern itself (that is, a correctly parenthe- |
recursive match of the pattern itself (that is, a correctly parenthe- |
4136 |
sized substring). Finally there is a closing parenthesis. |
sized substring). Finally there is a closing parenthesis. |
4137 |
|
|
4138 |
If this were part of a larger pattern, you would not want to recurse |
If this were part of a larger pattern, you would not want to recurse |
4139 |
the entire pattern, so instead you could use this: |
the entire pattern, so instead you could use this: |
4140 |
|
|
4141 |
( \( ( (?>[^()]+) | (?1) )* \) ) |
( \( ( (?>[^()]+) | (?1) )* \) ) |
4142 |
|
|
4143 |
We have put the pattern into parentheses, and caused the recursion to |
We have put the pattern into parentheses, and caused the recursion to |
4144 |
refer to them instead of the whole pattern. In a larger pattern, keep- |
refer to them instead of the whole pattern. In a larger pattern, keep- |
4145 |
ing track of parenthesis numbers can be tricky. It may be more conve- |
ing track of parenthesis numbers can be tricky. It may be more conve- |
4146 |
nient to use named parentheses instead. For this, PCRE uses (?P>name), |
nient to use named parentheses instead. The Perl syntax for this is |
4147 |
which is an extension to the Python syntax that PCRE uses for named |
(?&name); PCRE's earlier syntax (?P>name) is also supported. We could |
4148 |
parentheses (Perl does not provide named parentheses). We could rewrite |
rewrite the above example as follows: |
4149 |
the above example as follows: |
|
4150 |
|
(?<pn> \( ( (?>[^()]+) | (?&pn) )* \) ) |
4151 |
(?P<pn> \( ( (?>[^()]+) | (?P>pn) )* \) ) |
|
4152 |
|
If there is more than one subpattern with the same name, the earliest |
4153 |
This particular example pattern contains nested unlimited repeats, and |
one is used. This particular example pattern contains nested unlimited |
4154 |
so the use of atomic grouping for matching strings of non-parentheses |
repeats, and so the use of atomic grouping for matching strings of non- |
4155 |
is important when applying the pattern to strings that do not match. |
parentheses is important when applying the pattern to strings that do |
4156 |
For example, when this pattern is applied to |
not match. For example, when this pattern is applied to |
4157 |
|
|
4158 |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
(aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() |
4159 |
|
|
4160 |
it yields "no match" quickly. However, if atomic grouping is not used, |
it yields "no match" quickly. However, if atomic grouping is not used, |
4161 |
the match runs for a very long time indeed because there are so many |
the match runs for a very long time indeed because there are so many |
4162 |
different ways the + and * repeats can carve up the subject, and all |
different ways the + and * repeats can carve up the subject, and all |
4163 |
have to be tested before failure can be reported. |
have to be tested before failure can be reported. |
4164 |
|
|
4165 |
At the end of a match, the values set for any capturing subpatterns are |
At the end of a match, the values set for any capturing subpatterns are |
4166 |
those from the outermost level of the recursion at which the subpattern |
those from the outermost level of the recursion at which the subpattern |
4167 |
value is set. If you want to obtain intermediate values, a callout |
value is set. If you want to obtain intermediate values, a callout |
4168 |
function can be used (see the next section and the pcrecallout documen- |
function can be used (see below and the pcrecallout documentation). If |
4169 |
tation). If the pattern above is matched against |
the pattern above is matched against |
4170 |
|
|
4171 |
(ab(cd)ef) |
(ab(cd)ef) |
4172 |
|
|
4173 |
the value for the capturing parentheses is "ef", which is the last |
the value for the capturing parentheses is "ef", which is the last |
4174 |
value taken on at the top level. If additional parentheses are added, |
value taken on at the top level. If additional parentheses are added, |
4175 |
giving |
giving |
4176 |
|
|
4177 |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
\( ( ( (?>[^()]+) | (?R) )* ) \) |
4178 |
^ ^ |
^ ^ |
4179 |
^ ^ |
^ ^ |
4180 |
|
|
4181 |
the string they capture is "ab(cd)ef", the contents of the top level |
the string they capture is "ab(cd)ef", the contents of the top level |
4182 |
parentheses. If there are more than 15 capturing parentheses in a pat- |
parentheses. If there are more than 15 capturing parentheses in a pat- |
4183 |
tern, PCRE has to obtain extra memory to store data during a recursion, |
tern, PCRE has to obtain extra memory to store data during a recursion, |
4184 |
which it does by using pcre_malloc, freeing it via pcre_free after- |
which it does by using pcre_malloc, freeing it via pcre_free after- |
4185 |
wards. If no memory can be obtained, the match fails with the |
wards. If no memory can be obtained, the match fails with the |
4186 |
PCRE_ERROR_NOMEMORY error. |
PCRE_ERROR_NOMEMORY error. |
4187 |
|
|
4188 |
Do not confuse the (?R) item with the condition (R), which tests for |
Do not confuse the (?R) item with the condition (R), which tests for |
4189 |
recursion. Consider this pattern, which matches text in angle brack- |
recursion. Consider this pattern, which matches text in angle brack- |
4190 |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
ets, allowing for arbitrary nesting. Only digits are allowed in nested |
4191 |
brackets (that is, when recursing), whereas any characters are permit- |
brackets (that is, when recursing), whereas any characters are permit- |
4192 |
ted at the outer level. |
ted at the outer level. |
4193 |
|
|
4194 |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
< (?: (?(R) \d++ | [^<>]*+) | (?R)) * > |
4195 |
|
|
4196 |
In this pattern, (?(R) is the start of a conditional subpattern, with |
In this pattern, (?(R) is the start of a conditional subpattern, with |
4197 |
two different alternatives for the recursive and non-recursive cases. |
two different alternatives for the recursive and non-recursive cases. |
4198 |
The (?R) item is the actual recursive call. |
The (?R) item is the actual recursive call. |
4199 |
|
|
4200 |
|
|
4201 |
SUBPATTERNS AS SUBROUTINES |
SUBPATTERNS AS SUBROUTINES |
4202 |
|
|
4203 |
If the syntax for a recursive subpattern reference (either by number or |
If the syntax for a recursive subpattern reference (either by number or |
4204 |
by name) is used outside the parentheses to which it refers, it oper- |
by name) is used outside the parentheses to which it refers, it oper- |
4205 |
ates like a subroutine in a programming language. An earlier example |
ates like a subroutine in a programming language. The "called" subpat- |
4206 |
|
tern may be defined before or after the reference. An earlier example |
4207 |
pointed out that the pattern |
pointed out that the pattern |
4208 |
|
|
4209 |
(sens|respons)e and \1ibility |
(sens|respons)e and \1ibility |
4214 |
(sens|respons)e and (?1)ibility |
(sens|respons)e and (?1)ibility |
4215 |
|
|
4216 |
is used, it does match "sense and responsibility" as well as the other |
is used, it does match "sense and responsibility" as well as the other |
4217 |
two strings. Such references, if given numerically, must follow the |
two strings. Another example is given in the discussion of DEFINE |
4218 |
subpattern to which they refer. However, named references can refer to |
above. |
|
later subpatterns. |
|
4219 |
|
|
4220 |
Like recursive subpatterns, a "subroutine" call is always treated as an |
Like recursive subpatterns, a "subroutine" call is always treated as an |
4221 |
atomic group. That is, once it has matched some of the subject string, |
atomic group. That is, once it has matched some of the subject string, |
4222 |
it is never re-entered, even if it contains untried alternatives and |
it is never re-entered, even if it contains untried alternatives and |
4223 |
there is a subsequent matching failure. |
there is a subsequent matching failure. |
4224 |
|
|
4225 |
|
When a subpattern is used as a subroutine, processing options such as |
4226 |
|
case-independence are fixed when the subpattern is defined. They cannot |
4227 |
|
be changed for different calls. For example, consider this pattern: |
4228 |
|
|
4229 |
|
(abc)(?i:(?1)) |
4230 |
|
|
4231 |
|
It matches "abcabc". It does not match "abcABC" because the change of |
4232 |
|
processing option does not affect the called subpattern. |
4233 |
|
|
4234 |
|
|
4235 |
CALLOUTS |
CALLOUTS |
4236 |
|
|
4266 |
gether. A complete description of the interface to the callout function |
gether. A complete description of the interface to the callout function |
4267 |
is given in the pcrecallout documentation. |
is given in the pcrecallout documentation. |
4268 |
|
|
4269 |
Last updated: 06 June 2006 |
|
4270 |
|
SEE ALSO |
4271 |
|
|
4272 |
|
pcreapi(3), pcrecallout(3), pcrematching(3), pcre(3). |
4273 |
|
|
4274 |
|
Last updated: 06 December 2006 |
4275 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
4276 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4277 |
|
|
4377 |
The first data string is matched completely, so pcretest shows the |
The first data string is matched completely, so pcretest shows the |
4378 |
matched substrings. The remaining four strings do not match the com- |
matched substrings. The remaining four strings do not match the com- |
4379 |
plete pattern, but the first two are partial matches. The same test, |
plete pattern, but the first two are partial matches. The same test, |
4380 |
using DFA matching (by means of the \D escape sequence), produces the |
using pcre_dfa_exec() matching (by means of the \D escape sequence), |
4381 |
following output: |
produces the following output: |
4382 |
|
|
4383 |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
4384 |
data> 25jun04\P\D |
data> 25jun04\P\D |
4400 |
|
|
4401 |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
When a partial match has been found using pcre_dfa_exec(), it is possi- |
4402 |
ble to continue the match by providing additional subject data and |
ble to continue the match by providing additional subject data and |
4403 |
calling pcre_dfa_exec() again with the PCRE_DFA_RESTART option and the |
calling pcre_dfa_exec() again with the same compiled regular expres- |
4404 |
same working space (where details of the previous partial match are |
sion, this time setting the PCRE_DFA_RESTART option. You must also pass |
4405 |
stored). Here is an example using pcretest, where the \R escape |
the same working space as before, because this is where details of the |
4406 |
sequence sets the PCRE_DFA_RESTART option and the \D escape sequence |
previous partial match are stored. Here is an example using pcretest, |
4407 |
requests the use of pcre_dfa_exec(): |
using the \R escape sequence to set the PCRE_DFA_RESTART option (\P and |
4408 |
|
\D are as above): |
4409 |
|
|
4410 |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
re> /^?(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)$/ |
4411 |
data> 23ja\P\D |
data> 23ja\P\D |
4413 |
data> n05\R\D |
data> n05\R\D |
4414 |
0: n05 |
0: n05 |
4415 |
|
|
4416 |
The first call has "23ja" as the subject, and requests partial match- |
The first call has "23ja" as the subject, and requests partial match- |
4417 |
ing; the second call has "n05" as the subject for the continued |
ing; the second call has "n05" as the subject for the continued |
4418 |
(restarted) match. Notice that when the match is complete, only the |
(restarted) match. Notice that when the match is complete, only the |
4419 |
last part is shown; PCRE does not retain the previously partially- |
last part is shown; PCRE does not retain the previously partially- |
4420 |
matched string. It is up to the calling program to do that if it needs |
matched string. It is up to the calling program to do that if it needs |
4421 |
to. |
to. |
4422 |
|
|
4423 |
This facility can be used to pass very long subject strings to |
You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial |
4424 |
pcre_dfa_exec(). However, some care is needed for certain types of pat- |
matching over multiple segments. This facility can be used to pass very |
4425 |
tern. |
long subject strings to pcre_dfa_exec(). However, some care is needed |
4426 |
|
for certain types of pattern. |
4427 |
|
|
4428 |
1. If the pattern contains tests for the beginning or end of a line, |
1. If the pattern contains tests for the beginning or end of a line, |
4429 |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropri- |
4432 |
|
|
4433 |
2. If the pattern contains backward assertions (including \b or \B), |
2. If the pattern contains backward assertions (including \b or \B), |
4434 |
you need to arrange for some overlap in the subject strings to allow |
you need to arrange for some overlap in the subject strings to allow |
4435 |
for this. For example, you could pass the subject in chunks that were |
for this. For example, you could pass the subject in chunks that are |
4436 |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
500 bytes long, but in a buffer of 700 bytes, with the starting offset |
4437 |
set to 200 and the previous 200 bytes at the start of the buffer. |
set to 200 and the previous 200 bytes at the start of the buffer. |
4438 |
|
|
4482 |
|
|
4483 |
where no string can be a partial match for both alternatives. |
where no string can be a partial match for both alternatives. |
4484 |
|
|
4485 |
Last updated: 16 January 2006 |
Last updated: 30 November 2006 |
4486 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
4487 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4488 |
|
|
4594 |
makes up a compiled pattern was changed for release 5.0. If you have |
makes up a compiled pattern was changed for release 5.0. If you have |
4595 |
any saved patterns that were compiled with previous releases (not a |
any saved patterns that were compiled with previous releases (not a |
4596 |
facility that was previously advertised), you will have to recompile |
facility that was previously advertised), you will have to recompile |
4597 |
them for release 5.0. However, from now on, it should be possible to |
them for release 5.0 and above. |
|
make changes in a compatible manner. |
|
4598 |
|
|
4599 |
Notwithstanding the above, if you have any saved patterns in UTF-8 mode |
If you have any saved patterns in UTF-8 mode that use \p or \P that |
4600 |
that use \p or \P that were compiled with any release up to and includ- |
were compiled with any release up to and including 6.4, you will have |
4601 |
ing 6.4, you will have to recompile them for release 6.5 and above. |
to recompile them for release 6.5 and above. |
4602 |
|
|
4603 |
Last updated: 01 February 2006 |
All saved patterns from earlier releases must be recompiled for release |
4604 |
|
7.0 or higher, because there was an internal reorganization at that |
4605 |
|
release. |
4606 |
|
|
4607 |
|
Last updated: 28 November 2006 |
4608 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
4609 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4610 |
|
|
4618 |
|
|
4619 |
PCRE PERFORMANCE |
PCRE PERFORMANCE |
4620 |
|
|
4621 |
Certain items that may appear in regular expression patterns are more |
Two aspects of performance are discussed below: memory usage and pro- |
4622 |
efficient than others. It is more efficient to use a character class |
cessing time. The way you express your pattern as a regular expression |
4623 |
like [aeiou] than a set of alternatives such as (a|e|i|o|u). In gen- |
can affect both of them. |
4624 |
eral, the simplest construction that provides the required behaviour is |
|
4625 |
usually the most efficient. Jeffrey Friedl's book contains a lot of |
|
4626 |
useful general discussion about optimizing regular expressions for |
MEMORY USAGE |
4627 |
efficient performance. This document contains a few observations about |
|
4628 |
PCRE. |
Patterns are compiled by PCRE into a reasonably efficient byte code, so |
4629 |
|
that most simple patterns do not use much memory. However, there is one |
4630 |
|
case where memory usage can be unexpectedly large. When a parenthesized |
4631 |
|
subpattern has a quantifier with a minimum greater than 1 and/or a lim- |
4632 |
|
ited maximum, the whole subpattern is repeated in the compiled code. |
4633 |
|
For example, the pattern |
4634 |
|
|
4635 |
|
(abc|def){2,4} |
4636 |
|
|
4637 |
|
is compiled as if it were |
4638 |
|
|
4639 |
|
(abc|def)(abc|def)((abc|def)(abc|def)?)? |
4640 |
|
|
4641 |
|
(Technical aside: It is done this way so that backtrack points within |
4642 |
|
each of the repetitions can be independently maintained.) |
4643 |
|
|
4644 |
|
For regular expressions whose quantifiers use only small numbers, this |
4645 |
|
is not usually a problem. However, if the numbers are large, and par- |
4646 |
|
ticularly if such repetitions are nested, the memory usage can become |
4647 |
|
an embarrassment. For example, the very simple pattern |
4648 |
|
|
4649 |
|
((ab){1,1000}c){1,3} |
4650 |
|
|
4651 |
|
uses 51K bytes when compiled. When PCRE is compiled with its default |
4652 |
|
internal pointer size of two bytes, the size limit on a compiled pat- |
4653 |
|
tern is 64K, and this is reached with the above pattern if the outer |
4654 |
|
repetition is increased from 3 to 4. PCRE can be compiled to use larger |
4655 |
|
internal pointers and thus handle larger compiled patterns, but it is |
4656 |
|
better to try to rewrite your pattern to use less memory if you can. |
4657 |
|
|
4658 |
|
One way of reducing the memory usage for such patterns is to make use |
4659 |
|
of PCRE's "subroutine" facility. Re-writing the above pattern as |
4660 |
|
|
4661 |
|
((ab)(?2){0,999}c)(?1){0,2} |
4662 |
|
|
4663 |
|
reduces the memory requirements to 18K, and indeed it remains under 20K |
4664 |
|
even with the outer repetition increased to 100. However, this pattern |
4665 |
|
is not exactly equivalent, because the "subroutine" calls are treated |
4666 |
|
as atomic groups into which there can be no backtracking if there is a |
4667 |
|
subsequent matching failure. Therefore, PCRE cannot do this kind of |
4668 |
|
rewriting automatically. Furthermore, there is a noticeable loss of |
4669 |
|
speed when executing the modified pattern. Nevertheless, if the atomic |
4670 |
|
grouping is not a problem and the loss of speed is acceptable, this |
4671 |
|
kind of rewriting will allow you to process patterns that PCRE cannot |
4672 |
|
otherwise handle. |
4673 |
|
|
4674 |
|
|
4675 |
|
PROCESSING TIME |
4676 |
|
|
4677 |
|
Certain items in regular expression patterns are processed more effi- |
4678 |
|
ciently than others. It is more efficient to use a character class like |
4679 |
|
[aeiou] than a set of single-character alternatives such as |
4680 |
|
(a|e|i|o|u). In general, the simplest construction that provides the |
4681 |
|
required behaviour is usually the most efficient. Jeffrey Friedl's book |
4682 |
|
contains a lot of useful general discussion about optimizing regular |
4683 |
|
expressions for efficient performance. This document contains a few |
4684 |
|
observations about PCRE. |
4685 |
|
|
4686 |
Using Unicode character properties (the \p, \P, and \X escapes) is |
Using Unicode character properties (the \p, \P, and \X escapes) is |
4687 |
slow, because PCRE has to scan a structure that contains data for over |
slow, because PCRE has to scan a structure that contains data for over |
4715 |
take a long time to run when applied to a string that does not match. |
take a long time to run when applied to a string that does not match. |
4716 |
Consider the pattern fragment |
Consider the pattern fragment |
4717 |
|
|
4718 |
(a+)* |
^(a+)* |
4719 |
|
|
4720 |
This can match "aaaa" in 33 different ways, and this number increases |
This can match "aaaa" in 16 different ways, and this number increases |
4721 |
very rapidly as the string gets longer. (The * repeat can match 0, 1, |
very rapidly as the string gets longer. (The * repeat can match 0, 1, |
4722 |
2, 3, or 4 times, and for each of those cases other than 0, the + |
2, 3, or 4 times, and for each of those cases other than 0 or 4, the + |
4723 |
repeats can match different numbers of times.) When the remainder of |
repeats can match different numbers of times.) When the remainder of |
4724 |
the pattern is such that the entire match is going to fail, PCRE has in |
the pattern is such that the entire match is going to fail, PCRE has in |
4725 |
principle to try every possible variation, and this can take an |
principle to try every possible variation, and this can take an |
4726 |
extremely long time. |
extremely long time, even for relatively short strings. |
4727 |
|
|
4728 |
An optimization catches some of the more simple cases such as |
An optimization catches some of the more simple cases such as |
4729 |
|
|
4744 |
In many cases, the solution to this kind of performance issue is to use |
In many cases, the solution to this kind of performance issue is to use |
4745 |
an atomic group or a possessive quantifier. |
an atomic group or a possessive quantifier. |
4746 |
|
|
4747 |
Last updated: 28 February 2005 |
Last updated: 20 September 2006 |
4748 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
4749 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
4750 |
|
|
4751 |
|
|
4959 |
|
|
4960 |
Philip Hazel |
Philip Hazel |
4961 |
University Computing Service, |
University Computing Service, |
4962 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QH, England. |
4963 |
|
|
4964 |
Last updated: 16 January 2006 |
Last updated: 16 January 2006 |
4965 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
5054 |
number of sub-patterns, "i"th captured sub-pattern is |
number of sub-patterns, "i"th captured sub-pattern is |
5055 |
ignored. |
ignored. |
5056 |
|
|
5057 |
|
CAVEAT: An optional sub-pattern that does not exist in the matched |
5058 |
|
string is assigned the empty string. Therefore, the following will |
5059 |
|
return false (because the empty string is not a valid number): |
5060 |
|
|
5061 |
|
int number; |
5062 |
|
pcrecpp::RE::FullMatch("abc", "[a-z]+(\d+)?", &number); |
5063 |
|
|
5064 |
The matching interface supports at most 16 arguments per call. If you |
The matching interface supports at most 16 arguments per call. If you |
5065 |
need more, consider using the more general interface |
need more, consider using the more general interface |
5066 |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
pcrecpp::RE::DoMatch. See pcrecpp.h for the signature for DoMatch. |
5067 |
|
|
5068 |
|
|
5069 |
|
QUOTING METACHARACTERS |
5070 |
|
|
5071 |
|
You can use the "QuoteMeta" operation to insert backslashes before all |
5072 |
|
potentially meaningful characters in a string. The returned string, |
5073 |
|
used as a regular expression, will exactly match the original string. |
5074 |
|
|
5075 |
|
Example: |
5076 |
|
string quoted = RE::QuoteMeta(unquoted); |
5077 |
|
|
5078 |
|
Note that it's legal to escape a character even if it has no special |
5079 |
|
meaning in a regular expression -- so this function does that. (This |
5080 |
|
also makes it identical to the perl function of the same name; see |
5081 |
|
"perldoc -f quotemeta".) For example, "1.5-2.0?" becomes |
5082 |
|
"1\.5\-2\.0\?". |
5083 |
|
|
5084 |
|
|
5085 |
PARTIAL MATCHES |
PARTIAL MATCHES |
5086 |
|
|
5087 |
You can use the "PartialMatch" operation when you want the pattern to |
You can use the "PartialMatch" operation when you want the pattern to |
5293 |
AUTHOR |
AUTHOR |
5294 |
|
|
5295 |
The C++ wrapper was contributed by Google Inc. |
The C++ wrapper was contributed by Google Inc. |
5296 |
Copyright (c) 2005 Google Inc. |
Copyright (c) 2006 Google Inc. |
5297 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5298 |
|
|
5299 |
|
|
5423 |
quantifier is used to stop any backtracking into the runs of non-"<" |
quantifier is used to stop any backtracking into the runs of non-"<" |
5424 |
characters, but that is not related to stack usage. |
characters, but that is not related to stack usage. |
5425 |
|
|
5426 |
|
This example shows that one way of avoiding stack problems when match- |
5427 |
|
ing long subject strings is to write repeated parenthesized subpatterns |
5428 |
|
to match more than one character whenever possible. |
5429 |
|
|
5430 |
In environments where stack memory is constrained, you might want to |
In environments where stack memory is constrained, you might want to |
5431 |
compile PCRE to use heap memory instead of stack for remembering back- |
compile PCRE to use heap memory instead of stack for remembering back- |
5432 |
up points. This makes it run a lot more slowly, however. Details of how |
up points. This makes it run a lot more slowly, however. Details of how |
5433 |
to do this are given in the pcrebuild documentation. |
to do this are given in the pcrebuild documentation. |
5434 |
|
|
5435 |
In Unix-like environments, there is not often a problem with the stack, |
In Unix-like environments, there is not often a problem with the stack |
5436 |
though the default limit on stack size varies from system to system. |
unless very long strings are involved, though the default limit on |
5437 |
Values from 8Mb to 64Mb are common. You can find your default limit by |
stack size varies from system to system. Values from 8Mb to 64Mb are |
5438 |
running the command: |
common. You can find your default limit by running the command: |
5439 |
|
|
5440 |
ulimit -s |
ulimit -s |
5441 |
|
|
5442 |
The effect of running out of stack is often SIGSEGV, though sometimes |
Unfortunately, the effect of running out of stack is often SIGSEGV, |
5443 |
an error message is given. You can normally increase the limit on stack |
though sometimes a more explicit error message is given. You can nor- |
5444 |
size by code such as this: |
mally increase the limit on stack size by code such as this: |
5445 |
|
|
5446 |
struct rlimit rlim; |
struct rlimit rlim; |
5447 |
getrlimit(RLIMIT_STACK, &rlim); |
getrlimit(RLIMIT_STACK, &rlim); |
5463 |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
recursion. Thus, if you want to limit your stack usage to 8Mb, you |
5464 |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
should set the limit at 16000 recursions. A 64Mb stack, on the other |
5465 |
hand, can support around 128000 recursions. The pcretest test program |
hand, can support around 128000 recursions. The pcretest test program |
5466 |
has a command line option (-S) that can be used to increase its stack. |
has a command line option (-S) that can be used to increase the size of |
5467 |
|
its stack. |
5468 |
|
|
5469 |
Last updated: 29 June 2006 |
Last updated: 14 September 2006 |
5470 |
Copyright (c) 1997-2006 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |
5471 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
5472 |
|
|