114 |
There is no limit to the number of parenthesized subpatterns, but there |
There is no limit to the number of parenthesized subpatterns, but there |
115 |
can be no more than 65535 capturing subpatterns. |
can be no more than 65535 capturing subpatterns. |
116 |
|
|
117 |
|
If a non-capturing subpattern with an unlimited repetition quantifier |
118 |
|
can match an empty string, there is a limit of 1000 on the number of |
119 |
|
times it can be repeated while not matching an empty string - if it |
120 |
|
does match an empty string, the loop is immediately broken. |
121 |
|
|
122 |
The maximum length of name for a named subpattern is 32 characters, and |
The maximum length of name for a named subpattern is 32 characters, and |
123 |
the maximum number of named subpatterns is 10000. |
the maximum number of named subpatterns is 10000. |
124 |
|
|
125 |
The maximum length of a subject string is the largest positive number |
The maximum length of a subject string is the largest positive number |
126 |
that an integer variable can hold. However, when using the traditional |
that an integer variable can hold. However, when using the traditional |
127 |
matching function, PCRE uses recursion to handle subpatterns and indef- |
matching function, PCRE uses recursion to handle subpatterns and indef- |
128 |
inite repetition. This means that the available stack space may limit |
inite repetition. This means that the available stack space may limit |
129 |
the size of a subject string that can be processed by certain patterns. |
the size of a subject string that can be processed by certain patterns. |
130 |
For a discussion of stack issues, see the pcrestack documentation. |
For a discussion of stack issues, see the pcrestack documentation. |
131 |
|
|
132 |
|
|
133 |
UTF-8 AND UNICODE PROPERTY SUPPORT |
UTF-8 AND UNICODE PROPERTY SUPPORT |
134 |
|
|
135 |
From release 3.3, PCRE has had some support for character strings |
From release 3.3, PCRE has had some support for character strings |
136 |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
encoded in the UTF-8 format. For release 4.0 this was greatly extended |
137 |
to cover most common requirements, and in release 5.0 additional sup- |
to cover most common requirements, and in release 5.0 additional sup- |
138 |
port for Unicode general category properties was added. |
port for Unicode general category properties was added. |
139 |
|
|
140 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
In order process UTF-8 strings, you must build PCRE to include UTF-8 |
141 |
support in the code, and, in addition, you must call pcre_compile() |
support in the code, and, in addition, you must call pcre_compile() |
142 |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
with the PCRE_UTF8 option flag. When you do this, both the pattern and |
143 |
any subject strings that are matched against it are treated as UTF-8 |
any subject strings that are matched against it are treated as UTF-8 |
144 |
strings instead of just strings of bytes. |
strings instead of just strings of bytes. |
145 |
|
|
146 |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
If you compile PCRE with UTF-8 support, but do not use it at run time, |
147 |
the library will be a bit bigger, but the additional run time overhead |
the library will be a bit bigger, but the additional run time overhead |
148 |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
is limited to testing the PCRE_UTF8 flag occasionally, so should not be |
149 |
very big. |
very big. |
150 |
|
|
151 |
If PCRE is built with Unicode character property support (which implies |
If PCRE is built with Unicode character property support (which implies |
152 |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
UTF-8 support), the escape sequences \p{..}, \P{..}, and \X are sup- |
153 |
ported. The available properties that can be tested are limited to the |
ported. The available properties that can be tested are limited to the |
154 |
general category properties such as Lu for an upper case letter or Nd |
general category properties such as Lu for an upper case letter or Nd |
155 |
for a decimal number, the Unicode script names such as Arabic or Han, |
for a decimal number, the Unicode script names such as Arabic or Han, |
156 |
and the derived properties Any and L&. A full list is given in the |
and the derived properties Any and L&. A full list is given in the |
157 |
pcrepattern documentation. Only the short names for properties are sup- |
pcrepattern documentation. Only the short names for properties are sup- |
158 |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
ported. For example, \p{L} matches a letter. Its Perl synonym, \p{Let- |
159 |
ter}, is not supported. Furthermore, in Perl, many properties may |
ter}, is not supported. Furthermore, in Perl, many properties may |
160 |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
optionally be prefixed by "Is", for compatibility with Perl 5.6. PCRE |
161 |
does not support this. |
does not support this. |
162 |
|
|
163 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
164 |
|
|
165 |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
1. When you set the PCRE_UTF8 flag, the strings passed as patterns and |
166 |
subjects are checked for validity on entry to the relevant functions. |
subjects are checked for validity on entry to the relevant functions. |
167 |
If an invalid UTF-8 string is passed, an error return is given. In some |
If an invalid UTF-8 string is passed, an error return is given. In some |
168 |
situations, you may already know that your strings are valid, and |
situations, you may already know that your strings are valid, and |
169 |
therefore want to skip these checks in order to improve performance. If |
therefore want to skip these checks in order to improve performance. If |
170 |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, |
171 |
PCRE assumes that the pattern or subject it is given (respectively) |
PCRE assumes that the pattern or subject it is given (respectively) |
172 |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
contains only valid UTF-8 codes. In this case, it does not diagnose an |
173 |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE when |
174 |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program may |
175 |
crash. |
crash. |
176 |
|
|
177 |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
2. An unbraced hexadecimal escape sequence (such as \xb3) matches a |
178 |
two-byte UTF-8 character if the value is greater than 127. |
two-byte UTF-8 character if the value is greater than 127. |
179 |
|
|
180 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
3. Octal numbers up to \777 are recognized, and match two-byte UTF-8 |
181 |
characters for values greater than \177. |
characters for values greater than \177. |
182 |
|
|
183 |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
4. Repeat quantifiers apply to complete UTF-8 characters, not to indi- |
184 |
vidual bytes, for example: \x{100}{3}. |
vidual bytes, for example: \x{100}{3}. |
185 |
|
|
186 |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
5. The dot metacharacter matches one UTF-8 character instead of a sin- |
187 |
gle byte. |
gle byte. |
188 |
|
|
189 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
6. The escape sequence \C can be used to match a single byte in UTF-8 |
190 |
mode, but its use can lead to some strange effects. This facility is |
mode, but its use can lead to some strange effects. This facility is |
191 |
not available in the alternative matching function, pcre_dfa_exec(). |
not available in the alternative matching function, pcre_dfa_exec(). |
192 |
|
|
193 |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
7. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
194 |
test characters of any code value, but the characters that PCRE recog- |
test characters of any code value, but the characters that PCRE recog- |
195 |
nizes as digits, spaces, or word characters remain the same set as |
nizes as digits, spaces, or word characters remain the same set as |
196 |
before, all with values less than 256. This remains true even when PCRE |
before, all with values less than 256. This remains true even when PCRE |
197 |
includes Unicode property support, because to do otherwise would slow |
includes Unicode property support, because to do otherwise would slow |
198 |
down PCRE in many common cases. If you really want to test for a wider |
down PCRE in many common cases. If you really want to test for a wider |
199 |
sense of, say, "digit", you must use Unicode property tests such as |
sense of, say, "digit", you must use Unicode property tests such as |
200 |
\p{Nd}. |
\p{Nd}. |
201 |
|
|
202 |
8. Similarly, characters that match the POSIX named character classes |
8. Similarly, characters that match the POSIX named character classes |
203 |
are all low-valued characters. |
are all low-valued characters. |
204 |
|
|
205 |
9. However, the Perl 5.10 horizontal and vertical whitespace matching |
9. However, the Perl 5.10 horizontal and vertical whitespace matching |
206 |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char- |
207 |
acters. |
acters. |
208 |
|
|
209 |
10. Case-insensitive matching applies only to characters whose values |
10. Case-insensitive matching applies only to characters whose values |
210 |
are less than 128, unless PCRE is built with Unicode property support. |
are less than 128, unless PCRE is built with Unicode property support. |
211 |
Even when Unicode property support is available, PCRE still uses its |
Even when Unicode property support is available, PCRE still uses its |
212 |
own character tables when checking the case of low-valued characters, |
own character tables when checking the case of low-valued characters, |
213 |
so as not to degrade performance. The Unicode property information is |
so as not to degrade performance. The Unicode property information is |
214 |
used only for characters with higher values. Even when Unicode property |
used only for characters with higher values. Even when Unicode property |
215 |
support is available, PCRE supports case-insensitive matching only when |
support is available, PCRE supports case-insensitive matching only when |
216 |
there is a one-to-one mapping between a letter's cases. There are a |
there is a one-to-one mapping between a letter's cases. There are a |
217 |
small number of many-to-one mappings in Unicode; these are not sup- |
small number of many-to-one mappings in Unicode; these are not sup- |
218 |
ported by PCRE. |
ported by PCRE. |
219 |
|
|
220 |
|
|
224 |
University Computing Service |
University Computing Service |
225 |
Cambridge CB2 3QH, England. |
Cambridge CB2 3QH, England. |
226 |
|
|
227 |
Putting an actual email address here seems to have been a spam magnet, |
Putting an actual email address here seems to have been a spam magnet, |
228 |
so I've taken it away. If you want to email me, use my two initials, |
so I've taken it away. If you want to email me, use my two initials, |
229 |
followed by the two digits 10, at the domain cam.ac.uk. |
followed by the two digits 10, at the domain cam.ac.uk. |
230 |
|
|
231 |
|
|
232 |
REVISION |
REVISION |
233 |
|
|
234 |
Last updated: 13 June 2007 |
Last updated: 30 July 2007 |
235 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
236 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
237 |
|
|
464 |
|
|
465 |
PCRE assumes by default that it will run in an environment where the |
PCRE assumes by default that it will run in an environment where the |
466 |
character code is ASCII (or Unicode, which is a superset of ASCII). |
character code is ASCII (or Unicode, which is a superset of ASCII). |
467 |
PCRE can, however, be compiled to run in an EBCDIC environment by |
This is the case for most computer operating systems. PCRE can, how- |
468 |
adding |
ever, be compiled to run in an EBCDIC environment by adding |
469 |
|
|
470 |
--enable-ebcdic |
--enable-ebcdic |
471 |
|
|
472 |
to the configure command. This setting implies --enable-rebuild-charta- |
to the configure command. This setting implies --enable-rebuild-charta- |
473 |
bles. |
bles. You should only use it if you know that you are in an EBCDIC |
474 |
|
environment (for example, an IBM mainframe operating system). |
475 |
|
|
476 |
|
|
477 |
SEE ALSO |
SEE ALSO |
488 |
|
|
489 |
REVISION |
REVISION |
490 |
|
|
491 |
Last updated: 05 June 2007 |
Last updated: 30 July 2007 |
492 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
493 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
494 |
|
|
1565 |
Return a copy of the options with which the pattern was compiled. The |
Return a copy of the options with which the pattern was compiled. The |
1566 |
fourth argument should point to an unsigned long int variable. These |
fourth argument should point to an unsigned long int variable. These |
1567 |
option bits are those specified in the call to pcre_compile(), modified |
option bits are those specified in the call to pcre_compile(), modified |
1568 |
by any top-level option settings within the pattern itself. |
by any top-level option settings at the start of the pattern itself. In |
1569 |
|
other words, they are the options that will be in force when matching |
1570 |
|
starts. For example, if the pattern /(?im)abc(?-i)d/ is compiled with |
1571 |
|
the PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE, |
1572 |
|
and PCRE_EXTENDED. |
1573 |
|
|
1574 |
A pattern is automatically anchored by PCRE if all of its top-level |
A pattern is automatically anchored by PCRE if all of its top-level |
1575 |
alternatives begin with one of the following: |
alternatives begin with one of the following: |
2060 |
field in a pcre_extra structure (or defaulted) was reached. See the |
field in a pcre_extra structure (or defaulted) was reached. See the |
2061 |
description above. |
description above. |
2062 |
|
|
|
PCRE_ERROR_NULLWSLIMIT (-22) |
|
|
|
|
|
When a group that can match an empty substring is repeated with an |
|
|
unbounded upper limit, the subject position at the start of the group |
|
|
must be remembered, so that a test for an empty string can be made when |
|
|
the end of the group is reached. Some workspace is required for this; |
|
|
if it runs out, this error is given. |
|
|
|
|
2063 |
PCRE_ERROR_BADNEWLINE (-23) |
PCRE_ERROR_BADNEWLINE (-23) |
2064 |
|
|
2065 |
An invalid combination of PCRE_NEWLINE_xxx options was given. |
An invalid combination of PCRE_NEWLINE_xxx options was given. |
2066 |
|
|
2067 |
Error numbers -16 to -20 are not used by pcre_exec(). |
Error numbers -16 to -20 and -22 are not used by pcre_exec(). |
2068 |
|
|
2069 |
|
|
2070 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
2419 |
|
|
2420 |
REVISION |
REVISION |
2421 |
|
|
2422 |
Last updated: 13 June 2007 |
Last updated: 30 July 2007 |
2423 |
Copyright (c) 1997-2007 University of Cambridge. |
Copyright (c) 1997-2007 University of Cambridge. |
2424 |
------------------------------------------------------------------------------ |
------------------------------------------------------------------------------ |
2425 |
|
|