156 |
support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. |
support), the escape sequences \ep{..}, \eP{..}, and \eX are supported. |
157 |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
158 |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
159 |
number. A full list is given in the |
number, the Unicode script names such as Arabic or Han, and the derived |
160 |
|
properties Any and L&. A full list is given in the |
161 |
.\" HREF |
.\" HREF |
162 |
\fBpcrepattern\fP |
\fBpcrepattern\fP |
163 |
.\" |
.\" |
164 |
documentation. The PCRE library is increased in size by about 90K when Unicode |
documentation. Only the short names for properties are supported. For example, |
165 |
property support is included. |
\ep{L} matches a letter. Its Perl synonym, \ep{Letter}, is not supported. |
166 |
|
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for |
167 |
|
compatibility with Perl 5.6. PCRE does not support this. |
168 |
.P |
.P |
169 |
The following comments apply when PCRE is running in UTF-8 mode: |
The following comments apply when PCRE is running in UTF-8 mode: |
170 |
.P |
.P |
179 |
PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program |
PCRE when PCRE_NO_UTF8_CHECK is set, the results are undefined. Your program |
180 |
may crash. |
may crash. |
181 |
.P |
.P |
182 |
2. In a pattern, the escape sequence \ex{...}, where the contents of the braces |
2. An unbraced hexadecimal escape sequence (such as \exb3) matches a two-byte |
183 |
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
UTF-8 character if the value is greater than 127. |
|
code number is the given hexadecimal number, for example: \ex{1234}. If a |
|
|
non-hexadecimal digit appears between the braces, the item is not recognized. |
|
|
This escape sequence can be used either as a literal, or within a character |
|
|
class. |
|
184 |
.P |
.P |
185 |
3. The original hexadecimal escape sequence, \exhh, matches a two-byte UTF-8 |
3. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
character if the value is greater than 127. |
|
|
.P |
|
|
4. Repeat quantifiers apply to complete UTF-8 characters, not to individual |
|
186 |
bytes, for example: \ex{100}{3}. |
bytes, for example: \ex{100}{3}. |
187 |
.P |
.P |
188 |
5. The dot metacharacter matches one UTF-8 character instead of a single byte. |
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
189 |
.P |
.P |
190 |
6. The escape sequence \eC can be used to match a single byte in UTF-8 mode, |
5. The escape sequence \eC can be used to match a single byte in UTF-8 mode, |
191 |
but its use can lead to some strange effects. This facility is not available in |
but its use can lead to some strange effects. This facility is not available in |
192 |
the alternative matching function, \fBpcre_dfa_exec()\fP. |
the alternative matching function, \fBpcre_dfa_exec()\fP. |
193 |
.P |
.P |
194 |
7. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
6. The character escapes \eb, \eB, \ed, \eD, \es, \eS, \ew, and \eW correctly |
195 |
test characters of any code value, but the characters that PCRE recognizes as |
test characters of any code value, but the characters that PCRE recognizes as |
196 |
digits, spaces, or word characters remain the same set as before, all with |
digits, spaces, or word characters remain the same set as before, all with |
197 |
values less than 256. This remains true even when PCRE includes Unicode |
values less than 256. This remains true even when PCRE includes Unicode |
199 |
cases. If you really want to test for a wider sense of, say, "digit", you |
cases. If you really want to test for a wider sense of, say, "digit", you |
200 |
must use Unicode property tests such as \ep{Nd}. |
must use Unicode property tests such as \ep{Nd}. |
201 |
.P |
.P |
202 |
8. Similarly, characters that match the POSIX named character classes are all |
7. Similarly, characters that match the POSIX named character classes are all |
203 |
low-valued characters. |
low-valued characters. |
204 |
.P |
.P |
205 |
9. Case-insensitive matching applies only to characters whose values are less |
8. Case-insensitive matching applies only to characters whose values are less |
206 |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
than 128, unless PCRE is built with Unicode property support. Even when Unicode |
207 |
property support is available, PCRE still uses its own character tables when |
property support is available, PCRE still uses its own character tables when |
208 |
checking the case of low-valued characters, so as not to degrade performance. |
checking the case of low-valued characters, so as not to degrade performance. |
209 |
The Unicode property information is used only for characters with higher |
The Unicode property information is used only for characters with higher |
210 |
values. |
values. Even when Unicode property support is available, PCRE supports |
211 |
|
case-insensitive matching only when there is a one-to-one mapping between a |
212 |
|
letter's cases. There are a small number of many-to-one mappings in Unicode; |
213 |
|
these are not supported by PCRE. |
214 |
. |
. |
215 |
.SH AUTHOR |
.SH AUTHOR |
216 |
.rs |
.rs |
226 |
by a dot, at the domain ucs.cam.ac.uk. |
by a dot, at the domain ucs.cam.ac.uk. |
227 |
.sp |
.sp |
228 |
.in 0 |
.in 0 |
229 |
Last updated: 07 March 2005 |
Last updated: 24 January 2006 |
230 |
.br |
.br |
231 |
Copyright (c) 1997-2005 University of Cambridge. |
Copyright (c) 1997-2006 University of Cambridge. |