13 |
man page, in case the conversion went wrong. |
man page, in case the conversion went wrong. |
14 |
<br> |
<br> |
15 |
<br><b> |
<br><b> |
16 |
UTF-8, UTF-16, AND UNICODE PROPERTY SUPPORT |
UTF-8, UTF-16, UTF-32, AND UNICODE PROPERTY SUPPORT |
17 |
</b><br> |
</b><br> |
18 |
<P> |
<P> |
19 |
From Release 8.30, in addition to its previous UTF-8 support, PCRE also |
As well as UTF-8 support, PCRE also supports UTF-16 (from release 8.30) and |
20 |
supports UTF-16 by means of a separate 16-bit library. This can be built as |
UTF-32 (from release 8.32), by means of two additional libraries. They can be |
21 |
well as, or instead of, the 8-bit library. |
built as well as, or instead of, the 8-bit library. |
|
</P> |
|
|
<P> |
|
|
From Release 8.32, in addition to its previous UTF-8 and UTF-16 support, |
|
|
PCRE also supports UTF-32 by means of a separate 32-bit library. This can be |
|
|
built as well as, or instead of, the 8-bit and 16-bit libraries. |
|
22 |
</P> |
</P> |
23 |
<br><b> |
<br><b> |
24 |
UTF-8 SUPPORT |
UTF-8 SUPPORT |
28 |
support, and, in addition, you must call |
support, and, in addition, you must call |
29 |
<a href="pcre_compile.html"><b>pcre_compile()</b></a> |
<a href="pcre_compile.html"><b>pcre_compile()</b></a> |
30 |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
with the PCRE_UTF8 option flag, or the pattern must start with the sequence |
31 |
(*UTF8). When either of these is the case, both the pattern and any subject |
(*UTF8) or (*UTF). When either of these is the case, both the pattern and any |
32 |
strings that are matched against it are treated as UTF-8 strings instead of |
subject strings that are matched against it are treated as UTF-8 strings |
33 |
strings of 1-byte characters. |
instead of strings of individual 1-byte characters. |
34 |
</P> |
</P> |
35 |
<br><b> |
<br><b> |
36 |
UTF-16 SUPPORT |
UTF-16 AND UTF-32 SUPPORT |
37 |
</b><br> |
</b><br> |
38 |
<P> |
<P> |
39 |
In order process UTF-16 strings, you must build PCRE's 16-bit library with UTF |
In order process UTF-16 or UTF-32 strings, you must build PCRE's 16-bit or |
40 |
support, and, in addition, you must call |
32-bit library with UTF support, and, in addition, you must call |
41 |
<a href="pcre_compile.html"><b>pcre16_compile()</b></a> |
<a href="pcre16_compile.html"><b>pcre16_compile()</b></a> |
42 |
with the PCRE_UTF16 option flag, or the pattern must start with the sequence |
or |
43 |
(*UTF16). When either of these is the case, both the pattern and any subject |
<a href="pcre32_compile.html"><b>pcre32_compile()</b></a> |
44 |
strings that are matched against it are treated as UTF-16 strings instead of |
with the PCRE_UTF16 or PCRE_UTF32 option flag, as appropriate. Alternatively, |
45 |
strings of 16-bit characters. |
the pattern must start with the sequence (*UTF16), (*UTF32), as appropriate, or |
46 |
</P> |
(*UTF), which can be used with either library. When UTF mode is set, both the |
47 |
<br><b> |
pattern and any subject strings that are matched against it are treated as |
48 |
UTF-32 SUPPORT |
UTF-16 or UTF-32 strings instead of strings of individual 16-bit or 32-bit |
49 |
</b><br> |
characters. |
|
<P> |
|
|
In order process UTF-32 strings, you must build PCRE's 32-bit library with UTF |
|
|
support, and, in addition, you must call |
|
|
<a href="pcre_compile.html"><b>pcre32_compile()</b></a> |
|
|
with the PCRE_UTF32 option flag, or the pattern must start with the sequence |
|
|
(*UTF32). When either of these is the case, both the pattern and any subject |
|
|
strings that are matched against it are treated as UTF-32 strings instead of |
|
|
strings of 32-bit characters. |
|
50 |
</P> |
</P> |
51 |
<br><b> |
<br><b> |
52 |
UTF SUPPORT OVERHEAD |
UTF SUPPORT OVERHEAD |
65 |
The available properties that can be tested are limited to the general |
The available properties that can be tested are limited to the general |
66 |
category properties such as Lu for an upper case letter or Nd for a decimal |
category properties such as Lu for an upper case letter or Nd for a decimal |
67 |
number, the Unicode script names such as Arabic or Han, and the derived |
number, the Unicode script names such as Arabic or Han, and the derived |
68 |
properties Any and L&. A full list is given in the |
properties Any and L&. Full lists is given in the |
69 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
70 |
|
and |
71 |
|
<a href="pcresyntax.html"><b>pcresyntax</b></a> |
72 |
documentation. Only the short names for properties are supported. For example, |
documentation. Only the short names for properties are supported. For example, |
73 |
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. |
\p{L} matches a letter. Its Perl synonym, \p{Letter}, is not supported. |
74 |
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for |
Furthermore, in Perl, many properties may optionally be prefixed by "Is", for |
85 |
which are themselves derived from the Unicode specification. Earlier releases |
which are themselves derived from the Unicode specification. Earlier releases |
86 |
of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit |
of PCRE followed the rules of RFC 2279, which allows the full range of 31-bit |
87 |
values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 |
values (0 to 0x7FFFFFFF). The current check allows only values in the range U+0 |
88 |
to U+10FFFF, excluding the surrogate area, and the non-characters. |
to U+10FFFF, excluding the surrogate area and the non-characters. |
89 |
</P> |
</P> |
90 |
<P> |
<P> |
91 |
Excluded code points are the "Surrogate Area" of Unicode. They are reserved |
Characters in the "Surrogate Area" of Unicode are reserved for use by UTF-16, |
92 |
for use by UTF-16, where they are used in pairs to encode codepoints with |
where they are used in pairs to encode codepoints with values greater than |
93 |
values greater than 0xFFFF. The code points that are encoded by UTF-16 pairs |
0xFFFF. The code points that are encoded by UTF-16 pairs are available |
94 |
are available independently in the UTF-8 encoding. (In other words, the whole |
independently in the UTF-8 and UTF-32 encodings. (In other words, the whole |
95 |
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8.) |
surrogate thing is a fudge for UTF-16 which unfortunately messes up UTF-8 and |
96 |
|
UTF-32.) |
97 |
</P> |
</P> |
98 |
<P> |
<P> |
99 |
Also excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF |
Also excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF |
100 |
and the last two code points in each plane, U+??FFFE and U+??FFFF. |
and the last two code points in each plane, U+??FFFE and U+??FFFF. |
101 |
</P> |
</P> |
102 |
<P> |
<P> |
109 |
<P> |
<P> |
110 |
In some situations, you may already know that your strings are valid, and |
In some situations, you may already know that your strings are valid, and |
111 |
therefore want to skip these checks in order to improve performance, for |
therefore want to skip these checks in order to improve performance, for |
112 |
example in the case of a long subject string that is being scanned repeatedly |
example in the case of a long subject string that is being scanned repeatedly. |
113 |
with different patterns. If you set the PCRE_NO_UTF8_CHECK flag at compile time |
If you set the PCRE_NO_UTF8_CHECK flag at compile time or at run time, PCRE |
114 |
or at run time, PCRE assumes that the pattern or subject it is given |
assumes that the pattern or subject it is given (respectively) contains only |
115 |
(respectively) contains only valid UTF-8 codes. In this case, it does not |
valid UTF-8 codes. In this case, it does not diagnose an invalid UTF-8 string. |
116 |
diagnose an invalid UTF-8 string. |
</P> |
117 |
</P> |
<P> |
118 |
<P> |
Note that passing PCRE_NO_UTF8_CHECK to <b>pcre_compile()</b> just disables the |
119 |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, what |
check for the pattern; it does not also apply to subject strings. If you want |
120 |
happens depends on why the string is invalid. If the string conforms to the |
to disable the check for a subject string you must pass this option to |
121 |
"old" definition of UTF-8 (RFC 2279), it is processed as a string of characters |
<b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>. |
122 |
in the range 0 to 0x7FFFFFFF by <b>pcre_dfa_exec()</b> and the interpreted |
</P> |
123 |
version of <b>pcre_exec()</b>. In other words, apart from the initial validity |
<P> |
124 |
test, these functions (when in UTF-8 mode) handle strings according to the more |
If you pass an invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set, the result |
125 |
liberal rules of RFC 2279. However, the just-in-time (JIT) optimization for |
is undefined and your program may crash. |
|
<b>pcre_exec()</b> supports only RFC 3629. If you are using JIT optimization, or |
|
|
if the string does not even conform to RFC 2279, the result is undefined. Your |
|
|
program may crash. |
|
|
</P> |
|
|
<P> |
|
|
If you want to process strings of values in the full range 0 to 0x7FFFFFFF, |
|
|
encoded in a UTF-8-like manner as per the old RFC, you can set |
|
|
PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in this |
|
|
situation, you will have to apply your own validity check, and avoid the use of |
|
|
JIT optimization. |
|
126 |
<a name="utf16strings"></a></P> |
<a name="utf16strings"></a></P> |
127 |
<br><b> |
<br><b> |
128 |
Validity of UTF-16 strings |
Validity of UTF-16 strings |
135 |
must be used in pairs in the correct manner. |
must be used in pairs in the correct manner. |
136 |
</P> |
</P> |
137 |
<P> |
<P> |
138 |
Excluded are the "Non-Characters" code points, which are U+FDD0 to U+FDEF |
Excluded are the "Non-Character" code points, which are U+FDD0 to U+FDEF |
139 |
and the last two code points in each plane, U+??FFFE and U+??FFFF. |
and the last two code points in each plane, U+??FFFE and U+??FFFF. |
140 |
</P> |
</P> |
141 |
<P> |
<P> |
151 |
the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
the PCRE_NO_UTF16_CHECK flag at compile time or at run time, PCRE assumes that |
152 |
the pattern or subject it is given (respectively) contains only valid UTF-16 |
the pattern or subject it is given (respectively) contains only valid UTF-16 |
153 |
sequences. In this case, it does not diagnose an invalid UTF-16 string. |
sequences. In this case, it does not diagnose an invalid UTF-16 string. |
154 |
|
However, if an invalid string is passed, the result is undefined. |
155 |
<a name="utf32strings"></a></P> |
<a name="utf32strings"></a></P> |
156 |
<br><b> |
<br><b> |
157 |
Validity of UTF-32 strings |
Validity of UTF-32 strings |
161 |
passed as patterns and subjects are (by default) checked for validity on entry |
passed as patterns and subjects are (by default) checked for validity on entry |
162 |
to the relevant functions. This check allows only values in the range U+0 |
to the relevant functions. This check allows only values in the range U+0 |
163 |
to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the |
to U+10FFFF, excluding the surrogate area U+D800 to U+DFFF, and the |
164 |
"Non-Characters" code points, which are U+FDD0 to U+FDEF and the last two |
"Non-Character" code points, which are U+FDD0 to U+FDEF and the last two |
165 |
characters in each plane, U+??FFFE and U+??FFFF. |
characters in each plane, U+??FFFE and U+??FFFF. |
166 |
</P> |
</P> |
167 |
<P> |
<P> |
177 |
the PCRE_NO_UTF32_CHECK flag at compile time or at run time, PCRE assumes that |
the PCRE_NO_UTF32_CHECK flag at compile time or at run time, PCRE assumes that |
178 |
the pattern or subject it is given (respectively) contains only valid UTF-32 |
the pattern or subject it is given (respectively) contains only valid UTF-32 |
179 |
sequences. In this case, it does not diagnose an invalid UTF-32 string. |
sequences. In this case, it does not diagnose an invalid UTF-32 string. |
180 |
</P> |
However, if an invalid string is passed, the result is undefined. |
|
<P> |
|
|
UTF-32 only uses the lowest 21 bits of the 32 bit characters, and the |
|
|
application may use the upper bits for internal purposes. To allow you to |
|
|
pass these strings to PCRE unmodified (thus avoiding the costly operation of |
|
|
creating a copy of the string with the upper bits masked), PCRE accepts |
|
|
these 32-bit character strings as-is, but only uses the lowest 21 bits for |
|
|
matching, if you pass the PCRE_NO_UTF32_CHECK flag to <b>pcre32_exec()</b> and |
|
|
<b>pcre32_dfa_exec()</b>. However, in this situation, you will have to apply |
|
|
your own validity check, and avoid the use of JIT optimization. |
|
|
(The latter restriction may be lifter in a later version of PCRE.) |
|
181 |
</P> |
</P> |
182 |
<br><b> |
<br><b> |
183 |
General comments about UTF modes |
General comments about UTF modes |
184 |
</b><br> |
</b><br> |
185 |
<P> |
<P> |
186 |
1. Codepoints less than 256 can be specified by either braced or unbraced |
1. Codepoints less than 256 can be specified in patterns by either braced or |
187 |
hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger values |
unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3). Larger |
188 |
have to use braced sequences. |
values have to use braced sequences. |
189 |
</P> |
</P> |
190 |
<P> |
<P> |
191 |
2. Octal numbers up to \777 are recognized, and in UTF-8 mode, they match |
2. Octal numbers up to \777 are recognized, and in UTF-8 mode they match |
192 |
two-byte characters for values greater than \177. |
two-byte characters for values greater than \177. |
193 |
</P> |
</P> |
194 |
<P> |
<P> |
206 |
multi-unit characters (see the description of \C in the |
multi-unit characters (see the description of \C in the |
207 |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
<a href="pcrepattern.html"><b>pcrepattern</b></a> |
208 |
documentation). The use of \C is not supported in the alternative matching |
documentation). The use of \C is not supported in the alternative matching |
209 |
function <b>pcre[16|32]_dfa_exec()</b>, nor is it supported in UTF mode by the JIT |
function <b>pcre[16|32]_dfa_exec()</b>, nor is it supported in UTF mode by the |
210 |
optimization of <b>pcre[16|32]_exec()</b>. If JIT optimization is requested for a |
JIT optimization of <b>pcre[16|32]_exec()</b>. If JIT optimization is requested |
211 |
UTF pattern that contains \C, it will not succeed, and so the matching will |
for a UTF pattern that contains \C, it will not succeed, and so the matching |
212 |
be carried out by the normal interpretive function. |
will be carried out by the normal interpretive function. |
213 |
</P> |
</P> |
214 |
<P> |
<P> |
215 |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
6. The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly |
261 |
REVISION |
REVISION |
262 |
</b><br> |
</b><br> |
263 |
<P> |
<P> |
264 |
Last updated: 25 September 2012 |
Last updated: 11 November 2012 |
265 |
<br> |
<br> |
266 |
Copyright © 1997-2012 University of Cambridge. |
Copyright © 1997-2012 University of Cambridge. |
267 |
<br> |
<br> |