1323 |
LOCALE SUPPORT |
LOCALE SUPPORT |
1324 |
|
|
1325 |
PCRE handles caseless matching, and determines whether characters are |
PCRE handles caseless matching, and determines whether characters are |
1326 |
letters digits, or whatever, by reference to a set of tables, indexed |
letters, digits, or whatever, by reference to a set of tables, indexed |
1327 |
by character value. When running in UTF-8 mode, this applies only to |
by character value. When running in UTF-8 mode, this applies only to |
1328 |
characters with codes less than 128. Higher-valued codes never match |
characters with codes less than 128. Higher-valued codes never match |
1329 |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
escapes such as \w or \d, but can be tested with \p if PCRE is built |
1330 |
with Unicode character property support. The use of locales with Uni- |
with Unicode character property support. The use of locales with Uni- |
1331 |
code is discouraged. |
code is discouraged. If you are handling characters with codes greater |
1332 |
|
than 128, you should either use UTF-8 and Unicode, or use locales, but |
1333 |
An internal set of tables is created in the default C locale when PCRE |
not try to mix the two. |
1334 |
is built. This is used when the final argument of pcre_compile() is |
|
1335 |
NULL, and is sufficient for many applications. An alternative set of |
PCRE contains an internal set of tables that are used when the final |
1336 |
tables can, however, be supplied. These may be created in a different |
argument of pcre_compile() is NULL. These are sufficient for many |
1337 |
locale from the default. As more and more applications change to using |
applications. Normally, the internal tables recognize only ASCII char- |
1338 |
Unicode, the need for this locale support is expected to die away. |
acters. However, when PCRE is built, it is possible to cause the inter- |
1339 |
|
nal tables to be rebuilt in the default "C" locale of the local system, |
1340 |
External tables are built by calling the pcre_maketables() function, |
which may cause them to be different. |
1341 |
which has no arguments, in the relevant locale. The result can then be |
|
1342 |
passed to pcre_compile() or pcre_exec() as often as necessary. For |
The internal tables can always be overridden by tables supplied by the |
1343 |
example, to build and use tables that are appropriate for the French |
application that calls PCRE. These may be created in a different locale |
1344 |
locale (where accented characters with values greater than 128 are |
from the default. As more and more applications change to using Uni- |
1345 |
|
code, the need for this locale support is expected to die away. |
1346 |
|
|
1347 |
|
External tables are built by calling the pcre_maketables() function, |
1348 |
|
which has no arguments, in the relevant locale. The result can then be |
1349 |
|
passed to pcre_compile() or pcre_exec() as often as necessary. For |
1350 |
|
example, to build and use tables that are appropriate for the French |
1351 |
|
locale (where accented characters with values greater than 128 are |
1352 |
treated as letters), the following code could be used: |
treated as letters), the following code could be used: |
1353 |
|
|
1354 |
setlocale(LC_CTYPE, "fr_FR"); |
setlocale(LC_CTYPE, "fr_FR"); |
1355 |
tables = pcre_maketables(); |
tables = pcre_maketables(); |
1356 |
re = pcre_compile(..., tables); |
re = pcre_compile(..., tables); |
1357 |
|
|
1358 |
|
The locale name "fr_FR" is used on Linux and other Unix-like systems; |
1359 |
|
if you are using Windows, the name for the French locale is "french". |
1360 |
|
|
1361 |
When pcre_maketables() runs, the tables are built in memory that is |
When pcre_maketables() runs, the tables are built in memory that is |
1362 |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
obtained via pcre_malloc. It is the caller's responsibility to ensure |
1363 |
that the memory containing the tables remains available for as long as |
that the memory containing the tables remains available for as long as |
2928 |
is a letter or digit. The definition of letters and digits is con- |
is a letter or digit. The definition of letters and digits is con- |
2929 |
trolled by PCRE's low-valued character tables, and may vary if locale- |
trolled by PCRE's low-valued character tables, and may vary if locale- |
2930 |
specific matching is taking place (see "Locale support" in the pcreapi |
specific matching is taking place (see "Locale support" in the pcreapi |
2931 |
page). For example, in the "fr_FR" (French) locale, some character |
page). For example, in a French locale such as "fr_FR" in Unix-like |
2932 |
codes greater than 128 are used for accented letters, and these are |
systems, or "french" in Windows, some character codes greater than 128 |
2933 |
matched by \w. |
are used for accented letters, and these are matched by \w. |
2934 |
|
|
2935 |
In UTF-8 mode, characters with values greater than 128 never match \d, |
In UTF-8 mode, characters with values greater than 128 never match \d, |
2936 |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
\s, or \w, and always match \D, \S, and \W. This is true even when Uni- |
3299 |
If a range that includes letters is used when caseless matching is set, |
If a range that includes letters is used when caseless matching is set, |
3300 |
it matches the letters in either case. For example, [W-c] is equivalent |
it matches the letters in either case. For example, [W-c] is equivalent |
3301 |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
to [][\\^_`wxyzabc], matched caselessly, and in non-UTF-8 mode, if |
3302 |
character tables for the "fr_FR" locale are in use, [\xc8-\xcb] matches |
character tables for a French locale are in use, [\xc8-\xcb] matches |
3303 |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
accented E characters in both cases. In UTF-8 mode, PCRE supports the |
3304 |
concept of case for characters with values greater than 128 only when |
concept of case for characters with values greater than 128 only when |
3305 |
it is compiled with Unicode property support. |
it is compiled with Unicode property support. |