118 |
The following comments apply when PCRE is running in UTF-8 |
The following comments apply when PCRE is running in UTF-8 |
119 |
mode: |
mode: |
120 |
|
|
121 |
1. PCRE assumes that the strings it is given contain valid |
1. When you set the PCRE_UTF8 flag, the strings passed as |
122 |
UTF-8 codes. It does not diagnose invalid UTF-8 strings. If |
patterns and subjects are checked for validity on entry to |
123 |
you pass invalid UTF-8 strings to PCRE, the results are |
the relevant functions. If an invalid UTF-8 string is |
124 |
undefined. |
passed, an error return is given. In some situations, you |
125 |
|
may already know that your strings are valid, and therefore |
126 |
|
want to skip these checks in order to improve performance. |
127 |
|
If you set the PCRE_NO_UTF8_CHECK flag at compile time or at |
128 |
|
run time, PCRE assumes that the pattern or subject it is |
129 |
|
given (respectively) contains only valid UTF-8 codes. In |
130 |
|
this case, it does not diagnose an invalid UTF-8 string. If |
131 |
|
you pass an invalid UTF-8 string to PCRE when |
132 |
|
PCRE_NO_UTF8_CHECK is set, the results are undefined. Your |
133 |
|
program may crash. |
134 |
|
|
135 |
2. In a pattern, the escape sequence \x{...}, where the con- |
2. In a pattern, the escape sequence \x{...}, where the con- |
136 |
tents of the braces is a string of hexadecimal digits, is |
tents of the braces is a string of hexadecimal digits, is |
173 |
Cambridge CB2 3QG, England. |
Cambridge CB2 3QG, England. |
174 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
175 |
|
|
176 |
Last updated: 04 February 2003 |
Last updated: 20 August 2003 |
177 |
Copyright (c) 1997-2003 University of Cambridge. |
Copyright (c) 1997-2003 University of Cambridge. |
178 |
----------------------------------------------------------------------------- |
----------------------------------------------------------------------------- |
179 |
|
|
663 |
option changes the behaviour of PCRE are given in the sec- |
option changes the behaviour of PCRE are given in the sec- |
664 |
tion on UTF-8 support in the main pcre page. |
tion on UTF-8 support in the main pcre page. |
665 |
|
|
666 |
|
PCRE_NO_UTF8_CHECK |
667 |
|
|
668 |
|
When PCRE_UTF8 is set, the validity of the pattern as a |
669 |
|
UTF-8 string is automatically checked. If an invalid UTF-8 |
670 |
|
sequence of bytes is found, pcre_compile() returns an error. |
671 |
|
If you already know that your pattern is valid, and you want |
672 |
|
to skip this check for performance reasons, you can set the |
673 |
|
PCRE_NO_UTF8_CHECK option. When it is set, the effect of |
674 |
|
passing an invalid UTF-8 string as a pattern is undefined. |
675 |
|
It may cause your program to crash. Note that there is a |
676 |
|
similar option for suppressing the checking of subject |
677 |
|
strings passed to pcre_exec(). |
678 |
|
|
679 |
|
|
680 |
|
|
681 |
STUDYING A PATTERN |
STUDYING A PATTERN |
682 |
|
|
770 |
compiled pattern. It replaces the obsolete pcre_info() func- |
compiled pattern. It replaces the obsolete pcre_info() func- |
771 |
tion, which is nevertheless retained for backwards compabil- |
tion, which is nevertheless retained for backwards compabil- |
772 |
ity (and is documented below). |
ity (and is documented below). |
|
|
|
773 |
The first argument for pcre_fullinfo() is a pointer to the |
The first argument for pcre_fullinfo() is a pointer to the |
774 |
compiled pattern. The second argument is the result of |
compiled pattern. The second argument is the result of |
775 |
pcre_study(), or NULL if the pattern was not studied. The |
pcre_study(), or NULL if the pattern was not studied. The |
1036 |
turned out to be anchored by virtue of its contents, it can- |
turned out to be anchored by virtue of its contents, it can- |
1037 |
not be made unachored at matching time. |
not be made unachored at matching time. |
1038 |
|
|
1039 |
|
When PCRE_UTF8 was set at compile time, the validity of the |
1040 |
|
subject as a UTF-8 string is automatically checked. If an |
1041 |
|
invalid UTF-8 sequence of bytes is found, pcre_exec() |
1042 |
|
returns the error PCRE_ERROR_BADUTF8. If you already know |
1043 |
|
that your subject is valid, and you want to skip this check |
1044 |
|
for performance reasons, you can set the PCRE_NO_UTF8_CHECK |
1045 |
|
option when calling pcre_exec(). When this option is set, |
1046 |
|
the effect of passing an invalid UTF-8 string as a subject |
1047 |
|
is undefined. It may cause your program to crash. |
1048 |
|
|
1049 |
There are also three further options that can be set only at |
There are also three further options that can be set only at |
1050 |
matching time: |
matching time: |
1051 |
|
|
1135 |
used for a fragment of a pattern that picks out a substring. |
used for a fragment of a pattern that picks out a substring. |
1136 |
PCRE supports several other kinds of parenthesized subpat- |
PCRE supports several other kinds of parenthesized subpat- |
1137 |
tern that do not cause substrings to be captured. |
tern that do not cause substrings to be captured. |
|
|
|
1138 |
Captured substrings are returned to the caller via a vector |
Captured substrings are returned to the caller via a vector |
1139 |
of integer offsets whose address is passed in ovector. The |
of integer offsets whose address is passed in ovector. The |
1140 |
number of elements in the vector is passed in ovecsize. The |
number of elements in the vector is passed in ovecsize. The |
1250 |
distinctive error code. See the pcrecallout documentation |
distinctive error code. See the pcrecallout documentation |
1251 |
for details. |
for details. |
1252 |
|
|
1253 |
|
PCRE_ERROR_BADUTF8 (-10) |
1254 |
|
|
1255 |
|
A string that contains an invalid UTF-8 byte sequence was |
1256 |
|
passed as a subject. |
1257 |
|
|
1258 |
|
|
1259 |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
EXTRACTING CAPTURED SUBSTRINGS BY NUMBER |
1260 |
|
|
1291 |
returned zero, indicating that it ran out of space in ovec- |
returned zero, indicating that it ran out of space in ovec- |
1292 |
tor, the value passed as stringcount should be the size of |
tor, the value passed as stringcount should be the size of |
1293 |
the vector divided by three. |
the vector divided by three. |
|
|
|
1294 |
The functions pcre_copy_substring() and pcre_get_substring() |
The functions pcre_copy_substring() and pcre_get_substring() |
1295 |
extract a single substring, whose number is given as string- |
extract a single substring, whose number is given as string- |
1296 |
number. A value of zero extracts the substring that matched |
number. A value of zero extracts the substring that matched |
1387 |
succeeds, they then call pcre_copy_substring() or |
succeeds, they then call pcre_copy_substring() or |
1388 |
pcre_get_substring(), as appropriate. |
pcre_get_substring(), as appropriate. |
1389 |
|
|
1390 |
Last updated: 03 February 2003 |
Last updated: 20 August 2003 |
1391 |
Copyright (c) 1997-2003 University of Cambridge. |
Copyright (c) 1997-2003 University of Cambridge. |
1392 |
----------------------------------------------------------------------------- |
----------------------------------------------------------------------------- |
1393 |
|
|
1455 |
The current_position field contains the offset within the |
The current_position field contains the offset within the |
1456 |
subject of the current match pointer. |
subject of the current match pointer. |
1457 |
|
|
1458 |
The capture_top field contains the number of the highest |
The capture_top field contains one more than the number of |
1459 |
captured substring so far. |
the highest numbered captured substring so far. If no sub- |
1460 |
|
strings have been captured, the value of capture_top is one. |
1461 |
|
|
1462 |
The capture_last field contains the number of the most |
The capture_last field contains the number of the most |
1463 |
recently captured substring. |
recently captured substring. |
3126 |
that is POSIX-like in style. The syntax and semantics of the |
that is POSIX-like in style. The syntax and semantics of the |
3127 |
regular expressions themselves are still those of Perl, sub- |
regular expressions themselves are still those of Perl, sub- |
3128 |
ject to the setting of various PCRE options, as described |
ject to the setting of various PCRE options, as described |
3129 |
below. |
below. "POSIX-like in style" means that the API approximates |
3130 |
|
to the POSIX definition; it is not fully POSIX-compatible, |
3131 |
|
and in multi-byte encoding domains it is probably even less |
3132 |
|
compatible. |
3133 |
|
|
3134 |
The header for these functions is supplied as pcreposix.h to |
The header for these functions is supplied as pcreposix.h to |
3135 |
avoid any potential clash with other POSIX libraries. It |
avoid any potential clash with other POSIX libraries. It |