37 |
<LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A> |
<LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A> |
38 |
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A> |
<LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A> |
39 |
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A> |
<LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A> |
40 |
<LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A> |
<LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A> |
41 |
|
<LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A> |
42 |
</UL> |
</UL> |
43 |
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A> |
<LI><A NAME="SEC1" HREF="#TOC1">NAME</A> |
44 |
<P> |
<P> |
77 |
<B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B> |
<B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B> |
78 |
</P> |
</P> |
79 |
<P> |
<P> |
80 |
|
<B>void pcre_free_substring(const char *<I>stringptr</I>);</B> |
81 |
|
</P> |
82 |
|
<P> |
83 |
|
<B>void pcre_free_substring_list(const char **<I>stringptr</I>);</B> |
84 |
|
</P> |
85 |
|
<P> |
86 |
<B>const unsigned char *pcre_maketables(void);</B> |
<B>const unsigned char *pcre_maketables(void);</B> |
87 |
</P> |
</P> |
88 |
<P> |
<P> |
107 |
The PCRE library is a set of functions that implement regular expression |
The PCRE library is a set of functions that implement regular expression |
108 |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
pattern matching using the same syntax and semantics as Perl 5, with just a few |
109 |
differences (see below). The current implementation corresponds to Perl 5.005, |
differences (see below). The current implementation corresponds to Perl 5.005, |
110 |
with some additional features from the Perl development release. |
with some additional features from later versions. This includes some |
111 |
|
experimental, incomplete support for UTF-8 encoded strings. Details of exactly |
112 |
|
what is and what is not supported are given below. |
113 |
</P> |
</P> |
114 |
<P> |
<P> |
115 |
PCRE has its own native API, which is described in this document. There is also |
PCRE has its own native API, which is described in this document. There is also |
126 |
</P> |
</P> |
127 |
<P> |
<P> |
128 |
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B> |
The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B> |
129 |
are used for compiling and matching regular expressions, while |
are used for compiling and matching regular expressions. |
130 |
<B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and |
</P> |
131 |
|
<P> |
132 |
|
The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and |
133 |
<B>pcre_get_substring_list()</B> are convenience functions for extracting |
<B>pcre_get_substring_list()</B> are convenience functions for extracting |
134 |
captured substrings from a matched subject string. The function |
captured substrings from a matched subject string; <B>pcre_free_substring()</B> |
135 |
<B>pcre_maketables()</B> is used (optionally) to build a set of character tables |
and <B>pcre_free_substring_list()</B> are also provided, to free the memory used |
136 |
in the current locale for passing to <B>pcre_compile()</B>. |
for extracted strings. |
137 |
|
</P> |
138 |
|
<P> |
139 |
|
The function <B>pcre_maketables()</B> is used (optionally) to build a set of |
140 |
|
character tables in the current locale for passing to <B>pcre_compile()</B>. |
141 |
</P> |
</P> |
142 |
<P> |
<P> |
143 |
The function <B>pcre_fullinfo()</B> is used to find out information about a |
The function <B>pcre_fullinfo()</B> is used to find out information about a |
312 |
greedy by default, but become greedy if followed by "?". It is not compatible |
greedy by default, but become greedy if followed by "?". It is not compatible |
313 |
with Perl. It can also be set by a (?U) option setting within the pattern. |
with Perl. It can also be set by a (?U) option setting within the pattern. |
314 |
</P> |
</P> |
315 |
|
<P> |
316 |
|
<PRE> |
317 |
|
PCRE_UTF8 |
318 |
|
</PRE> |
319 |
|
</P> |
320 |
|
<P> |
321 |
|
This option causes PCRE to regard both the pattern and the subject as strings |
322 |
|
of UTF-8 characters instead of just byte strings. However, it is available only |
323 |
|
if PCRE has been built to include UTF-8 support. If not, the use of this option |
324 |
|
provokes an error. Support for UTF-8 is new, experimental, and incomplete. |
325 |
|
Details of exactly what it entails are given below. |
326 |
|
</P> |
327 |
<LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A> |
<LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A> |
328 |
<P> |
<P> |
329 |
When a pattern is going to be used several times, it is worth spending more |
When a pattern is going to be used several times, it is worth spending more |
770 |
value of zero extracts the substring that matched the entire pattern, while |
value of zero extracts the substring that matched the entire pattern, while |
771 |
higher values extract the captured substrings. For <B>pcre_copy_substring()</B>, |
higher values extract the captured substrings. For <B>pcre_copy_substring()</B>, |
772 |
the string is placed in <I>buffer</I>, whose length is given by |
the string is placed in <I>buffer</I>, whose length is given by |
773 |
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of store is |
<I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of memory is |
774 |
obtained via <B>pcre_malloc</B>, and its address is returned via |
obtained via <B>pcre_malloc</B>, and its address is returned via |
775 |
<I>stringptr</I>. The yield of the function is the length of the string, not |
<I>stringptr</I>. The yield of the function is the length of the string, not |
776 |
including the terminating zero, or one of |
including the terminating zero, or one of |
816 |
inspecting the appropriate offset in <I>ovector</I>, which is negative for unset |
inspecting the appropriate offset in <I>ovector</I>, which is negative for unset |
817 |
substrings. |
substrings. |
818 |
</P> |
</P> |
819 |
|
<P> |
820 |
|
The two convenience functions <B>pcre_free_substring()</B> and |
821 |
|
<B>pcre_free_substring_list()</B> can be used to free the memory returned by |
822 |
|
a previous call of <B>pcre_get_substring()</B> or |
823 |
|
<B>pcre_get_substring_list()</B>, respectively. They do nothing more than call |
824 |
|
the function pointed to by <B>pcre_free</B>, which of course could be called |
825 |
|
directly from a C program. However, PCRE is used in some situations where it is |
826 |
|
linked via a special interface to another programming language which cannot use |
827 |
|
<B>pcre_free</B> directly; it is for these cases that the functions are |
828 |
|
provided. |
829 |
|
</P> |
830 |
<LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A> |
<LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A> |
831 |
<P> |
<P> |
832 |
There are some size limitations in PCRE but it is hoped that they will never in |
There are some size limitations in PCRE but it is hoped that they will never in |
946 |
described below. Regular expressions are also described in the Perl |
described below. Regular expressions are also described in the Perl |
947 |
documentation and in a number of other books, some of which have copious |
documentation and in a number of other books, some of which have copious |
948 |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
examples. Jeffrey Friedl's "Mastering Regular Expressions", published by |
949 |
O'Reilly (ISBN 1-56592-257), covers them in great detail. The description |
O'Reilly (ISBN 1-56592-257), covers them in great detail. |
950 |
here is intended as reference documentation. |
</P> |
951 |
|
<P> |
952 |
|
The description here is intended as reference documentation. The basic |
953 |
|
operation of PCRE is on strings of bytes. However, there is the beginnings of |
954 |
|
some support for UTF-8 character strings. To use this support you must |
955 |
|
configure PCRE to include it, and then call <B>pcre_compile()</B> with the |
956 |
|
PCRE_UTF8 option. How this affects the pattern matching is described in the |
957 |
|
final section of this document. |
958 |
</P> |
</P> |
959 |
<P> |
<P> |
960 |
A regular expression is a pattern that is matched against a subject string from |
A regular expression is a pattern that is matched against a subject string from |
1763 |
</PRE> |
</PRE> |
1764 |
</P> |
</P> |
1765 |
<P> |
<P> |
1766 |
matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of |
matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of |
1767 |
the subpattern, the back reference matches the character string corresponding |
the subpattern, the back reference matches the character string corresponding |
1768 |
to the previous iteration. In order for this to work, the pattern must be such |
to the previous iteration. In order for this to work, the pattern must be such |
1769 |
that the first iteration does not need to match the back reference. This can be |
that the first iteration does not need to match the back reference. This can be |
2285 |
applied to a whole line of "a" characters, whereas the latter takes an |
applied to a whole line of "a" characters, whereas the latter takes an |
2286 |
appreciable time with strings longer than about 20 characters. |
appreciable time with strings longer than about 20 characters. |
2287 |
</P> |
</P> |
2288 |
<LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A> |
<LI><A NAME="SEC30" HREF="#TOC1">UTF-8 SUPPORT</A> |
2289 |
|
<P> |
2290 |
|
Starting at release 3.3, PCRE has some support for character strings encoded |
2291 |
|
in the UTF-8 format. This is incomplete, and is regarded as experimental. In |
2292 |
|
order to use it, you must configure PCRE to include UTF-8 support in the code, |
2293 |
|
and, in addition, you must call <B>pcre_compile()</B> with the PCRE_UTF8 option |
2294 |
|
flag. When you do this, both the pattern and any subject strings that are |
2295 |
|
matched against it are treated as UTF-8 strings instead of just strings of |
2296 |
|
bytes, but only in the cases that are mentioned below. |
2297 |
|
</P> |
2298 |
|
<P> |
2299 |
|
If you compile PCRE with UTF-8 support, but do not use it at run time, the |
2300 |
|
library will be a bit bigger, but the additional run time overhead is limited |
2301 |
|
to testing the PCRE_UTF8 flag in several places, so should not be very large. |
2302 |
|
</P> |
2303 |
|
<P> |
2304 |
|
PCRE assumes that the strings it is given contain valid UTF-8 codes. It does |
2305 |
|
not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE, |
2306 |
|
the results are undefined. |
2307 |
|
</P> |
2308 |
|
<P> |
2309 |
|
Running with PCRE_UTF8 set causes these changes in the way PCRE works: |
2310 |
|
</P> |
2311 |
|
<P> |
2312 |
|
1. In a pattern, the escape sequence \x{...}, where the contents of the braces |
2313 |
|
is a string of hexadecimal digits, is interpreted as a UTF-8 character whose |
2314 |
|
code number is the given hexadecimal number, for example: \x{1234}. This |
2315 |
|
inserts from one to six literal bytes into the pattern, using the UTF-8 |
2316 |
|
encoding. If a non-hexadecimal digit appears between the braces, the item is |
2317 |
|
not recognized. |
2318 |
|
</P> |
2319 |
|
<P> |
2320 |
|
2. The original hexadecimal escape sequence, \xhh, generates a two-byte UTF-8 |
2321 |
|
character if its value is greater than 127. |
2322 |
|
</P> |
2323 |
|
<P> |
2324 |
|
3. Repeat quantifiers are NOT correctly handled if they follow a multibyte |
2325 |
|
character. For example, \x{100}* and \xc3+ do not work. If you want to |
2326 |
|
repeat such characters, you must enclose them in non-capturing parentheses, |
2327 |
|
for example (?:\x{100}), at present. |
2328 |
|
</P> |
2329 |
|
<P> |
2330 |
|
4. The dot metacharacter matches one UTF-8 character instead of a single byte. |
2331 |
|
</P> |
2332 |
|
<P> |
2333 |
|
5. Unlike literal UTF-8 characters, the dot metacharacter followed by a |
2334 |
|
repeat quantifier does operate correctly on UTF-8 characters instead of |
2335 |
|
single bytes. |
2336 |
|
</P> |
2337 |
|
<P> |
2338 |
|
4. Although the \x{...} escape is permitted in a character class, characters |
2339 |
|
whose values are greater than 255 cannot be included in a class. |
2340 |
|
</P> |
2341 |
|
<P> |
2342 |
|
5. A class is matched against a UTF-8 character instead of just a single byte, |
2343 |
|
but it can match only characters whose values are less than 256. Characters |
2344 |
|
with greater values always fail to match a class. |
2345 |
|
</P> |
2346 |
|
<P> |
2347 |
|
6. Repeated classes work correctly on multiple characters. |
2348 |
|
</P> |
2349 |
|
<P> |
2350 |
|
7. Classes containing just a single character whose value is greater than 127 |
2351 |
|
(but less than 256), for example, [\x80] or [^\x{93}], do not work because |
2352 |
|
these are optimized into single byte matches. In the first case, of course, |
2353 |
|
the class brackets are just redundant. |
2354 |
|
</P> |
2355 |
|
<P> |
2356 |
|
8. Lookbehind assertions move backwards in the subject by a fixed number of |
2357 |
|
characters instead of a fixed number of bytes. Simple cases have been tested |
2358 |
|
to work correctly, but there may be hidden gotchas herein. |
2359 |
|
</P> |
2360 |
|
<P> |
2361 |
|
9. The character types such as \d and \w do not work correctly with UTF-8 |
2362 |
|
characters. They continue to test a single byte. |
2363 |
|
</P> |
2364 |
|
<P> |
2365 |
|
10. Anything not explicitly mentioned here continues to work in bytes rather |
2366 |
|
than in characters. |
2367 |
|
</P> |
2368 |
|
<P> |
2369 |
|
The following UTF-8 features of Perl 5.6 are not implemented: |
2370 |
|
</P> |
2371 |
|
<P> |
2372 |
|
1. The escape sequence \C to match a single byte. |
2373 |
|
</P> |
2374 |
|
<P> |
2375 |
|
2. The use of Unicode tables and properties and escapes \p, \P, and \X. |
2376 |
|
</P> |
2377 |
|
<LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A> |
2378 |
<P> |
<P> |
2379 |
Philip Hazel <ph10@cam.ac.uk> |
Philip Hazel <ph10@cam.ac.uk> |
2380 |
<BR> |
<BR> |
2387 |
Phone: +44 1223 334714 |
Phone: +44 1223 334714 |
2388 |
</P> |
</P> |
2389 |
<P> |
<P> |
2390 |
Last updated: 27 January 2000 |
Last updated: 28 August 2000, |
2391 |
|
<BR> |
2392 |
|
the 250th anniversary of the death of J.S. Bach. |
2393 |
<BR> |
<BR> |
2394 |
Copyright (c) 1997-2000 University of Cambridge. |
Copyright (c) 1997-2000 University of Cambridge. |