/[pcre]/code/trunk/doc/pcre.html
ViewVC logotype

Diff of /code/trunk/doc/pcre.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC
# Line 37  conversion went wrong. Line 37  conversion went wrong.
37  <LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A>  <LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A>
38  <LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>  <LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
39  <LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>  <LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
40  <LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A>  <LI><A NAME="TOC30" HREF="#SEC30">UTF-8 SUPPORT</A>
41    <LI><A NAME="TOC31" HREF="#SEC31">AUTHOR</A>
42  </UL>  </UL>
43  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
44  <P>  <P>
# Line 76  pcre - Perl-compatible regular expressio Line 77  pcre - Perl-compatible regular expressio
77  <B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B>  <B>int *<I>ovector</I>, int <I>stringcount</I>, const char ***<I>listptr</I>);</B>
78  </P>  </P>
79  <P>  <P>
80    <B>void pcre_free_substring(const char *<I>stringptr</I>);</B>
81    </P>
82    <P>
83    <B>void pcre_free_substring_list(const char **<I>stringptr</I>);</B>
84    </P>
85    <P>
86  <B>const unsigned char *pcre_maketables(void);</B>  <B>const unsigned char *pcre_maketables(void);</B>
87  </P>  </P>
88  <P>  <P>
# Line 100  pcre - Perl-compatible regular expressio Line 107  pcre - Perl-compatible regular expressio
107  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
108  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
109  differences (see below). The current implementation corresponds to Perl 5.005,  differences (see below). The current implementation corresponds to Perl 5.005,
110  with some additional features from the Perl development release.  with some additional features from later versions. This includes some
111    experimental, incomplete support for UTF-8 encoded strings. Details of exactly
112    what is and what is not supported are given below.
113  </P>  </P>
114  <P>  <P>
115  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
# Line 117  use these to include support for differe Line 126  use these to include support for differe
126  </P>  </P>
127  <P>  <P>
128  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
129  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions.
130  <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and  </P>
131    <P>
132    The functions <B>pcre_copy_substring()</B>, <B>pcre_get_substring()</B>, and
133  <B>pcre_get_substring_list()</B> are convenience functions for extracting  <B>pcre_get_substring_list()</B> are convenience functions for extracting
134  captured substrings from a matched subject string. The function  captured substrings from a matched subject string; <B>pcre_free_substring()</B>
135  <B>pcre_maketables()</B> is used (optionally) to build a set of character tables  and <B>pcre_free_substring_list()</B> are also provided, to free the memory used
136  in the current locale for passing to <B>pcre_compile()</B>.  for extracted strings.
137    </P>
138    <P>
139    The function <B>pcre_maketables()</B> is used (optionally) to build a set of
140    character tables in the current locale for passing to <B>pcre_compile()</B>.
141  </P>  </P>
142  <P>  <P>
143  The function <B>pcre_fullinfo()</B> is used to find out information about a  The function <B>pcre_fullinfo()</B> is used to find out information about a
# Line 297  This option inverts the "greediness" of Line 312  This option inverts the "greediness" of
312  greedy by default, but become greedy if followed by "?". It is not compatible  greedy by default, but become greedy if followed by "?". It is not compatible
313  with Perl. It can also be set by a (?U) option setting within the pattern.  with Perl. It can also be set by a (?U) option setting within the pattern.
314  </P>  </P>
315    <P>
316    <PRE>
317      PCRE_UTF8
318    </PRE>
319    </P>
320    <P>
321    This option causes PCRE to regard both the pattern and the subject as strings
322    of UTF-8 characters instead of just byte strings. However, it is available only
323    if PCRE has been built to include UTF-8 support. If not, the use of this option
324    provokes an error. Support for UTF-8 is new, experimental, and incomplete.
325    Details of exactly what it entails are given below.
326    </P>
327  <LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A>  <LI><A NAME="SEC6" HREF="#TOC1">STUDYING A PATTERN</A>
328  <P>  <P>
329  When a pattern is going to be used several times, it is worth spending more  When a pattern is going to be used several times, it is worth spending more
# Line 743  extract a single substring, whose number Line 770  extract a single substring, whose number
770  value of zero extracts the substring that matched the entire pattern, while  value of zero extracts the substring that matched the entire pattern, while
771  higher values extract the captured substrings. For <B>pcre_copy_substring()</B>,  higher values extract the captured substrings. For <B>pcre_copy_substring()</B>,
772  the string is placed in <I>buffer</I>, whose length is given by  the string is placed in <I>buffer</I>, whose length is given by
773  <I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of store is  <I>buffersize</I>, while for <B>pcre_get_substring()</B> a new block of memory is
774  obtained via <B>pcre_malloc</B>, and its address is returned via  obtained via <B>pcre_malloc</B>, and its address is returned via
775  <I>stringptr</I>. The yield of the function is the length of the string, not  <I>stringptr</I>. The yield of the function is the length of the string, not
776  including the terminating zero, or one of  including the terminating zero, or one of
# Line 789  string. This can be distinguished from a Line 816  string. This can be distinguished from a
816  inspecting the appropriate offset in <I>ovector</I>, which is negative for unset  inspecting the appropriate offset in <I>ovector</I>, which is negative for unset
817  substrings.  substrings.
818  </P>  </P>
819    <P>
820    The two convenience functions <B>pcre_free_substring()</B> and
821    <B>pcre_free_substring_list()</B> can be used to free the memory returned by
822    a previous call of <B>pcre_get_substring()</B> or
823    <B>pcre_get_substring_list()</B>, respectively. They do nothing more than call
824    the function pointed to by <B>pcre_free</B>, which of course could be called
825    directly from a C program. However, PCRE is used in some situations where it is
826    linked via a special interface to another programming language which cannot use
827    <B>pcre_free</B> directly; it is for these cases that the functions are
828    provided.
829    </P>
830  <LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A>  <LI><A NAME="SEC11" HREF="#TOC1">LIMITATIONS</A>
831  <P>  <P>
832  There are some size limitations in PCRE but it is hoped that they will never in  There are some size limitations in PCRE but it is hoped that they will never in
# Line 908  The syntax and semantics of the regular Line 946  The syntax and semantics of the regular
946  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
947  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
948  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
949  O'Reilly (ISBN 1-56592-257), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail.
950  here is intended as reference documentation.  </P>
951    <P>
952    The description here is intended as reference documentation. The basic
953    operation of PCRE is on strings of bytes. However, there is the beginnings of
954    some support for UTF-8 character strings. To use this support you must
955    configure PCRE to include it, and then call <B>pcre_compile()</B> with the
956    PCRE_UTF8 option. How this affects the pattern matching is described in the
957    final section of this document.
958  </P>  </P>
959  <P>  <P>
960  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
# Line 1718  example, the pattern Line 1763  example, the pattern
1763  </PRE>  </PRE>
1764  </P>  </P>
1765  <P>  <P>
1766  matches any number of "a"s and also "aba", "ababaa" etc. At each iteration of  matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
1767  the subpattern, the back reference matches the character string corresponding  the subpattern, the back reference matches the character string corresponding
1768  to the previous iteration. In order for this to work, the pattern must be such  to the previous iteration. In order for this to work, the pattern must be such
1769  that the first iteration does not need to match the back reference. This can be  that the first iteration does not need to match the back reference. This can be
# Line 2240  with the pattern above. The former gives Line 2285  with the pattern above. The former gives
2285  applied to a whole line of "a" characters, whereas the latter takes an  applied to a whole line of "a" characters, whereas the latter takes an
2286  appreciable time with strings longer than about 20 characters.  appreciable time with strings longer than about 20 characters.
2287  </P>  </P>
2288  <LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A>  <LI><A NAME="SEC30" HREF="#TOC1">UTF-8 SUPPORT</A>
2289    <P>
2290    Starting at release 3.3, PCRE has some support for character strings encoded
2291    in the UTF-8 format. This is incomplete, and is regarded as experimental. In
2292    order to use it, you must configure PCRE to include UTF-8 support in the code,
2293    and, in addition, you must call <B>pcre_compile()</B> with the PCRE_UTF8 option
2294    flag. When you do this, both the pattern and any subject strings that are
2295    matched against it are treated as UTF-8 strings instead of just strings of
2296    bytes, but only in the cases that are mentioned below.
2297    </P>
2298    <P>
2299    If you compile PCRE with UTF-8 support, but do not use it at run time, the
2300    library will be a bit bigger, but the additional run time overhead is limited
2301    to testing the PCRE_UTF8 flag in several places, so should not be very large.
2302    </P>
2303    <P>
2304    PCRE assumes that the strings it is given contain valid UTF-8 codes. It does
2305    not diagnose invalid UTF-8 strings. If you pass invalid UTF-8 strings to PCRE,
2306    the results are undefined.
2307    </P>
2308    <P>
2309    Running with PCRE_UTF8 set causes these changes in the way PCRE works:
2310    </P>
2311    <P>
2312    1. In a pattern, the escape sequence \x{...}, where the contents of the braces
2313    is a string of hexadecimal digits, is interpreted as a UTF-8 character whose
2314    code number is the given hexadecimal number, for example: \x{1234}. This
2315    inserts from one to six literal bytes into the pattern, using the UTF-8
2316    encoding. If a non-hexadecimal digit appears between the braces, the item is
2317    not recognized.
2318    </P>
2319    <P>
2320    2. The original hexadecimal escape sequence, \xhh, generates a two-byte UTF-8
2321    character if its value is greater than 127.
2322    </P>
2323    <P>
2324    3. Repeat quantifiers are NOT correctly handled if they follow a multibyte
2325    character. For example, \x{100}* and \xc3+ do not work. If you want to
2326    repeat such characters, you must enclose them in non-capturing parentheses,
2327    for example (?:\x{100}), at present.
2328    </P>
2329    <P>
2330    4. The dot metacharacter matches one UTF-8 character instead of a single byte.
2331    </P>
2332    <P>
2333    5. Unlike literal UTF-8 characters, the dot metacharacter followed by a
2334    repeat quantifier does operate correctly on UTF-8 characters instead of
2335    single bytes.
2336    </P>
2337    <P>
2338    4. Although the \x{...} escape is permitted in a character class, characters
2339    whose values are greater than 255 cannot be included in a class.
2340    </P>
2341    <P>
2342    5. A class is matched against a UTF-8 character instead of just a single byte,
2343    but it can match only characters whose values are less than 256. Characters
2344    with greater values always fail to match a class.
2345    </P>
2346    <P>
2347    6. Repeated classes work correctly on multiple characters.
2348    </P>
2349    <P>
2350    7. Classes containing just a single character whose value is greater than 127
2351    (but less than 256), for example, [\x80] or [^\x{93}], do not work because
2352    these are optimized into single byte matches. In the first case, of course,
2353    the class brackets are just redundant.
2354    </P>
2355    <P>
2356    8. Lookbehind assertions move backwards in the subject by a fixed number of
2357    characters instead of a fixed number of bytes. Simple cases have been tested
2358    to work correctly, but there may be hidden gotchas herein.
2359    </P>
2360    <P>
2361    9. The character types such as \d and \w do not work correctly with UTF-8
2362    characters. They continue to test a single byte.
2363    </P>
2364    <P>
2365    10. Anything not explicitly mentioned here continues to work in bytes rather
2366    than in characters.
2367    </P>
2368    <P>
2369    The following UTF-8 features of Perl 5.6 are not implemented:
2370    </P>
2371    <P>
2372    1. The escape sequence \C to match a single byte.
2373    </P>
2374    <P>
2375    2. The use of Unicode tables and properties and escapes \p, \P, and \X.
2376    </P>
2377    <LI><A NAME="SEC31" HREF="#TOC1">AUTHOR</A>
2378  <P>  <P>
2379  Philip Hazel &#60;ph10@cam.ac.uk&#62;  Philip Hazel &#60;ph10@cam.ac.uk&#62;
2380  <BR>  <BR>
# Line 2253  Cambridge CB2 3QG, England. Line 2387  Cambridge CB2 3QG, England.
2387  Phone: +44 1223 334714  Phone: +44 1223 334714
2388  </P>  </P>
2389  <P>  <P>
2390  Last updated: 27 January 2000  Last updated: 28 August 2000,
2391    <BR>
2392      the 250th anniversary of the death of J.S. Bach.
2393  <BR>  <BR>
2394  Copyright (c) 1997-2000 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.47  
changed lines
  Added in v.49

  ViewVC Help
Powered by ViewVC 1.1.5