/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 48 by nigel, Sat Feb 24 21:39:29 2007 UTC revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC
# Line 28  SYNOPSIS Line 28  SYNOPSIS
28       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
29            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
30    
31         void pcre_free_substring(const char *stringptr);
32    
33         void pcre_free_substring_list(const char **stringptr);
34    
35       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
36    
37       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
# Line 48  DESCRIPTION Line 52  DESCRIPTION
52       The PCRE library is a set of functions that implement  regu-       The PCRE library is a set of functions that implement  regu-
53       lar  expression  pattern  matching using the same syntax and       lar  expression  pattern  matching using the same syntax and
54       semantics as Perl  5,  with  just  a  few  differences  (see       semantics as Perl  5,  with  just  a  few  differences  (see
55    
56       below).  The  current  implementation  corresponds  to  Perl       below).  The  current  implementation  corresponds  to  Perl
57       5.005, with some additional features from the Perl  develop-       5.005, with some additional features  from  later  versions.
58       ment release.       This  includes  some  experimental,  incomplete  support for
59         UTF-8 encoded strings. Details of exactly what is  and  what
60         is not supported are given below.
61    
62       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
63       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
# Line 67  DESCRIPTION Line 74  DESCRIPTION
74       releases.       releases.
75    
76       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
77       are  used  for  compiling  and matching regular expressions,       are used for compiling and matching regular expressions.
78       while   pcre_copy_substring(),   pcre_get_substring(),   and  
79       pcre_get_substring_list()   are  convenience  functions  for       The functions  pcre_copy_substring(),  pcre_get_substring(),
80         and  pcre_get_substring_list() are convenience functions for
81       extracting  captured  substrings  from  a  matched   subject       extracting  captured  substrings  from  a  matched   subject
82       string.  The function pcre_maketables() is used (optionally)       string; pcre_free_substring() and pcre_free_substring_list()
83       to build a set of character tables in the current locale for       are also provided, to free the  memory  used  for  extracted
84       passing to pcre_compile().       strings.
85    
86         The function pcre_maketables() is used (optionally) to build
87         a  set of character tables in the current locale for passing
88         to pcre_compile().
89    
90       The function pcre_fullinfo() is used to find out information       The function pcre_fullinfo() is used to find out information
91       about a compiled pattern; pcre_info() is an obsolete version       about a compiled pattern; pcre_info() is an obsolete version
# Line 92  DESCRIPTION Line 104  DESCRIPTION
104    
105    
106  MULTI-THREADING  MULTI-THREADING
107       The PCRE functions can be used in  multi-threading  applica-       The  PCRE  functions  can   be   used   in   multi-threading
108       tions, with the proviso that the memory management functions  
109       pointed to by pcre_malloc and pcre_free are  shared  by  all  
110       threads.  
111    
112    
113    SunOS 5.8                 Last change:                          2
114    
115    
116    
117         applications,  with  the  proviso that the memory management
118         functions pointed to by pcre_malloc and pcre_free are shared
119         by all threads.
120    
121       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
122       during  matching, so the same compiled pattern can safely be       during  matching, so the same compiled pattern can safely be
# Line 103  MULTI-THREADING Line 124  MULTI-THREADING
124    
125    
126    
   
127  COMPILING A PATTERN  COMPILING A PATTERN
128       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
129       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
# Line 235  COMPILING A PATTERN Line 255  COMPILING A PATTERN
255       followed by "?". It is not compatible with Perl. It can also       followed by "?". It is not compatible with Perl. It can also
256       be set by a (?U) option setting within the pattern.       be set by a (?U) option setting within the pattern.
257    
258           PCRE_UTF8
259    
260         This option causes PCRE to regard both the pattern  and  the
261         subject  as strings of UTF-8 characters instead of just byte
262         strings. However, it is available  only  if  PCRE  has  been
263         built  to  include  UTF-8  support.  If not, the use of this
264         option provokes an error. Support for UTF-8 is new,  experi-
265         mental,  and incomplete.  Details of exactly what it entails
266         are given below.
267    
268    
269    
270  STUDYING A PATTERN  STUDYING A PATTERN
271       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
272       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
273       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
274    
275       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to a compiled pattern as its first argument, and
276       returns a  pointer  to  a  pcre_extra  block  (another  void       returns a  pointer  to  a  pcre_extra  block  (another  void
277       typedef)  containing  additional  information about the pat-       typedef)  containing  additional  information about the pat-
# Line 344  INFORMATION ABOUT A PATTERN Line 375  INFORMATION ABOUT A PATTERN
375    
376         PCRE_INFO_BACKREFMAX         PCRE_INFO_BACKREFMAX
377    
378       Return the number of the highest back reference in the  pat-       Return the number of  the  highest  back  reference  in  the
379       tern.  The  fourth argument should point to an int variable.       pattern.  The  fourth  argument should point to an int vari-
380       Zero is returned if there are no back references.       able. Zero is returned if there are no back references.
381    
382         PCRE_INFO_FIRSTCHAR         PCRE_INFO_FIRSTCHAR
383    
# Line 605  MATCHING A PATTERN Line 636  MATCHING A PATTERN
636    
637  EXTRACTING CAPTURED SUBSTRINGS  EXTRACTING CAPTURED SUBSTRINGS
638       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
639    
640    
641    
642    
643    
644    SunOS 5.8                 Last change:                         12
645    
646    
647    
648       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
649       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
650       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
# Line 631  EXTRACTING CAPTURED SUBSTRINGS Line 671  EXTRACTING CAPTURED SUBSTRINGS
671       the entire pattern, while higher values extract the captured       the entire pattern, while higher values extract the captured
672       substrings. For pcre_copy_substring(), the string is  placed       substrings. For pcre_copy_substring(), the string is  placed
673       in  buffer,  whose  length is given by buffersize, while for       in  buffer,  whose  length is given by buffersize, while for
674       pcre_get_substring() a new block of store  is  obtained  via       pcre_get_substring() a new block of memory is  obtained  via
675       pcre_malloc,  and its address is returned via stringptr. The       pcre_malloc,  and its address is returned via stringptr. The
676       yield of the function is  the  length  of  the  string,  not       yield of the function is  the  length  of  the  string,  not
677       including the terminating zero, or one of       including the terminating zero, or one of
# Line 665  EXTRACTING CAPTURED SUBSTRINGS Line 705  EXTRACTING CAPTURED SUBSTRINGS
705       inspecting the appropriate offset in ovector, which is nega-       inspecting the appropriate offset in ovector, which is nega-
706       tive for unset substrings.       tive for unset substrings.
707    
708         The  two  convenience  functions  pcre_free_substring()  and
709         pcre_free_substring_list()  can  be  used to free the memory
710         returned by  a  previous  call  of  pcre_get_substring()  or
711         pcre_get_substring_list(),  respectively.  They  do  nothing
712         more than call the function pointed to by  pcre_free,  which
713         of  course  could  be called directly from a C program. How-
714         ever, PCRE is used in some situations where it is linked via
715         a  special  interface  to another programming language which
716         cannot use pcre_free directly; it is for  these  cases  that
717         the functions are provided.
718    
719    
720    
# Line 733  DIFFERENCES FROM PERL Line 783  DIFFERENCES FROM PERL
783       (?p{code})  constructions. However, there is some experimen-       (?p{code})  constructions. However, there is some experimen-
784       tal support for recursive patterns using the  non-Perl  item       tal support for recursive patterns using the  non-Perl  item
785       (?R).       (?R).
786    
787       8. There are at the time of writing some  oddities  in  Perl       8. There are at the time of writing some  oddities  in  Perl
788       5.005_02  concerned  with  the  settings of captured strings       5.005_02  concerned  with  the  settings of captured strings
789       when part of a pattern is repeated.  For  example,  matching       when part of a pattern is repeated.  For  example,  matching
# Line 785  REGULAR EXPRESSION DETAILS Line 836  REGULAR EXPRESSION DETAILS
836       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
837       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
838       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
   
839       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
840       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
841       O'Reilly  (ISBN  1-56592-257),  covers them in great detail.       O'Reilly (ISBN 1-56592-257), covers them in great detail.
842    
843       The description here is intended as reference documentation.       The description here is intended as reference documentation.
844         The basic operation of PCRE is on strings of bytes. However,
845         there is the beginnings of some support for UTF-8  character
846         strings.  To  use  this  support  you must configure PCRE to
847         include it, and then call pcre_compile() with the  PCRE_UTF8
848         option.  How  this affects the pattern matching is described
849         in the final section of this document.
850    
851       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
852       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 1004  CIRCUMFLEX AND DOLLAR Line 1061  CIRCUMFLEX AND DOLLAR
1061       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1062       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1063       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1064    
1065       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1066       zero, circumflex can never match. Inside a character  class,       zero, circumflex can never match. Inside a character  class,
1067       circumflex has an entirely different meaning (see below).       circumflex has an entirely different meaning (see below).
# Line 1056  FULL STOP (PERIOD, DOT) Line 1114  FULL STOP (PERIOD, DOT)
1114       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1115       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1116       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default)  newline.   If  the  PCRE_DOTALL
1117    
1118       option  is set, dots match newlines as well. The handling of       option  is set, dots match newlines as well. The handling of
1119       dot is entirely independent of the  handling  of  circumflex       dot is entirely independent of the  handling  of  circumflex
1120       and  dollar,  the  only  relationship  being  that they both       and  dollar,  the  only  relationship  being  that they both
# Line 1517  BACK REFERENCES Line 1576  BACK REFERENCES
1576       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
1577       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
1578       example, (a\1) never matches.  However, such references  can       example, (a\1) never matches.  However, such references  can
1579       be  useful  inside  repeated  subpatterns.  For example, the       be useful inside repeated subpatterns. For example, the pat-
1580       pattern       tern
1581    
1582         (a|b\1)+         (a|b\1)+
1583    
1584       matches any number of "a"s and also "aba", "ababaa" etc.  At       matches any number of "a"s and also "aba", "ababbaa" etc. At
1585       each iteration of the subpattern, the back reference matches       each iteration of the subpattern, the back reference matches
1586       the character string corresponding to  the  previous  itera-       the  character  string   corresponding   to   the   previous
1587       tion.  In  order  for this to work, the pattern must be such       iteration.  In  order  for this to work, the pattern must be
1588       that the first iteration does not need  to  match  the  back       such that the first iteration does not  need  to  match  the
1589       reference.  This  can  be  done using alternation, as in the       back  reference.  This  can be done using alternation, as in
1590       example above, or by a quantifier with a minimum of zero.       the example above, or by a  quantifier  with  a  minimum  of
1591         zero.
1592    
1593    
1594    
# Line 1681  ONCE-ONLY SUBPATTERNS Line 1741  ONCE-ONLY SUBPATTERNS
1741    
1742       This kind of parenthesis "locks up" the  part of the pattern       This kind of parenthesis "locks up" the  part of the pattern
1743       it  contains once it has matched, and a failure further into       it  contains once it has matched, and a failure further into
1744       the pattern is prevented from backtracking  into  it.  Back-       the  pattern  is  prevented  from  backtracking   into   it.
1745       tracking  past  it to previous items, however, works as nor-       Backtracking  past  it  to previous items, however, works as
1746       mal.       normal.
1747    
1748       An alternative description is that a subpattern of this type       An alternative description is that a subpattern of this type
1749       matches  the  string  of  characters that an identical stan-       matches  the  string  of  characters that an identical stan-
# Line 1941  PERFORMANCE Line 2001  PERFORMANCE
2001       repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of       repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
2002       those  cases other than 0, the + repeats can match different       those  cases other than 0, the + repeats can match different
2003       numbers of times.) When the remainder of the pattern is such       numbers of times.) When the remainder of the pattern is such
2004       that  the entire match is going to fail, PCRE has in princi-       that  the  entire  match  is  going  to  fail,  PCRE  has in
2005       ple to try every possible variation, and this  can  take  an       principle to try every possible variation, and this can take
2006       extremely long time.       an extremely long time.
2007    
2008       An optimization catches some of the more simple  cases  such       An optimization catches some of the more simple  cases  such
2009       as       as
# Line 1966  PERFORMANCE Line 2026  PERFORMANCE
2026    
2027    
2028    
2029    UTF-8 SUPPORT
2030         Starting at release 3.3, PCRE has some support for character
2031         strings encoded in the UTF-8 format. This is incomplete, and
2032         is regarded as experimental. In order to use  it,  you  must
2033         configure PCRE to include UTF-8 support in the code, and, in
2034         addition, you must call pcre_compile()  with  the  PCRE_UTF8
2035         option flag. When you do this, both the pattern and any sub-
2036         ject strings that are matched  against  it  are  treated  as
2037         UTF-8  strings instead of just strings of bytes, but only in
2038         the cases that are mentioned below.
2039    
2040         If you compile PCRE with UTF-8 support, but do not use it at
2041         run  time,  the  library will be a bit bigger, but the addi-
2042         tional run time overhead is limited to testing the PCRE_UTF8
2043         flag in several places, so should not be very large.
2044    
2045         PCRE assumes that the strings  it  is  given  contain  valid
2046         UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
2047         you pass invalid UTF-8 strings  to  PCRE,  the  results  are
2048         undefined.
2049    
2050         Running with PCRE_UTF8 set causes these changes in  the  way
2051         PCRE works:
2052    
2053         1. In a pattern, the escape sequence \x{...}, where the con-
2054         tents  of  the  braces is a string of hexadecimal digits, is
2055         interpreted as a UTF-8 character whose code  number  is  the
2056         given   hexadecimal  number,  for  example:  \x{1234}.  This
2057         inserts from one to six  literal  bytes  into  the  pattern,
2058         using the UTF-8 encoding. If a non-hexadecimal digit appears
2059         between the braces, the item is not recognized.
2060    
2061         2. The original hexadecimal escape sequence, \xhh, generates
2062         a two-byte UTF-8 character if its value is greater than 127.
2063    
2064         3. Repeat quantifiers are NOT correctly handled if they fol-
2065         low  a  multibyte character. For example, \x{100}* and \xc3+
2066         do not work. If you want to repeat such characters, you must
2067         enclose  them  in  non-capturing  parentheses,  for  example
2068         (?:\x{100}), at present.
2069    
2070         4. The dot metacharacter matches one UTF-8 character instead
2071         of a single byte.
2072    
2073         5. Unlike literal UTF-8 characters,  the  dot  metacharacter
2074         followed  by  a  repeat quantifier does operate correctly on
2075         UTF-8 characters instead of single bytes.
2076    
2077         4. Although the \x{...} escape is permitted in  a  character
2078         class,  characters  whose values are greater than 255 cannot
2079         be included in a class.
2080    
2081         5. A class is matched against a UTF-8 character  instead  of
2082         just  a  single byte, but it can match only characters whose
2083         values are less than 256.  Characters  with  greater  values
2084         always fail to match a class.
2085    
2086         6. Repeated classes work correctly on multiple characters.
2087    
2088         7. Classes containing just a single character whose value is
2089         greater than 127 (but less than 256), for example, [\x80] or
2090         [^\x{93}], do not work because these are optimized into sin-
2091         gle  byte  matches.  In the first case, of course, the class
2092         brackets are just redundant.
2093    
2094         8. Lookbehind assertions move backwards in the subject by  a
2095         fixed  number  of  characters  instead  of a fixed number of
2096         bytes. Simple cases have been tested to work correctly,  but
2097         there may be hidden gotchas herein.
2098    
2099         9. The character types  such  as  \d  and  \w  do  not  work
2100         correctly  with  UTF-8  characters.  They continue to test a
2101         single byte.
2102    
2103         10. Anything not explicitly mentioned here continues to work
2104         in bytes rather than in characters.
2105    
2106         The following UTF-8 features of  Perl  5.6  are  not  imple-
2107         mented:
2108    
2109         1. The escape sequence \C to match a single byte.
2110    
2111         2. The use of Unicode tables and properties and escapes  \p,
2112         \P, and \X.
2113    
2114    
2115    
2116  AUTHOR  AUTHOR
2117       Philip Hazel <ph10@cam.ac.uk>       Philip Hazel <ph10@cam.ac.uk>
2118       University Computing Service,       University Computing Service,
# Line 1973  AUTHOR Line 2120  AUTHOR
2120       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
2121       Phone: +44 1223 334714       Phone: +44 1223 334714
2122    
2123       Last updated: 27 January 2000       Last updated: 28 August 2000,
2124           the 250th anniversary of the death of J.S. Bach.
2125       Copyright (c) 1997-2000 University of Cambridge.       Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.48  
changed lines
  Added in v.49

  ViewVC Help
Powered by ViewVC 1.1.5