/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 43 by nigel, Sat Feb 24 21:39:21 2007 UTC revision 51 by nigel, Sat Feb 24 21:39:37 2007 UTC
# Line 28  SYNOPSIS Line 28  SYNOPSIS
28       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
29            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
30    
31         void pcre_free_substring(const char *stringptr);
32    
33         void pcre_free_substring_list(const char **stringptr);
34    
35       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
36    
37       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
# Line 48  DESCRIPTION Line 52  DESCRIPTION
52       The PCRE library is a set of functions that implement  regu-       The PCRE library is a set of functions that implement  regu-
53       lar  expression  pattern  matching using the same syntax and       lar  expression  pattern  matching using the same syntax and
54       semantics as Perl  5,  with  just  a  few  differences  (see       semantics as Perl  5,  with  just  a  few  differences  (see
55    
56       below).  The  current  implementation  corresponds  to  Perl       below).  The  current  implementation  corresponds  to  Perl
57       5.005, with some additional features from the Perl  develop-       5.005, with some additional features  from  later  versions.
58       ment release.       This  includes  some  experimental,  incomplete  support for
59         UTF-8 encoded strings. Details of exactly what is  and  what
60         is not supported are given below.
61    
62       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
63       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
# Line 67  DESCRIPTION Line 74  DESCRIPTION
74       releases.       releases.
75    
76       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
77       are  used  for  compiling  and matching regular expressions,       are used for compiling and matching regular expressions.
78       while   pcre_copy_substring(),   pcre_get_substring(),   and  
79       pcre_get_substring_list()   are  convenience  functions  for       The functions  pcre_copy_substring(),  pcre_get_substring(),
80         and  pcre_get_substring_list() are convenience functions for
81       extracting  captured  substrings  from  a  matched   subject       extracting  captured  substrings  from  a  matched   subject
82       string.  The function pcre_maketables() is used (optionally)       string; pcre_free_substring() and pcre_free_substring_list()
83       to build a set of character tables in the current locale for       are also provided, to free the  memory  used  for  extracted
84       passing to pcre_compile().       strings.
85    
86         The function pcre_maketables() is used (optionally) to build
87         a  set of character tables in the current locale for passing
88         to pcre_compile().
89    
90       The function pcre_fullinfo() is used to find out information       The function pcre_fullinfo() is used to find out information
91       about a compiled pattern; pcre_info() is an obsolete version       about a compiled pattern; pcre_info() is an obsolete version
# Line 92  DESCRIPTION Line 104  DESCRIPTION
104    
105    
106  MULTI-THREADING  MULTI-THREADING
107       The PCRE functions can be used in  multi-threading  applica-       The  PCRE  functions  can   be   used   in   multi-threading
108       tions, with the proviso that the memory management functions  
109       pointed to by pcre_malloc and pcre_free are  shared  by  all  
110       threads.  
111    
112    
113    SunOS 5.8                 Last change:                          2
114    
115    
116    
117         applications,  with  the  proviso that the memory management
118         functions pointed to by pcre_malloc and pcre_free are shared
119         by all threads.
120    
121       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
122       during  matching, so the same compiled pattern can safely be       during  matching, so the same compiled pattern can safely be
# Line 103  MULTI-THREADING Line 124  MULTI-THREADING
124    
125    
126    
   
127  COMPILING A PATTERN  COMPILING A PATTERN
128       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
129       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
# Line 235  COMPILING A PATTERN Line 255  COMPILING A PATTERN
255       followed by "?". It is not compatible with Perl. It can also       followed by "?". It is not compatible with Perl. It can also
256       be set by a (?U) option setting within the pattern.       be set by a (?U) option setting within the pattern.
257    
258           PCRE_UTF8
259    
260         This option causes PCRE to regard both the pattern  and  the
261         subject  as strings of UTF-8 characters instead of just byte
262         strings. However, it is available  only  if  PCRE  has  been
263         built  to  include  UTF-8  support.  If not, the use of this
264         option provokes an error. Support for UTF-8 is new,  experi-
265         mental,  and incomplete.  Details of exactly what it entails
266         are given below.
267    
268    
269    
270  STUDYING A PATTERN  STUDYING A PATTERN
271       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
272       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
273       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
274    
275       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to a compiled pattern as its first argument, and
276       returns a  pointer  to  a  pcre_extra  block  (another  void       returns a  pointer  to  a  pcre_extra  block  (another  void
277       typedef)  containing  additional  information about the pat-       typedef)  containing  additional  information about the pat-
# Line 344  INFORMATION ABOUT A PATTERN Line 375  INFORMATION ABOUT A PATTERN
375    
376         PCRE_INFO_BACKREFMAX         PCRE_INFO_BACKREFMAX
377    
378       Return the number of the highest back reference in the  pat-       Return the number of  the  highest  back  reference  in  the
379       tern.  The  fourth argument should point to an int variable.       pattern.  The  fourth  argument should point to an int vari-
380       Zero is returned if there are no back references.       able. Zero is returned if there are no back references.
381    
382         PCRE_INFO_FIRSTCHAR         PCRE_INFO_FIRSTCHAR
383    
384       Return information about the first character of any  matched       Return information about the first character of any  matched
385       string,  for  a  non-anchored  pattern.  If there is a fixed       string,  for  a  non-anchored  pattern.  If there is a fixed
386       first   character,   e.g.   from   a   pattern    such    as       first   character,   e.g.   from   a   pattern    such    as
387       (cat|cow|coyote), then it is returned in the integer pointed       (cat|cow|coyote),  it  is returned in the integer pointed to
388       to by where. Otherwise, if either       by where. Otherwise, if either
389    
390       (a) the pattern was compiled with the PCRE_MULTILINE option,       (a) the pattern was compiled with the PCRE_MULTILINE option,
391       and every branch starts with "^", or       and every branch starts with "^", or
# Line 363  INFORMATION ABOUT A PATTERN Line 394  INFORMATION ABOUT A PATTERN
394       PCRE_DOTALL is not set (if it were set, the pattern would be       PCRE_DOTALL is not set (if it were set, the pattern would be
395       anchored),       anchored),
396    
397       then -1 is returned, indicating  that  the  pattern  matches       -1 is returned, indicating that the pattern matches only  at
398       only  at  the  start  of  a subject string or after any "\n"       the  start  of a subject string or after any "\n" within the
399       within the string. Otherwise -2 is  returned.  For  anchored       string. Otherwise -2 is returned.  For anchored patterns, -2
400       patterns, -2 is returned.       is returned.
401    
402         PCRE_INFO_FIRSTTABLE         PCRE_INFO_FIRSTTABLE
403    
# Line 605  MATCHING A PATTERN Line 636  MATCHING A PATTERN
636    
637  EXTRACTING CAPTURED SUBSTRINGS  EXTRACTING CAPTURED SUBSTRINGS
638       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
639    
640    
641    
642    
643    
644    SunOS 5.8                 Last change:                         12
645    
646    
647    
648       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
649       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
650       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
# Line 622  EXTRACTING CAPTURED SUBSTRINGS Line 662  EXTRACTING CAPTURED SUBSTRINGS
662       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
663       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
664       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
665       tor, then the value passed as stringcount should be the size       tor,  the  value passed as stringcount should be the size of
666       of the vector divided by three.       the vector divided by three.
667    
668       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
669       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
# Line 631  EXTRACTING CAPTURED SUBSTRINGS Line 671  EXTRACTING CAPTURED SUBSTRINGS
671       the entire pattern, while higher values extract the captured       the entire pattern, while higher values extract the captured
672       substrings. For pcre_copy_substring(), the string is  placed       substrings. For pcre_copy_substring(), the string is  placed
673       in  buffer,  whose  length is given by buffersize, while for       in  buffer,  whose  length is given by buffersize, while for
674       pcre_get_substring() a new block of store  is  obtained  via       pcre_get_substring() a new block of memory is  obtained  via
675       pcre_malloc,  and its address is returned via stringptr. The       pcre_malloc,  and its address is returned via stringptr. The
676       yield of the function is  the  length  of  the  string,  not       yield of the function is  the  length  of  the  string,  not
677       including the terminating zero, or one of       including the terminating zero, or one of
# Line 665  EXTRACTING CAPTURED SUBSTRINGS Line 705  EXTRACTING CAPTURED SUBSTRINGS
705       inspecting the appropriate offset in ovector, which is nega-       inspecting the appropriate offset in ovector, which is nega-
706       tive for unset substrings.       tive for unset substrings.
707    
708         The  two  convenience  functions  pcre_free_substring()  and
709         pcre_free_substring_list()  can  be  used to free the memory
710         returned by  a  previous  call  of  pcre_get_substring()  or
711         pcre_get_substring_list(),  respectively.  They  do  nothing
712         more than call the function pointed to by  pcre_free,  which
713         of  course  could  be called directly from a C program. How-
714         ever, PCRE is used in some situations where it is linked via
715         a  special  interface  to another programming language which
716         cannot use pcre_free directly; it is for  these  cases  that
717         the functions are provided.
718    
719    
720    
# Line 733  DIFFERENCES FROM PERL Line 783  DIFFERENCES FROM PERL
783       (?p{code})  constructions. However, there is some experimen-       (?p{code})  constructions. However, there is some experimen-
784       tal support for recursive patterns using the  non-Perl  item       tal support for recursive patterns using the  non-Perl  item
785       (?R).       (?R).
786    
787       8. There are at the time of writing some  oddities  in  Perl       8. There are at the time of writing some  oddities  in  Perl
788       5.005_02  concerned  with  the  settings of captured strings       5.005_02  concerned  with  the  settings of captured strings
789       when part of a pattern is repeated.  For  example,  matching       when part of a pattern is repeated.  For  example,  matching
790       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value
791       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2
792       unset.    However,    if   the   pattern   is   changed   to       unset.    However,    if   the   pattern   is   changed   to
793       /^(aa(b(b))?)+$/ then $2 (and $3) get set.       /^(aa(b(b))?)+$/ then $2 (and $3) are set.
794    
795       In Perl 5.004 $2 is set in both cases, and that is also true       In Perl 5.004 $2 is set in both cases, and that is also true
796       of PCRE. If in the future Perl changes to a consistent state       of PCRE. If in the future Perl changes to a consistent state
# Line 785  REGULAR EXPRESSION DETAILS Line 836  REGULAR EXPRESSION DETAILS
836       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
837       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
838       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
   
839       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
840       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
841       O'Reilly  (ISBN  1-56592-257),  covers them in great detail.       O'Reilly (ISBN 1-56592-257), covers them in great detail.
842    
843       The description here is intended as reference documentation.       The description here is intended as reference documentation.
844         The basic operation of PCRE is on strings of bytes. However,
845         there is the beginnings of some support for UTF-8  character
846         strings.  To  use  this  support  you must configure PCRE to
847         include it, and then call pcre_compile() with the  PCRE_UTF8
848         option.  How  this affects the pattern matching is described
849         in the final section of this document.
850    
851       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
852       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 1004  CIRCUMFLEX AND DOLLAR Line 1061  CIRCUMFLEX AND DOLLAR
1061       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1062       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1063       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1064    
1065       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1066       zero, circumflex can never match. Inside a character  class,       zero, circumflex can never match. Inside a character  class,
1067       circumflex has an entirely different meaning (see below).       circumflex has an entirely different meaning (see below).
# Line 1056  FULL STOP (PERIOD, DOT) Line 1114  FULL STOP (PERIOD, DOT)
1114       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1115       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1116       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default)  newline.   If  the  PCRE_DOTALL
1117       option  is  set,  then dots match newlines as well. The han-  
1118       dling of dot is entirely independent of the handling of cir-       option  is set, dots match newlines as well. The handling of
1119       cumflex  and  dollar,  the only relationship being that they       dot is entirely independent of the  handling  of  circumflex
1120       both involve newline characters.  Dot has no special meaning       and  dollar,  the  only  relationship  being  that they both
1121       in a character class.       involve newline characters. Dot has no special meaning in  a
1122         character class.
1123    
1124    
1125    
# Line 1403  REPETITION Line 1462  REPETITION
1462    
1463         /* first command */  not comment  /* second comment */         /* first command */  not comment  /* second comment */
1464    
1465       fails, because it matches  the  entire  string  due  to  the       fails, because it matches the entire  string  owing  to  the
1466       greediness of the .*  item.       greediness of the .*  item.
1467    
1468       However, if a quantifier is followed  by  a  question  mark,       However, if a quantifier is followed by a question mark,  it
1469       then it ceases to be greedy, and instead matches the minimum       ceases  to be greedy, and instead matches the minimum number
1470       number of times possible, so the pattern       of times possible, so the pattern
1471    
1472         /\*.*?\*/         /\*.*?\*/
1473    
# Line 1425  REPETITION Line 1484  REPETITION
1484       that is the only way the rest of the pattern matches.       that is the only way the rest of the pattern matches.
1485    
1486       If the PCRE_UNGREEDY option is set (an option which  is  not       If the PCRE_UNGREEDY option is set (an option which  is  not
1487       available  in  Perl)  then the quantifiers are not greedy by       available  in  Perl),  the  quantifiers  are  not  greedy by
1488       default, but individual ones can be made greedy by following       default, but individual ones can be made greedy by following
1489       them  with  a  question mark. In other words, it inverts the       them  with  a  question mark. In other words, it inverts the
1490       default behaviour.       default behaviour.
# Line 1437  REPETITION Line 1496  REPETITION
1496    
1497       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
1498       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
1499       to match newlines, then the pattern is implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
1500       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
1501       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
1502       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
# Line 1490  BACK REFERENCES Line 1549  BACK REFERENCES
1549    
1550       matches "sense and sensibility" and "response and  responsi-       matches "sense and sensibility" and "response and  responsi-
1551       bility",  but  not  "sense  and  responsibility". If caseful       bility",  but  not  "sense  and  responsibility". If caseful
1552       matching is in force at the time of the back reference, then       matching is in force at the time of the back reference,  the
1553       the case of letters is relevant. For example,       case of letters is relevant. For example,
1554    
1555         ((?i)rah)\s+\1         ((?i)rah)\s+\1
1556    
# Line 1501  BACK REFERENCES Line 1560  BACK REFERENCES
1560    
1561       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
1562       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
1563       particular match, then any  back  references  to  it  always       particular match, any back references to it always fail. For
1564       fail. For example, the pattern       example, the pattern
1565    
1566         (a|(bc))\2         (a|(bc))\2
1567    
# Line 1510  BACK REFERENCES Line 1569  BACK REFERENCES
1569       Because  there  may  be up to 99 back references, all digits       Because  there  may  be up to 99 back references, all digits
1570       following the backslash are taken as  part  of  a  potential       following the backslash are taken as  part  of  a  potential
1571       back reference number. If the pattern continues with a digit       back reference number. If the pattern continues with a digit
1572       character, then some delimiter must be used to terminate the       character, some delimiter must be used to terminate the back
1573       back reference. If the PCRE_EXTENDED option is set, this can       reference.   If the PCRE_EXTENDED option is set, this can be
1574       be whitespace.  Otherwise an empty comment can be used.       whitespace. Otherwise an empty comment can be used.
1575    
1576       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
1577       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
1578       example, (a\1) never matches.  However, such references  can       example, (a\1) never matches.  However, such references  can
1579       be  useful  inside  repeated  subpatterns.  For example, the       be useful inside repeated subpatterns. For example, the pat-
1580       pattern       tern
1581    
1582         (a|b\1)+         (a|b\1)+
1583    
1584       matches any number of "a"s and also "aba", "ababaa" etc.  At       matches any number of "a"s and also "aba", "ababbaa" etc. At
1585       each iteration of the subpattern, the back reference matches       each iteration of the subpattern, the back reference matches
1586       the character string corresponding to  the  previous  itera-       the  character  string   corresponding   to   the   previous
1587       tion.  In  order  for this to work, the pattern must be such       iteration.  In  order  for this to work, the pattern must be
1588       that the first iteration does not need  to  match  the  back       such that the first iteration does not  need  to  match  the
1589       reference.  This  can  be  done using alternation, as in the       back  reference.  This  can be done using alternation, as in
1590       example above, or by a quantifier with a minimum of zero.       the example above, or by a  quantifier  with  a  minimum  of
1591         zero.
1592    
1593    
1594    
# Line 1612  ASSERTIONS Line 1672  ASSERTIONS
1672       matches "foo" preceded by three digits that are  not  "999".       matches "foo" preceded by three digits that are  not  "999".
1673       Notice  that each of the assertions is applied independently       Notice  that each of the assertions is applied independently
1674       at the same point in the subject string. First  there  is  a       at the same point in the subject string. First  there  is  a
1675       check  that  the  previous  three characters are all digits,       check that the previous three characters are all digits, and
1676       then there is a check that the same three characters are not       then there is a check that the same three characters are not
1677       "999".   This  pattern  does not match "foo" preceded by six       "999".   This  pattern  does not match "foo" preceded by six
1678       characters, the first of which are digits and the last three       characters, the first of which are digits and the last three
# Line 1681  ONCE-ONLY SUBPATTERNS Line 1741  ONCE-ONLY SUBPATTERNS
1741    
1742       This kind of parenthesis "locks up" the  part of the pattern       This kind of parenthesis "locks up" the  part of the pattern
1743       it  contains once it has matched, and a failure further into       it  contains once it has matched, and a failure further into
1744       the pattern is prevented from backtracking  into  it.  Back-       the  pattern  is  prevented  from  backtracking   into   it.
1745       tracking  past  it to previous items, however, works as nor-       Backtracking  past  it  to previous items, however, works as
1746       mal.       normal.
1747    
1748       An alternative description is that a subpattern of this type       An alternative description is that a subpattern of this type
1749       matches  the  string  of  characters that an identical stan-       matches  the  string  of  characters that an identical stan-
# Line 1713  ONCE-ONLY SUBPATTERNS Line 1773  ONCE-ONLY SUBPATTERNS
1773    
1774         ^.*abcd$         ^.*abcd$
1775    
1776       then the initial .* matches the entire string at first,  but       the initial .* matches the entire string at first, but  when
1777       when  this  fails  (because  there  is no following "a"), it       this  fails  (because  there  is no following "a"), it back-
1778       backtracks to match all but the last character, then all but       tracks to match all but the last character, then all but the
1779       the  last  two  characters, and so on. Once again the search       last  two  characters,  and so on. Once again the search for
1780       for "a" covers the entire string, from right to left, so  we       "a" covers the entire string, from right to left, so we  are
1781       are no better off. However, if the pattern is written as       no better off. However, if the pattern is written as
1782    
1783         ^(?>.*)(?<=abcd)         ^(?>.*)(?<=abcd)
1784    
1785       then there can be no backtracking for the .*  item;  it  can       there can be no backtracking for the .* item; it  can  match
1786       match  only  the  entire  string.  The subsequent lookbehind       only  the entire string. The subsequent lookbehind assertion
1787       assertion does a single test on the last four characters. If       does a single test on the last four characters. If it fails,
1788       it  fails,  the  match  fails immediately. For long strings,       the match fails immediately. For long strings, this approach
1789       this approach makes a significant difference to the process-       makes a significant difference to the processing time.
      ing time.  
1790    
1791       When a pattern contains an unlimited repeat inside a subpat-       When a pattern contains an unlimited repeat inside a subpat-
1792       tern  that  can  itself  be  repeated an unlimited number of       tern  that  can  itself  be  repeated an unlimited number of
# Line 1777  CONDITIONAL SUBPATTERNS Line 1836  CONDITIONAL SUBPATTERNS
1836       error occurs.       error occurs.
1837    
1838       There are two kinds of condition. If the  text  between  the       There are two kinds of condition. If the  text  between  the
1839       parentheses  consists  of  a  sequence  of  digits, then the       parentheses  consists of a sequence of digits, the condition
1840       condition is satisfied if the capturing subpattern  of  that       is satisfied if the capturing subpattern of that number  has
1841       number  has  previously matched. Consider the following pat-       previously  matched.  The  number must be greater than zero.
1842       tern, which contains non-significant white space to make  it       Consider  the  following  pattern,   which   contains   non-
1843       more  readable  (assume  the  PCRE_EXTENDED  option)  and to       significant white space to make it more readable (assume the
1844       divide it into three parts for ease of discussion:       PCRE_EXTENDED option) and to divide it into three parts  for
1845         ease of discussion:
1846    
1847         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
1848    
# Line 1888  RECURSIVE PATTERNS Line 1948  RECURSIVE PATTERNS
1948    
1949         \( ( ( (?>[^()]+) | (?R) )* ) \)         \( ( ( (?>[^()]+) | (?R) )* ) \)
1950            ^                        ^            ^                        ^
1951            ^                        ^ then the string they capture            ^                        ^ the string they  capture  is
1952       is "ab(cd)ef", the contents of the top level parentheses. If       "ab(cd)ef",  the  contents  of the top level parentheses. If
1953       there are more than 15 capturing parentheses in  a  pattern,       there are more than 15 capturing parentheses in  a  pattern,
1954       PCRE  has  to  obtain  extra  memory  to store data during a       PCRE  has  to  obtain  extra  memory  to store data during a
1955       recursion, which it does by using  pcre_malloc,  freeing  it       recursion, which it does by using  pcre_malloc,  freeing  it
# Line 1967  PERFORMANCE Line 2027  PERFORMANCE
2027    
2028    
2029    
2030    UTF-8 SUPPORT
2031         Starting at release 3.3, PCRE has some support for character
2032         strings encoded in the UTF-8 format. This is incomplete, and
2033         is regarded as experimental. In order to use  it,  you  must
2034         configure PCRE to include UTF-8 support in the code, and, in
2035         addition, you must call pcre_compile()  with  the  PCRE_UTF8
2036         option flag. When you do this, both the pattern and any sub-
2037         ject strings that are matched  against  it  are  treated  as
2038         UTF-8  strings instead of just strings of bytes, but only in
2039         the cases that are mentioned below.
2040    
2041         If you compile PCRE with UTF-8 support, but do not use it at
2042         run  time,  the  library will be a bit bigger, but the addi-
2043         tional run time overhead is limited to testing the PCRE_UTF8
2044         flag in several places, so should not be very large.
2045    
2046         PCRE assumes that the strings  it  is  given  contain  valid
2047         UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
2048         you pass invalid UTF-8 strings  to  PCRE,  the  results  are
2049         undefined.
2050    
2051         Running with PCRE_UTF8 set causes these changes in  the  way
2052         PCRE works:
2053    
2054         1. In a pattern, the escape sequence \x{...}, where the con-
2055         tents  of  the  braces is a string of hexadecimal digits, is
2056         interpreted as a UTF-8 character whose code  number  is  the
2057         given   hexadecimal  number,  for  example:  \x{1234}.  This
2058         inserts from one to six  literal  bytes  into  the  pattern,
2059         using the UTF-8 encoding. If a non-hexadecimal digit appears
2060         between the braces, the item is not recognized.
2061    
2062         2. The original hexadecimal escape sequence, \xhh, generates
2063         a two-byte UTF-8 character if its value is greater than 127.
2064    
2065         3. Repeat quantifiers are NOT correctly handled if they fol-
2066         low  a  multibyte character. For example, \x{100}* and \xc3+
2067         do not work. If you want to repeat such characters, you must
2068         enclose  them  in  non-capturing  parentheses,  for  example
2069         (?:\x{100}), at present.
2070    
2071         4. The dot metacharacter matches one UTF-8 character instead
2072         of a single byte.
2073    
2074         5. Unlike literal UTF-8 characters,  the  dot  metacharacter
2075         followed  by  a  repeat quantifier does operate correctly on
2076         UTF-8 characters instead of single bytes.
2077    
2078         4. Although the \x{...} escape is permitted in  a  character
2079         class,  characters  whose values are greater than 255 cannot
2080         be included in a class.
2081    
2082         5. A class is matched against a UTF-8 character  instead  of
2083         just  a  single byte, but it can match only characters whose
2084         values are less than 256.  Characters  with  greater  values
2085         always fail to match a class.
2086    
2087         6. Repeated classes work correctly on multiple characters.
2088    
2089         7. Classes containing just a single character whose value is
2090         greater than 127 (but less than 256), for example, [\x80] or
2091         [^\x{93}], do not work because these are optimized into sin-
2092         gle  byte  matches.  In the first case, of course, the class
2093         brackets are just redundant.
2094    
2095         8. Lookbehind assertions move backwards in the subject by  a
2096         fixed  number  of  characters  instead  of a fixed number of
2097         bytes. Simple cases have been tested to work correctly,  but
2098         there may be hidden gotchas herein.
2099    
2100         9. The character types  such  as  \d  and  \w  do  not  work
2101         correctly  with  UTF-8  characters.  They continue to test a
2102         single byte.
2103    
2104         10. Anything not explicitly mentioned here continues to work
2105         in bytes rather than in characters.
2106    
2107         The following UTF-8 features of  Perl  5.6  are  not  imple-
2108         mented:
2109         1. The escape sequence \C to match a single byte.
2110    
2111         2. The use of Unicode tables and properties and escapes  \p,
2112         \P, and \X.
2113    
2114    
2115    
2116  AUTHOR  AUTHOR
2117       Philip Hazel <ph10@cam.ac.uk>       Philip Hazel <ph10@cam.ac.uk>
2118       University Computing Service,       University Computing Service,
# Line 1974  AUTHOR Line 2120  AUTHOR
2120       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
2121       Phone: +44 1223 334714       Phone: +44 1223 334714
2122    
2123       Last updated: 27 January 2000       Last updated: 28 August 2000,
2124           the 250th anniversary of the death of J.S. Bach.
2125       Copyright (c) 1997-2000 University of Cambridge.       Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.43  
changed lines
  Added in v.51

  ViewVC Help
Powered by ViewVC 1.1.5