/[pcre]/code/tags/pcre-5.0/doc/pcre.txt
ViewVC logotype

Diff of /code/tags/pcre-5.0/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 49 by nigel, Sat Feb 24 21:39:33 2007 UTC
# Line 28  SYNOPSIS Line 28  SYNOPSIS
28       int pcre_get_substring_list(const char *subject,       int pcre_get_substring_list(const char *subject,
29            int *ovector, int stringcount, const char ***listptr);            int *ovector, int stringcount, const char ***listptr);
30    
31         void pcre_free_substring(const char *stringptr);
32    
33         void pcre_free_substring_list(const char **stringptr);
34    
35       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
36    
37         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
38              int what, void *where);
39    
40       int pcre_info(const pcre *code, int *optptr, *firstcharptr);       int pcre_info(const pcre *code, int *optptr, *firstcharptr);
41    
42       char *pcre_version(void);       char *pcre_version(void);
# Line 45  DESCRIPTION Line 52  DESCRIPTION
52       The PCRE library is a set of functions that implement  regu-       The PCRE library is a set of functions that implement  regu-
53       lar  expression  pattern  matching using the same syntax and       lar  expression  pattern  matching using the same syntax and
54       semantics as Perl  5,  with  just  a  few  differences  (see       semantics as Perl  5,  with  just  a  few  differences  (see
55    
56       below).  The  current  implementation  corresponds  to  Perl       below).  The  current  implementation  corresponds  to  Perl
57       5.005.       5.005, with some additional features  from  later  versions.
58         This  includes  some  experimental,  incomplete  support for
59         UTF-8 encoded strings. Details of exactly what is  and  what
60         is not supported are given below.
61    
62       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
63       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
64       correspond to the POSIX API.  These  are  described  in  the       correspond to the POSIX regular expression API.   These  are
65       pcreposix documentation.       described in the pcreposix documentation.
66    
67       The native API function prototypes are defined in the header       The native API function prototypes are defined in the header
68       file  pcre.h,  and  on  Unix  systems  the library itself is       file  pcre.h,  and  on  Unix  systems  the library itself is
69       called libpcre.a, so can be accessed by adding -lpcre to the       called libpcre.a, so can be accessed by adding -lpcre to the
70       command for linking an application which calls it.       command  for  linking  an  application  which  calls it. The
71         header file defines the macros PCRE_MAJOR and PCRE_MINOR  to
72         contain the major and minor release numbers for the library.
73         Applications can use these to include support for  different
74         releases.
75    
76       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
77       are  used  for  compiling  and matching regular expressions,       are used for compiling and matching regular expressions.
78       while   pcre_copy_substring(),   pcre_get_substring(),   and  
79       pcre_get_substring_list()   are  convenience  functions  for       The functions  pcre_copy_substring(),  pcre_get_substring(),
80         and  pcre_get_substring_list() are convenience functions for
81       extracting  captured  substrings  from  a  matched   subject       extracting  captured  substrings  from  a  matched   subject
82       string.  The function pcre_maketables() is used (optionally)       string; pcre_free_substring() and pcre_free_substring_list()
83       to build a set of character tables in the current locale for       are also provided, to free the  memory  used  for  extracted
84       passing to pcre_compile().       strings.
85    
86       The function pcre_info() is used  to  find  out  information       The function pcre_maketables() is used (optionally) to build
87       about  a compiled pattern, while the function pcre_version()       a  set of character tables in the current locale for passing
88       returns a pointer to a string containing the version of PCRE       to pcre_compile().
89       and its date of release.  
90         The function pcre_fullinfo() is used to find out information
91         about a compiled pattern; pcre_info() is an obsolete version
92         which returns only some of the available information, but is
93         retained   for   backwards   compatibility.    The  function
94         pcre_version() returns a pointer to a string containing  the
95         version of PCRE and its date of release.
96    
97       The global variables  pcre_malloc  and  pcre_free  initially       The global variables  pcre_malloc  and  pcre_free  initially
98       contain the entry points of the standard malloc() and free()       contain the entry points of the standard malloc() and free()
# Line 81  DESCRIPTION Line 104  DESCRIPTION
104    
105    
106  MULTI-THREADING  MULTI-THREADING
107       The PCRE functions can be used in  multi-threading  applica-       The  PCRE  functions  can   be   used   in   multi-threading
108       tions, with the proviso that the memory management functions  
109       pointed to by pcre_malloc and pcre_free are  shared  by  all  
110       threads.  
111    
112    
113    SunOS 5.8                 Last change:                          2
114    
115    
116    
117         applications,  with  the  proviso that the memory management
118         functions pointed to by pcre_malloc and pcre_free are shared
119         by all threads.
120    
121       The compiled form of a regular  expression  is  not  altered       The compiled form of a regular  expression  is  not  altered
122       during  matching, so the same compiled pattern can safely be       during  matching, so the same compiled pattern can safely be
# Line 187  COMPILING A PATTERN Line 219  COMPILING A PATTERN
219    
220         PCRE_EXTRA         PCRE_EXTRA
221    
222       This option turns on additional functionality of  PCRE  that       This option was invented in  order  to  turn  on  additional
223       is  incompatible  with Perl. Any backslash in a pattern that       functionality of PCRE that is incompatible with Perl, but it
224       is followed by a letter that has no special  meaning  causes       is currently of very little use. When set, any backslash  in
225       an  error,  thus  reserving  these  combinations  for future       a  pattern  that is followed by a letter that has no special
226       expansion. By default, as in Perl, a backslash followed by a       meaning causes an error, thus reserving  these  combinations
227       letter  with  no  special  meaning  is treated as a literal.       for  future  expansion.  By default, as in Perl, a backslash
228       There are at present no other features  controlled  by  this       followed by a letter with no special meaning is treated as a
229       option.       literal.  There  are at present no other features controlled
230         by this option. It can also be set by a (?X) option  setting
231         within a pattern.
232    
233         PCRE_MULTILINE         PCRE_MULTILINE
234    
# Line 207  COMPILING A PATTERN Line 241  COMPILING A PATTERN
241       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
242    
243       When PCRE_MULTILINE it is set, the "start of line" and  "end       When PCRE_MULTILINE it is set, the "start of line" and  "end
244       of   line"   constructs   match   immediately  following  or       of  line"  constructs match immediately following or immedi-
245       immediately  before  any  newline  in  the  subject  string,       ately before any newline  in  the  subject  string,  respec-
246       respectively,  as well as at the very start and end. This is       tively,  as  well  as  at  the  very  start and end. This is
247       equivalent to Perl's /m option. If there are no "\n" charac-       equivalent to Perl's /m option. If there are no "\n" charac-
248       ters  in  a subject string, or no occurrences of ^ or $ in a       ters  in  a subject string, or no occurrences of ^ or $ in a
249       pattern, setting PCRE_MULTILINE has no effect.       pattern, setting PCRE_MULTILINE has no effect.
# Line 221  COMPILING A PATTERN Line 255  COMPILING A PATTERN
255       followed by "?". It is not compatible with Perl. It can also       followed by "?". It is not compatible with Perl. It can also
256       be set by a (?U) option setting within the pattern.       be set by a (?U) option setting within the pattern.
257    
258           PCRE_UTF8
259    
260         This option causes PCRE to regard both the pattern  and  the
261         subject  as strings of UTF-8 characters instead of just byte
262         strings. However, it is available  only  if  PCRE  has  been
263         built  to  include  UTF-8  support.  If not, the use of this
264         option provokes an error. Support for UTF-8 is new,  experi-
265         mental,  and incomplete.  Details of exactly what it entails
266         are given below.
267    
268    
269    
270  STUDYING A PATTERN  STUDYING A PATTERN
271       When a pattern is going to be  used  several  times,  it  is       When a pattern is going to be  used  several  times,  it  is
272       worth  spending  more time analyzing it in order to speed up       worth  spending  more time analyzing it in order to speed up
273       the time taken for matching. The function pcre_study() takes       the time taken for matching. The function pcre_study() takes
274    
275       a  pointer  to a compiled pattern as its first argument, and       a  pointer  to a compiled pattern as its first argument, and
276       returns a  pointer  to  a  pcre_extra  block  (another  void       returns a  pointer  to  a  pcre_extra  block  (another  void
277       typedef)  containing  additional  information about the pat-       typedef)  containing  additional  information about the pat-
# Line 284  LOCALE SUPPORT Line 329  LOCALE SUPPORT
329    
330    
331  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
332       The pcre_info() function returns information  about  a  com-       The pcre_fullinfo() function  returns  information  about  a
333       piled pattern.  Its yield is the number of capturing subpat-       compiled pattern. It replaces the obsolete pcre_info() func-
334       terns, or one of the following negative numbers:       tion, which is nevertheless retained for backwards compabil-
335         ity (and is documented below).
336    
337         The first argument for pcre_fullinfo() is a pointer  to  the
338         compiled  pattern.  The  second  argument  is  the result of
339         pcre_study(), or NULL if the pattern was  not  studied.  The
340         third  argument  specifies  which  piece  of  information is
341         required, while the fourth argument is a pointer to a  vari-
342         able  to receive the data. The yield of the function is zero
343         for success, or one of the following negative numbers:
344    
345         PCRE_ERROR_NULL       the argument code was NULL         PCRE_ERROR_NULL       the argument code was NULL
346                                 the argument where was NULL
347         PCRE_ERROR_BADMAGIC   the "magic number" was not found         PCRE_ERROR_BADMAGIC   the "magic number" was not found
348           PCRE_ERROR_BADOPTION  the value of what was invalid
349    
350       If the optptr argument is not NULL, a copy  of  the  options       The possible values for the third argument  are  defined  in
351       with which the pattern was compiled is placed in the integer       pcre.h, and are as follows:
352       it points to. These option bits are those specified  in  the  
353           PCRE_INFO_OPTIONS
354    
355         Return a copy of the options with which the pattern was com-
356         piled.  The fourth argument should point to au unsigned long
357         int variable. These option bits are those specified  in  the
358       call  to  pcre_compile(),  modified  by any top-level option       call  to  pcre_compile(),  modified  by any top-level option
359       settings  within  the   pattern   itself,   and   with   the       settings  within  the   pattern   itself,   and   with   the
360       PCRE_ANCHORED  bit  set  if  the form of the pattern implies       PCRE_ANCHORED  bit  forcibly  set if the form of the pattern
361       that it can match only at the start of a subject string.       implies that it can match only at the  start  of  a  subject
362         string.
363    
364       If the pattern is not anchored and the firstcharptr argument         PCRE_INFO_SIZE
365       is  not  NULL, it is used to pass back information about the  
366       first character of any matched string. If there is  a  fixed       Return the size of the compiled pattern, that is, the  value
367       first    character,    e.g.   from   a   pattern   such   as       that  was  passed as the argument to pcre_malloc() when PCRE
368       (cat|cow|coyote), then it is returned in the integer pointed       was getting memory in which to place the compiled data.  The
369       to by firstcharptr. Otherwise, if either       fourth argument should point to a size_t variable.
370    
371           PCRE_INFO_CAPTURECOUNT
372    
373         Return the number of capturing subpatterns in  the  pattern.
374         The fourth argument should point to an int variable.
375    
376           PCRE_INFO_BACKREFMAX
377    
378         Return the number of  the  highest  back  reference  in  the
379         pattern.  The  fourth  argument should point to an int vari-
380         able. Zero is returned if there are no back references.
381    
382           PCRE_INFO_FIRSTCHAR
383    
384         Return information about the first character of any  matched
385         string,  for  a  non-anchored  pattern.  If there is a fixed
386         first   character,   e.g.   from   a   pattern    such    as
387         (cat|cow|coyote),  it  is returned in the integer pointed to
388         by where. Otherwise, if either
389    
390       (a) the pattern was compiled with the PCRE_MULTILINE option,       (a) the pattern was compiled with the PCRE_MULTILINE option,
391       and every branch starts with "^", or       and every branch starts with "^", or
# Line 312  INFORMATION ABOUT A PATTERN Line 393  INFORMATION ABOUT A PATTERN
393       (b) every  branch  of  the  pattern  starts  with  ".*"  and       (b) every  branch  of  the  pattern  starts  with  ".*"  and
394       PCRE_DOTALL is not set (if it were set, the pattern would be       PCRE_DOTALL is not set (if it were set, the pattern would be
395       anchored),       anchored),
396       then -1 is returned, indicating  that  the  pattern  matches  
397       only  at  the  start  of  a subject string or after any "\n"       -1 is returned, indicating that the pattern matches only  at
398       within the string. Otherwise -2 is returned.       the  start  of a subject string or after any "\n" within the
399         string. Otherwise -2 is returned.  For anchored patterns, -2
400         is returned.
401    
402           PCRE_INFO_FIRSTTABLE
403    
404         If the pattern was studied, and this resulted  in  the  con-
405         struction of a 256-bit table indicating a fixed set of char-
406         acters for the first character in  any  matching  string,  a
407         pointer   to  the  table  is  returned.  Otherwise  NULL  is
408         returned. The fourth argument should point  to  an  unsigned
409         char * variable.
410    
411           PCRE_INFO_LASTLITERAL
412    
413         For a non-anchored pattern, return the value of  the  right-
414         most  literal  character  which  must  exist  in any matched
415         string, other than at its start. The fourth argument  should
416         point  to an int variable. If there is no such character, or
417         if the pattern is anchored, -1 is returned. For example, for
418         the pattern /a\d+z\d+/ the returned value is 'z'.
419    
420         The pcre_info() function is now obsolete because its  inter-
421         face  is  too  restrictive  to return all the available data
422         about  a  compiled  pattern.   New   programs   should   use
423         pcre_fullinfo()  instead.  The  yield  of pcre_info() is the
424         number of capturing subpatterns, or  one  of  the  following
425         negative numbers:
426    
427           PCRE_ERROR_NULL       the argument code was NULL
428           PCRE_ERROR_BADMAGIC   the "magic number" was not found
429    
430         If the optptr argument is not NULL, a copy  of  the  options
431         with which the pattern was compiled is placed in the integer
432         it points to (see PCRE_INFO_OPTIONS above).
433    
434         If the pattern is not anchored and the firstcharptr argument
435         is  not  NULL, it is used to pass back information about the
436         first    character    of    any    matched    string    (see
437         PCRE_INFO_FIRSTCHAR above).
438    
439    
440    
# Line 516  MATCHING A PATTERN Line 636  MATCHING A PATTERN
636    
637  EXTRACTING CAPTURED SUBSTRINGS  EXTRACTING CAPTURED SUBSTRINGS
638       Captured substrings can be accessed directly  by  using  the       Captured substrings can be accessed directly  by  using  the
639    
640    
641    
642    
643    
644    SunOS 5.8                 Last change:                         12
645    
646    
647    
648       offsets returned by pcre_exec() in ovector. For convenience,       offsets returned by pcre_exec() in ovector. For convenience,
649       the functions  pcre_copy_substring(),  pcre_get_substring(),       the functions  pcre_copy_substring(),  pcre_get_substring(),
650       and  pcre_get_substring_list()  are  provided for extracting       and  pcre_get_substring_list()  are  provided for extracting
# Line 533  EXTRACTING CAPTURED SUBSTRINGS Line 662  EXTRACTING CAPTURED SUBSTRINGS
662       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
663       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
664       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
665       tor, then the value passed as stringcount should be the size       tor,  the  value passed as stringcount should be the size of
666       of the vector divided by three.       the vector divided by three.
667    
668       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
669       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
# Line 542  EXTRACTING CAPTURED SUBSTRINGS Line 671  EXTRACTING CAPTURED SUBSTRINGS
671       the entire pattern, while higher values extract the captured       the entire pattern, while higher values extract the captured
672       substrings. For pcre_copy_substring(), the string is  placed       substrings. For pcre_copy_substring(), the string is  placed
673       in  buffer,  whose  length is given by buffersize, while for       in  buffer,  whose  length is given by buffersize, while for
674       pcre_get_substring() a new block of store  is  obtained  via       pcre_get_substring() a new block of memory is  obtained  via
675       pcre_malloc,  and its address is returned via stringptr. The       pcre_malloc,  and its address is returned via stringptr. The
676       yield of the function is  the  length  of  the  string,  not       yield of the function is  the  length  of  the  string,  not
677       including the terminating zero, or one of       including the terminating zero, or one of
# Line 576  EXTRACTING CAPTURED SUBSTRINGS Line 705  EXTRACTING CAPTURED SUBSTRINGS
705       inspecting the appropriate offset in ovector, which is nega-       inspecting the appropriate offset in ovector, which is nega-
706       tive for unset substrings.       tive for unset substrings.
707    
708         The  two  convenience  functions  pcre_free_substring()  and
709         pcre_free_substring_list()  can  be  used to free the memory
710         returned by  a  previous  call  of  pcre_get_substring()  or
711         pcre_get_substring_list(),  respectively.  They  do  nothing
712         more than call the function pointed to by  pcre_free,  which
713         of  course  could  be called directly from a C program. How-
714         ever, PCRE is used in some situations where it is linked via
715         a  special  interface  to another programming language which
716         cannot use pcre_free directly; it is for  these  cases  that
717         the functions are provided.
718    
719    
720    
# Line 640  DIFFERENCES FROM PERL Line 779  DIFFERENCES FROM PERL
779       6. The Perl \G assertion is  not  supported  as  it  is  not       6. The Perl \G assertion is  not  supported  as  it  is  not
780       relevant to single pattern matches.       relevant to single pattern matches.
781    
782       7. Fairly obviously, PCRE does  not  support  the  (?{code})       7. Fairly obviously, PCRE does not support the (?{code}) and
783       construction.       (?p{code})  constructions. However, there is some experimen-
784         tal support for recursive patterns using the  non-Perl  item
785         (?R).
786    
787       8. There are at the time of writing some  oddities  in  Perl       8. There are at the time of writing some  oddities  in  Perl
788       5.005_02  concerned  with  the  settings of captured strings       5.005_02  concerned  with  the  settings of captured strings
# Line 649  DIFFERENCES FROM PERL Line 790  DIFFERENCES FROM PERL
790       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value
791       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2
792       unset.    However,    if   the   pattern   is   changed   to       unset.    However,    if   the   pattern   is   changed   to
793       /^(aa(b(b))?)+$/ then $2 (and $3) get set.       /^(aa(b(b))?)+$/ then $2 (and $3) are set.
794    
795       In Perl 5.004 $2 is set in both cases, and that is also true       In Perl 5.004 $2 is set in both cases, and that is also true
796       of PCRE. If in the future Perl changes to a consistent state       of PCRE. If in the future Perl changes to a consistent state
# Line 675  DIFFERENCES FROM PERL Line 816  DIFFERENCES FROM PERL
816       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
817       with no special meaning is faulted.       with no special meaning is faulted.
818    
819       (d)  If  PCRE_UNGREEDY  is  set,  the  greediness   of   the       (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-
820       repetition quantifiers is inverted, that is, by default they       tion  quantifiers  is inverted, that is, by default they are
821       are not greedy, but if followed by a question mark they are.       not greedy, but if followed by a question mark they are.
822    
823       (e) PCRE_ANCHORED can be used to force a pattern to be tried       (e) PCRE_ANCHORED can be used to force a pattern to be tried
824       only at the start of the subject.       only at the start of the subject.
# Line 685  DIFFERENCES FROM PERL Line 826  DIFFERENCES FROM PERL
826       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options
827       for pcre_exec() have no Perl equivalents.       for pcre_exec() have no Perl equivalents.
828    
829         (g) The (?R) construct allows for recursive pattern matching
830         (Perl  5.6 can do this using the (?p{code}) construct, which
831         PCRE cannot of course support.)
832    
833    
834    
835  REGULAR EXPRESSION DETAILS  REGULAR EXPRESSION DETAILS
# Line 693  REGULAR EXPRESSION DETAILS Line 838  REGULAR EXPRESSION DETAILS
838       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
839       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
840       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
841       O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.       O'Reilly (ISBN 1-56592-257), covers them in great detail.
842    
843       The description here is intended as reference documentation.       The description here is intended as reference documentation.
844         The basic operation of PCRE is on strings of bytes. However,
845         there is the beginnings of some support for UTF-8  character
846         strings.  To  use  this  support  you must configure PCRE to
847         include it, and then call pcre_compile() with the  PCRE_UTF8
848         option.  How  this affects the pattern matching is described
849         in the final section of this document.
850    
851       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
852       subject string from left to right. Most characters stand for       subject string from left to right. Most characters stand for
# Line 780  BACKSLASH Line 932  BACKSLASH
932         \f     formfeed (hex 0C)         \f     formfeed (hex 0C)
933         \n     newline (hex 0A)         \n     newline (hex 0A)
934         \r     carriage return (hex 0D)         \r     carriage return (hex 0D)
935           \t     tab (hex 09)
             tab (hex 09)  
936         \xhh   character with hex code hh         \xhh   character with hex code hh
937         \ddd   character with octal code ddd, or backreference         \ddd   character with octal code ddd, or backreference
938    
# Line 833  BACKSLASH Line 984  BACKSLASH
984       Note that octal values of 100 or greater must not be  intro-       Note that octal values of 100 or greater must not be  intro-
985       duced  by  a  leading zero, because no more than three octal       duced  by  a  leading zero, because no more than three octal
986       digits are ever read.       digits are ever read.
987    
988       All the sequences that define a single  byte  value  can  be       All the sequences that define a single  byte  value  can  be
989       used both inside and outside character classes. In addition,       used both inside and outside character classes. In addition,
990       inside a character class, the sequence "\b"  is  interpreted       inside a character class, the sequence "\b"  is  interpreted
# Line 885  BACKSLASH Line 1037  BACKSLASH
1037       These assertions may not appear in  character  classes  (but       These assertions may not appear in  character  classes  (but
1038       note that "\b" has a different meaning, namely the backspace       note that "\b" has a different meaning, namely the backspace
1039       character, inside a character class).       character, inside a character class).
1040    
1041       A word boundary is a position in the  subject  string  where       A word boundary is a position in the  subject  string  where
1042       the current character and the previous character do not both       the current character and the previous character do not both
1043       match \w or \W (i.e. one matches \w and  the  other  matches       match \w or \W (i.e. one matches \w and  the  other  matches
# Line 908  CIRCUMFLEX AND DOLLAR Line 1061  CIRCUMFLEX AND DOLLAR
1061       Outside a character class, in the default matching mode, the       Outside a character class, in the default matching mode, the
1062       circumflex  character  is an assertion which is true only if       circumflex  character  is an assertion which is true only if
1063       the current matching point is at the start  of  the  subject       the current matching point is at the start  of  the  subject
1064    
1065       string.  If  the startoffset argument of pcre_exec() is non-       string.  If  the startoffset argument of pcre_exec() is non-
1066       zero, circumflex can never match. Inside a character  class,       zero, circumflex can never match. Inside a character  class,
1067       circumflex has an entirely different meaning (see below).       circumflex has an entirely different meaning (see below).
# Line 960  FULL STOP (PERIOD, DOT) Line 1114  FULL STOP (PERIOD, DOT)
1114       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1115       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1116       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default)  newline.   If  the  PCRE_DOTALL
1117       option  is  set,  then dots match newlines as well. The han-  
1118       dling of dot is entirely independent of the handling of cir-       option  is set, dots match newlines as well. The handling of
1119       cumflex  and  dollar,  the only relationship being that they       dot is entirely independent of the  handling  of  circumflex
1120       both involve newline characters.  Dot has no special meaning       and  dollar,  the  only  relationship  being  that they both
1121       in a character class.       involve newline characters. Dot has no special meaning in  a
1122         character class.
1123    
1124    
1125    
# Line 1046  SQUARE BRACKETS Line 1201  SQUARE BRACKETS
1201    
1202    
1203    
1204    POSIX CHARACTER CLASSES
1205         Perl 5.6 (not yet released at the time of writing) is  going
1206         to  support  the POSIX notation for character classes, which
1207         uses names enclosed by  [:  and  :]   within  the  enclosing
1208         square brackets. PCRE supports this notation. For example,
1209    
1210           [01[:alpha:]%]
1211    
1212         matches "0", "1", any alphabetic character, or "%". The sup-
1213         ported class names are
1214    
1215           alnum    letters and digits
1216           alpha    letters
1217           ascii    character codes 0 - 127
1218           cntrl    control characters
1219           digit    decimal digits (same as \d)
1220           graph    printing characters, excluding space
1221           lower    lower case letters
1222           print    printing characters, including space
1223           punct    printing characters, excluding letters and digits
1224           space    white space (same as \s)
1225           upper    upper case letters
1226           word     "word" characters (same as \w)
1227           xdigit   hexadecimal digits
1228    
1229         The names "ascii" and "word" are  Perl  extensions.  Another
1230         Perl  extension is negation, which is indicated by a ^ char-
1231         acter after the colon. For example,
1232    
1233           [12[:^digit:]]
1234    
1235         matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also
1236         recogize  the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1237         "collating element", but these are  not  supported,  and  an
1238         error is given if they are encountered.
1239    
1240    
1241    
1242  VERTICAL BAR  VERTICAL BAR
1243       Vertical bar characters are  used  to  separate  alternative       Vertical bar characters are  used  to  separate  alternative
1244       patterns. For example, the pattern       patterns. For example, the pattern
# Line 1197  REPETITION Line 1390  REPETITION
1390       Repetition is specified by quantifiers, which can follow any       Repetition is specified by quantifiers, which can follow any
1391       of the following items:       of the following items:
1392    
   
1393         a single character, possibly escaped         a single character, possibly escaped
1394         the . metacharacter         the . metacharacter
1395         a character class         a character class
# Line 1273  REPETITION Line 1465  REPETITION
1465       fails, because it matches  the  entire  string  due  to  the       fails, because it matches  the  entire  string  due  to  the
1466       greediness of the .*  item.       greediness of the .*  item.
1467    
1468       However, if a quantifier is followed  by  a  question  mark,       However, if a quantifier is followed by a question mark,  it
1469       then it ceases to be greedy, and instead matches the minimum       ceases  to be greedy, and instead matches the minimum number
1470       number of times possible, so the pattern       of times possible, so the pattern
1471    
1472         /\*.*?\*/         /\*.*?\*/
1473    
# Line 1292  REPETITION Line 1484  REPETITION
1484       that is the only way the rest of the pattern matches.       that is the only way the rest of the pattern matches.
1485    
1486       If the PCRE_UNGREEDY option is set (an option which  is  not       If the PCRE_UNGREEDY option is set (an option which  is  not
1487       available  in  Perl)  then the quantifiers are not greedy by       available  in  Perl),  the  quantifiers  are  not  greedy by
1488       default, but individual ones can be made greedy by following       default, but individual ones can be made greedy by following
1489       them  with  a  question mark. In other words, it inverts the       them  with  a  question mark. In other words, it inverts the
1490       default behaviour.       default behaviour.
# Line 1304  REPETITION Line 1496  REPETITION
1496    
1497       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
1498       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
1499       to match newlines, then the pattern is implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
1500       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
1501       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
1502       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
# Line 1357  BACK REFERENCES Line 1549  BACK REFERENCES
1549    
1550       matches "sense and sensibility" and "response and  responsi-       matches "sense and sensibility" and "response and  responsi-
1551       bility",  but  not  "sense  and  responsibility". If caseful       bility",  but  not  "sense  and  responsibility". If caseful
1552       matching is in force at the time of the back reference, then       matching is in force at the time of the back reference,  the
1553       the case of letters is relevant. For example,       case of letters is relevant. For example,
1554    
1555         ((?i)rah)\s+\1         ((?i)rah)\s+\1
1556    
# Line 1368  BACK REFERENCES Line 1560  BACK REFERENCES
1560    
1561       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
1562       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
1563       particular match, then any  back  references  to  it  always       particular match, any back references to it always fail. For
1564       fail. For example, the pattern       example, the pattern
1565    
1566         (a|(bc))\2         (a|(bc))\2
1567    
# Line 1377  BACK REFERENCES Line 1569  BACK REFERENCES
1569       Because  there  may  be up to 99 back references, all digits       Because  there  may  be up to 99 back references, all digits
1570       following the backslash are taken as  part  of  a  potential       following the backslash are taken as  part  of  a  potential
1571       back reference number. If the pattern continues with a digit       back reference number. If the pattern continues with a digit
1572       character, then some delimiter must be used to terminate the       character, some delimiter must be used to terminate the back
1573       back reference. If the PCRE_EXTENDED option is set, this can       reference.   If the PCRE_EXTENDED option is set, this can be
1574       be whitespace.  Otherwise an empty comment can be used.       whitespace. Otherwise an empty comment can be used.
1575    
1576       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
1577       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
# Line 1389  BACK REFERENCES Line 1581  BACK REFERENCES
1581    
1582         (a|b\1)+         (a|b\1)+
1583    
1584       matches any number of "a"s and also "aba", "ababaa" etc.  At       matches any number of "a"s and also "aba", "ababbaa" etc. At
1585       each iteration of the subpattern, the back reference matches       each iteration of the subpattern, the back reference matches
1586       the character string corresponding to  the  previous  itera-       the  character  string   corresponding   to   the   previous
1587       tion.  In  order  for this to work, the pattern must be such       iteration.  In  order  for this to work, the pattern must be
1588       that the first iteration does not need  to  match  the  back       such that the first iteration does not  need  to  match  the
1589       reference.  This  can  be  done using alternation, as in the       back  reference.  This  can be done using alternation, as in
1590       example above, or by a quantifier with a minimum of zero.       the example above, or by a  quantifier  with  a  minimum  of
1591         zero.
1592    
1593    
1594    
# Line 1407  ASSERTIONS Line 1600  ASSERTIONS
1600       cated assertions are coded as  subpatterns.  There  are  two       cated assertions are coded as  subpatterns.  There  are  two
1601       kinds:  those that look ahead of the current position in the       kinds:  those that look ahead of the current position in the
1602       subject string, and those that look behind it.       subject string, and those that look behind it.
1603    
1604       An assertion subpattern is matched in the normal way, except       An assertion subpattern is matched in the normal way, except
1605       that  it  does not cause the current matching position to be       that  it  does not cause the current matching position to be
1606       changed. Lookahead assertions start with  (?=  for  positive       changed. Lookahead assertions start with  (?=  for  positive
# Line 1478  ASSERTIONS Line 1672  ASSERTIONS
1672       matches "foo" preceded by three digits that are  not  "999".       matches "foo" preceded by three digits that are  not  "999".
1673       Notice  that each of the assertions is applied independently       Notice  that each of the assertions is applied independently
1674       at the same point in the subject string. First  there  is  a       at the same point in the subject string. First  there  is  a
1675       check  that  the  previous  three characters are all digits,       check that the previous three characters are all digits, and
1676       then there is a check that the same three characters are not       then there is a check that the same three characters are not
1677       "999".   This  pattern  does not match "foo" preceded by six       "999".   This  pattern  does not match "foo" preceded by six
1678       characters, the first of which are digits and the last three       characters, the first of which are digits and the last three
# Line 1547  ONCE-ONLY SUBPATTERNS Line 1741  ONCE-ONLY SUBPATTERNS
1741    
1742       This kind of parenthesis "locks up" the  part of the pattern       This kind of parenthesis "locks up" the  part of the pattern
1743       it  contains once it has matched, and a failure further into       it  contains once it has matched, and a failure further into
1744       the pattern is prevented from backtracking  into  it.  Back-       the  pattern  is  prevented  from  backtracking   into   it.
1745       tracking  past  it to previous items, however, works as nor-       Backtracking  past  it  to previous items, however, works as
1746       mal.       normal.
1747    
1748       An alternative description is that a subpattern of this type       An alternative description is that a subpattern of this type
1749       matches  the  string  of  characters that an identical stan-       matches  the  string  of  characters that an identical stan-
# Line 1572  ONCE-ONLY SUBPATTERNS Line 1766  ONCE-ONLY SUBPATTERNS
1766    
1767         abcd$         abcd$
1768    
1769       when applied to a long  string  which  does  not  match  it.       when applied to a long string which does not match.  Because
1770       Because matching proceeds from left to right, PCRE will look       matching  proceeds  from  left  to right, PCRE will look for
1771       for each "a" in the subject and then  see  if  what  follows       each "a" in the subject and then see if what follows matches
1772       matches the rest of the pattern. If the pattern is specified       the rest of the pattern. If the pattern is specified as
      as  
1773    
1774         ^.*abcd$         ^.*abcd$
1775    
1776       then the initial .* matches the entire string at first,  but       the initial .* matches the entire string at first, but  when
1777       when  this  fails,  it  backtracks to match all but the last       this  fails  (because  there  is no following "a"), it back-
1778       character, then all but the last two characters, and so  on.       tracks to match all but the last character, then all but the
1779       Once again the search for "a" covers the entire string, from       last  two  characters,  and so on. Once again the search for
1780       right to left, so we are no better off. However, if the pat-       "a" covers the entire string, from right to left, so we  are
1781       tern is written as       no better off. However, if the pattern is written as
1782    
1783         ^(?>.*)(?<=abcd)         ^(?>.*)(?<=abcd)
1784    
1785       then there can be no backtracking for the .*  item;  it  can       there can be no backtracking for the .* item; it  can  match
1786       match  only  the  entire  string.  The subsequent lookbehind       only  the entire string. The subsequent lookbehind assertion
1787       assertion does a single test on the last four characters. If       does a single test on the last four characters. If it fails,
1788       it  fails,  the  match  fails immediately. For long strings,       the match fails immediately. For long strings, this approach
1789       this approach makes a significant difference to the process-       makes a significant difference to the processing time.
1790       ing time.  
1791         When a pattern contains an unlimited repeat inside a subpat-
1792         tern  that  can  itself  be  repeated an unlimited number of
1793         times, the use of a once-only subpattern is the only way  to
1794         avoid  some  failing matches taking a very long time indeed.
1795         The pattern
1796    
1797           (\D+|<\d+>)*[!?]
1798    
1799         matches an unlimited number of substrings that  either  con-
1800         sist  of  non-digits,  or digits enclosed in <>, followed by
1801         either ! or ?. When it matches, it runs quickly. However, if
1802         it is applied to
1803    
1804           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1805    
1806         it takes a long  time  before  reporting  failure.  This  is
1807         because the string can be divided between the two repeats in
1808         a large number of ways, and all have to be tried. (The exam-
1809         ple  used  [!?]  rather  than a single character at the end,
1810         because both PCRE and Perl have an optimization that  allows
1811         for  fast  failure  when  a  single  character is used. They
1812         remember the last single character that is  required  for  a
1813         match,  and  fail early if it is not present in the string.)
1814         If the pattern is changed to
1815    
1816           ((?>\D+)|<\d+>)*[!?]
1817    
1818         sequences of non-digits cannot be broken, and  failure  hap-
1819         pens quickly.
1820    
1821    
1822    
# Line 1614  CONDITIONAL SUBPATTERNS Line 1836  CONDITIONAL SUBPATTERNS
1836       error occurs.       error occurs.
1837    
1838       There are two kinds of condition. If the  text  between  the       There are two kinds of condition. If the  text  between  the
1839       parentheses  consists  of  a  sequence  of  digits, then the       parentheses  consists of a sequence of digits, the condition
1840       condition is satisfied if the capturing subpattern  of  that       is satisfied if the capturing subpattern of that number  has
1841       number  has  previously matched. Consider the following pat-       previously  matched.  Consider  the following pattern, which
1842       tern, which contains non-significant white space to make  it       contains non-significant white space to make it  more  read-
1843       more  readable  (assume  the  PCRE_EXTENDED  option)  and to       able (assume the PCRE_EXTENDED option) and to divide it into
1844       divide it into three parts for ease of discussion:       three parts for ease of discussion:
1845    
1846         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
1847    
# Line 1668  COMMENTS Line 1890  COMMENTS
1890    
1891    
1892    
1893    RECURSIVE PATTERNS
1894         Consider the problem of matching a  string  in  parentheses,
1895         allowing  for  unlimited nested parentheses. Without the use
1896         of recursion, the best that can be done is to use a  pattern
1897         that  matches  up  to some fixed depth of nesting. It is not
1898         possible to handle an arbitrary nesting depth. Perl 5.6  has
1899         provided   an  experimental  facility  that  allows  regular
1900         expressions to recurse (amongst other things). It does  this
1901         by  interpolating  Perl  code in the expression at run time,
1902         and the code can refer to the expression itself. A Perl pat-
1903         tern  to  solve  the parentheses problem can be created like
1904         this:
1905    
1906           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1907    
1908         The (?p{...}) item interpolates Perl code at run  time,  and
1909         in  this  case refers recursively to the pattern in which it
1910         appears. Obviously, PCRE cannot support the interpolation of
1911         Perl  code.  Instead,  the special item (?R) is provided for
1912         the specific case of recursion. This PCRE pattern solves the
1913         parentheses  problem (assume the PCRE_EXTENDED option is set
1914         so that white space is ignored):
1915    
1916           \( ( (?>[^()]+) | (?R) )* \)
1917    
1918         First it matches an opening parenthesis. Then it matches any
1919         number  of substrings which can either be a sequence of non-
1920         parentheses, or a recursive  match  of  the  pattern  itself
1921         (i.e. a correctly parenthesized substring). Finally there is
1922         a closing parenthesis.
1923    
1924         This particular example pattern  contains  nested  unlimited
1925         repeats, and so the use of a once-only subpattern for match-
1926         ing strings of non-parentheses is  important  when  applying
1927         the  pattern to strings that do not match. For example, when
1928         it is applied to
1929    
1930           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1931    
1932         it yields "no match" quickly. However, if a  once-only  sub-
1933         pattern  is  not  used,  the match runs for a very long time
1934         indeed because there are so many different ways the + and  *
1935         repeats  can carve up the subject, and all have to be tested
1936         before failure can be reported.
1937    
1938         The values set for any capturing subpatterns are those  from
1939         the outermost level of the recursion at which the subpattern
1940         value is set. If the pattern above is matched against
1941    
1942           (ab(cd)ef)
1943    
1944         the value for the capturing parentheses is  "ef",  which  is
1945         the  last  value  taken  on  at the top level. If additional
1946         parentheses are added, giving
1947    
1948           \( ( ( (?>[^()]+) | (?R) )* ) \)
1949              ^                        ^
1950              ^                        ^ the string they  capture  is
1951         "ab(cd)ef",  the  contents  of the top level parentheses. If
1952         there are more than 15 capturing parentheses in  a  pattern,
1953         PCRE  has  to  obtain  extra  memory  to store data during a
1954         recursion, which it does by using  pcre_malloc,  freeing  it
1955         via  pcre_free  afterwards. If no memory can be obtained, it
1956         saves data for the first 15 capturing parentheses  only,  as
1957         there is no way to give an out-of-memory error from within a
1958         recursion.
1959    
1960    
1961    
1962  PERFORMANCE  PERFORMANCE
1963       Certain items that may appear in patterns are more efficient       Certain items that may appear in patterns are more efficient
1964       than  others.  It is more efficient to use a character class       than  others.  It is more efficient to use a character class
# Line 1710  PERFORMANCE Line 2001  PERFORMANCE
2001       repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of       repeat can match 0, 1, 2, 3, or 4 times,  and  for  each  of
2002       those  cases other than 0, the + repeats can match different       those  cases other than 0, the + repeats can match different
2003       numbers of times.) When the remainder of the pattern is such       numbers of times.) When the remainder of the pattern is such
2004       that  the entire match is going to fail, PCRE has in princi-       that  the  entire  match  is  going  to  fail,  PCRE  has in
2005       ple to try every possible variation, and this  can  take  an       principle to try every possible variation, and this can take
2006       extremely long time.       an extremely long time.
2007    
2008       An optimization catches some of the more simple  cases  such       An optimization catches some of the more simple  cases  such
2009       as       as
# Line 1735  PERFORMANCE Line 2026  PERFORMANCE
2026    
2027    
2028    
2029    UTF-8 SUPPORT
2030         Starting at release 3.3, PCRE has some support for character
2031         strings encoded in the UTF-8 format. This is incomplete, and
2032         is regarded as experimental. In order to use  it,  you  must
2033         configure PCRE to include UTF-8 support in the code, and, in
2034         addition, you must call pcre_compile()  with  the  PCRE_UTF8
2035         option flag. When you do this, both the pattern and any sub-
2036         ject strings that are matched  against  it  are  treated  as
2037         UTF-8  strings instead of just strings of bytes, but only in
2038         the cases that are mentioned below.
2039    
2040         If you compile PCRE with UTF-8 support, but do not use it at
2041         run  time,  the  library will be a bit bigger, but the addi-
2042         tional run time overhead is limited to testing the PCRE_UTF8
2043         flag in several places, so should not be very large.
2044    
2045         PCRE assumes that the strings  it  is  given  contain  valid
2046         UTF-8  codes. It does not diagnose invalid UTF-8 strings. If
2047         you pass invalid UTF-8 strings  to  PCRE,  the  results  are
2048         undefined.
2049    
2050         Running with PCRE_UTF8 set causes these changes in  the  way
2051         PCRE works:
2052    
2053         1. In a pattern, the escape sequence \x{...}, where the con-
2054         tents  of  the  braces is a string of hexadecimal digits, is
2055         interpreted as a UTF-8 character whose code  number  is  the
2056         given   hexadecimal  number,  for  example:  \x{1234}.  This
2057         inserts from one to six  literal  bytes  into  the  pattern,
2058         using the UTF-8 encoding. If a non-hexadecimal digit appears
2059         between the braces, the item is not recognized.
2060    
2061         2. The original hexadecimal escape sequence, \xhh, generates
2062         a two-byte UTF-8 character if its value is greater than 127.
2063    
2064         3. Repeat quantifiers are NOT correctly handled if they fol-
2065         low  a  multibyte character. For example, \x{100}* and \xc3+
2066         do not work. If you want to repeat such characters, you must
2067         enclose  them  in  non-capturing  parentheses,  for  example
2068         (?:\x{100}), at present.
2069    
2070         4. The dot metacharacter matches one UTF-8 character instead
2071         of a single byte.
2072    
2073         5. Unlike literal UTF-8 characters,  the  dot  metacharacter
2074         followed  by  a  repeat quantifier does operate correctly on
2075         UTF-8 characters instead of single bytes.
2076    
2077         4. Although the \x{...} escape is permitted in  a  character
2078         class,  characters  whose values are greater than 255 cannot
2079         be included in a class.
2080    
2081         5. A class is matched against a UTF-8 character  instead  of
2082         just  a  single byte, but it can match only characters whose
2083         values are less than 256.  Characters  with  greater  values
2084         always fail to match a class.
2085    
2086         6. Repeated classes work correctly on multiple characters.
2087    
2088         7. Classes containing just a single character whose value is
2089         greater than 127 (but less than 256), for example, [\x80] or
2090         [^\x{93}], do not work because these are optimized into sin-
2091         gle  byte  matches.  In the first case, of course, the class
2092         brackets are just redundant.
2093    
2094         8. Lookbehind assertions move backwards in the subject by  a
2095         fixed  number  of  characters  instead  of a fixed number of
2096         bytes. Simple cases have been tested to work correctly,  but
2097         there may be hidden gotchas herein.
2098    
2099         9. The character types  such  as  \d  and  \w  do  not  work
2100         correctly  with  UTF-8  characters.  They continue to test a
2101         single byte.
2102    
2103         10. Anything not explicitly mentioned here continues to work
2104         in bytes rather than in characters.
2105    
2106         The following UTF-8 features of  Perl  5.6  are  not  imple-
2107         mented:
2108    
2109         1. The escape sequence \C to match a single byte.
2110    
2111         2. The use of Unicode tables and properties and escapes  \p,
2112         \P, and \X.
2113    
2114    
2115    
2116  AUTHOR  AUTHOR
2117       Philip Hazel <ph10@cam.ac.uk>       Philip Hazel <ph10@cam.ac.uk>
2118       University Computing Service,       University Computing Service,
# Line 1742  AUTHOR Line 2120  AUTHOR
2120       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
2121       Phone: +44 1223 334714       Phone: +44 1223 334714
2122    
2123       Last updated: 29 July 1999       Last updated: 28 August 2000,
2124       Copyright (c) 1997-1999 University of Cambridge.         the 250th anniversary of the death of J.S. Bach.
2125         Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.41  
changed lines
  Added in v.49

  ViewVC Help
Powered by ViewVC 1.1.5