/[pcre]/code/tags/pcre-3.2/doc/pcre.txt
ViewVC logotype

Diff of /code/tags/pcre-3.2/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC
# Line 30  SYNOPSIS Line 30  SYNOPSIS
30    
31       const unsigned char *pcre_maketables(void);       const unsigned char *pcre_maketables(void);
32    
33         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
34              int what, void *where);
35    
36       int pcre_info(const pcre *code, int *optptr, *firstcharptr);       int pcre_info(const pcre *code, int *optptr, *firstcharptr);
37    
38       char *pcre_version(void);       char *pcre_version(void);
# Line 46  DESCRIPTION Line 49  DESCRIPTION
49       lar  expression  pattern  matching using the same syntax and       lar  expression  pattern  matching using the same syntax and
50       semantics as Perl  5,  with  just  a  few  differences  (see       semantics as Perl  5,  with  just  a  few  differences  (see
51       below).  The  current  implementation  corresponds  to  Perl       below).  The  current  implementation  corresponds  to  Perl
52       5.005.       5.005, with some additional features from the Perl  develop-
53         ment release.
54    
55       PCRE has its own native API,  which  is  described  in  this       PCRE has its own native API,  which  is  described  in  this
56       document.  There  is  also  a  set of wrapper functions that       document.  There  is  also  a  set of wrapper functions that
57       correspond to the POSIX API.  These  are  described  in  the       correspond to the POSIX regular expression API.   These  are
58       pcreposix documentation.       described in the pcreposix documentation.
59    
60       The native API function prototypes are defined in the header       The native API function prototypes are defined in the header
61       file  pcre.h,  and  on  Unix  systems  the library itself is       file  pcre.h,  and  on  Unix  systems  the library itself is
62       called libpcre.a, so can be accessed by adding -lpcre to the       called libpcre.a, so can be accessed by adding -lpcre to the
63       command for linking an application which calls it.       command  for  linking  an  application  which  calls it. The
64         header file defines the macros PCRE_MAJOR and PCRE_MINOR  to
65         contain the major and minor release numbers for the library.
66         Applications can use these to include support for  different
67         releases.
68    
69       The functions pcre_compile(), pcre_study(), and  pcre_exec()       The functions pcre_compile(), pcre_study(), and  pcre_exec()
70       are  used  for  compiling  and matching regular expressions,       are  used  for  compiling  and matching regular expressions,
# Line 66  DESCRIPTION Line 75  DESCRIPTION
75       to build a set of character tables in the current locale for       to build a set of character tables in the current locale for
76       passing to pcre_compile().       passing to pcre_compile().
77    
78       The function pcre_info() is used  to  find  out  information       The function pcre_fullinfo() is used to find out information
79       about  a compiled pattern, while the function pcre_version()       about a compiled pattern; pcre_info() is an obsolete version
80       returns a pointer to a string containing the version of PCRE       which returns only some of the available information, but is
81       and its date of release.       retained   for   backwards   compatibility.    The  function
82         pcre_version() returns a pointer to a string containing  the
83         version of PCRE and its date of release.
84    
85       The global variables  pcre_malloc  and  pcre_free  initially       The global variables  pcre_malloc  and  pcre_free  initially
86       contain the entry points of the standard malloc() and free()       contain the entry points of the standard malloc() and free()
# Line 92  MULTI-THREADING Line 103  MULTI-THREADING
103    
104    
105    
106    
107  COMPILING A PATTERN  COMPILING A PATTERN
108       The function pcre_compile() is called to compile  a  pattern       The function pcre_compile() is called to compile  a  pattern
109       into  an internal form. The pattern is a C string terminated       into  an internal form. The pattern is a C string terminated
# Line 187  COMPILING A PATTERN Line 199  COMPILING A PATTERN
199    
200         PCRE_EXTRA         PCRE_EXTRA
201    
202       This option turns on additional functionality of  PCRE  that       This option was invented in  order  to  turn  on  additional
203       is  incompatible  with Perl. Any backslash in a pattern that       functionality of PCRE that is incompatible with Perl, but it
204       is followed by a letter that has no special  meaning  causes       is currently of very little use. When set, any backslash  in
205       an  error,  thus  reserving  these  combinations  for future       a  pattern  that is followed by a letter that has no special
206       expansion. By default, as in Perl, a backslash followed by a       meaning causes an error, thus reserving  these  combinations
207       letter  with  no  special  meaning  is treated as a literal.       for  future  expansion.  By default, as in Perl, a backslash
208       There are at present no other features  controlled  by  this       followed by a letter with no special meaning is treated as a
209       option.       literal.  There  are at present no other features controlled
210         by this option. It can also be set by a (?X) option  setting
211         within a pattern.
212    
213         PCRE_MULTILINE         PCRE_MULTILINE
214    
# Line 207  COMPILING A PATTERN Line 221  COMPILING A PATTERN
221       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.       PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.
222    
223       When PCRE_MULTILINE it is set, the "start of line" and  "end       When PCRE_MULTILINE it is set, the "start of line" and  "end
224       of   line"   constructs   match   immediately  following  or       of  line"  constructs match immediately following or immedi-
225       immediately  before  any  newline  in  the  subject  string,       ately before any newline  in  the  subject  string,  respec-
226       respectively,  as well as at the very start and end. This is       tively,  as  well  as  at  the  very  start and end. This is
227       equivalent to Perl's /m option. If there are no "\n" charac-       equivalent to Perl's /m option. If there are no "\n" charac-
228       ters  in  a subject string, or no occurrences of ^ or $ in a       ters  in  a subject string, or no occurrences of ^ or $ in a
229       pattern, setting PCRE_MULTILINE has no effect.       pattern, setting PCRE_MULTILINE has no effect.
# Line 284  LOCALE SUPPORT Line 298  LOCALE SUPPORT
298    
299    
300  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
301       The pcre_info() function returns information  about  a  com-       The pcre_fullinfo() function  returns  information  about  a
302       piled pattern.  Its yield is the number of capturing subpat-       compiled pattern. It replaces the obsolete pcre_info() func-
303       terns, or one of the following negative numbers:       tion, which is nevertheless retained for backwards compabil-
304         ity (and is documented below).
305    
306         The first argument for pcre_fullinfo() is a pointer  to  the
307         compiled  pattern.  The  second  argument  is  the result of
308         pcre_study(), or NULL if the pattern was  not  studied.  The
309         third  argument  specifies  which  piece  of  information is
310         required, while the fourth argument is a pointer to a  vari-
311         able  to receive the data. The yield of the function is zero
312         for success, or one of the following negative numbers:
313    
314         PCRE_ERROR_NULL       the argument code was NULL         PCRE_ERROR_NULL       the argument code was NULL
315                                 the argument where was NULL
316         PCRE_ERROR_BADMAGIC   the "magic number" was not found         PCRE_ERROR_BADMAGIC   the "magic number" was not found
317           PCRE_ERROR_BADOPTION  the value of what was invalid
318    
319       If the optptr argument is not NULL, a copy  of  the  options       The possible values for the third argument  are  defined  in
320       with which the pattern was compiled is placed in the integer       pcre.h, and are as follows:
321       it points to. These option bits are those specified  in  the  
322           PCRE_INFO_OPTIONS
323    
324         Return a copy of the options with which the pattern was com-
325         piled.  The fourth argument should point to au unsigned long
326         int variable. These option bits are those specified  in  the
327       call  to  pcre_compile(),  modified  by any top-level option       call  to  pcre_compile(),  modified  by any top-level option
328       settings  within  the   pattern   itself,   and   with   the       settings  within  the   pattern   itself,   and   with   the
329       PCRE_ANCHORED  bit  set  if  the form of the pattern implies       PCRE_ANCHORED  bit  forcibly  set if the form of the pattern
330       that it can match only at the start of a subject string.       implies that it can match only at the  start  of  a  subject
331         string.
332    
333       If the pattern is not anchored and the firstcharptr argument         PCRE_INFO_SIZE
334       is  not  NULL, it is used to pass back information about the  
335       first character of any matched string. If there is  a  fixed       Return the size of the compiled pattern, that is, the  value
336       first    character,    e.g.   from   a   pattern   such   as       that  was  passed as the argument to pcre_malloc() when PCRE
337       (cat|cow|coyote), then it is returned in the integer pointed       was getting memory in which to place the compiled data.  The
338       to by firstcharptr. Otherwise, if either       fourth argument should point to a size_t variable.
339    
340           PCRE_INFO_CAPTURECOUNT
341    
342         Return the number of capturing subpatterns in  the  pattern.
343         The fourth argument should point to an int variable.
344    
345           PCRE_INFO_BACKREFMAX
346    
347         Return the number of the highest back reference in the  pat-
348         tern.  The  fourth argument should point to an int variable.
349         Zero is returned if there are no back references.
350    
351           PCRE_INFO_FIRSTCHAR
352    
353         Return information about the first character of any  matched
354         string,  for  a  non-anchored  pattern.  If there is a fixed
355         first   character,   e.g.   from   a   pattern    such    as
356         (cat|cow|coyote),  it  is returned in the integer pointed to
357         by where. Otherwise, if either
358    
359       (a) the pattern was compiled with the PCRE_MULTILINE option,       (a) the pattern was compiled with the PCRE_MULTILINE option,
360       and every branch starts with "^", or       and every branch starts with "^", or
# Line 312  INFORMATION ABOUT A PATTERN Line 362  INFORMATION ABOUT A PATTERN
362       (b) every  branch  of  the  pattern  starts  with  ".*"  and       (b) every  branch  of  the  pattern  starts  with  ".*"  and
363       PCRE_DOTALL is not set (if it were set, the pattern would be       PCRE_DOTALL is not set (if it were set, the pattern would be
364       anchored),       anchored),
365       then -1 is returned, indicating  that  the  pattern  matches  
366       only  at  the  start  of  a subject string or after any "\n"       -1 is returned, indicating that the pattern matches only  at
367       within the string. Otherwise -2 is returned.       the  start  of a subject string or after any "\n" within the
368         string. Otherwise -2 is returned.  For anchored patterns, -2
369         is returned.
370    
371           PCRE_INFO_FIRSTTABLE
372    
373         If the pattern was studied, and this resulted  in  the  con-
374         struction of a 256-bit table indicating a fixed set of char-
375         acters for the first character in  any  matching  string,  a
376         pointer   to  the  table  is  returned.  Otherwise  NULL  is
377         returned. The fourth argument should point  to  an  unsigned
378         char * variable.
379    
380           PCRE_INFO_LASTLITERAL
381    
382         For a non-anchored pattern, return the value of  the  right-
383         most  literal  character  which  must  exist  in any matched
384         string, other than at its start. The fourth argument  should
385         point  to an int variable. If there is no such character, or
386         if the pattern is anchored, -1 is returned. For example, for
387         the pattern /a\d+z\d+/ the returned value is 'z'.
388    
389         The pcre_info() function is now obsolete because its  inter-
390         face  is  too  restrictive  to return all the available data
391         about  a  compiled  pattern.   New   programs   should   use
392         pcre_fullinfo()  instead.  The  yield  of pcre_info() is the
393         number of capturing subpatterns, or  one  of  the  following
394         negative numbers:
395    
396           PCRE_ERROR_NULL       the argument code was NULL
397           PCRE_ERROR_BADMAGIC   the "magic number" was not found
398    
399         If the optptr argument is not NULL, a copy  of  the  options
400         with which the pattern was compiled is placed in the integer
401         it points to (see PCRE_INFO_OPTIONS above).
402    
403         If the pattern is not anchored and the firstcharptr argument
404         is  not  NULL, it is used to pass back information about the
405         first    character    of    any    matched    string    (see
406         PCRE_INFO_FIRSTCHAR above).
407    
408    
409    
# Line 533  EXTRACTING CAPTURED SUBSTRINGS Line 622  EXTRACTING CAPTURED SUBSTRINGS
622       entire regular expression. This is  the  value  returned  by       entire regular expression. This is  the  value  returned  by
623       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()       pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()
624       returned zero, indicating that it ran out of space in  ovec-       returned zero, indicating that it ran out of space in  ovec-
625       tor, then the value passed as stringcount should be the size       tor,  the  value passed as stringcount should be the size of
626       of the vector divided by three.       the vector divided by three.
627    
628       The functions pcre_copy_substring() and pcre_get_substring()       The functions pcre_copy_substring() and pcre_get_substring()
629       extract a single substring, whose number is given as string-       extract a single substring, whose number is given as string-
# Line 640  DIFFERENCES FROM PERL Line 729  DIFFERENCES FROM PERL
729       6. The Perl \G assertion is  not  supported  as  it  is  not       6. The Perl \G assertion is  not  supported  as  it  is  not
730       relevant to single pattern matches.       relevant to single pattern matches.
731    
732       7. Fairly obviously, PCRE does  not  support  the  (?{code})       7. Fairly obviously, PCRE does not support the (?{code}) and
733       construction.       (?p{code})  constructions. However, there is some experimen-
734         tal support for recursive patterns using the  non-Perl  item
735         (?R).
736       8. There are at the time of writing some  oddities  in  Perl       8. There are at the time of writing some  oddities  in  Perl
737       5.005_02  concerned  with  the  settings of captured strings       5.005_02  concerned  with  the  settings of captured strings
738       when part of a pattern is repeated.  For  example,  matching       when part of a pattern is repeated.  For  example,  matching
739       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value       "aba"  against the pattern /^(a(b)?)+$/ sets $2 to the value
740       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2       "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves  $2
741       unset.    However,    if   the   pattern   is   changed   to       unset.    However,    if   the   pattern   is   changed   to
742       /^(aa(b(b))?)+$/ then $2 (and $3) get set.       /^(aa(b(b))?)+$/ then $2 (and $3) are set.
743    
744       In Perl 5.004 $2 is set in both cases, and that is also true       In Perl 5.004 $2 is set in both cases, and that is also true
745       of PCRE. If in the future Perl changes to a consistent state       of PCRE. If in the future Perl changes to a consistent state
# Line 675  DIFFERENCES FROM PERL Line 765  DIFFERENCES FROM PERL
765       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter       (c) If PCRE_EXTRA is set, a backslash followed by  a  letter
766       with no special meaning is faulted.       with no special meaning is faulted.
767    
768       (d)  If  PCRE_UNGREEDY  is  set,  the  greediness   of   the       (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-
769       repetition quantifiers is inverted, that is, by default they       tion  quantifiers  is inverted, that is, by default they are
770       are not greedy, but if followed by a question mark they are.       not greedy, but if followed by a question mark they are.
771    
772       (e) PCRE_ANCHORED can be used to force a pattern to be tried       (e) PCRE_ANCHORED can be used to force a pattern to be tried
773       only at the start of the subject.       only at the start of the subject.
# Line 685  DIFFERENCES FROM PERL Line 775  DIFFERENCES FROM PERL
775       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options       (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY  options
776       for pcre_exec() have no Perl equivalents.       for pcre_exec() have no Perl equivalents.
777    
778         (g) The (?R) construct allows for recursive pattern matching
779         (Perl  5.6 can do this using the (?p{code}) construct, which
780         PCRE cannot of course support.)
781    
782    
783    
784  REGULAR EXPRESSION DETAILS  REGULAR EXPRESSION DETAILS
785       The syntax and semantics of  the  regular  expressions  sup-       The syntax and semantics of  the  regular  expressions  sup-
786       ported  by PCRE are described below. Regular expressions are       ported  by PCRE are described below. Regular expressions are
787       also described in the Perl documentation and in a number  of       also described in the Perl documentation and in a number  of
788    
789       other  books,  some  of which have copious examples. Jeffrey       other  books,  some  of which have copious examples. Jeffrey
790       Friedl's  "Mastering  Regular  Expressions",  published   by       Friedl's  "Mastering  Regular  Expressions",  published   by
791       O'Reilly  (ISBN 1-56592-257-3), covers them in great detail.       O'Reilly  (ISBN  1-56592-257),  covers them in great detail.
792       The description here is intended as reference documentation.       The description here is intended as reference documentation.
793    
794       A regular expression is a pattern that is matched against  a       A regular expression is a pattern that is matched against  a
# Line 780  BACKSLASH Line 875  BACKSLASH
875         \f     formfeed (hex 0C)         \f     formfeed (hex 0C)
876         \n     newline (hex 0A)         \n     newline (hex 0A)
877         \r     carriage return (hex 0D)         \r     carriage return (hex 0D)
878           \t     tab (hex 09)
             tab (hex 09)  
879         \xhh   character with hex code hh         \xhh   character with hex code hh
880         \ddd   character with octal code ddd, or backreference         \ddd   character with octal code ddd, or backreference
881    
# Line 833  BACKSLASH Line 927  BACKSLASH
927       Note that octal values of 100 or greater must not be  intro-       Note that octal values of 100 or greater must not be  intro-
928       duced  by  a  leading zero, because no more than three octal       duced  by  a  leading zero, because no more than three octal
929       digits are ever read.       digits are ever read.
930    
931       All the sequences that define a single  byte  value  can  be       All the sequences that define a single  byte  value  can  be
932       used both inside and outside character classes. In addition,       used both inside and outside character classes. In addition,
933       inside a character class, the sequence "\b"  is  interpreted       inside a character class, the sequence "\b"  is  interpreted
# Line 885  BACKSLASH Line 980  BACKSLASH
980       These assertions may not appear in  character  classes  (but       These assertions may not appear in  character  classes  (but
981       note that "\b" has a different meaning, namely the backspace       note that "\b" has a different meaning, namely the backspace
982       character, inside a character class).       character, inside a character class).
983    
984       A word boundary is a position in the  subject  string  where       A word boundary is a position in the  subject  string  where
985       the current character and the previous character do not both       the current character and the previous character do not both
986       match \w or \W (i.e. one matches \w and  the  other  matches       match \w or \W (i.e. one matches \w and  the  other  matches
# Line 960  FULL STOP (PERIOD, DOT) Line 1056  FULL STOP (PERIOD, DOT)
1056       Outside a character class, a dot in the pattern matches  any       Outside a character class, a dot in the pattern matches  any
1057       one character in the subject, including a non-printing char-       one character in the subject, including a non-printing char-
1058       acter, but not (by default)  newline.   If  the  PCRE_DOTALL       acter, but not (by default)  newline.   If  the  PCRE_DOTALL
1059       option  is  set,  then dots match newlines as well. The han-       option  is set, dots match newlines as well. The handling of
1060       dling of dot is entirely independent of the handling of cir-       dot is entirely independent of the  handling  of  circumflex
1061       cumflex  and  dollar,  the only relationship being that they       and  dollar,  the  only  relationship  being  that they both
1062       both involve newline characters.  Dot has no special meaning       involve newline characters. Dot has no special meaning in  a
1063       in a character class.       character class.
1064    
1065    
1066    
# Line 1046  SQUARE BRACKETS Line 1142  SQUARE BRACKETS
1142    
1143    
1144    
1145    POSIX CHARACTER CLASSES
1146         Perl 5.6 (not yet released at the time of writing) is  going
1147         to  support  the POSIX notation for character classes, which
1148         uses names enclosed by  [:  and  :]   within  the  enclosing
1149         square brackets. PCRE supports this notation. For example,
1150    
1151           [01[:alpha:]%]
1152    
1153         matches "0", "1", any alphabetic character, or "%". The sup-
1154         ported class names are
1155    
1156           alnum    letters and digits
1157           alpha    letters
1158           ascii    character codes 0 - 127
1159           cntrl    control characters
1160           digit    decimal digits (same as \d)
1161           graph    printing characters, excluding space
1162           lower    lower case letters
1163           print    printing characters, including space
1164           punct    printing characters, excluding letters and digits
1165           space    white space (same as \s)
1166           upper    upper case letters
1167           word     "word" characters (same as \w)
1168           xdigit   hexadecimal digits
1169    
1170         The names "ascii" and "word" are  Perl  extensions.  Another
1171         Perl  extension is negation, which is indicated by a ^ char-
1172         acter after the colon. For example,
1173    
1174           [12[:^digit:]]
1175    
1176         matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also
1177         recogize  the POSIX syntax [.ch.] and [=ch=] where "ch" is a
1178         "collating element", but these are  not  supported,  and  an
1179         error is given if they are encountered.
1180    
1181    
1182    
1183  VERTICAL BAR  VERTICAL BAR
1184       Vertical bar characters are  used  to  separate  alternative       Vertical bar characters are  used  to  separate  alternative
1185       patterns. For example, the pattern       patterns. For example, the pattern
# Line 1197  REPETITION Line 1331  REPETITION
1331       Repetition is specified by quantifiers, which can follow any       Repetition is specified by quantifiers, which can follow any
1332       of the following items:       of the following items:
1333    
   
1334         a single character, possibly escaped         a single character, possibly escaped
1335         the . metacharacter         the . metacharacter
1336         a character class         a character class
# Line 1273  REPETITION Line 1406  REPETITION
1406       fails, because it matches  the  entire  string  due  to  the       fails, because it matches  the  entire  string  due  to  the
1407       greediness of the .*  item.       greediness of the .*  item.
1408    
1409       However, if a quantifier is followed  by  a  question  mark,       However, if a quantifier is followed by a question mark,  it
1410       then it ceases to be greedy, and instead matches the minimum       ceases  to be greedy, and instead matches the minimum number
1411       number of times possible, so the pattern       of times possible, so the pattern
1412    
1413         /\*.*?\*/         /\*.*?\*/
1414    
# Line 1292  REPETITION Line 1425  REPETITION
1425       that is the only way the rest of the pattern matches.       that is the only way the rest of the pattern matches.
1426    
1427       If the PCRE_UNGREEDY option is set (an option which  is  not       If the PCRE_UNGREEDY option is set (an option which  is  not
1428       available  in  Perl)  then the quantifiers are not greedy by       available  in  Perl),  the  quantifiers  are  not  greedy by
1429       default, but individual ones can be made greedy by following       default, but individual ones can be made greedy by following
1430       them  with  a  question mark. In other words, it inverts the       them  with  a  question mark. In other words, it inverts the
1431       default behaviour.       default behaviour.
# Line 1304  REPETITION Line 1437  REPETITION
1437    
1438       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL       If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL
1439       option (equivalent to Perl's /s) is set, thus allowing the .       option (equivalent to Perl's /s) is set, thus allowing the .
1440       to match newlines, then the pattern is implicitly  anchored,       to match  newlines,  the  pattern  is  implicitly  anchored,
1441       because whatever follows will be tried against every charac-       because whatever follows will be tried against every charac-
1442       ter position in the subject string, so there is no point  in       ter position in the subject string, so there is no point  in
1443       retrying  the overall match at any position after the first.       retrying  the overall match at any position after the first.
# Line 1357  BACK REFERENCES Line 1490  BACK REFERENCES
1490    
1491       matches "sense and sensibility" and "response and  responsi-       matches "sense and sensibility" and "response and  responsi-
1492       bility",  but  not  "sense  and  responsibility". If caseful       bility",  but  not  "sense  and  responsibility". If caseful
1493       matching is in force at the time of the back reference, then       matching is in force at the time of the back reference,  the
1494       the case of letters is relevant. For example,       case of letters is relevant. For example,
1495    
1496         ((?i)rah)\s+\1         ((?i)rah)\s+\1
1497    
# Line 1368  BACK REFERENCES Line 1501  BACK REFERENCES
1501    
1502       There may be more than one back reference to the  same  sub-       There may be more than one back reference to the  same  sub-
1503       pattern.  If  a  subpattern  has not actually been used in a       pattern.  If  a  subpattern  has not actually been used in a
1504       particular match, then any  back  references  to  it  always       particular match, any back references to it always fail. For
1505       fail. For example, the pattern       example, the pattern
1506    
1507         (a|(bc))\2         (a|(bc))\2
1508    
# Line 1377  BACK REFERENCES Line 1510  BACK REFERENCES
1510       Because  there  may  be up to 99 back references, all digits       Because  there  may  be up to 99 back references, all digits
1511       following the backslash are taken as  part  of  a  potential       following the backslash are taken as  part  of  a  potential
1512       back reference number. If the pattern continues with a digit       back reference number. If the pattern continues with a digit
1513       character, then some delimiter must be used to terminate the       character, some delimiter must be used to terminate the back
1514       back reference. If the PCRE_EXTENDED option is set, this can       reference.   If the PCRE_EXTENDED option is set, this can be
1515       be whitespace.  Otherwise an empty comment can be used.       whitespace. Otherwise an empty comment can be used.
1516    
1517       A back reference that occurs inside the parentheses to which       A back reference that occurs inside the parentheses to which
1518       it  refers  fails when the subpattern is first used, so, for       it  refers  fails when the subpattern is first used, so, for
1519       example, (a\1) never matches.  However, such references  can       example, (a\1) never matches.  However, such references  can
1520       be useful inside repeated subpatterns. For example, the pat-       be  useful  inside  repeated  subpatterns.  For example, the
1521       tern       pattern
1522    
1523         (a|b\1)+         (a|b\1)+
1524    
# Line 1407  ASSERTIONS Line 1540  ASSERTIONS
1540       cated assertions are coded as  subpatterns.  There  are  two       cated assertions are coded as  subpatterns.  There  are  two
1541       kinds:  those that look ahead of the current position in the       kinds:  those that look ahead of the current position in the
1542       subject string, and those that look behind it.       subject string, and those that look behind it.
1543    
1544       An assertion subpattern is matched in the normal way, except       An assertion subpattern is matched in the normal way, except
1545       that  it  does not cause the current matching position to be       that  it  does not cause the current matching position to be
1546       changed. Lookahead assertions start with  (?=  for  positive       changed. Lookahead assertions start with  (?=  for  positive
# Line 1478  ASSERTIONS Line 1612  ASSERTIONS
1612       matches "foo" preceded by three digits that are  not  "999".       matches "foo" preceded by three digits that are  not  "999".
1613       Notice  that each of the assertions is applied independently       Notice  that each of the assertions is applied independently
1614       at the same point in the subject string. First  there  is  a       at the same point in the subject string. First  there  is  a
1615       check  that  the  previous  three characters are all digits,       check that the previous three characters are all digits, and
1616       then there is a check that the same three characters are not       then there is a check that the same three characters are not
1617       "999".   This  pattern  does not match "foo" preceded by six       "999".   This  pattern  does not match "foo" preceded by six
1618       characters, the first of which are digits and the last three       characters, the first of which are digits and the last three
# Line 1572  ONCE-ONLY SUBPATTERNS Line 1706  ONCE-ONLY SUBPATTERNS
1706    
1707         abcd$         abcd$
1708    
1709       when applied to a long  string  which  does  not  match  it.       when applied to a long string which does not match.  Because
1710       Because matching proceeds from left to right, PCRE will look       matching  proceeds  from  left  to right, PCRE will look for
1711       for each "a" in the subject and then  see  if  what  follows       each "a" in the subject and then see if what follows matches
1712       matches the rest of the pattern. If the pattern is specified       the rest of the pattern. If the pattern is specified as
      as  
1713    
1714         ^.*abcd$         ^.*abcd$
1715    
1716       then the initial .* matches the entire string at first,  but       the initial .* matches the entire string at first, but  when
1717       when  this  fails,  it  backtracks to match all but the last       this  fails  (because  there  is no following "a"), it back-
1718       character, then all but the last two characters, and so  on.       tracks to match all but the last character, then all but the
1719       Once again the search for "a" covers the entire string, from       last  two  characters,  and so on. Once again the search for
1720       right to left, so we are no better off. However, if the pat-       "a" covers the entire string, from right to left, so we  are
1721       tern is written as       no better off. However, if the pattern is written as
1722    
1723         ^(?>.*)(?<=abcd)         ^(?>.*)(?<=abcd)
1724    
1725       then there can be no backtracking for the .*  item;  it  can       there can be no backtracking for the .* item; it  can  match
1726       match  only  the  entire  string.  The subsequent lookbehind       only  the entire string. The subsequent lookbehind assertion
1727       assertion does a single test on the last four characters. If       does a single test on the last four characters. If it fails,
1728       it  fails,  the  match  fails immediately. For long strings,       the match fails immediately. For long strings, this approach
1729       this approach makes a significant difference to the process-       makes a significant difference to the processing time.
1730       ing time.  
1731         When a pattern contains an unlimited repeat inside a subpat-
1732         tern  that  can  itself  be  repeated an unlimited number of
1733         times, the use of a once-only subpattern is the only way  to
1734         avoid  some  failing matches taking a very long time indeed.
1735         The pattern
1736    
1737           (\D+|<\d+>)*[!?]
1738    
1739         matches an unlimited number of substrings that  either  con-
1740         sist  of  non-digits,  or digits enclosed in <>, followed by
1741         either ! or ?. When it matches, it runs quickly. However, if
1742         it is applied to
1743    
1744           aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1745    
1746         it takes a long  time  before  reporting  failure.  This  is
1747         because the string can be divided between the two repeats in
1748         a large number of ways, and all have to be tried. (The exam-
1749         ple  used  [!?]  rather  than a single character at the end,
1750         because both PCRE and Perl have an optimization that  allows
1751         for  fast  failure  when  a  single  character is used. They
1752         remember the last single character that is  required  for  a
1753         match,  and  fail early if it is not present in the string.)
1754         If the pattern is changed to
1755    
1756           ((?>\D+)|<\d+>)*[!?]
1757    
1758         sequences of non-digits cannot be broken, and  failure  hap-
1759         pens quickly.
1760    
1761    
1762    
# Line 1614  CONDITIONAL SUBPATTERNS Line 1776  CONDITIONAL SUBPATTERNS
1776       error occurs.       error occurs.
1777    
1778       There are two kinds of condition. If the  text  between  the       There are two kinds of condition. If the  text  between  the
1779       parentheses  consists  of  a  sequence  of  digits, then the       parentheses  consists of a sequence of digits, the condition
1780       condition is satisfied if the capturing subpattern  of  that       is satisfied if the capturing subpattern of that number  has
1781       number  has  previously matched. Consider the following pat-       previously  matched.  Consider  the following pattern, which
1782       tern, which contains non-significant white space to make  it       contains non-significant white space to make it  more  read-
1783       more  readable  (assume  the  PCRE_EXTENDED  option)  and to       able (assume the PCRE_EXTENDED option) and to divide it into
1784       divide it into three parts for ease of discussion:       three parts for ease of discussion:
1785    
1786         ( \( )?    [^()]+    (?(1) \) )         ( \( )?    [^()]+    (?(1) \) )
1787    
# Line 1668  COMMENTS Line 1830  COMMENTS
1830    
1831    
1832    
1833    RECURSIVE PATTERNS
1834         Consider the problem of matching a  string  in  parentheses,
1835         allowing  for  unlimited nested parentheses. Without the use
1836         of recursion, the best that can be done is to use a  pattern
1837         that  matches  up  to some fixed depth of nesting. It is not
1838         possible to handle an arbitrary nesting depth. Perl 5.6  has
1839         provided   an  experimental  facility  that  allows  regular
1840         expressions to recurse (amongst other things). It does  this
1841         by  interpolating  Perl  code in the expression at run time,
1842         and the code can refer to the expression itself. A Perl pat-
1843         tern  to  solve  the parentheses problem can be created like
1844         this:
1845    
1846           $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
1847    
1848         The (?p{...}) item interpolates Perl code at run  time,  and
1849         in  this  case refers recursively to the pattern in which it
1850         appears. Obviously, PCRE cannot support the interpolation of
1851         Perl  code.  Instead,  the special item (?R) is provided for
1852         the specific case of recursion. This PCRE pattern solves the
1853         parentheses  problem (assume the PCRE_EXTENDED option is set
1854         so that white space is ignored):
1855    
1856           \( ( (?>[^()]+) | (?R) )* \)
1857    
1858         First it matches an opening parenthesis. Then it matches any
1859         number  of substrings which can either be a sequence of non-
1860         parentheses, or a recursive  match  of  the  pattern  itself
1861         (i.e. a correctly parenthesized substring). Finally there is
1862         a closing parenthesis.
1863    
1864         This particular example pattern  contains  nested  unlimited
1865         repeats, and so the use of a once-only subpattern for match-
1866         ing strings of non-parentheses is  important  when  applying
1867         the  pattern to strings that do not match. For example, when
1868         it is applied to
1869    
1870           (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1871    
1872         it yields "no match" quickly. However, if a  once-only  sub-
1873         pattern  is  not  used,  the match runs for a very long time
1874         indeed because there are so many different ways the + and  *
1875         repeats  can carve up the subject, and all have to be tested
1876         before failure can be reported.
1877    
1878         The values set for any capturing subpatterns are those  from
1879         the outermost level of the recursion at which the subpattern
1880         value is set. If the pattern above is matched against
1881    
1882           (ab(cd)ef)
1883    
1884         the value for the capturing parentheses is  "ef",  which  is
1885         the  last  value  taken  on  at the top level. If additional
1886         parentheses are added, giving
1887    
1888           \( ( ( (?>[^()]+) | (?R) )* ) \)
1889              ^                        ^
1890              ^                        ^ the string they  capture  is
1891         "ab(cd)ef",  the  contents  of the top level parentheses. If
1892         there are more than 15 capturing parentheses in  a  pattern,
1893         PCRE  has  to  obtain  extra  memory  to store data during a
1894         recursion, which it does by using  pcre_malloc,  freeing  it
1895         via  pcre_free  afterwards. If no memory can be obtained, it
1896         saves data for the first 15 capturing parentheses  only,  as
1897         there is no way to give an out-of-memory error from within a
1898         recursion.
1899    
1900    
1901    
1902  PERFORMANCE  PERFORMANCE
1903       Certain items that may appear in patterns are more efficient       Certain items that may appear in patterns are more efficient
1904       than  others.  It is more efficient to use a character class       than  others.  It is more efficient to use a character class
# Line 1742  AUTHOR Line 1973  AUTHOR
1973       Cambridge CB2 3QG, England.       Cambridge CB2 3QG, England.
1974       Phone: +44 1223 334714       Phone: +44 1223 334714
1975    
1976       Last updated: 29 July 1999       Last updated: 27 January 2000
1977       Copyright (c) 1997-1999 University of Cambridge.       Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.41  
changed lines
  Added in v.47

  ViewVC Help
Powered by ViewVC 1.1.5