/[pcre]/code/trunk/doc/pcre.html
ViewVC logotype

Diff of /code/trunk/doc/pcre.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 43 by nigel, Sat Feb 24 21:39:21 2007 UTC
# Line 25  conversion went wrong. Line 25  conversion went wrong.
25  <LI><A NAME="TOC15" HREF="#SEC15">CIRCUMFLEX AND DOLLAR</A>  <LI><A NAME="TOC15" HREF="#SEC15">CIRCUMFLEX AND DOLLAR</A>
26  <LI><A NAME="TOC16" HREF="#SEC16">FULL STOP (PERIOD, DOT)</A>  <LI><A NAME="TOC16" HREF="#SEC16">FULL STOP (PERIOD, DOT)</A>
27  <LI><A NAME="TOC17" HREF="#SEC17">SQUARE BRACKETS</A>  <LI><A NAME="TOC17" HREF="#SEC17">SQUARE BRACKETS</A>
28  <LI><A NAME="TOC18" HREF="#SEC18">VERTICAL BAR</A>  <LI><A NAME="TOC18" HREF="#SEC18">POSIX CHARACTER CLASSES</A>
29  <LI><A NAME="TOC19" HREF="#SEC19">INTERNAL OPTION SETTING</A>  <LI><A NAME="TOC19" HREF="#SEC19">VERTICAL BAR</A>
30  <LI><A NAME="TOC20" HREF="#SEC20">SUBPATTERNS</A>  <LI><A NAME="TOC20" HREF="#SEC20">INTERNAL OPTION SETTING</A>
31  <LI><A NAME="TOC21" HREF="#SEC21">REPETITION</A>  <LI><A NAME="TOC21" HREF="#SEC21">SUBPATTERNS</A>
32  <LI><A NAME="TOC22" HREF="#SEC22">BACK REFERENCES</A>  <LI><A NAME="TOC22" HREF="#SEC22">REPETITION</A>
33  <LI><A NAME="TOC23" HREF="#SEC23">ASSERTIONS</A>  <LI><A NAME="TOC23" HREF="#SEC23">BACK REFERENCES</A>
34  <LI><A NAME="TOC24" HREF="#SEC24">ONCE-ONLY SUBPATTERNS</A>  <LI><A NAME="TOC24" HREF="#SEC24">ASSERTIONS</A>
35  <LI><A NAME="TOC25" HREF="#SEC25">CONDITIONAL SUBPATTERNS</A>  <LI><A NAME="TOC25" HREF="#SEC25">ONCE-ONLY SUBPATTERNS</A>
36  <LI><A NAME="TOC26" HREF="#SEC26">COMMENTS</A>  <LI><A NAME="TOC26" HREF="#SEC26">CONDITIONAL SUBPATTERNS</A>
37  <LI><A NAME="TOC27" HREF="#SEC27">PERFORMANCE</A>  <LI><A NAME="TOC27" HREF="#SEC27">COMMENTS</A>
38  <LI><A NAME="TOC28" HREF="#SEC28">AUTHOR</A>  <LI><A NAME="TOC28" HREF="#SEC28">RECURSIVE PATTERNS</A>
39    <LI><A NAME="TOC29" HREF="#SEC29">PERFORMANCE</A>
40    <LI><A NAME="TOC30" HREF="#SEC30">AUTHOR</A>
41  </UL>  </UL>
42  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>  <LI><A NAME="SEC1" HREF="#TOC1">NAME</A>
43  <P>  <P>
# Line 77  pcre - Perl-compatible regular expressio Line 79  pcre - Perl-compatible regular expressio
79  <B>const unsigned char *pcre_maketables(void);</B>  <B>const unsigned char *pcre_maketables(void);</B>
80  </P>  </P>
81  <P>  <P>
82    <B>int pcre_fullinfo(const pcre *<I>code</I>, const pcre_extra *<I>extra</I>,</B>
83    <B>int <I>what</I>, void *<I>where</I>);</B>
84    </P>
85    <P>
86  <B>int pcre_info(const pcre *<I>code</I>, int *<I>optptr</I>, int</B>  <B>int pcre_info(const pcre *<I>code</I>, int *<I>optptr</I>, int</B>
87  <B>*<I>firstcharptr</I>);</B>  <B>*<I>firstcharptr</I>);</B>
88  </P>  </P>
# Line 93  pcre - Perl-compatible regular expressio Line 99  pcre - Perl-compatible regular expressio
99  <P>  <P>
100  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
101  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
102  differences (see below). The current implementation corresponds to Perl 5.005.  differences (see below). The current implementation corresponds to Perl 5.005,
103    with some additional features from the Perl development release.
104  </P>  </P>
105  <P>  <P>
106  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
107  a set of wrapper functions that correspond to the POSIX API. These are  a set of wrapper functions that correspond to the POSIX regular expression API.
108  described in the <B>pcreposix</B> documentation.  These are described in the <B>pcreposix</B> documentation.
109  </P>  </P>
110  <P>  <P>
111  The native API function prototypes are defined in the header file <B>pcre.h</B>,  The native API function prototypes are defined in the header file <B>pcre.h</B>,
112  and on Unix systems the library itself is called <B>libpcre.a</B>, so can be  and on Unix systems the library itself is called <B>libpcre.a</B>, so can be
113  accessed by adding <B>-lpcre</B> to the command for linking an application which  accessed by adding <B>-lpcre</B> to the command for linking an application which
114  calls it.  calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to
115    contain the major and minor release numbers for the library. Applications can
116    use these to include support for different releases.
117  </P>  </P>
118  <P>  <P>
119  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>  The functions <B>pcre_compile()</B>, <B>pcre_study()</B>, and <B>pcre_exec()</B>
# Line 116  captured substrings from a matched subje Line 125  captured substrings from a matched subje
125  in the current locale for passing to <B>pcre_compile()</B>.  in the current locale for passing to <B>pcre_compile()</B>.
126  </P>  </P>
127  <P>  <P>
128  The function <B>pcre_info()</B> is used to find out information about a compiled  The function <B>pcre_fullinfo()</B> is used to find out information about a
129  pattern, while the function <B>pcre_version()</B> returns a pointer to a string  compiled pattern; <B>pcre_info()</B> is an obsolete version which returns only
130  containing the version of PCRE and its date of release.  some of the available information, but is retained for backwards compatibility.
131    The function <B>pcre_version()</B> returns a pointer to a string containing the
132    version of PCRE and its date of release.
133  </P>  </P>
134  <P>  <P>
135  The global variables <B>pcre_malloc</B> and <B>pcre_free</B> initially contain  The global variables <B>pcre_malloc</B> and <B>pcre_free</B> initially contain
# Line 246  sequence (?( which introduces a conditio Line 257  sequence (?( which introduces a conditio
257  </PRE>  </PRE>
258  </P>  </P>
259  <P>  <P>
260  This option turns on additional functionality of PCRE that is incompatible with  This option was invented in order to turn on additional functionality of PCRE
261  Perl. Any backslash in a pattern that is followed by a letter that has no  that is incompatible with Perl, but it is currently of very little use. When
262    set, any backslash in a pattern that is followed by a letter that has no
263  special meaning causes an error, thus reserving these combinations for future  special meaning causes an error, thus reserving these combinations for future
264  expansion. By default, as in Perl, a backslash followed by a letter with no  expansion. By default, as in Perl, a backslash followed by a letter with no
265  special meaning is treated as a literal. There are at present no other features  special meaning is treated as a literal. There are at present no other features
266  controlled by this option.  controlled by this option. It can also be set by a (?X) option setting within a
267    pattern.
268  </P>  </P>
269  <P>  <P>
270  <PRE>  <PRE>
# Line 342  memory containing the tables remains ava Line 355  memory containing the tables remains ava
355  </P>  </P>
356  <LI><A NAME="SEC8" HREF="#TOC1">INFORMATION ABOUT A PATTERN</A>  <LI><A NAME="SEC8" HREF="#TOC1">INFORMATION ABOUT A PATTERN</A>
357  <P>  <P>
358  The <B>pcre_info()</B> function returns information about a compiled pattern.  The <B>pcre_fullinfo()</B> function returns information about a compiled
359  Its yield is the number of capturing subpatterns, or one of the following  pattern. It replaces the obsolete <B>pcre_info()</B> function, which is
360  negative numbers:  nevertheless retained for backwards compability (and is documented below).
361    </P>
362    <P>
363    The first argument for <B>pcre_fullinfo()</B> is a pointer to the compiled
364    pattern. The second argument is the result of <B>pcre_study()</B>, or NULL if
365    the pattern was not studied. The third argument specifies which piece of
366    information is required, while the fourth argument is a pointer to a variable
367    to receive the data. The yield of the function is zero for success, or one of
368    the following negative numbers:
369  </P>  </P>
370  <P>  <P>
371  <PRE>  <PRE>
372    PCRE_ERROR_NULL       the argument <I>code</I> was NULL    PCRE_ERROR_NULL       the argument <I>code</I> was NULL
373                            the argument <I>where</I> was NULL
374    PCRE_ERROR_BADMAGIC   the "magic number" was not found    PCRE_ERROR_BADMAGIC   the "magic number" was not found
375      PCRE_ERROR_BADOPTION  the value of <I>what</I> was invalid
376  </PRE>  </PRE>
377  </P>  </P>
378  <P>  <P>
379  If the <I>optptr</I> argument is not NULL, a copy of the options with which the  The possible values for the third argument are defined in <B>pcre.h</B>, and are
380  pattern was compiled is placed in the integer it points to. These option bits  as follows:
381    </P>
382    <P>
383    <PRE>
384      PCRE_INFO_OPTIONS
385    </PRE>
386    </P>
387    <P>
388    Return a copy of the options with which the pattern was compiled. The fourth
389    argument should point to au <B>unsigned long int</B> variable. These option bits
390  are those specified in the call to <B>pcre_compile()</B>, modified by any  are those specified in the call to <B>pcre_compile()</B>, modified by any
391  top-level option settings within the pattern itself, and with the PCRE_ANCHORED  top-level option settings within the pattern itself, and with the PCRE_ANCHORED
392  bit set if the form of the pattern implies that it can match only at the start  bit forcibly set if the form of the pattern implies that it can match only at
393  of a subject string.  the start of a subject string.
394  </P>  </P>
395  <P>  <P>
396  If the pattern is not anchored and the <I>firstcharptr</I> argument is not NULL,  <PRE>
397  it is used to pass back information about the first character of any matched    PCRE_INFO_SIZE
398  string. If there is a fixed first character, e.g. from a pattern such as  </PRE>
399  (cat|cow|coyote), then it is returned in the integer pointed to by  </P>
400  <I>firstcharptr</I>. Otherwise, if either  <P>
401    Return the size of the compiled pattern, that is, the value that was passed as
402    the argument to <B>pcre_malloc()</B> when PCRE was getting memory in which to
403    place the compiled data. The fourth argument should point to a <B>size_t</B>
404    variable.
405    </P>
406    <P>
407    <PRE>
408      PCRE_INFO_CAPTURECOUNT
409    </PRE>
410    </P>
411    <P>
412    Return the number of capturing subpatterns in the pattern. The fourth argument
413    should point to an \fbint\fR variable.
414    </P>
415    <P>
416    <PRE>
417      PCRE_INFO_BACKREFMAX
418    </PRE>
419    </P>
420    <P>
421    Return the number of the highest back reference in the pattern. The fourth
422    argument should point to an <B>int</B> variable. Zero is returned if there are
423    no back references.
424    </P>
425    <P>
426    <PRE>
427      PCRE_INFO_FIRSTCHAR
428    </PRE>
429    </P>
430    <P>
431    Return information about the first character of any matched string, for a
432    non-anchored pattern. If there is a fixed first character, e.g. from a pattern
433    such as (cat|cow|coyote), then it is returned in the integer pointed to by
434    <I>where</I>. Otherwise, if either
435  </P>  </P>
436  <P>  <P>
437  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
# Line 378  starts with "^", or Line 444  starts with "^", or
444  <P>  <P>
445  then -1 is returned, indicating that the pattern matches only at the  then -1 is returned, indicating that the pattern matches only at the
446  start of a subject string or after any "\n" within the string. Otherwise -2 is  start of a subject string or after any "\n" within the string. Otherwise -2 is
447  returned.  returned. For anchored patterns, -2 is returned.
448    </P>
449    <P>
450    <PRE>
451      PCRE_INFO_FIRSTTABLE
452    </PRE>
453    </P>
454    <P>
455    If the pattern was studied, and this resulted in the construction of a 256-bit
456    table indicating a fixed set of characters for the first character in any
457    matching string, a pointer to the table is returned. Otherwise NULL is
458    returned. The fourth argument should point to an <B>unsigned char *</B>
459    variable.
460    </P>
461    <P>
462    <PRE>
463      PCRE_INFO_LASTLITERAL
464    </PRE>
465    </P>
466    <P>
467    For a non-anchored pattern, return the value of the rightmost literal character
468    which must exist in any matched string, other than at its start. The fourth
469    argument should point to an <B>int</B> variable. If there is no such character,
470    or if the pattern is anchored, -1 is returned. For example, for the pattern
471    /a\d+z\d+/ the returned value is 'z'.
472    </P>
473    <P>
474    The <B>pcre_info()</B> function is now obsolete because its interface is too
475    restrictive to return all the available data about a compiled pattern. New
476    programs should use <B>pcre_fullinfo()</B> instead. The yield of
477    <B>pcre_info()</B> is the number of capturing subpatterns, or one of the
478    following negative numbers:
479    </P>
480    <P>
481    <PRE>
482      PCRE_ERROR_NULL       the argument <I>code</I> was NULL
483      PCRE_ERROR_BADMAGIC   the "magic number" was not found
484    </PRE>
485    </P>
486    <P>
487    If the <I>optptr</I> argument is not NULL, a copy of the options with which the
488    pattern was compiled is placed in the integer it points to (see
489    PCRE_INFO_OPTIONS above).
490    </P>
491    <P>
492    If the pattern is not anchored and the <I>firstcharptr</I> argument is not NULL,
493    it is used to pass back information about the first character of any matched
494    string (see PCRE_INFO_FIRSTCHAR above).
495  </P>  </P>
496  <LI><A NAME="SEC9" HREF="#TOC1">MATCHING A PATTERN</A>  <LI><A NAME="SEC9" HREF="#TOC1">MATCHING A PATTERN</A>
497  <P>  <P>
# Line 735  are not part of its pattern matching eng Line 848  are not part of its pattern matching eng
848  pattern matches.  pattern matches.
849  </P>  </P>
850  <P>  <P>
851  7. Fairly obviously, PCRE does not support the (?{code}) construction.  7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
852    constructions. However, there is some experimental support for recursive
853    patterns using the non-Perl item (?R).
854  </P>  </P>
855  <P>  <P>
856  8. There are at the time of writing some oddities in Perl 5.005_02 concerned  8. There are at the time of writing some oddities in Perl 5.005_02 concerned
# Line 783  of the subject. Line 898  of the subject.
898  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
899  <B>pcre_exec()</B> have no Perl equivalents.  <B>pcre_exec()</B> have no Perl equivalents.
900  </P>  </P>
901    <P>
902    (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do
903    this using the (?p{code}) construct, which PCRE cannot of course support.)
904    </P>
905  <LI><A NAME="SEC13" HREF="#TOC1">REGULAR EXPRESSION DETAILS</A>  <LI><A NAME="SEC13" HREF="#TOC1">REGULAR EXPRESSION DETAILS</A>
906  <P>  <P>
907  The syntax and semantics of the regular expressions supported by PCRE are  The syntax and semantics of the regular expressions supported by PCRE are
908  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
909  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
910  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
911  O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
912  here is intended as reference documentation.  here is intended as reference documentation.
913  </P>  </P>
914  <P>  <P>
# Line 1144  All non-alphameric characters other than Line 1263  All non-alphameric characters other than
1263  terminating ] are non-special in character classes, but it does no harm if they  terminating ] are non-special in character classes, but it does no harm if they
1264  are escaped.  are escaped.
1265  </P>  </P>
1266  <LI><A NAME="SEC18" HREF="#TOC1">VERTICAL BAR</A>  <LI><A NAME="SEC18" HREF="#TOC1">POSIX CHARACTER CLASSES</A>
1267    <P>
1268    Perl 5.6 (not yet released at the time of writing) is going to support the
1269    POSIX notation for character classes, which uses names enclosed by [: and :]
1270    within the enclosing square brackets. PCRE supports this notation. For example,
1271    </P>
1272    <P>
1273    <PRE>
1274      [01[:alpha:]%]
1275    </PRE>
1276    </P>
1277    <P>
1278    matches "0", "1", any alphabetic character, or "%". The supported class names
1279    are
1280    </P>
1281    <P>
1282    <PRE>
1283      alnum    letters and digits
1284      alpha    letters
1285      ascii    character codes 0 - 127
1286      cntrl    control characters
1287      digit    decimal digits (same as \d)
1288      graph    printing characters, excluding space
1289      lower    lower case letters
1290      print    printing characters, including space
1291      punct    printing characters, excluding letters and digits
1292      space    white space (same as \s)
1293      upper    upper case letters
1294      word     "word" characters (same as \w)
1295      xdigit   hexadecimal digits
1296    </PRE>
1297    </P>
1298    <P>
1299    The names "ascii" and "word" are Perl extensions. Another Perl extension is
1300    negation, which is indicated by a ^ character after the colon. For example,
1301    </P>
1302    <P>
1303    <PRE>
1304      [12[:^digit:]]
1305    </PRE>
1306    </P>
1307    <P>
1308    matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
1309    syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1310    supported, and an error is given if they are encountered.
1311    </P>
1312    <LI><A NAME="SEC19" HREF="#TOC1">VERTICAL BAR</A>
1313  <P>  <P>
1314  Vertical bar characters are used to separate alternative patterns. For example,  Vertical bar characters are used to separate alternative patterns. For example,
1315  the pattern  the pattern
# Line 1162  and the first one that succeeds is used. Line 1327  and the first one that succeeds is used.
1327  subpattern (defined below), "succeeds" means matching the rest of the main  subpattern (defined below), "succeeds" means matching the rest of the main
1328  pattern as well as the alternative in the subpattern.  pattern as well as the alternative in the subpattern.
1329  </P>  </P>
1330  <LI><A NAME="SEC19" HREF="#TOC1">INTERNAL OPTION SETTING</A>  <LI><A NAME="SEC20" HREF="#TOC1">INTERNAL OPTION SETTING</A>
1331  <P>  <P>
1332  The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED  The settings of PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, and PCRE_EXTENDED
1333  can be changed from within the pattern by a sequence of Perl option letters  can be changed from within the pattern by a sequence of Perl option letters
# Line 1238  respectively. The (?X) flag setting is s Line 1403  respectively. The (?X) flag setting is s
1403  earlier in the pattern than any of the additional features it turns on, even  earlier in the pattern than any of the additional features it turns on, even
1404  when it is at top level. It is best put at the start.  when it is at top level. It is best put at the start.
1405  </P>  </P>
1406  <LI><A NAME="SEC20" HREF="#TOC1">SUBPATTERNS</A>  <LI><A NAME="SEC21" HREF="#TOC1">SUBPATTERNS</A>
1407  <P>  <P>
1408  Subpatterns are delimited by parentheses (round brackets), which can be nested.  Subpatterns are delimited by parentheses (round brackets), which can be nested.
1409  Marking part of a pattern as a subpattern does two things:  Marking part of a pattern as a subpattern does two things:
# Line 1309  from left to right, and options are not Line 1474  from left to right, and options are not
1474  is reached, an option setting in one branch does affect subsequent branches, so  is reached, an option setting in one branch does affect subsequent branches, so
1475  the above patterns match "SUNDAY" as well as "Saturday".  the above patterns match "SUNDAY" as well as "Saturday".
1476  </P>  </P>
1477  <LI><A NAME="SEC21" HREF="#TOC1">REPETITION</A>  <LI><A NAME="SEC22" HREF="#TOC1">REPETITION</A>
1478  <P>  <P>
1479  Repetition is specified by quantifiers, which can follow any of the following  Repetition is specified by quantifiers, which can follow any of the following
1480  items:  items:
# Line 1484  example, after Line 1649  example, after
1649  <P>  <P>
1650  matches "aba" the value of the second captured substring is "b".  matches "aba" the value of the second captured substring is "b".
1651  </P>  </P>
1652  <LI><A NAME="SEC22" HREF="#TOC1">BACK REFERENCES</A>  <LI><A NAME="SEC23" HREF="#TOC1">BACK REFERENCES</A>
1653  <P>  <P>
1654  Outside a character class, a backslash followed by a digit greater than 0 (and  Outside a character class, a backslash followed by a digit greater than 0 (and
1655  possibly further digits) is a back reference to a capturing subpattern earlier  possibly further digits) is a back reference to a capturing subpattern earlier
# Line 1560  that the first iteration does not need t Line 1725  that the first iteration does not need t
1725  done using alternation, as in the example above, or by a quantifier with a  done using alternation, as in the example above, or by a quantifier with a
1726  minimum of zero.  minimum of zero.
1727  </P>  </P>
1728  <LI><A NAME="SEC23" HREF="#TOC1">ASSERTIONS</A>  <LI><A NAME="SEC24" HREF="#TOC1">ASSERTIONS</A>
1729  <P>  <P>
1730  An assertion is a test on the characters following or preceding the current  An assertion is a test on the characters following or preceding the current
1731  matching point that does not actually consume any characters. The simple  matching point that does not actually consume any characters. The simple
# Line 1718  because it does not make sense for negat Line 1883  because it does not make sense for negat
1883  <P>  <P>
1884  Assertions count towards the maximum of 200 parenthesized subpatterns.  Assertions count towards the maximum of 200 parenthesized subpatterns.
1885  </P>  </P>
1886  <LI><A NAME="SEC24" HREF="#TOC1">ONCE-ONLY SUBPATTERNS</A>  <LI><A NAME="SEC25" HREF="#TOC1">ONCE-ONLY SUBPATTERNS</A>
1887  <P>  <P>
1888  With both maximizing and minimizing repetition, failure of what follows  With both maximizing and minimizing repetition, failure of what follows
1889  normally causes the repeated item to be re-evaluated to see if a different  normally causes the repeated item to be re-evaluated to see if a different
# Line 1782  pattern such as Line 1947  pattern such as
1947  </PRE>  </PRE>
1948  </P>  </P>
1949  <P>  <P>
1950  when applied to a long string which does not match it. Because matching  when applied to a long string which does not match. Because matching proceeds
1951  proceeds from left to right, PCRE will look for each "a" in the subject and  from left to right, PCRE will look for each "a" in the subject and then see if
1952  then see if what follows matches the rest of the pattern. If the pattern is  what follows matches the rest of the pattern. If the pattern is specified as
 specified as  
1953  </P>  </P>
1954  <P>  <P>
1955  <PRE>  <PRE>
# Line 1793  specified as Line 1957  specified as
1957  </PRE>  </PRE>
1958  </P>  </P>
1959  <P>  <P>
1960  then the initial .* matches the entire string at first, but when this fails, it  then the initial .* matches the entire string at first, but when this fails
1961  backtracks to match all but the last character, then all but the last two  (because there is no following "a"), it backtracks to match all but the last
1962  characters, and so on. Once again the search for "a" covers the entire string,  character, then all but the last two characters, and so on. Once again the
1963  from right to left, so we are no better off. However, if the pattern is written  search for "a" covers the entire string, from right to left, so we are no
1964  as  better off. However, if the pattern is written as
1965  </P>  </P>
1966  <P>  <P>
1967  <PRE>  <PRE>
# Line 1810  string. The subsequent lookbehind assert Line 1974  string. The subsequent lookbehind assert
1974  characters. If it fails, the match fails immediately. For long strings, this  characters. If it fails, the match fails immediately. For long strings, this
1975  approach makes a significant difference to the processing time.  approach makes a significant difference to the processing time.
1976  </P>  </P>
1977  <LI><A NAME="SEC25" HREF="#TOC1">CONDITIONAL SUBPATTERNS</A>  <P>
1978    When a pattern contains an unlimited repeat inside a subpattern that can itself
1979    be repeated an unlimited number of times, the use of a once-only subpattern is
1980    the only way to avoid some failing matches taking a very long time indeed.
1981    The pattern
1982    </P>
1983    <P>
1984    <PRE>
1985      (\D+|&#60;\d+&#62;)*[!?]
1986    </PRE>
1987    </P>
1988    <P>
1989    matches an unlimited number of substrings that either consist of non-digits, or
1990    digits enclosed in &#60;&#62;, followed by either ! or ?. When it matches, it runs
1991    quickly. However, if it is applied to
1992    </P>
1993    <P>
1994    <PRE>
1995      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1996    </PRE>
1997    </P>
1998    <P>
1999    it takes a long time before reporting failure. This is because the string can
2000    be divided between the two repeats in a large number of ways, and all have to
2001    be tried. (The example used [!?] rather than a single character at the end,
2002    because both PCRE and Perl have an optimization that allows for fast failure
2003    when a single character is used. They remember the last single character that
2004    is required for a match, and fail early if it is not present in the string.)
2005    If the pattern is changed to
2006    </P>
2007    <P>
2008    <PRE>
2009      ((?&#62;\D+)|&#60;\d+&#62;)*[!?]
2010    </PRE>
2011    </P>
2012    <P>
2013    sequences of non-digits cannot be broken, and failure happens quickly.
2014    </P>
2015    <LI><A NAME="SEC26" HREF="#TOC1">CONDITIONAL SUBPATTERNS</A>
2016  <P>  <P>
2017  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
2018  conditionally or to choose between two alternative subpatterns, depending on  conditionally or to choose between two alternative subpatterns, depending on
# Line 1872  subject is matched against the first alt Line 2074  subject is matched against the first alt
2074  against the second. This pattern matches strings in one of the two forms  against the second. This pattern matches strings in one of the two forms
2075  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.  dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2076  </P>  </P>
2077  <LI><A NAME="SEC26" HREF="#TOC1">COMMENTS</A>  <LI><A NAME="SEC27" HREF="#TOC1">COMMENTS</A>
2078  <P>  <P>
2079  The sequence (?# marks the start of a comment which continues up to the next  The sequence (?# marks the start of a comment which continues up to the next
2080  closing parenthesis. Nested parentheses are not permitted. The characters  closing parenthesis. Nested parentheses are not permitted. The characters
# Line 1883  If the PCRE_EXTENDED option is set, an u Line 2085  If the PCRE_EXTENDED option is set, an u
2085  character class introduces a comment that continues up to the next newline  character class introduces a comment that continues up to the next newline
2086  character in the pattern.  character in the pattern.
2087  </P>  </P>
2088  <LI><A NAME="SEC27" HREF="#TOC1">PERFORMANCE</A>  <LI><A NAME="SEC28" HREF="#TOC1">RECURSIVE PATTERNS</A>
2089    <P>
2090    Consider the problem of matching a string in parentheses, allowing for
2091    unlimited nested parentheses. Without the use of recursion, the best that can
2092    be done is to use a pattern that matches up to some fixed depth of nesting. It
2093    is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an
2094    experimental facility that allows regular expressions to recurse (amongst other
2095    things). It does this by interpolating Perl code in the expression at run time,
2096    and the code can refer to the expression itself. A Perl pattern to solve the
2097    parentheses problem can be created like this:
2098    </P>
2099    <P>
2100    <PRE>
2101      $re = qr{\( (?: (?&#62;[^()]+) | (?p{$re}) )* \)}x;
2102    </PRE>
2103    </P>
2104    <P>
2105    The (?p{...}) item interpolates Perl code at run time, and in this case refers
2106    recursively to the pattern in which it appears. Obviously, PCRE cannot support
2107    the interpolation of Perl code. Instead, the special item (?R) is provided for
2108    the specific case of recursion. This PCRE pattern solves the parentheses
2109    problem (assume the PCRE_EXTENDED option is set so that white space is
2110    ignored):
2111    </P>
2112    <P>
2113    <PRE>
2114      \( ( (?&#62;[^()]+) | (?R) )* \)
2115    </PRE>
2116    </P>
2117    <P>
2118    First it matches an opening parenthesis. Then it matches any number of
2119    substrings which can either be a sequence of non-parentheses, or a recursive
2120    match of the pattern itself (i.e. a correctly parenthesized substring). Finally
2121    there is a closing parenthesis.
2122    </P>
2123    <P>
2124    This particular example pattern contains nested unlimited repeats, and so the
2125    use of a once-only subpattern for matching strings of non-parentheses is
2126    important when applying the pattern to strings that do not match. For example,
2127    when it is applied to
2128    </P>
2129    <P>
2130    <PRE>
2131      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2132    </PRE>
2133    </P>
2134    <P>
2135    it yields "no match" quickly. However, if a once-only subpattern is not used,
2136    the match runs for a very long time indeed because there are so many different
2137    ways the + and * repeats can carve up the subject, and all have to be tested
2138    before failure can be reported.
2139    </P>
2140    <P>
2141    The values set for any capturing subpatterns are those from the outermost level
2142    of the recursion at which the subpattern value is set. If the pattern above is
2143    matched against
2144    </P>
2145    <P>
2146    <PRE>
2147      (ab(cd)ef)
2148    </PRE>
2149    </P>
2150    <P>
2151    the value for the capturing parentheses is "ef", which is the last value taken
2152    on at the top level. If additional parentheses are added, giving
2153    </P>
2154    <P>
2155    <PRE>
2156      \( ( ( (?&#62;[^()]+) | (?R) )* ) \)
2157         ^                        ^
2158         ^                        ^
2159    </PRE>
2160    then the string they capture is "ab(cd)ef", the contents of the top level
2161    parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
2162    has to obtain extra memory to store data during a recursion, which it does by
2163    using <B>pcre_malloc</B>, freeing it via <B>pcre_free</B> afterwards. If no
2164    memory can be obtained, it saves data for the first 15 capturing parentheses
2165    only, as there is no way to give an out-of-memory error from within a
2166    recursion.
2167    </P>
2168    <LI><A NAME="SEC29" HREF="#TOC1">PERFORMANCE</A>
2169  <P>  <P>
2170  Certain items that may appear in patterns are more efficient than others. It is  Certain items that may appear in patterns are more efficient than others. It is
2171  more efficient to use a character class like [aeiou] than a set of alternatives  more efficient to use a character class like [aeiou] than a set of alternatives
# Line 1959  with the pattern above. The former gives Line 2241  with the pattern above. The former gives
2241  applied to a whole line of "a" characters, whereas the latter takes an  applied to a whole line of "a" characters, whereas the latter takes an
2242  appreciable time with strings longer than about 20 characters.  appreciable time with strings longer than about 20 characters.
2243  </P>  </P>
2244  <LI><A NAME="SEC28" HREF="#TOC1">AUTHOR</A>  <LI><A NAME="SEC30" HREF="#TOC1">AUTHOR</A>
2245  <P>  <P>
2246  Philip Hazel &#60;ph10@cam.ac.uk&#62;  Philip Hazel &#60;ph10@cam.ac.uk&#62;
2247  <BR>  <BR>
# Line 1972  Cambridge CB2 3QG, England. Line 2254  Cambridge CB2 3QG, England.
2254  Phone: +44 1223 334714  Phone: +44 1223 334714
2255  </P>  </P>
2256  <P>  <P>
2257  Last updated: 29 July 1999  Last updated: 27 January 2000
2258  <BR>  <BR>
2259  Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.41  
changed lines
  Added in v.43

  ViewVC Help
Powered by ViewVC 1.1.5