/[pcre]/code/trunk/doc/pcre.3
ViewVC logotype

Diff of /code/trunk/doc/pcre.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 41 by nigel, Sat Feb 24 21:39:17 2007 UTC revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC
# Line 47  pcre - Perl-compatible regular expressio Line 47  pcre - Perl-compatible regular expressio
47  .B const unsigned char *pcre_maketables(void);  .B const unsigned char *pcre_maketables(void);
48  .PP  .PP
49  .br  .br
50    .B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR,"
51    .ti +5n
52    .B int \fIwhat\fR, void *\fIwhere\fR);
53    .PP
54    .br
55  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int  .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int
56  .B *\fIfirstcharptr\fR);  .B *\fIfirstcharptr\fR);
57  .PP  .PP
# Line 64  pcre - Perl-compatible regular expressio Line 69  pcre - Perl-compatible regular expressio
69  .SH DESCRIPTION  .SH DESCRIPTION
70  The PCRE library is a set of functions that implement regular expression  The PCRE library is a set of functions that implement regular expression
71  pattern matching using the same syntax and semantics as Perl 5, with just a few  pattern matching using the same syntax and semantics as Perl 5, with just a few
72  differences (see below). The current implementation corresponds to Perl 5.005.  differences (see below). The current implementation corresponds to Perl 5.005,
73    with some additional features from the Perl development release.
74    
75  PCRE has its own native API, which is described in this document. There is also  PCRE has its own native API, which is described in this document. There is also
76  a set of wrapper functions that correspond to the POSIX API. These are  a set of wrapper functions that correspond to the POSIX regular expression API.
77  described in the \fBpcreposix\fR documentation.  These are described in the \fBpcreposix\fR documentation.
78    
79  The native API function prototypes are defined in the header file \fBpcre.h\fR,  The native API function prototypes are defined in the header file \fBpcre.h\fR,
80  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be  and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be
81  accessed by adding \fB-lpcre\fR to the command for linking an application which  accessed by adding \fB-lpcre\fR to the command for linking an application which
82  calls it.  calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to
83    contain the major and minor release numbers for the library. Applications can
84    use these to include support for different releases.
85    
86  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR  The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR
87  are used for compiling and matching regular expressions, while  are used for compiling and matching regular expressions, while
# Line 83  captured substrings from a matched subje Line 91  captured substrings from a matched subje
91  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables  \fBpcre_maketables()\fR is used (optionally) to build a set of character tables
92  in the current locale for passing to \fBpcre_compile()\fR.  in the current locale for passing to \fBpcre_compile()\fR.
93    
94  The function \fBpcre_info()\fR is used to find out information about a compiled  The function \fBpcre_fullinfo()\fR is used to find out information about a
95  pattern, while the function \fBpcre_version()\fR returns a pointer to a string  compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only
96  containing the version of PCRE and its date of release.  some of the available information, but is retained for backwards compatibility.
97    The function \fBpcre_version()\fR returns a pointer to a string containing the
98    version of PCRE and its date of release.
99    
100  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain  The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain
101  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions  the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions
# Line 182  sequence (?( which introduces a conditio Line 192  sequence (?( which introduces a conditio
192    
193    PCRE_EXTRA    PCRE_EXTRA
194    
195  This option turns on additional functionality of PCRE that is incompatible with  This option was invented in order to turn on additional functionality of PCRE
196  Perl. Any backslash in a pattern that is followed by a letter that has no  that is incompatible with Perl, but it is currently of very little use. When
197    set, any backslash in a pattern that is followed by a letter that has no
198  special meaning causes an error, thus reserving these combinations for future  special meaning causes an error, thus reserving these combinations for future
199  expansion. By default, as in Perl, a backslash followed by a letter with no  expansion. By default, as in Perl, a backslash followed by a letter with no
200  special meaning is treated as a literal. There are at present no other features  special meaning is treated as a literal. There are at present no other features
201  controlled by this option.  controlled by this option. It can also be set by a (?X) option setting within a
202    pattern.
203    
204    PCRE_MULTILINE    PCRE_MULTILINE
205    
# Line 261  memory containing the tables remains ava Line 273  memory containing the tables remains ava
273    
274    
275  .SH INFORMATION ABOUT A PATTERN  .SH INFORMATION ABOUT A PATTERN
276  The \fBpcre_info()\fR function returns information about a compiled pattern.  The \fBpcre_fullinfo()\fR function returns information about a compiled
277  Its yield is the number of capturing subpatterns, or one of the following  pattern. It replaces the obsolete \fBpcre_info()\fR function, which is
278  negative numbers:  nevertheless retained for backwards compability (and is documented below).
279    
280    The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled
281    pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if
282    the pattern was not studied. The third argument specifies which piece of
283    information is required, while the fourth argument is a pointer to a variable
284    to receive the data. The yield of the function is zero for success, or one of
285    the following negative numbers:
286    
287    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL    PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
288                            the argument \fIwhere\fR was NULL
289    PCRE_ERROR_BADMAGIC   the "magic number" was not found    PCRE_ERROR_BADMAGIC   the "magic number" was not found
290      PCRE_ERROR_BADOPTION  the value of \fIwhat\fR was invalid
291    
292  If the \fIoptptr\fR argument is not NULL, a copy of the options with which the  The possible values for the third argument are defined in \fBpcre.h\fR, and are
293  pattern was compiled is placed in the integer it points to. These option bits  as follows:
294    
295      PCRE_INFO_OPTIONS
296    
297    Return a copy of the options with which the pattern was compiled. The fourth
298    argument should point to au \fBunsigned long int\fR variable. These option bits
299  are those specified in the call to \fBpcre_compile()\fR, modified by any  are those specified in the call to \fBpcre_compile()\fR, modified by any
300  top-level option settings within the pattern itself, and with the PCRE_ANCHORED  top-level option settings within the pattern itself, and with the PCRE_ANCHORED
301  bit set if the form of the pattern implies that it can match only at the start  bit forcibly set if the form of the pattern implies that it can match only at
302  of a subject string.  the start of a subject string.
303    
304  If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,    PCRE_INFO_SIZE
305  it is used to pass back information about the first character of any matched  
306  string. If there is a fixed first character, e.g. from a pattern such as  Return the size of the compiled pattern, that is, the value that was passed as
307  (cat|cow|coyote), then it is returned in the integer pointed to by  the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to
308  \fIfirstcharptr\fR. Otherwise, if either  place the compiled data. The fourth argument should point to a \fBsize_t\fR
309    variable.
310    
311      PCRE_INFO_CAPTURECOUNT
312    
313    Return the number of capturing subpatterns in the pattern. The fourth argument
314    should point to an \fbint\fR variable.
315    
316      PCRE_INFO_BACKREFMAX
317    
318    Return the number of the highest back reference in the pattern. The fourth
319    argument should point to an \fBint\fR variable. Zero is returned if there are
320    no back references.
321    
322      PCRE_INFO_FIRSTCHAR
323    
324    Return information about the first character of any matched string, for a
325    non-anchored pattern. If there is a fixed first character, e.g. from a pattern
326    such as (cat|cow|coyote), it is returned in the integer pointed to by
327    \fIwhere\fR. Otherwise, if either
328    
329  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch  (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch
330  starts with "^", or  starts with "^", or
# Line 287  starts with "^", or Line 332  starts with "^", or
332  (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set  (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not set
333  (if it were set, the pattern would be anchored),  (if it were set, the pattern would be anchored),
334    
335  then -1 is returned, indicating that the pattern matches only at the  -1 is returned, indicating that the pattern matches only at the start of a
336  start of a subject string or after any "\\n" within the string. Otherwise -2 is  subject string or after any "\\n" within the string. Otherwise -2 is returned.
337  returned.  For anchored patterns, -2 is returned.
338    
339      PCRE_INFO_FIRSTTABLE
340    
341    If the pattern was studied, and this resulted in the construction of a 256-bit
342    table indicating a fixed set of characters for the first character in any
343    matching string, a pointer to the table is returned. Otherwise NULL is
344    returned. The fourth argument should point to an \fBunsigned char *\fR
345    variable.
346    
347      PCRE_INFO_LASTLITERAL
348    
349    For a non-anchored pattern, return the value of the rightmost literal character
350    which must exist in any matched string, other than at its start. The fourth
351    argument should point to an \fBint\fR variable. If there is no such character,
352    or if the pattern is anchored, -1 is returned. For example, for the pattern
353    /a\\d+z\\d+/ the returned value is 'z'.
354    
355    The \fBpcre_info()\fR function is now obsolete because its interface is too
356    restrictive to return all the available data about a compiled pattern. New
357    programs should use \fBpcre_fullinfo()\fR instead. The yield of
358    \fBpcre_info()\fR is the number of capturing subpatterns, or one of the
359    following negative numbers:
360    
361      PCRE_ERROR_NULL       the argument \fIcode\fR was NULL
362      PCRE_ERROR_BADMAGIC   the "magic number" was not found
363    
364    If the \fIoptptr\fR argument is not NULL, a copy of the options with which the
365    pattern was compiled is placed in the integer it points to (see
366    PCRE_INFO_OPTIONS above).
367    
368    If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL,
369    it is used to pass back information about the first character of any matched
370    string (see PCRE_INFO_FIRSTCHAR above).
371    
372    
373  .SH MATCHING A PATTERN  .SH MATCHING A PATTERN
# Line 472  is a pointer to the vector of integer of Line 550  is a pointer to the vector of integer of
550  were captured by the match, including the substring that matched the entire  were captured by the match, including the substring that matched the entire
551  regular expression. This is the value returned by \fBpcre_exec\fR if it  regular expression. This is the value returned by \fBpcre_exec\fR if it
552  is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it  is greater than zero. If \fBpcre_exec()\fR returned zero, indicating that it
553  ran out of space in \fIovector\fR, then the value passed as  ran out of space in \fIovector\fR, the value passed as \fIstringcount\fR should
554  \fIstringcount\fR should be the size of the vector divided by three.  be the size of the vector divided by three.
555    
556  The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR  The functions \fBpcre_copy_substring()\fR and \fBpcre_get_substring()\fR
557  extract a single substring, whose number is given as \fIstringnumber\fR. A  extract a single substring, whose number is given as \fIstringnumber\fR. A
# Line 564  are not part of its pattern matching eng Line 642  are not part of its pattern matching eng
642  6. The Perl \\G assertion is not supported as it is not relevant to single  6. The Perl \\G assertion is not supported as it is not relevant to single
643  pattern matches.  pattern matches.
644    
645  7. Fairly obviously, PCRE does not support the (?{code}) construction.  7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})
646    constructions. However, there is some experimental support for recursive
647    patterns using the non-Perl item (?R).
648    
649  8. There are at the time of writing some oddities in Perl 5.005_02 concerned  8. There are at the time of writing some oddities in Perl 5.005_02 concerned
650  with the settings of captured strings when part of a pattern is repeated. For  with the settings of captured strings when part of a pattern is repeated. For
651  example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value  example, matching "aba" against the pattern /^(a(b)?)+$/ sets $2 to the value
652  "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if  "b", but matching "aabbaa" against /^(aa(bb)?)+$/ leaves $2 unset. However, if
653  the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) get set.  the pattern is changed to /^(aa(b(b))?)+$/ then $2 (and $3) are set.
654    
655  In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the  In Perl 5.004 $2 is set in both cases, and that is also true of PCRE. If in the
656  future Perl changes to a consistent state that is different, PCRE may change to  future Perl changes to a consistent state that is different, PCRE may change to
# Line 602  of the subject. Line 682  of the subject.
682  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for  (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for
683  \fBpcre_exec()\fR have no Perl equivalents.  \fBpcre_exec()\fR have no Perl equivalents.
684    
685    (g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do
686    this using the (?p{code}) construct, which PCRE cannot of course support.)
687    
688    
689  .SH REGULAR EXPRESSION DETAILS  .SH REGULAR EXPRESSION DETAILS
690  The syntax and semantics of the regular expressions supported by PCRE are  The syntax and semantics of the regular expressions supported by PCRE are
691  described below. Regular expressions are also described in the Perl  described below. Regular expressions are also described in the Perl
692  documentation and in a number of other books, some of which have copious  documentation and in a number of other books, some of which have copious
693  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by  examples. Jeffrey Friedl's "Mastering Regular Expressions", published by
694  O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description  O'Reilly (ISBN 1-56592-257), covers them in great detail. The description
695  here is intended as reference documentation.  here is intended as reference documentation.
696    
697  A regular expression is a pattern that is matched against a subject string from  A regular expression is a pattern that is matched against a subject string from
# Line 837  end of the subject in both modes, and if Line 920  end of the subject in both modes, and if
920  .SH FULL STOP (PERIOD, DOT)  .SH FULL STOP (PERIOD, DOT)
921  Outside a character class, a dot in the pattern matches any one character in  Outside a character class, a dot in the pattern matches any one character in
922  the subject, including a non-printing character, but not (by default) newline.  the subject, including a non-printing character, but not (by default) newline.
923  If the PCRE_DOTALL option is set, then dots match newlines as well. The  If the PCRE_DOTALL option is set, dots match newlines as well. The handling of
924  handling of dot is entirely independent of the handling of circumflex and  dot is entirely independent of the handling of circumflex and dollar, the only
925  dollar, the only relationship being that they both involve newline characters.  relationship being that they both involve newline characters. Dot has no
926  Dot has no special meaning in a character class.  special meaning in a character class.
927    
928    
929  .SH SQUARE BRACKETS  .SH SQUARE BRACKETS
# Line 906  terminating ] are non-special in charact Line 989  terminating ] are non-special in charact
989  are escaped.  are escaped.
990    
991    
992    .SH POSIX CHARACTER CLASSES
993    Perl 5.6 (not yet released at the time of writing) is going to support the
994    POSIX notation for character classes, which uses names enclosed by [: and :]
995    within the enclosing square brackets. PCRE supports this notation. For example,
996    
997      [01[:alpha:]%]
998    
999    matches "0", "1", any alphabetic character, or "%". The supported class names
1000    are
1001    
1002      alnum    letters and digits
1003      alpha    letters
1004      ascii    character codes 0 - 127
1005      cntrl    control characters
1006      digit    decimal digits (same as \\d)
1007      graph    printing characters, excluding space
1008      lower    lower case letters
1009      print    printing characters, including space
1010      punct    printing characters, excluding letters and digits
1011      space    white space (same as \\s)
1012      upper    upper case letters
1013      word     "word" characters (same as \\w)
1014      xdigit   hexadecimal digits
1015    
1016    The names "ascii" and "word" are Perl extensions. Another Perl extension is
1017    negation, which is indicated by a ^ character after the colon. For example,
1018    
1019      [12[:^digit:]]
1020    
1021    matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX
1022    syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1023    supported, and an error is given if they are encountered.
1024    
1025    
1026  .SH VERTICAL BAR  .SH VERTICAL BAR
1027  Vertical bar characters are used to separate alternative patterns. For example,  Vertical bar characters are used to separate alternative patterns. For example,
1028  the pattern  the pattern
# Line 1096  to the string Line 1213  to the string
1213  fails, because it matches the entire string due to the greediness of the .*  fails, because it matches the entire string due to the greediness of the .*
1214  item.  item.
1215    
1216  However, if a quantifier is followed by a question mark, then it ceases to be  However, if a quantifier is followed by a question mark, it ceases to be
1217  greedy, and instead matches the minimum number of times possible, so the  greedy, and instead matches the minimum number of times possible, so the
1218  pattern  pattern
1219    
# Line 1112  own right. Because it has two uses, it c Line 1229  own right. Because it has two uses, it c
1229  which matches one digit by preference, but can match two if that is the only  which matches one digit by preference, but can match two if that is the only
1230  way the rest of the pattern matches.  way the rest of the pattern matches.
1231    
1232  If the PCRE_UNGREEDY option is set (an option which is not available in Perl)  If the PCRE_UNGREEDY option is set (an option which is not available in Perl),
1233  then the quantifiers are not greedy by default, but individual ones can be made  the quantifiers are not greedy by default, but individual ones can be made
1234  greedy by following them with a question mark. In other words, it inverts the  greedy by following them with a question mark. In other words, it inverts the
1235  default behaviour.  default behaviour.
1236    
# Line 1122  is greater than 1 or with a limited maxi Line 1239  is greater than 1 or with a limited maxi
1239  compiled pattern, in proportion to the size of the minimum or maximum.  compiled pattern, in proportion to the size of the minimum or maximum.
1240    
1241  If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent  If a pattern starts with .* or .{0,} and the PCRE_DOTALL option (equivalent
1242  to Perl's /s) is set, thus allowing the . to match newlines, then the pattern  to Perl's /s) is set, thus allowing the . to match newlines, the pattern is
1243  is implicitly anchored, because whatever follows will be tried against every  implicitly anchored, because whatever follows will be tried against every
1244  character position in the subject string, so there is no point in retrying the  character position in the subject string, so there is no point in retrying the
1245  overall match at any position after the first. PCRE treats such a pattern as  overall match at any position after the first. PCRE treats such a pattern as
1246  though it were preceded by \\A. In cases where it is known that the subject  though it were preceded by \\A. In cases where it is known that the subject
# Line 1167  itself. So the pattern Line 1284  itself. So the pattern
1284    
1285  matches "sense and sensibility" and "response and responsibility", but not  matches "sense and sensibility" and "response and responsibility", but not
1286  "sense and responsibility". If caseful matching is in force at the time of the  "sense and responsibility". If caseful matching is in force at the time of the
1287  back reference, then the case of letters is relevant. For example,  back reference, the case of letters is relevant. For example,
1288    
1289    ((?i)rah)\\s+\\1    ((?i)rah)\\s+\\1
1290    
# Line 1175  matches "rah rah" and "RAH RAH", but not Line 1292  matches "rah rah" and "RAH RAH", but not
1292  capturing subpattern is matched caselessly.  capturing subpattern is matched caselessly.
1293    
1294  There may be more than one back reference to the same subpattern. If a  There may be more than one back reference to the same subpattern. If a
1295  subpattern has not actually been used in a particular match, then any back  subpattern has not actually been used in a particular match, any back
1296  references to it always fail. For example, the pattern  references to it always fail. For example, the pattern
1297    
1298    (a|(bc))\\2    (a|(bc))\\2
# Line 1183  references to it always fail. For exampl Line 1300  references to it always fail. For exampl
1300  always fails if it starts to match "a" rather than "bc". Because there may be  always fails if it starts to match "a" rather than "bc". Because there may be
1301  up to 99 back references, all digits following the backslash are taken  up to 99 back references, all digits following the backslash are taken
1302  as part of a potential back reference number. If the pattern continues with a  as part of a potential back reference number. If the pattern continues with a
1303  digit character, then some delimiter must be used to terminate the back  digit character, some delimiter must be used to terminate the back reference.
1304  reference. If the PCRE_EXTENDED option is set, this can be whitespace.  If the PCRE_EXTENDED option is set, this can be whitespace. Otherwise an empty
1305  Otherwise an empty comment can be used.  comment can be used.
1306    
1307  A back reference that occurs inside the parentheses to which it refers fails  A back reference that occurs inside the parentheses to which it refers fails
1308  when the subpattern is first used, so, for example, (a\\1) never matches.  when the subpattern is first used, so, for example, (a\\1) never matches.
# Line 1273  Several assertions (of any sort) may occ Line 1390  Several assertions (of any sort) may occ
1390  matches "foo" preceded by three digits that are not "999". Notice that each of  matches "foo" preceded by three digits that are not "999". Notice that each of
1391  the assertions is applied independently at the same point in the subject  the assertions is applied independently at the same point in the subject
1392  string. First there is a check that the previous three characters are all  string. First there is a check that the previous three characters are all
1393  digits, then there is a check that the same three characters are not "999".  digits, and then there is a check that the same three characters are not "999".
1394  This pattern does \fInot\fR match "foo" preceded by six characters, the first  This pattern does \fInot\fR match "foo" preceded by six characters, the first
1395  of which are digits and the last three of which are not "999". For example, it  of which are digits and the last three of which are not "999". For example, it
1396  doesn't match "123abcfoo". A pattern to do that is  doesn't match "123abcfoo". A pattern to do that is
# Line 1352  pattern such as Line 1469  pattern such as
1469    
1470    abcd$    abcd$
1471    
1472  when applied to a long string which does not match it. Because matching  when applied to a long string which does not match. Because matching proceeds
1473  proceeds from left to right, PCRE will look for each "a" in the subject and  from left to right, PCRE will look for each "a" in the subject and then see if
1474  then see if what follows matches the rest of the pattern. If the pattern is  what follows matches the rest of the pattern. If the pattern is specified as
 specified as  
1475    
1476    ^.*abcd$    ^.*abcd$
1477    
1478  then the initial .* matches the entire string at first, but when this fails, it  the initial .* matches the entire string at first, but when this fails (because
1479  backtracks to match all but the last character, then all but the last two  there is no following "a"), it backtracks to match all but the last character,
1480  characters, and so on. Once again the search for "a" covers the entire string,  then all but the last two characters, and so on. Once again the search for "a"
1481  from right to left, so we are no better off. However, if the pattern is written  covers the entire string, from right to left, so we are no better off. However,
1482  as  if the pattern is written as
1483    
1484    ^(?>.*)(?<=abcd)    ^(?>.*)(?<=abcd)
1485    
1486  then there can be no backtracking for the .* item; it can match only the entire  there can be no backtracking for the .* item; it can match only the entire
1487  string. The subsequent lookbehind assertion does a single test on the last four  string. The subsequent lookbehind assertion does a single test on the last four
1488  characters. If it fails, the match fails immediately. For long strings, this  characters. If it fails, the match fails immediately. For long strings, this
1489  approach makes a significant difference to the processing time.  approach makes a significant difference to the processing time.
1490    
1491    When a pattern contains an unlimited repeat inside a subpattern that can itself
1492    be repeated an unlimited number of times, the use of a once-only subpattern is
1493    the only way to avoid some failing matches taking a very long time indeed.
1494    The pattern
1495    
1496      (\\D+|<\\d+>)*[!?]
1497    
1498    matches an unlimited number of substrings that either consist of non-digits, or
1499    digits enclosed in <>, followed by either ! or ?. When it matches, it runs
1500    quickly. However, if it is applied to
1501    
1502      aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
1503    
1504    it takes a long time before reporting failure. This is because the string can
1505    be divided between the two repeats in a large number of ways, and all have to
1506    be tried. (The example used [!?] rather than a single character at the end,
1507    because both PCRE and Perl have an optimization that allows for fast failure
1508    when a single character is used. They remember the last single character that
1509    is required for a match, and fail early if it is not present in the string.)
1510    If the pattern is changed to
1511    
1512      ((?>\\D+)|<\\d+>)*[!?]
1513    
1514    sequences of non-digits cannot be broken, and failure happens quickly.
1515    
1516    
1517  .SH CONDITIONAL SUBPATTERNS  .SH CONDITIONAL SUBPATTERNS
1518  It is possible to cause the matching process to obey a subpattern  It is possible to cause the matching process to obey a subpattern
# Line 1387  no-pattern (if present) is used. If ther Line 1528  no-pattern (if present) is used. If ther
1528  subpattern, a compile-time error occurs.  subpattern, a compile-time error occurs.
1529    
1530  There are two kinds of condition. If the text between the parentheses consists  There are two kinds of condition. If the text between the parentheses consists
1531  of a sequence of digits, then the condition is satisfied if the capturing  of a sequence of digits, the condition is satisfied if the capturing subpattern
1532  subpattern of that number has previously matched. Consider the following  of that number has previously matched. Consider the following pattern, which
1533  pattern, which contains non-significant white space to make it more readable  contains non-significant white space to make it more readable (assume the
1534  (assume the PCRE_EXTENDED option) and to divide it into three parts for ease  PCRE_EXTENDED option) and to divide it into three parts for ease of discussion:
 of discussion:  
1535    
1536    ( \\( )?    [^()]+    (?(1) \\) )    ( \\( )?    [^()]+    (?(1) \\) )
1537    
# Line 1431  character class introduces a comment tha Line 1571  character class introduces a comment tha
1571  character in the pattern.  character in the pattern.
1572    
1573    
1574    .SH RECURSIVE PATTERNS
1575    Consider the problem of matching a string in parentheses, allowing for
1576    unlimited nested parentheses. Without the use of recursion, the best that can
1577    be done is to use a pattern that matches up to some fixed depth of nesting. It
1578    is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an
1579    experimental facility that allows regular expressions to recurse (amongst other
1580    things). It does this by interpolating Perl code in the expression at run time,
1581    and the code can refer to the expression itself. A Perl pattern to solve the
1582    parentheses problem can be created like this:
1583    
1584      $re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x;
1585    
1586    The (?p{...}) item interpolates Perl code at run time, and in this case refers
1587    recursively to the pattern in which it appears. Obviously, PCRE cannot support
1588    the interpolation of Perl code. Instead, the special item (?R) is provided for
1589    the specific case of recursion. This PCRE pattern solves the parentheses
1590    problem (assume the PCRE_EXTENDED option is set so that white space is
1591    ignored):
1592    
1593      \\( ( (?>[^()]+) | (?R) )* \\)
1594    
1595    First it matches an opening parenthesis. Then it matches any number of
1596    substrings which can either be a sequence of non-parentheses, or a recursive
1597    match of the pattern itself (i.e. a correctly parenthesized substring). Finally
1598    there is a closing parenthesis.
1599    
1600    This particular example pattern contains nested unlimited repeats, and so the
1601    use of a once-only subpattern for matching strings of non-parentheses is
1602    important when applying the pattern to strings that do not match. For example,
1603    when it is applied to
1604    
1605      (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
1606    
1607    it yields "no match" quickly. However, if a once-only subpattern is not used,
1608    the match runs for a very long time indeed because there are so many different
1609    ways the + and * repeats can carve up the subject, and all have to be tested
1610    before failure can be reported.
1611    
1612    The values set for any capturing subpatterns are those from the outermost level
1613    of the recursion at which the subpattern value is set. If the pattern above is
1614    matched against
1615    
1616      (ab(cd)ef)
1617    
1618    the value for the capturing parentheses is "ef", which is the last value taken
1619    on at the top level. If additional parentheses are added, giving
1620    
1621      \\( ( ( (?>[^()]+) | (?R) )* ) \\)
1622         ^                        ^
1623         ^                        ^
1624    the string they capture is "ab(cd)ef", the contents of the top level
1625    parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE
1626    has to obtain extra memory to store data during a recursion, which it does by
1627    using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no
1628    memory can be obtained, it saves data for the first 15 capturing parentheses
1629    only, as there is no way to give an out-of-memory error from within a
1630    recursion.
1631    
1632    
1633  .SH PERFORMANCE  .SH PERFORMANCE
1634  Certain items that may appear in patterns are more efficient than others. It is  Certain items that may appear in patterns are more efficient than others. It is
1635  more efficient to use a character class like [aeiou] than a set of alternatives  more efficient to use a character class like [aeiou] than a set of alternatives
# Line 1497  Cambridge CB2 3QG, England. Line 1696  Cambridge CB2 3QG, England.
1696  .br  .br
1697  Phone: +44 1223 334714  Phone: +44 1223 334714
1698    
1699  Last updated: 29 July 1999  Last updated: 27 January 2000
1700  .br  .br
1701  Copyright (c) 1997-1999 University of Cambridge.  Copyright (c) 1997-2000 University of Cambridge.

Legend:
Removed from v.41  
changed lines
  Added in v.47

  ViewVC Help
Powered by ViewVC 1.1.5