--- code/trunk/doc/pcre.3 2007/02/24 21:39:17 41 +++ code/trunk/doc/pcre.3 2007/02/24 21:39:21 43 @@ -47,6 +47,11 @@ .B const unsigned char *pcre_maketables(void); .PP .br +.B int pcre_fullinfo(const pcre *\fIcode\fR, "const pcre_extra *\fIextra\fR," +.ti +5n +.B int \fIwhat\fR, void *\fIwhere\fR); +.PP +.br .B int pcre_info(const pcre *\fIcode\fR, int *\fIoptptr\fR, int .B *\fIfirstcharptr\fR); .PP @@ -64,16 +69,19 @@ .SH DESCRIPTION The PCRE library is a set of functions that implement regular expression pattern matching using the same syntax and semantics as Perl 5, with just a few -differences (see below). The current implementation corresponds to Perl 5.005. +differences (see below). The current implementation corresponds to Perl 5.005, +with some additional features from the Perl development release. PCRE has its own native API, which is described in this document. There is also -a set of wrapper functions that correspond to the POSIX API. These are -described in the \fBpcreposix\fR documentation. +a set of wrapper functions that correspond to the POSIX regular expression API. +These are described in the \fBpcreposix\fR documentation. The native API function prototypes are defined in the header file \fBpcre.h\fR, and on Unix systems the library itself is called \fBlibpcre.a\fR, so can be accessed by adding \fB-lpcre\fR to the command for linking an application which -calls it. +calls it. The header file defines the macros PCRE_MAJOR and PCRE_MINOR to +contain the major and minor release numbers for the library. Applications can +use these to include support for different releases. The functions \fBpcre_compile()\fR, \fBpcre_study()\fR, and \fBpcre_exec()\fR are used for compiling and matching regular expressions, while @@ -83,9 +91,11 @@ \fBpcre_maketables()\fR is used (optionally) to build a set of character tables in the current locale for passing to \fBpcre_compile()\fR. -The function \fBpcre_info()\fR is used to find out information about a compiled -pattern, while the function \fBpcre_version()\fR returns a pointer to a string -containing the version of PCRE and its date of release. +The function \fBpcre_fullinfo()\fR is used to find out information about a +compiled pattern; \fBpcre_info()\fR is an obsolete version which returns only +some of the available information, but is retained for backwards compatibility. +The function \fBpcre_version()\fR returns a pointer to a string containing the +version of PCRE and its date of release. The global variables \fBpcre_malloc\fR and \fBpcre_free\fR initially contain the entry points of the standard \fBmalloc()\fR and \fBfree()\fR functions @@ -182,12 +192,14 @@ PCRE_EXTRA -This option turns on additional functionality of PCRE that is incompatible with -Perl. Any backslash in a pattern that is followed by a letter that has no +This option was invented in order to turn on additional functionality of PCRE +that is incompatible with Perl, but it is currently of very little use. When +set, any backslash in a pattern that is followed by a letter that has no special meaning causes an error, thus reserving these combinations for future expansion. By default, as in Perl, a backslash followed by a letter with no special meaning is treated as a literal. There are at present no other features -controlled by this option. +controlled by this option. It can also be set by a (?X) option setting within a +pattern. PCRE_MULTILINE @@ -261,25 +273,58 @@ .SH INFORMATION ABOUT A PATTERN -The \fBpcre_info()\fR function returns information about a compiled pattern. -Its yield is the number of capturing subpatterns, or one of the following -negative numbers: +The \fBpcre_fullinfo()\fR function returns information about a compiled +pattern. It replaces the obsolete \fBpcre_info()\fR function, which is +nevertheless retained for backwards compability (and is documented below). + +The first argument for \fBpcre_fullinfo()\fR is a pointer to the compiled +pattern. The second argument is the result of \fBpcre_study()\fR, or NULL if +the pattern was not studied. The third argument specifies which piece of +information is required, while the fourth argument is a pointer to a variable +to receive the data. The yield of the function is zero for success, or one of +the following negative numbers: PCRE_ERROR_NULL the argument \fIcode\fR was NULL + the argument \fIwhere\fR was NULL PCRE_ERROR_BADMAGIC the "magic number" was not found + PCRE_ERROR_BADOPTION the value of \fIwhat\fR was invalid -If the \fIoptptr\fR argument is not NULL, a copy of the options with which the -pattern was compiled is placed in the integer it points to. These option bits +The possible values for the third argument are defined in \fBpcre.h\fR, and are +as follows: + + PCRE_INFO_OPTIONS + +Return a copy of the options with which the pattern was compiled. The fourth +argument should point to au \fBunsigned long int\fR variable. These option bits are those specified in the call to \fBpcre_compile()\fR, modified by any top-level option settings within the pattern itself, and with the PCRE_ANCHORED -bit set if the form of the pattern implies that it can match only at the start -of a subject string. +bit forcibly set if the form of the pattern implies that it can match only at +the start of a subject string. -If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, -it is used to pass back information about the first character of any matched -string. If there is a fixed first character, e.g. from a pattern such as -(cat|cow|coyote), then it is returned in the integer pointed to by -\fIfirstcharptr\fR. Otherwise, if either + PCRE_INFO_SIZE + +Return the size of the compiled pattern, that is, the value that was passed as +the argument to \fBpcre_malloc()\fR when PCRE was getting memory in which to +place the compiled data. The fourth argument should point to a \fBsize_t\fR +variable. + + PCRE_INFO_CAPTURECOUNT + +Return the number of capturing subpatterns in the pattern. The fourth argument +should point to an \fbint\fR variable. + + PCRE_INFO_BACKREFMAX + +Return the number of the highest back reference in the pattern. The fourth +argument should point to an \fBint\fR variable. Zero is returned if there are +no back references. + + PCRE_INFO_FIRSTCHAR + +Return information about the first character of any matched string, for a +non-anchored pattern. If there is a fixed first character, e.g. from a pattern +such as (cat|cow|coyote), then it is returned in the integer pointed to by +\fIwhere\fR. Otherwise, if either (a) the pattern was compiled with the PCRE_MULTILINE option, and every branch starts with "^", or @@ -289,7 +334,40 @@ then -1 is returned, indicating that the pattern matches only at the start of a subject string or after any "\\n" within the string. Otherwise -2 is -returned. +returned. For anchored patterns, -2 is returned. + + PCRE_INFO_FIRSTTABLE + +If the pattern was studied, and this resulted in the construction of a 256-bit +table indicating a fixed set of characters for the first character in any +matching string, a pointer to the table is returned. Otherwise NULL is +returned. The fourth argument should point to an \fBunsigned char *\fR +variable. + + PCRE_INFO_LASTLITERAL + +For a non-anchored pattern, return the value of the rightmost literal character +which must exist in any matched string, other than at its start. The fourth +argument should point to an \fBint\fR variable. If there is no such character, +or if the pattern is anchored, -1 is returned. For example, for the pattern +/a\\d+z\\d+/ the returned value is 'z'. + +The \fBpcre_info()\fR function is now obsolete because its interface is too +restrictive to return all the available data about a compiled pattern. New +programs should use \fBpcre_fullinfo()\fR instead. The yield of +\fBpcre_info()\fR is the number of capturing subpatterns, or one of the +following negative numbers: + + PCRE_ERROR_NULL the argument \fIcode\fR was NULL + PCRE_ERROR_BADMAGIC the "magic number" was not found + +If the \fIoptptr\fR argument is not NULL, a copy of the options with which the +pattern was compiled is placed in the integer it points to (see +PCRE_INFO_OPTIONS above). + +If the pattern is not anchored and the \fIfirstcharptr\fR argument is not NULL, +it is used to pass back information about the first character of any matched +string (see PCRE_INFO_FIRSTCHAR above). .SH MATCHING A PATTERN @@ -564,7 +642,9 @@ 6. The Perl \\G assertion is not supported as it is not relevant to single pattern matches. -7. Fairly obviously, PCRE does not support the (?{code}) construction. +7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code}) +constructions. However, there is some experimental support for recursive +patterns using the non-Perl item (?R). 8. There are at the time of writing some oddities in Perl 5.005_02 concerned with the settings of captured strings when part of a pattern is repeated. For @@ -602,13 +682,16 @@ (f) The PCRE_NOTBOL, PCRE_NOTEOL, and PCRE_NOTEMPTY options for \fBpcre_exec()\fR have no Perl equivalents. +(g) The (?R) construct allows for recursive pattern matching (Perl 5.6 can do +this using the (?p{code}) construct, which PCRE cannot of course support.) + .SH REGULAR EXPRESSION DETAILS The syntax and semantics of the regular expressions supported by PCRE are described below. Regular expressions are also described in the Perl documentation and in a number of other books, some of which have copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published by -O'Reilly (ISBN 1-56592-257-3), covers them in great detail. The description +O'Reilly (ISBN 1-56592-257), covers them in great detail. The description here is intended as reference documentation. A regular expression is a pattern that is matched against a subject string from @@ -906,6 +989,40 @@ are escaped. +.SH POSIX CHARACTER CLASSES +Perl 5.6 (not yet released at the time of writing) is going to support the +POSIX notation for character classes, which uses names enclosed by [: and :] +within the enclosing square brackets. PCRE supports this notation. For example, + + [01[:alpha:]%] + +matches "0", "1", any alphabetic character, or "%". The supported class names +are + + alnum letters and digits + alpha letters + ascii character codes 0 - 127 + cntrl control characters + digit decimal digits (same as \\d) + graph printing characters, excluding space + lower lower case letters + print printing characters, including space + punct printing characters, excluding letters and digits + space white space (same as \\s) + upper upper case letters + word "word" characters (same as \\w) + xdigit hexadecimal digits + +The names "ascii" and "word" are Perl extensions. Another Perl extension is +negation, which is indicated by a ^ character after the colon. For example, + + [12[:^digit:]] + +matches "1", "2", or any non-digit. PCRE (and Perl) also recogize the POSIX +syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not +supported, and an error is given if they are encountered. + + .SH VERTICAL BAR Vertical bar characters are used to separate alternative patterns. For example, the pattern @@ -1352,18 +1469,17 @@ abcd$ -when applied to a long string which does not match it. Because matching -proceeds from left to right, PCRE will look for each "a" in the subject and -then see if what follows matches the rest of the pattern. If the pattern is -specified as +when applied to a long string which does not match. Because matching proceeds +from left to right, PCRE will look for each "a" in the subject and then see if +what follows matches the rest of the pattern. If the pattern is specified as ^.*abcd$ -then the initial .* matches the entire string at first, but when this fails, it -backtracks to match all but the last character, then all but the last two -characters, and so on. Once again the search for "a" covers the entire string, -from right to left, so we are no better off. However, if the pattern is written -as +then the initial .* matches the entire string at first, but when this fails +(because there is no following "a"), it backtracks to match all but the last +character, then all but the last two characters, and so on. Once again the +search for "a" covers the entire string, from right to left, so we are no +better off. However, if the pattern is written as ^(?>.*)(?<=abcd) @@ -1372,6 +1488,31 @@ characters. If it fails, the match fails immediately. For long strings, this approach makes a significant difference to the processing time. +When a pattern contains an unlimited repeat inside a subpattern that can itself +be repeated an unlimited number of times, the use of a once-only subpattern is +the only way to avoid some failing matches taking a very long time indeed. +The pattern + + (\\D+|<\\d+>)*[!?] + +matches an unlimited number of substrings that either consist of non-digits, or +digits enclosed in <>, followed by either ! or ?. When it matches, it runs +quickly. However, if it is applied to + + aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa + +it takes a long time before reporting failure. This is because the string can +be divided between the two repeats in a large number of ways, and all have to +be tried. (The example used [!?] rather than a single character at the end, +because both PCRE and Perl have an optimization that allows for fast failure +when a single character is used. They remember the last single character that +is required for a match, and fail early if it is not present in the string.) +If the pattern is changed to + + ((?>\\D+)|<\\d+>)*[!?] + +sequences of non-digits cannot be broken, and failure happens quickly. + .SH CONDITIONAL SUBPATTERNS It is possible to cause the matching process to obey a subpattern @@ -1431,6 +1572,65 @@ character in the pattern. +.SH RECURSIVE PATTERNS +Consider the problem of matching a string in parentheses, allowing for +unlimited nested parentheses. Without the use of recursion, the best that can +be done is to use a pattern that matches up to some fixed depth of nesting. It +is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an +experimental facility that allows regular expressions to recurse (amongst other +things). It does this by interpolating Perl code in the expression at run time, +and the code can refer to the expression itself. A Perl pattern to solve the +parentheses problem can be created like this: + + $re = qr{\\( (?: (?>[^()]+) | (?p{$re}) )* \\)}x; + +The (?p{...}) item interpolates Perl code at run time, and in this case refers +recursively to the pattern in which it appears. Obviously, PCRE cannot support +the interpolation of Perl code. Instead, the special item (?R) is provided for +the specific case of recursion. This PCRE pattern solves the parentheses +problem (assume the PCRE_EXTENDED option is set so that white space is +ignored): + + \\( ( (?>[^()]+) | (?R) )* \\) + +First it matches an opening parenthesis. Then it matches any number of +substrings which can either be a sequence of non-parentheses, or a recursive +match of the pattern itself (i.e. a correctly parenthesized substring). Finally +there is a closing parenthesis. + +This particular example pattern contains nested unlimited repeats, and so the +use of a once-only subpattern for matching strings of non-parentheses is +important when applying the pattern to strings that do not match. For example, +when it is applied to + + (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() + +it yields "no match" quickly. However, if a once-only subpattern is not used, +the match runs for a very long time indeed because there are so many different +ways the + and * repeats can carve up the subject, and all have to be tested +before failure can be reported. + +The values set for any capturing subpatterns are those from the outermost level +of the recursion at which the subpattern value is set. If the pattern above is +matched against + + (ab(cd)ef) + +the value for the capturing parentheses is "ef", which is the last value taken +on at the top level. If additional parentheses are added, giving + + \\( ( ( (?>[^()]+) | (?R) )* ) \\) + ^ ^ + ^ ^ +then the string they capture is "ab(cd)ef", the contents of the top level +parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE +has to obtain extra memory to store data during a recursion, which it does by +using \fBpcre_malloc\fR, freeing it via \fBpcre_free\fR afterwards. If no +memory can be obtained, it saves data for the first 15 capturing parentheses +only, as there is no way to give an out-of-memory error from within a +recursion. + + .SH PERFORMANCE Certain items that may appear in patterns are more efficient than others. It is more efficient to use a character class like [aeiou] than a set of alternatives @@ -1497,6 +1697,6 @@ .br Phone: +44 1223 334714 -Last updated: 29 July 1999 +Last updated: 27 January 2000 .br -Copyright (c) 1997-1999 University of Cambridge. +Copyright (c) 1997-2000 University of Cambridge.