--- code/trunk/doc/pcrepartial.3 2007/02/24 21:40:45 77 +++ code/trunk/doc/pcrepartial.3 2007/02/24 21:41:42 93 @@ -1,4 +1,4 @@ -.TH PCRE 3 +.TH PCREPARTIAL 3 .SH NAME PCRE - Perl-compatible regular expressions .SH "PARTIAL MATCHING IN PCRE" @@ -81,22 +81,23 @@ uses the date example quoted above: .sp re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/ - data> 25jun04\P + data> 25jun04\eP 0: 25jun04 1: jun - data> 25dec3\P + data> 25dec3\eP Partial match - data> 3ju\P + data> 3ju\eP Partial match - data> 3juj\P + data> 3juj\eP No match - data> j\P + data> j\eP No match .sp The first data string is matched completely, so \fBpcretest\fP shows the matched substrings. The remaining four strings do not match the complete -pattern, but the first two are partial matches. The same test, using DFA -matching (by means of the \eD escape sequence), produces the following output: +pattern, but the first two are partial matches. The same test, using +\fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces +the following output: .sp re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ data> 25jun04\eP\eD @@ -119,11 +120,11 @@ .sp When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible to continue the match by providing additional subject data and calling -\fBpcre_dfa_exec()\fP again with the PCRE_DFA_RESTART option and the same -working space (where details of the previous partial match are stored). Here is -an example using \fBpcretest\fP, where the \eR escape sequence sets the -PCRE_DFA_RESTART option and the \eD escape sequence requests the use of -\fBpcre_dfa_exec()\fP: +\fBpcre_dfa_exec()\fP again with the same compiled regular expression, this +time setting the PCRE_DFA_RESTART option. You must also pass the same working +space as before, because this is where details of the previous partial match +are stored. Here is an example using \fBpcretest\fP, using the \eR escape +sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above): .sp re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/ data> 23ja\eP\eD @@ -137,9 +138,10 @@ not retain the previously partially-matched string. It is up to the calling program to do that if it needs to. .P -This facility can be used to pass very long subject strings to -\fBpcre_dfa_exec()\fP. However, some care is needed for certain types of -pattern. +You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching +over multiple segments. This facility can be used to pass very long subject +strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain +types of pattern. .P 1. If the pattern contains tests for the beginning or end of a line, you need to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the @@ -147,7 +149,7 @@ .P 2. If the pattern contains backward assertions (including \eb or \eB), you need to arrange for some overlap in the subject strings to allow for this. For -example, you could pass the subject in chunks that were 500 bytes long, but in +example, you could pass the subject in chunks that are 500 bytes long, but in a buffer of 700 bytes, with the starting offset set to 200 and the previous 200 bytes at the start of the buffer. .P @@ -175,10 +177,29 @@ .P Because of this phenomenon, it does not usually make sense to end a pattern that is going to be matched in this way with a variable repeat. +.P +4. Patterns that contain alternatives at the top level which do not all +start with the same pattern item may not work as expected. For example, +consider this pattern: +.sp + 1234|3789 +.sp +If the first part of the subject is "ABC123", a partial match of the first +alternative is found at offset 3. There is no partial match for the second +alternative, because such a match does not start at the same point in the +subject string. Attempting to continue with the string "789" does not yield a +match because only those alternatives that match at one point in the subject +are remembered. The problem arises because the start of the second alternative +matches within the first alternative. There is no problem with anchored +patterns or patterns such as: +.sp + 1234|ABCD +.sp +where no string can be a partial match for both alternatives. . . .P .in 0 -Last updated: 28 February 2005 +Last updated: 30 November 2006 .br -Copyright (c) 1997-2005 University of Cambridge. +Copyright (c) 1997-2006 University of Cambridge.