ViewVC logotype

Contents of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 426 - (show annotations)
Wed Aug 26 15:38:32 2009 UTC (12 years, 1 month ago) by ph10
File size: 11218 byte(s)
Error occurred while calculating annotation data.
Remove restrictions on pcre_exec() partial matching.
3 PCRE - Perl-compatible regular expressions
5 .rs
6 .sp
7 In normal use of PCRE, if the subject string that is passed to
8 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10 are circumstances where it might be helpful to distinguish this case from other
11 cases in which there is no match.
12 .P
13 Consider, for example, an application where a human is required to type in data
14 for a field with specific formatting requirements. An example might be a date
15 in the form \fIddmmmyy\fP, defined by this pattern:
16 .sp
17 ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
18 .sp
19 If the application sees the user's keystrokes one by one, and can check that
20 what has been typed so far is potentially valid, it is able to raise an error
21 as soon as a mistake is made, possibly beeping and not reflecting the
22 character that has been typed. This immediate feedback is likely to be a better
23 user interface than a check that is delayed until the entire string has been
24 entered.
25 .P
26 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27 option, which can be set when calling \fBpcre_exec()\fP or
28 \fBpcre_dfa_exec()\fP.
29 .P
30 When PCRE_PARTIAL is set for \fBpcre_exec()\fP, the return code
31 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time during
32 the matching process the last part of the subject string matched part of the
33 pattern. If there are at least two slots in the offsets vector, they are filled
34 in with the offsets of the longest found string that partially matched. No
35 other captured data is set when PCRE_ERROR_PARTIAL is returned. The second
36 offset is always that for the end of the subject. Consider this pattern:
37 .sp
38 /123\ew+X|dogY/
39 .sp
40 If this is matched against the subject string "abc123dog", both
41 alternatives fail to match, but the end of the subject is reached, so
42 PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH if the
43 PCRE_PARTIAL option is set. The offsets are set to 3 and 9, identifying
44 "123dog" as the longest partial match that was found. (In this example, there
45 are two partial matches, because "dog" on its own partially matches the second
46 alternative.)
47 .P
48 When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
49 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
50 subject is reached, there have been no complete matches, but there is still at
51 least one matching possibility. The portion of the string that provided the
52 longest partial match is set as the first matching string, provided there are
53 at least two slots in the offsets vector.
54 .P
55 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
56 last literal byte in a pattern, and abandons matching immediately if such a
57 byte is not present in the subject string. This optimization cannot be used
58 for a subject string that might match only partially.
59 .
60 .
62 .rs
63 .sp
64 For releases of PCRE prior to 8.00, because of the way certain internal
65 optimizations were implemented in the \fBpcre_exec()\fP function, the
66 PCRE_PARTIAL option could not be used with all patterns. From release 8.00
67 onwards, the restrictions no longer apply, and partial matching can be
68 requested for any pattern.
69 .P
70 Items that were formerly restricted were repeated single characters and
71 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
72 conform to the restrictions, \fBpcre_exec()\fP returned the error code
73 PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
74 PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out if a compiled
75 pattern can be used for partial matching now always returns 1.
76 .
77 .
79 .rs
80 .sp
81 If the escape sequence \eP is present in a \fBpcretest\fP data line, the
82 PCRE_PARTIAL flag is used for the match. Here is a run of \fBpcretest\fP that
83 uses the date example quoted above:
84 .sp
85 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
86 data> 25jun04\eP
87 0: 25jun04
88 1: jun
89 data> 25dec3\eP
90 Partial match: 23dec3
91 data> 3ju\eP
92 Partial match: 3ju
93 data> 3juj\eP
94 No match
95 data> j\eP
96 No match
97 .sp
98 The first data string is matched completely, so \fBpcretest\fP shows the
99 matched substrings. The remaining four strings do not match the complete
100 pattern, but the first two are partial matches. Similar output is obtained
101 when \fBpcre_dfa_exec()\fP is used.
102 .
103 .
105 .rs
106 .sp
107 Certain types of pattern may behave in unintuitive ways when partial matching
108 is requested, whichever matching function is used. For example, matching a
109 pattern that ends with (*FAIL), or any other assertion that causes a match to
110 fail without inspecting any data, yields PCRE_ERROR_PARTIAL rather than
112 .sp
113 re> /a+(*FAIL)/
114 data> aaa\eP
115 Partial match: aaa
116 .sp
117 Although (*FAIL) itself could possibly be made a special case, there are other
118 assertions, for example (?!), which behave in the same way, and it is not
119 possible to catch all cases. For consistency, therefore, there are no
120 exceptions to the rule that PCRE_ERROR_PARTIAL is returned instead of
121 PCRE_ERROR_NOMATCH if at any time during the match the end of the subject
122 string was reached.
123 .
124 .
125 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
126 .rs
127 .sp
128 When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
129 to continue the match by providing additional subject data and calling
130 \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
131 time setting the PCRE_DFA_RESTART option. You must also pass the same working
132 space as before, because this is where details of the previous partial match
133 are stored. Here is an example using \fBpcretest\fP, using the \eR escape
134 sequence to set the PCRE_DFA_RESTART option (\eP sets the PCRE_PARTIAL option,
135 and \eD specifies the use of \fBpcre_dfa_exec()\fP):
136 .sp
137 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
138 data> 23ja\eP\eD
139 Partial match: 23ja
140 data> n05\eR\eD
141 0: n05
142 .sp
143 The first call has "23ja" as the subject, and requests partial matching; the
144 second call has "n05" as the subject for the continued (restarted) match.
145 Notice that when the match is complete, only the last part is shown; PCRE does
146 not retain the previously partially-matched string. It is up to the calling
147 program to do that if it needs to.
148 .P
149 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
150 over multiple segments. This facility can be used to pass very long subject
151 strings to \fBpcre_dfa_exec()\fP.
152 .
153 .
155 .rs
156 .sp
157 From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
158 matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
159 previous match with a new segment of data. Instead, new data must be added to
160 the previous subject string, and the entire match re-run, starting from the
161 point where the partial match occurred. Earlier data can be discarded.
162 Consider an unanchored pattern that matches dates:
163 .sp
164 re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
165 data> The date is 23ja\eP
166 Partial match: 23ja
167 .sp
168 The this stage, an application could discard the text preceding "23ja", add on
169 text from the next segment, and call \fBpcre_exec()\fP again. Unlike
170 \fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
171 the complete matching process occurs for each call, so more memory and more
172 processing time is needed.
173 .
174 .
176 .rs
177 .sp
178 Certain types of pattern may give problems with multi-segment matching,
179 whichever matching function is used.
180 .P
181 1. If the pattern contains tests for the beginning or end of a line, you need
182 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
183 subject string for any call does not contain the beginning or end of a line.
184 .P
185 2. If the pattern contains backward assertions (including \eb or \eB), you need
186 to arrange for some overlap in the subject strings to allow for this. For
187 example, using \fBpcre_dfa_exec()\fP, you could pass the subject in chunks that
188 are 500 bytes long, but in a buffer of 700 bytes, with the starting offset set
189 to 200 and the previous 200 bytes at the start of the buffer.
190 .P
191 3. Matching a subject string that is split into multiple segments does not
192 always produce exactly the same result as matching over one single long string.
193 The difference arises when there are multiple matching possibilities, because a
194 partial match result is given only when there are no completed matches. This
195 means that as soon as the shortest match has been found, continuation to a new
196 subject segment is no longer possible. Consider this \fBpcretest\fP example:
197 .sp
198 re> /dog(sbody)?/
199 data> dogsb\eP
200 0: dog
201 data> do\eP\eD
202 Partial match: do
203 data> gsb\eR\eP\eD
204 0: g
205 data> dogsbody\eD
206 0: dogsbody
207 1: dog
208 .sp
209 The pattern matches "dog" or "dogsbody". The first data line passes the string
210 "dogsb" to \fBpcre_exec()\fP, setting the PCRE_PARTIAL option. Although the
211 string is a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
212 because the shorter string "dog" is a complete match. Similarly, when the
213 subject is presented to \fBpcre_dfa_exec()\fP in several parts ("do" and "gsb"
214 being the first two) the match stops when "dog" has been found, and it is not
215 possible to continue. On the other hand, if "dogsbody" is presented as a single
216 string, \fBpcre_dfa_exec()\fP finds both matches.
217 .P
218 Because of this phenomenon, it does not usually make sense to end a pattern
219 that is going to be matched in this way with a variable repeat.
220 .P
221 4. Patterns that contain alternatives at the top level which do not all
222 start with the same pattern item may not work as expected when
223 \fBpcre_dfa_exec()\fP is used. For example, consider this pattern:
224 .sp
225 1234|3789
226 .sp
227 If the first part of the subject is "ABC123", a partial match of the first
228 alternative is found at offset 3. There is no partial match for the second
229 alternative, because such a match does not start at the same point in the
230 subject string. Attempting to continue with the string "7890" does not yield a
231 match because only those alternatives that match at one point in the subject
232 are remembered. The problem arises because the start of the second alternative
233 matches within the first alternative. There is no problem with anchored
234 patterns or patterns such as:
235 .sp
236 1234|ABCD
237 .sp
238 where no string can be a partial match for both alternatives. This is not a
239 problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
240 each time:
241 .sp
242 re> /1234|3789/
243 data> ABC123\eP
244 Partial match: 123
245 data> 1237890
246 0: 3789
247 .sp
248 .
249 .
251 .rs
252 .sp
253 .nf
254 Philip Hazel
255 University Computing Service
256 Cambridge CB2 3QH, England.
257 .fi
258 .
259 .
261 .rs
262 .sp
263 .nf
264 Last updated: 26 August 2009
265 Copyright (c) 1997-2009 University of Cambridge.
266 .fi


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5