ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 453 - (show annotations)
Fri Sep 18 19:12:35 2009 UTC (11 years, 7 months ago) by ph10
File MIME type: text/html
File size: 17496 byte(s)
Error occurred while calculating annotation data.
Add more explanation about recursive subpatterns, and make it possible to 
process the documenation without building a whole release.
1 <html>
2 <head>
3 <title>pcrepartial specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcrepartial man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
17 <li><a name="TOC2" href="#SEC2">PARTIAL MATCHING USING pcre_exec()</a>
18 <li><a name="TOC3" href="#SEC3">PARTIAL MATCHING USING pcre_dfa_exec()</a>
20 <li><a name="TOC5" href="#SEC5">FORMERLY RESTRICTED PATTERNS</a>
22 <li><a name="TOC7" href="#SEC7">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
23 <li><a name="TOC8" href="#SEC8">MULTI-SEGMENT MATCHING WITH pcre_exec()</a>
24 <li><a name="TOC9" href="#SEC9">ISSUES WITH MULTI-SEGMENT MATCHING</a>
25 <li><a name="TOC10" href="#SEC10">AUTHOR</a>
26 <li><a name="TOC11" href="#SEC11">REVISION</a>
27 </ul>
28 <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
29 <P>
30 In normal use of PCRE, if the subject string that is passed to
31 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
32 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
33 are circumstances where it might be helpful to distinguish this case from other
34 cases in which there is no match.
35 </P>
36 <P>
37 Consider, for example, an application where a human is required to type in data
38 for a field with specific formatting requirements. An example might be a date
39 in the form <i>ddmmmyy</i>, defined by this pattern:
40 <pre>
41 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
42 </pre>
43 If the application sees the user's keystrokes one by one, and can check that
44 what has been typed so far is potentially valid, it is able to raise an error
45 as soon as a mistake is made, by beeping and not reflecting the character that
46 has been typed, for example. This immediate feedback is likely to be a better
47 user interface than a check that is delayed until the entire string has been
48 entered. Partial matching can also sometimes be useful when the subject string
49 is very long and is not all available at once.
50 </P>
51 <P>
52 PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
53 PCRE_PARTIAL_HARD options, which can be set when calling <b>pcre_exec()</b> or
54 <b>pcre_dfa_exec()</b>. For backwards compatibility, PCRE_PARTIAL is a synonym
55 for PCRE_PARTIAL_SOFT. The essential difference between the two options is
56 whether or not a partial match is preferred to an alternative complete match,
57 though the details differ between the two matching functions. If both options
58 are set, PCRE_PARTIAL_HARD takes precedence.
59 </P>
60 <P>
61 Setting a partial matching option disables one of PCRE's optimizations. PCRE
62 remembers the last literal byte in a pattern, and abandons matching immediately
63 if such a byte is not present in the subject string. This optimization cannot
64 be used for a subject string that might match only partially.
65 </P>
66 <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec()</a><br>
67 <P>
68 A partial match occurs during a call to <b>pcre_exec()</b> whenever the end of
69 the subject string is reached successfully, but matching cannot continue
70 because more characters are needed. However, at least one character must have
71 been matched. (In other words, a partial match can never be an empty string.)
72 </P>
73 <P>
74 If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching
75 continues as normal, and other alternatives in the pattern are tried. If no
76 complete match can be found, <b>pcre_exec()</b> returns PCRE_ERROR_PARTIAL
77 instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets
78 vector, the first of them is set to the offset of the earliest character that
79 was inspected when the partial match was found. For convenience, the second
80 offset points to the end of the string so that a substring can easily be
81 extracted.
82 </P>
83 <P>
84 For the majority of patterns, the first offset identifies the start of the
85 partially matched string. However, for patterns that contain lookbehind
86 assertions, or \K, or begin with \b or \B, earlier characters have been
87 inspected while carrying out the match. For example:
88 <pre>
89 /(?&#60;=abc)123/
90 </pre>
91 This pattern matches "123", but only if it is preceded by "abc". If the subject
92 string is "xyzabc12", the offsets after a partial match are for the substring
93 "abc12", because all these characters are needed if another match is tried
94 with extra characters added.
95 </P>
96 <P>
97 If there is more than one partial match, the first one that was found provides
98 the data that is returned. Consider this pattern:
99 <pre>
100 /123\w+X|dogY/
101 </pre>
102 If this is matched against the subject string "abc123dog", both
103 alternatives fail to match, but the end of the subject is reached during
104 matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The
105 offsets are set to 3 and 9, identifying "123dog" as the first partial match
106 that was found. (In this example, there are two partial matches, because "dog"
107 on its own partially matches the second alternative.)
108 </P>
109 <P>
110 If PCRE_PARTIAL_HARD is set for <b>pcre_exec()</b>, it returns
111 PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
112 search for possible complete matches. The difference between the two options
113 can be illustrated by a pattern such as:
114 <pre>
115 /dog(sbody)?/
116 </pre>
117 This matches either "dog" or "dogsbody", greedily (that is, it prefers the
118 longer string if possible). If it is matched against the string "dog" with
119 PCRE_PARTIAL_SOFT, it yields a complete match for "dog". However, if
120 PCRE_PARTIAL_HARD is set, the result is PCRE_ERROR_PARTIAL. On the other hand,
121 if the pattern is made ungreedy the result is different:
122 <pre>
123 /dog(sbody)??/
124 </pre>
125 In this case the result is always a complete match because <b>pcre_exec()</b>
126 finds that first, and it never continues after finding a match. It might be
127 easier to follow this explanation by thinking of the two patterns like this:
128 <pre>
129 /dog(sbody)?/ is the same as /dogsbody|dog/
130 /dog(sbody)??/ is the same as /dog|dogsbody/
131 </pre>
132 The second pattern will never match "dogsbody" when <b>pcre_exec()</b> is
133 used, because it will always find the shorter match first.
134 </P>
135 <br><a name="SEC3" href="#TOC1">PARTIAL MATCHING USING pcre_dfa_exec()</a><br>
136 <P>
137 The <b>pcre_dfa_exec()</b> function moves along the subject string character by
138 character, without backtracking, searching for all possible matches
139 simultaneously. If the end of the subject is reached before the end of the
140 pattern, there is the possibility of a partial match, again provided that at
141 least one character has matched.
142 </P>
143 <P>
144 When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
145 have been no complete matches. Otherwise, the complete matches are returned.
146 However, if PCRE_PARTIAL_HARD is set, a partial match takes precedence over any
147 complete matches. The portion of the string that was inspected when the longest
148 partial match was found is set as the first matching string, provided there are
149 at least two slots in the offsets vector.
150 </P>
151 <P>
152 Because <b>pcre_dfa_exec()</b> always searches for all possible matches, and
153 there is no difference between greedy and ungreedy repetition, its behaviour is
154 different from <b>pcre_exec</b> when PCRE_PARTIAL_HARD is set. Consider the
155 string "dog" matched against the ungreedy pattern shown above:
156 <pre>
157 /dog(sbody)??/
158 </pre>
159 Whereas <b>pcre_exec()</b> stops as soon as it finds the complete match for
160 "dog", <b>pcre_dfa_exec()</b> also finds the partial match for "dogsbody", and
161 so returns that when PCRE_PARTIAL_HARD is set.
162 </P>
163 <br><a name="SEC4" href="#TOC1">PARTIAL MATCHING AND WORD BOUNDARIES</a><br>
164 <P>
165 If a pattern ends with one of sequences \w or \W, which test for word
166 boundaries, partial matching with PCRE_PARTIAL_SOFT can give counter-intuitive
167 results. Consider this pattern:
168 <pre>
169 /\bcat\b/
170 </pre>
171 This matches "cat", provided there is a word boundary at either end. If the
172 subject string is "the cat", the comparison of the final "t" with a following
173 character cannot take place, so a partial match is found. However,
174 <b>pcre_exec()</b> carries on with normal matching, which matches \b at the end
175 of the subject when the last character is a letter, thus finding a complete
176 match. The result, therefore, is <i>not</i> PCRE_ERROR_PARTIAL. The same thing
177 happens with <b>pcre_dfa_exec()</b>, because it also finds the complete match.
178 </P>
179 <P>
180 Using PCRE_PARTIAL_HARD in this case does yield PCRE_ERROR_PARTIAL, because
181 then the partial match takes precedence.
182 </P>
183 <br><a name="SEC5" href="#TOC1">FORMERLY RESTRICTED PATTERNS</a><br>
184 <P>
185 For releases of PCRE prior to 8.00, because of the way certain internal
186 optimizations were implemented in the <b>pcre_exec()</b> function, the
187 PCRE_PARTIAL option (predecessor of PCRE_PARTIAL_SOFT) could not be used with
188 all patterns. From release 8.00 onwards, the restrictions no longer apply, and
189 partial matching with <b>pcre_exec()</b> can be requested for any pattern.
190 </P>
191 <P>
192 Items that were formerly restricted were repeated single characters and
193 repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
194 conform to the restrictions, <b>pcre_exec()</b> returned the error code
195 PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
196 PCRE_INFO_OKPARTIAL call to <b>pcre_fullinfo()</b> to find out if a compiled
197 pattern can be used for partial matching now always returns 1.
198 </P>
199 <br><a name="SEC6" href="#TOC1">EXAMPLE OF PARTIAL MATCHING USING PCRETEST</a><br>
200 <P>
201 If the escape sequence \P is present in a <b>pcretest</b> data line, the
202 PCRE_PARTIAL_SOFT option is used for the match. Here is a run of <b>pcretest</b>
203 that uses the date example quoted above:
204 <pre>
205 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
206 data&#62; 25jun04\P
207 0: 25jun04
208 1: jun
209 data&#62; 25dec3\P
210 Partial match: 23dec3
211 data&#62; 3ju\P
212 Partial match: 3ju
213 data&#62; 3juj\P
214 No match
215 data&#62; j\P
216 No match
217 </pre>
218 The first data string is matched completely, so <b>pcretest</b> shows the
219 matched substrings. The remaining four strings do not match the complete
220 pattern, but the first two are partial matches. Similar output is obtained
221 when <b>pcre_dfa_exec()</b> is used.
222 </P>
223 <P>
224 If the escape sequence \P is present more than once in a <b>pcretest</b> data
225 line, the PCRE_PARTIAL_HARD option is set for the match.
226 </P>
227 <br><a name="SEC7" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
228 <P>
229 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
230 to continue the match by providing additional subject data and calling
231 <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
232 time setting the PCRE_DFA_RESTART option. You must pass the same working
233 space as before, because this is where details of the previous partial match
234 are stored. Here is an example using <b>pcretest</b>, using the \R escape
235 sequence to set the PCRE_DFA_RESTART option (\D specifies the use of
236 <b>pcre_dfa_exec()</b>):
237 <pre>
238 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
239 data&#62; 23ja\P\D
240 Partial match: 23ja
241 data&#62; n05\R\D
242 0: n05
243 </pre>
244 The first call has "23ja" as the subject, and requests partial matching; the
245 second call has "n05" as the subject for the continued (restarted) match.
246 Notice that when the match is complete, only the last part is shown; PCRE does
247 not retain the previously partially-matched string. It is up to the calling
248 program to do that if it needs to.
249 </P>
250 <P>
251 You can set the PCRE_PARTIAL_SOFT or PCRE_PARTIAL_HARD options with
252 PCRE_DFA_RESTART to continue partial matching over multiple segments. This
253 facility can be used to pass very long subject strings to
254 <b>pcre_dfa_exec()</b>.
255 </P>
256 <br><a name="SEC8" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_exec()</a><br>
257 <P>
258 From release 8.00, <b>pcre_exec()</b> can also be used to do multi-segment
259 matching. Unlike <b>pcre_dfa_exec()</b>, it is not possible to restart the
260 previous match with a new segment of data. Instead, new data must be added to
261 the previous subject string, and the entire match re-run, starting from the
262 point where the partial match occurred. Earlier data can be discarded.
263 Consider an unanchored pattern that matches dates:
264 <pre>
265 re&#62; /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
266 data&#62; The date is 23ja\P
267 Partial match: 23ja
268 </pre>
269 The this stage, an application could discard the text preceding "23ja", add on
270 text from the next segment, and call <b>pcre_exec()</b> again. Unlike
271 <b>pcre_dfa_exec()</b>, the entire matching string must always be available, and
272 the complete matching process occurs for each call, so more memory and more
273 processing time is needed.
274 </P>
275 <P>
276 <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
277 with \b or \B, the string that is returned for a partial match will include
278 characters that precede the partially matched string itself, because these must
279 be retained when adding on more characters for a subsequent matching attempt.
280 </P>
281 <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
282 <P>
283 Certain types of pattern may give problems with multi-segment matching,
284 whichever matching function is used.
285 </P>
286 <P>
287 1. If the pattern contains tests for the beginning or end of a line, you need
288 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
289 subject string for any call does not contain the beginning or end of a line.
290 </P>
291 <P>
292 2. Lookbehind assertions at the start of a pattern are catered for in the
293 offsets that are returned for a partial match. However, in theory, a lookbehind
294 assertion later in the pattern could require even earlier characters to be
295 inspected, and it might not have been reached when a partial match occurs. This
296 is probably an extremely unlikely case; you could guard against it to a certain
297 extent by always including extra characters at the start.
298 </P>
299 <P>
300 3. Matching a subject string that is split into multiple segments may not
301 always produce exactly the same result as matching over one single long string,
302 especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
303 Word Boundaries" above describes an issue that arises if the pattern ends with
304 \b or \B. Another kind of difference may occur when there are multiple
305 matching possibilities, because a partial match result is given only when there
306 are no completed matches. This means that as soon as the shortest match has
307 been found, continuation to a new subject segment is no longer possible.
308 Consider again this <b>pcretest</b> example:
309 <pre>
310 re&#62; /dog(sbody)?/
311 data&#62; dogsb\P
312 0: dog
313 data&#62; do\P\D
314 Partial match: do
315 data&#62; gsb\R\P\D
316 0: g
317 data&#62; dogsbody\D
318 0: dogsbody
319 1: dog
320 </pre>
321 The first data line passes the string "dogsb" to <b>pcre_exec()</b>, setting the
322 PCRE_PARTIAL_SOFT option. Although the string is a partial match for
323 "dogsbody", the result is not PCRE_ERROR_PARTIAL, because the shorter string
324 "dog" is a complete match. Similarly, when the subject is presented to
325 <b>pcre_dfa_exec()</b> in several parts ("do" and "gsb" being the first two) the
326 match stops when "dog" has been found, and it is not possible to continue. On
327 the other hand, if "dogsbody" is presented as a single string,
328 <b>pcre_dfa_exec()</b> finds both matches.
329 </P>
330 <P>
331 Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when
332 matching multi-segment data. The example above then behaves differently:
333 <pre>
334 re&#62; /dog(sbody)?/
335 data&#62; dogsb\P\P
336 Partial match: dogsb
337 data&#62; do\P\D
338 Partial match: do
339 data&#62; gsb\R\P\P\D
340 Partial match: gsb
342 </PRE>
343 </P>
344 <P>
345 4. Patterns that contain alternatives at the top level which do not all
346 start with the same pattern item may not work as expected when
347 <b>pcre_dfa_exec()</b> is used. For example, consider this pattern:
348 <pre>
349 1234|3789
350 </pre>
351 If the first part of the subject is "ABC123", a partial match of the first
352 alternative is found at offset 3. There is no partial match for the second
353 alternative, because such a match does not start at the same point in the
354 subject string. Attempting to continue with the string "7890" does not yield a
355 match because only those alternatives that match at one point in the subject
356 are remembered. The problem arises because the start of the second alternative
357 matches within the first alternative. There is no problem with anchored
358 patterns or patterns such as:
359 <pre>
360 1234|ABCD
361 </pre>
362 where no string can be a partial match for both alternatives. This is not a
363 problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
364 each time:
365 <pre>
366 re&#62; /1234|3789/
367 data&#62; ABC123\P
368 Partial match: 123
369 data&#62; 1237890
370 0: 3789
372 </PRE>
373 </P>
374 <br><a name="SEC10" href="#TOC1">AUTHOR</a><br>
375 <P>
376 Philip Hazel
377 <br>
378 University Computing Service
379 <br>
380 Cambridge CB2 3QH, England.
381 <br>
382 </P>
383 <br><a name="SEC11" href="#TOC1">REVISION</a><br>
384 <P>
385 Last updated: 05 September 2009
386 <br>
387 Copyright &copy; 1997-2009 University of Cambridge.
388 <br>
389 <p>
390 Return to the <a href="index.html">PCRE index page</a>.
391 </p>


Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5