ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 93 - (show annotations)
Sat Feb 24 21:41:42 2007 UTC (14 years, 2 months ago) by nigel
File MIME type: text/html
File size: 9451 byte(s)
Error occurred while calculating annotation data.
Load pcre-7.0 into code/trunk.
1 <html>
2 <head>
3 <title>pcrepartial specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcrepartial man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
19 <li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
20 </ul>
21 <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
22 <P>
23 In normal use of PCRE, if the subject string that is passed to
24 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
25 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
26 are circumstances where it might be helpful to distinguish this case from other
27 cases in which there is no match.
28 </P>
29 <P>
30 Consider, for example, an application where a human is required to type in data
31 for a field with specific formatting requirements. An example might be a date
32 in the form <i>ddmmmyy</i>, defined by this pattern:
33 <pre>
34 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
35 </pre>
36 If the application sees the user's keystrokes one by one, and can check that
37 what has been typed so far is potentially valid, it is able to raise an error
38 as soon as a mistake is made, possibly beeping and not reflecting the
39 character that has been typed. This immediate feedback is likely to be a better
40 user interface than a check that is delayed until the entire string has been
41 entered.
42 </P>
43 <P>
44 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
45 option, which can be set when calling <b>pcre_exec()</b> or
46 <b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
47 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
48 during the matching process the last part of the subject string matched part of
49 the pattern. Unfortunately, for non-anchored matching, it is not possible to
50 obtain the position of the start of the partial match. No captured data is set
51 when PCRE_ERROR_PARTIAL is returned.
52 </P>
53 <P>
54 When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
55 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
56 subject is reached, there have been no complete matches, but there is still at
57 least one matching possibility. The portion of the string that provided the
58 partial match is set as the first matching string.
59 </P>
60 <P>
61 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
62 last literal byte in a pattern, and abandons matching immediately if such a
63 byte is not present in the subject string. This optimization cannot be used
64 for a subject string that might match only partially.
65 </P>
66 <br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
67 <P>
68 Because of the way certain internal optimizations are implemented in the
69 <b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
70 patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
71 For <b>pcre_exec()</b>, repeated single characters such as
72 <pre>
73 a{2,4}
74 </pre>
75 and repeated single metasequences such as
76 <pre>
77 \d+
78 </pre>
79 are not permitted if the maximum number of occurrences is greater than one.
80 Optional items such as \d? (where the maximum is one) are permitted.
81 Quantifiers with any values are permitted after parentheses, so the invalid
82 examples above can be coded thus:
83 <pre>
84 (a){2,4}
85 (\d)+
86 </pre>
87 These constructions run more slowly, but for the kinds of application that are
88 envisaged for this facility, this is not felt to be a major restriction.
89 </P>
90 <P>
91 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
92 <b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
93 </P>
95 <P>
96 If the escape sequence \P is present in a <b>pcretest</b> data line, the
97 PCRE_PARTIAL flag is used for the match. Here is a run of <b>pcretest</b> that
98 uses the date example quoted above:
99 <pre>
100 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
101 data&#62; 25jun04\P
102 0: 25jun04
103 1: jun
104 data&#62; 25dec3\P
105 Partial match
106 data&#62; 3ju\P
107 Partial match
108 data&#62; 3juj\P
109 No match
110 data&#62; j\P
111 No match
112 </pre>
113 The first data string is matched completely, so <b>pcretest</b> shows the
114 matched substrings. The remaining four strings do not match the complete
115 pattern, but the first two are partial matches. The same test, using
116 <b>pcre_dfa_exec()</b> matching (by means of the \D escape sequence), produces
117 the following output:
118 <pre>
119 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
120 data&#62; 25jun04\P\D
121 0: 25jun04
122 data&#62; 23dec3\P\D
123 Partial match: 23dec3
124 data&#62; 3ju\P\D
125 Partial match: 3ju
126 data&#62; 3juj\P\D
127 No match
128 data&#62; j\P\D
129 No match
130 </pre>
131 Notice that in this case the portion of the string that was matched is made
132 available.
133 </P>
134 <br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
135 <P>
136 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
137 to continue the match by providing additional subject data and calling
138 <b>pcre_dfa_exec()</b> again with the same compiled regular expression, this
139 time setting the PCRE_DFA_RESTART option. You must also pass the same working
140 space as before, because this is where details of the previous partial match
141 are stored. Here is an example using <b>pcretest</b>, using the \R escape
142 sequence to set the PCRE_DFA_RESTART option (\P and \D are as above):
143 <pre>
144 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
145 data&#62; 23ja\P\D
146 Partial match: 23ja
147 data&#62; n05\R\D
148 0: n05
149 </pre>
150 The first call has "23ja" as the subject, and requests partial matching; the
151 second call has "n05" as the subject for the continued (restarted) match.
152 Notice that when the match is complete, only the last part is shown; PCRE does
153 not retain the previously partially-matched string. It is up to the calling
154 program to do that if it needs to.
155 </P>
156 <P>
157 You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
158 over multiple segments. This facility can be used to pass very long subject
159 strings to <b>pcre_dfa_exec()</b>. However, some care is needed for certain
160 types of pattern.
161 </P>
162 <P>
163 1. If the pattern contains tests for the beginning or end of a line, you need
164 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
165 subject string for any call does not contain the beginning or end of a line.
166 </P>
167 <P>
168 2. If the pattern contains backward assertions (including \b or \B), you need
169 to arrange for some overlap in the subject strings to allow for this. For
170 example, you could pass the subject in chunks that are 500 bytes long, but in
171 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
172 bytes at the start of the buffer.
173 </P>
174 <P>
175 3. Matching a subject string that is split into multiple segments does not
176 always produce exactly the same result as matching over one single long string.
177 The difference arises when there are multiple matching possibilities, because a
178 partial match result is given only when there are no completed matches in a
179 call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
180 been found, continuation to a new subject segment is no longer possible.
181 Consider this <b>pcretest</b> example:
182 <pre>
183 re&#62; /dog(sbody)?/
184 data&#62; do\P\D
185 Partial match: do
186 data&#62; gsb\R\P\D
187 0: g
188 data&#62; dogsbody\D
189 0: dogsbody
190 1: dog
191 </pre>
192 The pattern matches the words "dog" or "dogsbody". When the subject is
193 presented in several parts ("do" and "gsb" being the first two) the match stops
194 when "dog" has been found, and it is not possible to continue. On the other
195 hand, if "dogsbody" is presented as a single string, both matches are found.
196 </P>
197 <P>
198 Because of this phenomenon, it does not usually make sense to end a pattern
199 that is going to be matched in this way with a variable repeat.
200 </P>
201 <P>
202 4. Patterns that contain alternatives at the top level which do not all
203 start with the same pattern item may not work as expected. For example,
204 consider this pattern:
205 <pre>
206 1234|3789
207 </pre>
208 If the first part of the subject is "ABC123", a partial match of the first
209 alternative is found at offset 3. There is no partial match for the second
210 alternative, because such a match does not start at the same point in the
211 subject string. Attempting to continue with the string "789" does not yield a
212 match because only those alternatives that match at one point in the subject
213 are remembered. The problem arises because the start of the second alternative
214 matches within the first alternative. There is no problem with anchored
215 patterns or patterns such as:
216 <pre>
217 1234|ABCD
218 </pre>
219 where no string can be a partial match for both alternatives.
220 </P>
221 <P>
222 Last updated: 30 November 2006
223 <br>
224 Copyright &copy; 1997-2006 University of Cambridge.
225 <p>
226 Return to the <a href="index.html">PCRE index page</a>.
227 </p>

  ViewVC Help
Powered by ViewVC 1.1.5