ViewVC logotype

Contents of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log

Revision 87 - (show annotations)
Sat Feb 24 21:41:21 2007 UTC (14 years, 7 months ago) by nigel
File size: 8227 byte(s)
Load pcre-6.5 into code/trunk.
3 PCRE - Perl-compatible regular expressions
5 .rs
6 .sp
7 In normal use of PCRE, if the subject string that is passed to
8 \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10 are circumstances where it might be helpful to distinguish this case from other
11 cases in which there is no match.
12 .P
13 Consider, for example, an application where a human is required to type in data
14 for a field with specific formatting requirements. An example might be a date
15 in the form \fIddmmmyy\fP, defined by this pattern:
16 .sp
17 ^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$
18 .sp
19 If the application sees the user's keystrokes one by one, and can check that
20 what has been typed so far is potentially valid, it is able to raise an error
21 as soon as a mistake is made, possibly beeping and not reflecting the
22 character that has been typed. This immediate feedback is likely to be a better
23 user interface than a check that is delayed until the entire string has been
24 entered.
25 .P
26 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27 option, which can be set when calling \fBpcre_exec()\fP or
28 \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return
29 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
30 during the matching process the last part of the subject string matched part of
31 the pattern. Unfortunately, for non-anchored matching, it is not possible to
32 obtain the position of the start of the partial match. No captured data is set
33 when PCRE_ERROR_PARTIAL is returned.
34 .P
35 When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
36 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
37 subject is reached, there have been no complete matches, but there is still at
38 least one matching possibility. The portion of the string that provided the
39 partial match is set as the first matching string.
40 .P
41 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
42 last literal byte in a pattern, and abandons matching immediately if such a
43 byte is not present in the subject string. This optimization cannot be used
44 for a subject string that might match only partially.
45 .
46 .
48 .rs
49 .sp
50 Because of the way certain internal optimizations are implemented in the
51 \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all
52 patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.
53 For \fBpcre_exec()\fP, repeated single characters such as
54 .sp
55 a{2,4}
56 .sp
57 and repeated single metasequences such as
58 .sp
59 \ed+
60 .sp
61 are not permitted if the maximum number of occurrences is greater than one.
62 Optional items such as \ed? (where the maximum is one) are permitted.
63 Quantifiers with any values are permitted after parentheses, so the invalid
64 examples above can be coded thus:
65 .sp
66 (a){2,4}
67 (\ed)+
68 .sp
69 These constructions run more slowly, but for the kinds of application that are
70 envisaged for this facility, this is not felt to be a major restriction.
71 .P
72 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
73 \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).
74 .
75 .
77 .rs
78 .sp
79 If the escape sequence \eP is present in a \fBpcretest\fP data line, the
80 PCRE_PARTIAL flag is used for the match. Here is a run of \fBpcretest\fP that
81 uses the date example quoted above:
82 .sp
83 re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
84 data> 25jun04\eP
85 0: 25jun04
86 1: jun
87 data> 25dec3\eP
88 Partial match
89 data> 3ju\eP
90 Partial match
91 data> 3juj\eP
92 No match
93 data> j\eP
94 No match
95 .sp
96 The first data string is matched completely, so \fBpcretest\fP shows the
97 matched substrings. The remaining four strings do not match the complete
98 pattern, but the first two are partial matches. The same test, using DFA
99 matching (by means of the \eD escape sequence), produces the following output:
100 .sp
101 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
102 data> 25jun04\eP\eD
103 0: 25jun04
104 data> 23dec3\eP\eD
105 Partial match: 23dec3
106 data> 3ju\eP\eD
107 Partial match: 3ju
108 data> 3juj\eP\eD
109 No match
110 data> j\eP\eD
111 No match
112 .sp
113 Notice that in this case the portion of the string that was matched is made
114 available.
115 .
116 .
117 .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
118 .rs
119 .sp
120 When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
121 to continue the match by providing additional subject data and calling
122 \fBpcre_dfa_exec()\fP again with the PCRE_DFA_RESTART option and the same
123 working space (where details of the previous partial match are stored). Here is
124 an example using \fBpcretest\fP, where the \eR escape sequence sets the
125 PCRE_DFA_RESTART option and the \eD escape sequence requests the use of
126 \fBpcre_dfa_exec()\fP:
127 .sp
128 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
129 data> 23ja\eP\eD
130 Partial match: 23ja
131 data> n05\eR\eD
132 0: n05
133 .sp
134 The first call has "23ja" as the subject, and requests partial matching; the
135 second call has "n05" as the subject for the continued (restarted) match.
136 Notice that when the match is complete, only the last part is shown; PCRE does
137 not retain the previously partially-matched string. It is up to the calling
138 program to do that if it needs to.
139 .P
140 This facility can be used to pass very long subject strings to
141 \fBpcre_dfa_exec()\fP. However, some care is needed for certain types of
142 pattern.
143 .P
144 1. If the pattern contains tests for the beginning or end of a line, you need
145 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
146 subject string for any call does not contain the beginning or end of a line.
147 .P
148 2. If the pattern contains backward assertions (including \eb or \eB), you need
149 to arrange for some overlap in the subject strings to allow for this. For
150 example, you could pass the subject in chunks that were 500 bytes long, but in
151 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
152 bytes at the start of the buffer.
153 .P
154 3. Matching a subject string that is split into multiple segments does not
155 always produce exactly the same result as matching over one single long string.
156 The difference arises when there are multiple matching possibilities, because a
157 partial match result is given only when there are no completed matches in a
158 call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
159 been found, continuation to a new subject segment is no longer possible.
160 Consider this \fBpcretest\fP example:
161 .sp
162 re> /dog(sbody)?/
163 data> do\eP\eD
164 Partial match: do
165 data> gsb\eR\eP\eD
166 0: g
167 data> dogsbody\eD
168 0: dogsbody
169 1: dog
170 .sp
171 The pattern matches the words "dog" or "dogsbody". When the subject is
172 presented in several parts ("do" and "gsb" being the first two) the match stops
173 when "dog" has been found, and it is not possible to continue. On the other
174 hand, if "dogsbody" is presented as a single string, both matches are found.
175 .P
176 Because of this phenomenon, it does not usually make sense to end a pattern
177 that is going to be matched in this way with a variable repeat.
178 .P
179 4. Patterns that contain alternatives at the top level which do not all
180 start with the same pattern item may not work as expected. For example,
181 consider this pattern:
182 .sp
183 1234|3789
184 .sp
185 If the first part of the subject is "ABC123", a partial match of the first
186 alternative is found at offset 3. There is no partial match for the second
187 alternative, because such a match does not start at the same point in the
188 subject string. Attempting to continue with the string "789" does not yield a
189 match because only those alternatives that match at one point in the subject
190 are remembered. The problem arises because the start of the second alternative
191 matches within the first alternative. There is no problem with anchored
192 patterns or patterns such as:
193 .sp
194 1234|ABCD
195 .sp
196 where no string can be a partial match for both alternatives.
197 .
198 .
199 .P
200 .in 0
201 Last updated: 16 January 2006
202 .br
203 Copyright (c) 1997-2006 University of Cambridge.

  ViewVC Help
Powered by ViewVC 1.1.5