ViewVC logotype

Contents of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log

Revision 77 - (show annotations)
Sat Feb 24 21:40:45 2007 UTC (14 years, 2 months ago) by nigel
File MIME type: text/html
File size: 8431 byte(s)
Load pcre-6.0 into code/trunk.
1 <html>
2 <head>
3 <title>pcrepartial specification</title>
4 </head>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcrepartial man page</h1>
7 <p>
8 Return to the <a href="index.html">PCRE index page</a>.
9 </p>
10 <p>
11 This page is part of the PCRE HTML documentation. It was generated automatically
12 from the original man page. If there is any nonsense in it, please consult the
13 man page, in case the conversion went wrong.
14 <br>
15 <ul>
16 <li><a name="TOC1" href="#SEC1">PARTIAL MATCHING IN PCRE</a>
19 <li><a name="TOC4" href="#SEC4">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a>
20 </ul>
21 <br><a name="SEC1" href="#TOC1">PARTIAL MATCHING IN PCRE</a><br>
22 <P>
23 In normal use of PCRE, if the subject string that is passed to
24 <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b> matches as far as it goes, but is
25 too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
26 are circumstances where it might be helpful to distinguish this case from other
27 cases in which there is no match.
28 </P>
29 <P>
30 Consider, for example, an application where a human is required to type in data
31 for a field with specific formatting requirements. An example might be a date
32 in the form <i>ddmmmyy</i>, defined by this pattern:
33 <pre>
34 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
35 </pre>
36 If the application sees the user's keystrokes one by one, and can check that
37 what has been typed so far is potentially valid, it is able to raise an error
38 as soon as a mistake is made, possibly beeping and not reflecting the
39 character that has been typed. This immediate feedback is likely to be a better
40 user interface than a check that is delayed until the entire string has been
41 entered.
42 </P>
43 <P>
44 PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
45 option, which can be set when calling <b>pcre_exec()</b> or
46 <b>pcre_dfa_exec()</b>. When this flag is set for <b>pcre_exec()</b>, the return
47 code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
48 during the matching process the last part of the subject string matched part of
49 the pattern. Unfortunately, for non-anchored matching, it is not possible to
50 obtain the position of the start of the partial match. No captured data is set
51 when PCRE_ERROR_PARTIAL is returned.
52 </P>
53 <P>
54 When PCRE_PARTIAL is set for <b>pcre_dfa_exec()</b>, the return code
55 PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
56 subject is reached, there have been no complete matches, but there is still at
57 least one matching possibility. The portion of the string that provided the
58 partial match is set as the first matching string.
59 </P>
60 <P>
61 Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
62 last literal byte in a pattern, and abandons matching immediately if such a
63 byte is not present in the subject string. This optimization cannot be used
64 for a subject string that might match only partially.
65 </P>
66 <br><a name="SEC2" href="#TOC1">RESTRICTED PATTERNS FOR PCRE_PARTIAL</a><br>
67 <P>
68 Because of the way certain internal optimizations are implemented in the
69 <b>pcre_exec()</b> function, the PCRE_PARTIAL option cannot be used with all
70 patterns. These restrictions do not apply when <b>pcre_dfa_exec()</b> is used.
71 For <b>pcre_exec()</b>, repeated single characters such as
72 <pre>
73 a{2,4}
74 </pre>
75 and repeated single metasequences such as
76 <pre>
77 \d+
78 </pre>
79 are not permitted if the maximum number of occurrences is greater than one.
80 Optional items such as \d? (where the maximum is one) are permitted.
81 Quantifiers with any values are permitted after parentheses, so the invalid
82 examples above can be coded thus:
83 <pre>
84 (a){2,4}
85 (\d)+
86 </pre>
87 These constructions run more slowly, but for the kinds of application that are
88 envisaged for this facility, this is not felt to be a major restriction.
89 </P>
90 <P>
91 If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
92 <b>pcre_exec()</b> returns the error code PCRE_ERROR_BADPARTIAL (-13).
93 </P>
95 <P>
96 If the escape sequence \P is present in a <b>pcretest</b> data line, the
97 PCRE_PARTIAL flag is used for the match. Here is a run of <b>pcretest</b> that
98 uses the date example quoted above:
99 <pre>
100 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
101 data&#62; 25jun04\P
102 0: 25jun04
103 1: jun
104 data&#62; 25dec3\P
105 Partial match
106 data&#62; 3ju\P
107 Partial match
108 data&#62; 3juj\P
109 No match
110 data&#62; j\P
111 No match
112 </pre>
113 The first data string is matched completely, so <b>pcretest</b> shows the
114 matched substrings. The remaining four strings do not match the complete
115 pattern, but the first two are partial matches. The same test, using DFA
116 matching (by means of the \D escape sequence), produces the following output:
117 <pre>
118 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
119 data&#62; 25jun04\P\D
120 0: 25jun04
121 data&#62; 23dec3\P\D
122 Partial match: 23dec3
123 data&#62; 3ju\P\D
124 Partial match: 3ju
125 data&#62; 3juj\P\D
126 No match
127 data&#62; j\P\D
128 No match
129 </pre>
130 Notice that in this case the portion of the string that was matched is made
131 available.
132 </P>
133 <br><a name="SEC4" href="#TOC1">MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()</a><br>
134 <P>
135 When a partial match has been found using <b>pcre_dfa_exec()</b>, it is possible
136 to continue the match by providing additional subject data and calling
137 <b>pcre_dfa_exec()</b> again with the PCRE_DFA_RESTART option and the same
138 working space (where details of the previous partial match are stored). Here is
139 an example using <b>pcretest</b>, where the \R escape sequence sets the
140 PCRE_DFA_RESTART option and the \D escape sequence requests the use of
141 <b>pcre_dfa_exec()</b>:
142 <pre>
143 re&#62; /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
144 data&#62; 23ja\P\D
145 Partial match: 23ja
146 data&#62; n05\R\D
147 0: n05
148 </pre>
149 The first call has "23ja" as the subject, and requests partial matching; the
150 second call has "n05" as the subject for the continued (restarted) match.
151 Notice that when the match is complete, only the last part is shown; PCRE does
152 not retain the previously partially-matched string. It is up to the calling
153 program to do that if it needs to.
154 </P>
155 <P>
156 This facility can be used to pass very long subject strings to
157 <b>pcre_dfa_exec()</b>. However, some care is needed for certain types of
158 pattern.
159 </P>
160 <P>
161 1. If the pattern contains tests for the beginning or end of a line, you need
162 to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
163 subject string for any call does not contain the beginning or end of a line.
164 </P>
165 <P>
166 2. If the pattern contains backward assertions (including \b or \B), you need
167 to arrange for some overlap in the subject strings to allow for this. For
168 example, you could pass the subject in chunks that were 500 bytes long, but in
169 a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
170 bytes at the start of the buffer.
171 </P>
172 <P>
173 3. Matching a subject string that is split into multiple segments does not
174 always produce exactly the same result as matching over one single long string.
175 The difference arises when there are multiple matching possibilities, because a
176 partial match result is given only when there are no completed matches in a
177 call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
178 been found, continuation to a new subject segment is no longer possible.
179 Consider this <b>pcretest</b> example:
180 <pre>
181 re&#62; /dog(sbody)?/
182 data&#62; do\P\D
183 Partial match: do
184 data&#62; gsb\R\P\D
185 0: g
186 data&#62; dogsbody\D
187 0: dogsbody
188 1: dog
189 </pre>
190 The pattern matches the words "dog" or "dogsbody". When the subject is
191 presented in several parts ("do" and "gsb" being the first two) the match stops
192 when "dog" has been found, and it is not possible to continue. On the other
193 hand, if "dogsbody" is presented as a single string, both matches are found.
194 </P>
195 <P>
196 Because of this phenomenon, it does not usually make sense to end a pattern
197 that is going to be matched in this way with a variable repeat.
198 </P>
199 <P>
200 Last updated: 28 February 2005
201 <br>
202 Copyright &copy; 1997-2005 University of Cambridge.
203 <p>
204 Return to the <a href="index.html">PCRE index page</a>.
205 </p>

  ViewVC Help
Powered by ViewVC 1.1.5