/[pcre]/code/trunk/doc/pcrepartial.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 426 by ph10, Wed Aug 26 15:38:32 2009 UTC
# Line 1  Line 1 
1  .TH PCRE 3  .TH PCREPARTIAL 3
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PARTIAL MATCHING IN PCRE"  .SH "PARTIAL MATCHING IN PCRE"
5  .rs  .rs
6  .sp  .sp
7  In normal use of PCRE, if the subject string that is passed to  In normal use of PCRE, if the subject string that is passed to
8  \fBpcre_exec()\fP matches as far as it goes, but is too short to match the  \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9  entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where  too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10  it might be helpful to distinguish this case from other cases in which there is  are circumstances where it might be helpful to distinguish this case from other
11  no match.  cases in which there is no match.
12  .P  .P
13  Consider, for example, an application where a human is required to type in data  Consider, for example, an application where a human is required to type in data
14  for a field with specific formatting requirements. An example might be a date  for a field with specific formatting requirements. An example might be a date
# Line 24  user interface than a check that is dela Line 24  user interface than a check that is dela
24  entered.  entered.
25  .P  .P
26  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27  option, which can be set when calling \fBpcre_exec()\fP. When this is done, the  option, which can be set when calling \fBpcre_exec()\fP or
28  return code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any  \fBpcre_dfa_exec()\fP.
29  time during the matching process the entire subject string matched part of the  .P
30  pattern. No captured data is set when this occurs.  When PCRE_PARTIAL is set for \fBpcre_exec()\fP, the return code
31    PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time during
32    the matching process the last part of the subject string matched part of the
33    pattern. If there are at least two slots in the offsets vector, they are filled
34    in with the offsets of the longest found string that partially matched. No
35    other captured data is set when PCRE_ERROR_PARTIAL is returned. The second
36    offset is always that for the end of the subject. Consider this pattern:
37    .sp
38      /123\ew+X|dogY/
39    .sp
40    If this is matched against the subject string "abc123dog", both
41    alternatives fail to match, but the end of the subject is reached, so
42    PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH if the
43    PCRE_PARTIAL option is set. The offsets are set to 3 and 9, identifying
44    "123dog" as the longest partial match that was found. (In this example, there
45    are two partial matches, because "dog" on its own partially matches the second
46    alternative.)
47    .P
48    When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
49    PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
50    subject is reached, there have been no complete matches, but there is still at
51    least one matching possibility. The portion of the string that provided the
52    longest partial match is set as the first matching string, provided there are
53    at least two slots in the offsets vector.
54  .P  .P
55  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
56  last literal byte in a pattern, and abandons matching immediately if such a  last literal byte in a pattern, and abandons matching immediately if such a
# Line 35  byte is not present in the subject strin Line 58  byte is not present in the subject strin
58  for a subject string that might match only partially.  for a subject string that might match only partially.
59  .  .
60  .  .
61  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"  .SH "FORMERLY RESTRICTED PATTERNS FOR PCRE_PARTIAL"
62  .rs  .rs
63  .sp  .sp
64  Because of the way certain internal optimizations are implemented in PCRE, the  For releases of PCRE prior to 8.00, because of the way certain internal
65  PCRE_PARTIAL option cannot be used with all patterns. Repeated single  optimizations were implemented in the \fBpcre_exec()\fP function, the
66  characters such as  PCRE_PARTIAL option could not be used with all patterns. From release 8.00
67  .sp  onwards, the restrictions no longer apply, and partial matching can be
68    a{2,4}  requested for any pattern.
 .sp  
 and repeated single metasequences such as  
 .sp  
   \ed+  
 .sp  
 are not permitted if the maximum number of occurrences is greater than one.  
 Optional items such as \ed? (where the maximum is one) are permitted.  
 Quantifiers with any values are permitted after parentheses, so the invalid  
 examples above can be coded thus:  
 .sp  
   (a){2,4}  
   (\ed)+  
 .sp  
 These constructions run more slowly, but for the kinds of application that are  
 envisaged for this facility, this is not felt to be a major restriction.  
69  .P  .P
70  If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,  Items that were formerly restricted were repeated single characters and
71  \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).  repeated metasequences. If PCRE_PARTIAL was set for a pattern that did not
72    conform to the restrictions, \fBpcre_exec()\fP returned the error code
73    PCRE_ERROR_BADPARTIAL (-13). This error code is no longer in use. The
74    PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out if a compiled
75    pattern can be used for partial matching now always returns 1.
76  .  .
77  .  .
78  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"
# Line 71  PCRE_PARTIAL flag is used for the match. Line 83  PCRE_PARTIAL flag is used for the match.
83  uses the date example quoted above:  uses the date example quoted above:
84  .sp  .sp
85      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
86    data> 25jun04\P    data> 25jun04\eP
87     0: 25jun04     0: 25jun04
88     1: jun     1: jun
89    data> 25dec3\P    data> 25dec3\eP
90    Partial match    Partial match: 23dec3
91    data> 3ju\P    data> 3ju\eP
92    Partial match    Partial match: 3ju
93    data> 3juj\P    data> 3juj\eP
94    No match    No match
95    data> j\P    data> j\eP
96    No match    No match
97  .sp  .sp
98  The first data string is matched completely, so \fBpcretest\fP shows the  The first data string is matched completely, so \fBpcretest\fP shows the
99  matched substrings. The remaining four strings do not match the complete  matched substrings. The remaining four strings do not match the complete
100  pattern, but the first two are partial matches.  pattern, but the first two are partial matches. Similar output is obtained
101    when \fBpcre_dfa_exec()\fP is used.
102    .
103    .
104    .SH "ISSUES WITH PARTIAL MATCHING"
105    .rs
106    .sp
107    Certain types of pattern may behave in unintuitive ways when partial matching
108    is requested, whichever matching function is used. For example, matching a
109    pattern that ends with (*FAIL), or any other assertion that causes a match to
110    fail without inspecting any data, yields PCRE_ERROR_PARTIAL rather than
111    PCRE_ERROR_NOMATCH:
112    .sp
113        re> /a+(*FAIL)/
114      data> aaa\eP
115      Partial match: aaa
116    .sp
117    Although (*FAIL) itself could possibly be made a special case, there are other
118    assertions, for example (?!), which behave in the same way, and it is not
119    possible to catch all cases. For consistency, therefore, there are no
120    exceptions to the rule that PCRE_ERROR_PARTIAL is returned instead of
121    PCRE_ERROR_NOMATCH if at any time during the match the end of the subject
122    string was reached.
123  .  .
124  .  .
125    .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
126    .rs
127    .sp
128    When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
129    to continue the match by providing additional subject data and calling
130    \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
131    time setting the PCRE_DFA_RESTART option. You must also pass the same working
132    space as before, because this is where details of the previous partial match
133    are stored. Here is an example using \fBpcretest\fP, using the \eR escape
134    sequence to set the PCRE_DFA_RESTART option (\eP sets the PCRE_PARTIAL option,
135    and \eD specifies the use of \fBpcre_dfa_exec()\fP):
136    .sp
137        re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
138      data> 23ja\eP\eD
139      Partial match: 23ja
140      data> n05\eR\eD
141       0: n05
142    .sp
143    The first call has "23ja" as the subject, and requests partial matching; the
144    second call has "n05" as the subject for the continued (restarted) match.
145    Notice that when the match is complete, only the last part is shown; PCRE does
146    not retain the previously partially-matched string. It is up to the calling
147    program to do that if it needs to.
148  .P  .P
149  .in 0  You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
150  Last updated: 08 September 2004  over multiple segments. This facility can be used to pass very long subject
151  .br  strings to \fBpcre_dfa_exec()\fP.
152  Copyright (c) 1997-2004 University of Cambridge.  .
153    .
154    .SH "MULTI-SEGMENT MATCHING WITH pcre_exec()"
155    .rs
156    .sp
157    From release 8.00, \fBpcre_exec()\fP can also be used to do multi-segment
158    matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
159    previous match with a new segment of data. Instead, new data must be added to
160    the previous subject string, and the entire match re-run, starting from the
161    point where the partial match occurred. Earlier data can be discarded.
162    Consider an unanchored pattern that matches dates:
163    .sp
164        re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
165      data> The date is 23ja\eP
166      Partial match: 23ja
167    .sp
168    The this stage, an application could discard the text preceding "23ja", add on
169    text from the next segment, and call \fBpcre_exec()\fP again. Unlike
170    \fBpcre_dfa_exec()\fP, the entire matching string must always be available, and
171    the complete matching process occurs for each call, so more memory and more
172    processing time is needed.
173    .
174    .
175    .SH "ISSUES WITH MULTI-SEGMENT MATCHING"
176    .rs
177    .sp
178    Certain types of pattern may give problems with multi-segment matching,
179    whichever matching function is used.
180    .P
181    1. If the pattern contains tests for the beginning or end of a line, you need
182    to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
183    subject string for any call does not contain the beginning or end of a line.
184    .P
185    2. If the pattern contains backward assertions (including \eb or \eB), you need
186    to arrange for some overlap in the subject strings to allow for this. For
187    example, using \fBpcre_dfa_exec()\fP, you could pass the subject in chunks that
188    are 500 bytes long, but in a buffer of 700 bytes, with the starting offset set
189    to 200 and the previous 200 bytes at the start of the buffer.
190    .P
191    3. Matching a subject string that is split into multiple segments does not
192    always produce exactly the same result as matching over one single long string.
193    The difference arises when there are multiple matching possibilities, because a
194    partial match result is given only when there are no completed matches. This
195    means that as soon as the shortest match has been found, continuation to a new
196    subject segment is no longer possible. Consider this \fBpcretest\fP example:
197    .sp
198        re> /dog(sbody)?/
199      data> dogsb\eP
200       0: dog
201      data> do\eP\eD
202      Partial match: do
203      data> gsb\eR\eP\eD
204       0: g
205      data> dogsbody\eD
206       0: dogsbody
207       1: dog
208    .sp
209    The pattern matches "dog" or "dogsbody". The first data line passes the string
210    "dogsb" to \fBpcre_exec()\fP, setting the PCRE_PARTIAL option. Although the
211    string is a partial match for "dogsbody", the result is not PCRE_ERROR_PARTIAL,
212    because the shorter string "dog" is a complete match. Similarly, when the
213    subject is presented to \fBpcre_dfa_exec()\fP in several parts ("do" and "gsb"
214    being the first two) the match stops when "dog" has been found, and it is not
215    possible to continue. On the other hand, if "dogsbody" is presented as a single
216    string, \fBpcre_dfa_exec()\fP finds both matches.
217    .P
218    Because of this phenomenon, it does not usually make sense to end a pattern
219    that is going to be matched in this way with a variable repeat.
220    .P
221    4. Patterns that contain alternatives at the top level which do not all
222    start with the same pattern item may not work as expected when
223    \fBpcre_dfa_exec()\fP is used. For example, consider this pattern:
224    .sp
225      1234|3789
226    .sp
227    If the first part of the subject is "ABC123", a partial match of the first
228    alternative is found at offset 3. There is no partial match for the second
229    alternative, because such a match does not start at the same point in the
230    subject string. Attempting to continue with the string "7890" does not yield a
231    match because only those alternatives that match at one point in the subject
232    are remembered. The problem arises because the start of the second alternative
233    matches within the first alternative. There is no problem with anchored
234    patterns or patterns such as:
235    .sp
236      1234|ABCD
237    .sp
238    where no string can be a partial match for both alternatives. This is not a
239    problem if \fPpcre_exec()\fP is used, because the entire match has to be rerun
240    each time:
241    .sp
242        re> /1234|3789/
243      data> ABC123\eP
244      Partial match: 123
245      data> 1237890
246       0: 3789
247    .sp
248    .
249    .
250    .SH AUTHOR
251    .rs
252    .sp
253    .nf
254    Philip Hazel
255    University Computing Service
256    Cambridge CB2 3QH, England.
257    .fi
258    .
259    .
260    .SH REVISION
261    .rs
262    .sp
263    .nf
264    Last updated: 26 August 2009
265    Copyright (c) 1997-2009 University of Cambridge.
266    .fi

Legend:
Removed from v.75  
changed lines
  Added in v.426

  ViewVC Help
Powered by ViewVC 1.1.5