/[pcre]/code/trunk/doc/pcrepartial.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 169 by ph10, Mon Jun 4 10:49:21 2007 UTC
# Line 1  Line 1 
1  .TH PCRE 3  .TH PCREPARTIAL 3
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PARTIAL MATCHING IN PCRE"  .SH "PARTIAL MATCHING IN PCRE"
5  .rs  .rs
6  .sp  .sp
7  In normal use of PCRE, if the subject string that is passed to  In normal use of PCRE, if the subject string that is passed to
8  \fBpcre_exec()\fP matches as far as it goes, but is too short to match the  \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9  entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where  too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10  it might be helpful to distinguish this case from other cases in which there is  are circumstances where it might be helpful to distinguish this case from other
11  no match.  cases in which there is no match.
12  .P  .P
13  Consider, for example, an application where a human is required to type in data  Consider, for example, an application where a human is required to type in data
14  for a field with specific formatting requirements. An example might be a date  for a field with specific formatting requirements. An example might be a date
# Line 24  user interface than a check that is dela Line 24  user interface than a check that is dela
24  entered.  entered.
25  .P  .P
26  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27  option, which can be set when calling \fBpcre_exec()\fP. When this is done, the  option, which can be set when calling \fBpcre_exec()\fP or
28  return code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any  \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return
29  time during the matching process the entire subject string matched part of the  code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
30  pattern. No captured data is set when this occurs.  during the matching process the last part of the subject string matched part of
31    the pattern. Unfortunately, for non-anchored matching, it is not possible to
32    obtain the position of the start of the partial match. No captured data is set
33    when PCRE_ERROR_PARTIAL is returned.
34    .P
35    When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
36    PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
37    subject is reached, there have been no complete matches, but there is still at
38    least one matching possibility. The portion of the string that provided the
39    partial match is set as the first matching string.
40  .P  .P
41  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
42  last literal byte in a pattern, and abandons matching immediately if such a  last literal byte in a pattern, and abandons matching immediately if such a
# Line 38  for a subject string that might match on Line 47  for a subject string that might match on
47  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"
48  .rs  .rs
49  .sp  .sp
50  Because of the way certain internal optimizations are implemented in PCRE, the  Because of the way certain internal optimizations are implemented in the
51  PCRE_PARTIAL option cannot be used with all patterns. Repeated single  \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all
52  characters such as  patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.
53    For \fBpcre_exec()\fP, repeated single characters such as
54  .sp  .sp
55    a{2,4}    a{2,4}
56  .sp  .sp
# Line 61  envisaged for this facility, this is not Line 71  envisaged for this facility, this is not
71  .P  .P
72  If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,  If PCRE_PARTIAL is set for a pattern that does not conform to the restrictions,
73  \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).  \fBpcre_exec()\fP returns the error code PCRE_ERROR_BADPARTIAL (-13).
74    You can use the PCRE_INFO_OKPARTIAL call to \fBpcre_fullinfo()\fP to find out
75    if a compiled pattern can be used for partial matching.
76  .  .
77  .  .
78  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"  .SH "EXAMPLE OF PARTIAL MATCHING USING PCRETEST"
# Line 71  PCRE_PARTIAL flag is used for the match. Line 83  PCRE_PARTIAL flag is used for the match.
83  uses the date example quoted above:  uses the date example quoted above:
84  .sp  .sp
85      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
86    data> 25jun04\P    data> 25jun04\eP
87     0: 25jun04     0: 25jun04
88     1: jun     1: jun
89    data> 25dec3\P    data> 25dec3\eP
90    Partial match    Partial match
91    data> 3ju\P    data> 3ju\eP
92    Partial match    Partial match
93    data> 3juj\P    data> 3juj\eP
94    No match    No match
95    data> j\P    data> j\eP
96    No match    No match
97  .sp  .sp
98  The first data string is matched completely, so \fBpcretest\fP shows the  The first data string is matched completely, so \fBpcretest\fP shows the
99  matched substrings. The remaining four strings do not match the complete  matched substrings. The remaining four strings do not match the complete
100  pattern, but the first two are partial matches.  pattern, but the first two are partial matches. The same test, using
101    \fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces
102    the following output:
103    .sp
104        re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
105      data> 25jun04\eP\eD
106       0: 25jun04
107      data> 23dec3\eP\eD
108      Partial match: 23dec3
109      data> 3ju\eP\eD
110      Partial match: 3ju
111      data> 3juj\eP\eD
112      No match
113      data> j\eP\eD
114      No match
115    .sp
116    Notice that in this case the portion of the string that was matched is made
117    available.
118  .  .
119  .  .
120    .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
121    .rs
122    .sp
123    When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
124    to continue the match by providing additional subject data and calling
125    \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
126    time setting the PCRE_DFA_RESTART option. You must also pass the same working
127    space as before, because this is where details of the previous partial match
128    are stored. Here is an example using \fBpcretest\fP, using the \eR escape
129    sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above):
130    .sp
131        re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
132      data> 23ja\eP\eD
133      Partial match: 23ja
134      data> n05\eR\eD
135       0: n05
136    .sp
137    The first call has "23ja" as the subject, and requests partial matching; the
138    second call has "n05" as the subject for the continued (restarted) match.
139    Notice that when the match is complete, only the last part is shown; PCRE does
140    not retain the previously partially-matched string. It is up to the calling
141    program to do that if it needs to.
142    .P
143    You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
144    over multiple segments. This facility can be used to pass very long subject
145    strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain
146    types of pattern.
147    .P
148    1. If the pattern contains tests for the beginning or end of a line, you need
149    to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
150    subject string for any call does not contain the beginning or end of a line.
151    .P
152    2. If the pattern contains backward assertions (including \eb or \eB), you need
153    to arrange for some overlap in the subject strings to allow for this. For
154    example, you could pass the subject in chunks that are 500 bytes long, but in
155    a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
156    bytes at the start of the buffer.
157  .P  .P
158  .in 0  3. Matching a subject string that is split into multiple segments does not
159  Last updated: 08 September 2004  always produce exactly the same result as matching over one single long string.
160  .br  The difference arises when there are multiple matching possibilities, because a
161  Copyright (c) 1997-2004 University of Cambridge.  partial match result is given only when there are no completed matches in a
162    call to \fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
163    been found, continuation to a new subject segment is no longer possible.
164    Consider this \fBpcretest\fP example:
165    .sp
166        re> /dog(sbody)?/
167      data> do\eP\eD
168      Partial match: do
169      data> gsb\eR\eP\eD
170       0: g
171      data> dogsbody\eD
172       0: dogsbody
173       1: dog
174    .sp
175    The pattern matches the words "dog" or "dogsbody". When the subject is
176    presented in several parts ("do" and "gsb" being the first two) the match stops
177    when "dog" has been found, and it is not possible to continue. On the other
178    hand, if "dogsbody" is presented as a single string, both matches are found.
179    .P
180    Because of this phenomenon, it does not usually make sense to end a pattern
181    that is going to be matched in this way with a variable repeat.
182    .P
183    4. Patterns that contain alternatives at the top level which do not all
184    start with the same pattern item may not work as expected. For example,
185    consider this pattern:
186    .sp
187      1234|3789
188    .sp
189    If the first part of the subject is "ABC123", a partial match of the first
190    alternative is found at offset 3. There is no partial match for the second
191    alternative, because such a match does not start at the same point in the
192    subject string. Attempting to continue with the string "789" does not yield a
193    match because only those alternatives that match at one point in the subject
194    are remembered. The problem arises because the start of the second alternative
195    matches within the first alternative. There is no problem with anchored
196    patterns or patterns such as:
197    .sp
198      1234|ABCD
199    .sp
200    where no string can be a partial match for both alternatives.
201    .
202    .
203    .SH AUTHOR
204    .rs
205    .sp
206    .nf
207    Philip Hazel
208    University Computing Service
209    Cambridge CB2 3QH, England.
210    .fi
211    .
212    .
213    .SH REVISION
214    .rs
215    .sp
216    .nf
217    Last updated: 04 June 2007
218    Copyright (c) 1997-2007 University of Cambridge.
219    .fi

Legend:
Removed from v.75  
changed lines
  Added in v.169

  ViewVC Help
Powered by ViewVC 1.1.5