/[pcre]/code/trunk/doc/pcrepartial.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 75 by nigel, Sat Feb 24 21:40:37 2007 UTC revision 93 by nigel, Sat Feb 24 21:41:42 2007 UTC
# Line 1  Line 1 
1  .TH PCRE 3  .TH PCREPARTIAL 3
2  .SH NAME  .SH NAME
3  PCRE - Perl-compatible regular expressions  PCRE - Perl-compatible regular expressions
4  .SH "PARTIAL MATCHING IN PCRE"  .SH "PARTIAL MATCHING IN PCRE"
5  .rs  .rs
6  .sp  .sp
7  In normal use of PCRE, if the subject string that is passed to  In normal use of PCRE, if the subject string that is passed to
8  \fBpcre_exec()\fP matches as far as it goes, but is too short to match the  \fBpcre_exec()\fP or \fBpcre_dfa_exec()\fP matches as far as it goes, but is
9  entire pattern, PCRE_ERROR_NOMATCH is returned. There are circumstances where  too short to match the entire pattern, PCRE_ERROR_NOMATCH is returned. There
10  it might be helpful to distinguish this case from other cases in which there is  are circumstances where it might be helpful to distinguish this case from other
11  no match.  cases in which there is no match.
12  .P  .P
13  Consider, for example, an application where a human is required to type in data  Consider, for example, an application where a human is required to type in data
14  for a field with specific formatting requirements. An example might be a date  for a field with specific formatting requirements. An example might be a date
# Line 24  user interface than a check that is dela Line 24  user interface than a check that is dela
24  entered.  entered.
25  .P  .P
26  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL  PCRE supports the concept of partial matching by means of the PCRE_PARTIAL
27  option, which can be set when calling \fBpcre_exec()\fP. When this is done, the  option, which can be set when calling \fBpcre_exec()\fP or
28  return code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any  \fBpcre_dfa_exec()\fP. When this flag is set for \fBpcre_exec()\fP, the return
29  time during the matching process the entire subject string matched part of the  code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if at any time
30  pattern. No captured data is set when this occurs.  during the matching process the last part of the subject string matched part of
31    the pattern. Unfortunately, for non-anchored matching, it is not possible to
32    obtain the position of the start of the partial match. No captured data is set
33    when PCRE_ERROR_PARTIAL is returned.
34    .P
35    When PCRE_PARTIAL is set for \fBpcre_dfa_exec()\fP, the return code
36    PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end of the
37    subject is reached, there have been no complete matches, but there is still at
38    least one matching possibility. The portion of the string that provided the
39    partial match is set as the first matching string.
40  .P  .P
41  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the  Using PCRE_PARTIAL disables one of PCRE's optimizations. PCRE remembers the
42  last literal byte in a pattern, and abandons matching immediately if such a  last literal byte in a pattern, and abandons matching immediately if such a
# Line 38  for a subject string that might match on Line 47  for a subject string that might match on
47  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"  .SH "RESTRICTED PATTERNS FOR PCRE_PARTIAL"
48  .rs  .rs
49  .sp  .sp
50  Because of the way certain internal optimizations are implemented in PCRE, the  Because of the way certain internal optimizations are implemented in the
51  PCRE_PARTIAL option cannot be used with all patterns. Repeated single  \fBpcre_exec()\fP function, the PCRE_PARTIAL option cannot be used with all
52  characters such as  patterns. These restrictions do not apply when \fBpcre_dfa_exec()\fP is used.
53    For \fBpcre_exec()\fP, repeated single characters such as
54  .sp  .sp
55    a{2,4}    a{2,4}
56  .sp  .sp
# Line 71  PCRE_PARTIAL flag is used for the match. Line 81  PCRE_PARTIAL flag is used for the match.
81  uses the date example quoted above:  uses the date example quoted above:
82  .sp  .sp
83      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/      re> /^\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed$/
84    data> 25jun04\P    data> 25jun04\eP
85     0: 25jun04     0: 25jun04
86     1: jun     1: jun
87    data> 25dec3\P    data> 25dec3\eP
88    Partial match    Partial match
89    data> 3ju\P    data> 3ju\eP
90    Partial match    Partial match
91    data> 3juj\P    data> 3juj\eP
92    No match    No match
93    data> j\P    data> j\eP
94    No match    No match
95  .sp  .sp
96  The first data string is matched completely, so \fBpcretest\fP shows the  The first data string is matched completely, so \fBpcretest\fP shows the
97  matched substrings. The remaining four strings do not match the complete  matched substrings. The remaining four strings do not match the complete
98  pattern, but the first two are partial matches.  pattern, but the first two are partial matches. The same test, using
99    \fBpcre_dfa_exec()\fP matching (by means of the \eD escape sequence), produces
100    the following output:
101    .sp
102        re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
103      data> 25jun04\eP\eD
104       0: 25jun04
105      data> 23dec3\eP\eD
106      Partial match: 23dec3
107      data> 3ju\eP\eD
108      Partial match: 3ju
109      data> 3juj\eP\eD
110      No match
111      data> j\eP\eD
112      No match
113    .sp
114    Notice that in this case the portion of the string that was matched is made
115    available.
116    .
117    .
118    .SH "MULTI-SEGMENT MATCHING WITH pcre_dfa_exec()"
119    .rs
120    .sp
121    When a partial match has been found using \fBpcre_dfa_exec()\fP, it is possible
122    to continue the match by providing additional subject data and calling
123    \fBpcre_dfa_exec()\fP again with the same compiled regular expression, this
124    time setting the PCRE_DFA_RESTART option. You must also pass the same working
125    space as before, because this is where details of the previous partial match
126    are stored. Here is an example using \fBpcretest\fP, using the \eR escape
127    sequence to set the PCRE_DFA_RESTART option (\eP and \eD are as above):
128    .sp
129        re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
130      data> 23ja\eP\eD
131      Partial match: 23ja
132      data> n05\eR\eD
133       0: n05
134    .sp
135    The first call has "23ja" as the subject, and requests partial matching; the
136    second call has "n05" as the subject for the continued (restarted) match.
137    Notice that when the match is complete, only the last part is shown; PCRE does
138    not retain the previously partially-matched string. It is up to the calling
139    program to do that if it needs to.
140    .P
141    You can set PCRE_PARTIAL with PCRE_DFA_RESTART to continue partial matching
142    over multiple segments. This facility can be used to pass very long subject
143    strings to \fBpcre_dfa_exec()\fP. However, some care is needed for certain
144    types of pattern.
145    .P
146    1. If the pattern contains tests for the beginning or end of a line, you need
147    to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the
148    subject string for any call does not contain the beginning or end of a line.
149    .P
150    2. If the pattern contains backward assertions (including \eb or \eB), you need
151    to arrange for some overlap in the subject strings to allow for this. For
152    example, you could pass the subject in chunks that are 500 bytes long, but in
153    a buffer of 700 bytes, with the starting offset set to 200 and the previous 200
154    bytes at the start of the buffer.
155    .P
156    3. Matching a subject string that is split into multiple segments does not
157    always produce exactly the same result as matching over one single long string.
158    The difference arises when there are multiple matching possibilities, because a
159    partial match result is given only when there are no completed matches in a
160    call to fBpcre_dfa_exec()\fP. This means that as soon as the shortest match has
161    been found, continuation to a new subject segment is no longer possible.
162    Consider this \fBpcretest\fP example:
163    .sp
164        re> /dog(sbody)?/
165      data> do\eP\eD
166      Partial match: do
167      data> gsb\eR\eP\eD
168       0: g
169      data> dogsbody\eD
170       0: dogsbody
171       1: dog
172    .sp
173    The pattern matches the words "dog" or "dogsbody". When the subject is
174    presented in several parts ("do" and "gsb" being the first two) the match stops
175    when "dog" has been found, and it is not possible to continue. On the other
176    hand, if "dogsbody" is presented as a single string, both matches are found.
177    .P
178    Because of this phenomenon, it does not usually make sense to end a pattern
179    that is going to be matched in this way with a variable repeat.
180    .P
181    4. Patterns that contain alternatives at the top level which do not all
182    start with the same pattern item may not work as expected. For example,
183    consider this pattern:
184    .sp
185      1234|3789
186    .sp
187    If the first part of the subject is "ABC123", a partial match of the first
188    alternative is found at offset 3. There is no partial match for the second
189    alternative, because such a match does not start at the same point in the
190    subject string. Attempting to continue with the string "789" does not yield a
191    match because only those alternatives that match at one point in the subject
192    are remembered. The problem arises because the start of the second alternative
193    matches within the first alternative. There is no problem with anchored
194    patterns or patterns such as:
195    .sp
196      1234|ABCD
197    .sp
198    where no string can be a partial match for both alternatives.
199  .  .
200  .  .
201  .P  .P
202  .in 0  .in 0
203  Last updated: 08 September 2004  Last updated: 30 November 2006
204  .br  .br
205  Copyright (c) 1997-2004 University of Cambridge.  Copyright (c) 1997-2006 University of Cambridge.

Legend:
Removed from v.75  
changed lines
  Added in v.93

  ViewVC Help
Powered by ViewVC 1.1.5