/[pcre]/code/trunk/doc/pcrepartial.3
ViewVC logotype

Diff of /code/trunk/doc/pcrepartial.3

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 552 by ph10, Mon Oct 19 14:38:48 2009 UTC revision 553 by ph10, Fri Oct 22 15:57:50 2010 UTC
# Line 21  what has been typed so far is potentiall Line 21  what has been typed so far is potentiall
21  as soon as a mistake is made, by beeping and not reflecting the character that  as soon as a mistake is made, by beeping and not reflecting the character that
22  has been typed, for example. This immediate feedback is likely to be a better  has been typed, for example. This immediate feedback is likely to be a better
23  user interface than a check that is delayed until the entire string has been  user interface than a check that is delayed until the entire string has been
24  entered. Partial matching can also sometimes be useful when the subject string  entered. Partial matching can also be useful when the subject string is very
25  is very long and is not all available at once.  long and is not all available at once.
26  .P  .P
27  PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and  PCRE supports partial matching by means of the PCRE_PARTIAL_SOFT and
28  PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or  PCRE_PARTIAL_HARD options, which can be set when calling \fBpcre_exec()\fP or
# Line 44  also disabled for partial matching. Line 44  also disabled for partial matching.
44  .SH "PARTIAL MATCHING USING pcre_exec()"  .SH "PARTIAL MATCHING USING pcre_exec()"
45  .rs  .rs
46  .sp  .sp
47  A partial match occurs during a call to \fBpcre_exec()\fP whenever the end of  A partial match occurs during a call to \fBpcre_exec()\fP when the end of the
48  the subject string is reached successfully, but matching cannot continue  subject string is reached successfully, but matching cannot continue because
49  because more characters are needed. However, at least one character must have  more characters are needed. However, at least one character in the subject must
50  been matched. (In other words, a partial match can never be an empty string.)  have been inspected. This character need not form part of the final matched
51  .P  string; lookbehind assertions and the \eK escape sequence provide ways of
52  If PCRE_PARTIAL_SOFT is set, the partial match is remembered, but matching  inspecting characters before the start of a matched substring. The requirement
53  continues as normal, and other alternatives in the pattern are tried. If no  for inspecting at least one character exists because an empty string can always
54  complete match can be found, \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL  be matched; without such a restriction there would always be a partial match of
55  instead of PCRE_ERROR_NOMATCH. If there are at least two slots in the offsets  an empty string at the end of the subject.
56  vector, the first of them is set to the offset of the earliest character that  .P
57  was inspected when the partial match was found. For convenience, the second  If there are at least two slots in the offsets vector when \fBpcre_exec()\fP
58  offset points to the end of the string so that a substring can easily be  returns with a partial match, the first slot is set to the offset of the
59  identified.  earliest character that was inspected when the partial match was found. For
60    convenience, the second offset points to the end of the subject so that a
61    substring can easily be identified.
62  .P  .P
63  For the majority of patterns, the first offset identifies the start of the  For the majority of patterns, the first offset identifies the start of the
64  partially matched string. However, for patterns that contain lookbehind  partially matched string. However, for patterns that contain lookbehind
# Line 68  inspected while carrying out the match. Line 70  inspected while carrying out the match.
70  This pattern matches "123", but only if it is preceded by "abc". If the subject  This pattern matches "123", but only if it is preceded by "abc". If the subject
71  string is "xyzabc12", the offsets after a partial match are for the substring  string is "xyzabc12", the offsets after a partial match are for the substring
72  "abc12", because all these characters are needed if another match is tried  "abc12", because all these characters are needed if another match is tried
73  with extra characters added.  with extra characters added to the subject.
74    .P
75    What happens when a partial match is identified depends on which of the two
76    partial matching options are set.
77    .
78    .
79    .SS "PCRE_PARTIAL_SOFT with pcre_exec()"
80    .rs
81    .sp
82    If PCRE_PARTIAL_SOFT is set when \fBpcre_exec()\fP identifies a partial match,
83    the partial match is remembered, but matching continues as normal, and other
84    alternatives in the pattern are tried. If no complete match can be found,
85    \fBpcre_exec()\fP returns PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH.
86    .P
87    This option is "soft" because it prefers a complete match over a partial match.
88    All the various matching items in a pattern behave as if the subject string is
89    potentially complete. For example, \ez, \eZ, and $ match at the end of the
90    subject, as normal, and for \eb and \eB the end of the subject is treated as a
91    non-alphanumeric.
92  .P  .P
93  If there is more than one partial match, the first one that was found provides  If there is more than one partial match, the first one that was found provides
94  the data that is returned. Consider this pattern:  the data that is returned. Consider this pattern:
# Line 77  the data that is returned. Consider this Line 97  the data that is returned. Consider this
97  .sp  .sp
98  If this is matched against the subject string "abc123dog", both  If this is matched against the subject string "abc123dog", both
99  alternatives fail to match, but the end of the subject is reached during  alternatives fail to match, but the end of the subject is reached during
100  matching, so PCRE_ERROR_PARTIAL is returned instead of PCRE_ERROR_NOMATCH. The  matching, so PCRE_ERROR_PARTIAL is returned. The offsets are set to 3 and 9,
101  offsets are set to 3 and 9, identifying "123dog" as the first partial match  identifying "123dog" as the first partial match that was found. (In this
102  that was found. (In this example, there are two partial matches, because "dog"  example, there are two partial matches, because "dog" on its own partially
103  on its own partially matches the second alternative.)  matches the second alternative.)
104  .P  .
105    .
106    .SS "PCRE_PARTIAL_HARD with pcre_exec()"
107    .rs
108    .sp
109  If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns  If PCRE_PARTIAL_HARD is set for \fBpcre_exec()\fP, it returns
110  PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to  PCRE_ERROR_PARTIAL as soon as a partial match is found, without continuing to
111  search for possible complete matches. The difference between the two options  search for possible complete matches. This option is "hard" because it prefers
112  can be illustrated by a pattern such as:  an earlier partial match over a later complete match. For this reason, the
113    assumption is made that the end of the supplied subject string may not be the
114    true end of the available data, and so, if \ez, \eZ, \eb, \eB, or $ are
115    encountered at the end of the subject, the result is PCRE_ERROR_PARTIAL.
116    .
117    .
118    .SS "Comparing hard and soft partial matching"
119    .rs
120    .sp
121    The difference between the two partial matching options can be illustrated by a
122    pattern such as:
123  .sp  .sp
124    /dog(sbody)?/    /dog(sbody)?/
125  .sp  .sp
# Line 115  The \fBpcre_dfa_exec()\fP function moves Line 149  The \fBpcre_dfa_exec()\fP function moves
149  character, without backtracking, searching for all possible matches  character, without backtracking, searching for all possible matches
150  simultaneously. If the end of the subject is reached before the end of the  simultaneously. If the end of the subject is reached before the end of the
151  pattern, there is the possibility of a partial match, again provided that at  pattern, there is the possibility of a partial match, again provided that at
152  least one character has matched.  least one character has been inspected.
153  .P  .P
154  When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there  When PCRE_PARTIAL_SOFT is set, PCRE_ERROR_PARTIAL is returned only if there
155  have been no complete matches. Otherwise, the complete matches are returned.  have been no complete matches. Otherwise, the complete matches are returned.
# Line 240  From release 8.00, \fBpcre_exec()\fP can Line 274  From release 8.00, \fBpcre_exec()\fP can
274  matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the  matching. Unlike \fBpcre_dfa_exec()\fP, it is not possible to restart the
275  previous match with a new segment of data. Instead, new data must be added to  previous match with a new segment of data. Instead, new data must be added to
276  the previous subject string, and the entire match re-run, starting from the  the previous subject string, and the entire match re-run, starting from the
277  point where the partial match occurred. Earlier data can be discarded.  point where the partial match occurred. Earlier data can be discarded. It is
278  Consider an unanchored pattern that matches dates:  best to use PCRE_PARTIAL_HARD in this situation, because it does not treat the
279    end of a segment as the end of the subject when matching \ez, \eZ, \eb, \eB,
280    and $. Consider an unanchored pattern that matches dates:
281  .sp  .sp
282      re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/      re> /\ed?\ed(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\ed\ed/
283    data> The date is 23ja\eP    data> The date is 23ja\eP\eP
284    Partial match: 23ja    Partial match: 23ja
285  .sp  .sp
286  At this stage, an application could discard the text preceding "23ja", add on  At this stage, an application could discard the text preceding "23ja", add on
# Line 265  be retained when adding on more characte Line 301  be retained when adding on more characte
301  Certain types of pattern may give problems with multi-segment matching,  Certain types of pattern may give problems with multi-segment matching,
302  whichever matching function is used.  whichever matching function is used.
303  .P  .P
304  1. If the pattern contains tests for the beginning or end of a line, you need  1. If the pattern contains a test for the beginning of a line, you need to pass
305  to pass the PCRE_NOTBOL or PCRE_NOTEOL options, as appropriate, when the  the PCRE_NOTBOL option when the subject string for any call does start at the
306  subject string for any call does not contain the beginning or end of a line.  beginning of a line. There is also a PCRE_NOTEOL option, but in practice when
307    doing multi-segment matching you should be using PCRE_PARTIAL_HARD, which
308    includes the effect of PCRE_NOTEOL.
309  .P  .P
310  2. Lookbehind assertions at the start of a pattern are catered for in the  2. Lookbehind assertions at the start of a pattern are catered for in the
311  offsets that are returned for a partial match. However, in theory, a lookbehind  offsets that are returned for a partial match. However, in theory, a lookbehind
# Line 281  always produce exactly the same result a Line 319  always produce exactly the same result a
319  especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and  especially when PCRE_PARTIAL_SOFT is used. The section "Partial Matching and
320  Word Boundaries" above describes an issue that arises if the pattern ends with  Word Boundaries" above describes an issue that arises if the pattern ends with
321  \eb or \eB. Another kind of difference may occur when there are multiple  \eb or \eB. Another kind of difference may occur when there are multiple
322  matching possibilities, because a partial match result is given only when there  matching possibilities, because (for PCRE_PARTIAL_SOFT) a partial match result
323  are no completed matches. This means that as soon as the shortest match has  is given only when there are no completed matches. This means that as soon as
324  been found, continuation to a new subject segment is no longer possible.  the shortest match has been found, continuation to a new subject segment is no
325  Consider again this \fBpcretest\fP example:  longer possible. Consider again this \fBpcretest\fP example:
326  .sp  .sp
327      re> /dog(sbody)?/      re> /dog(sbody)?/
328    data> dogsb\eP    data> dogsb\eP
# Line 306  match stops when "dog" has been found, a Line 344  match stops when "dog" has been found, a
344  the other hand, if "dogsbody" is presented as a single string,  the other hand, if "dogsbody" is presented as a single string,
345  \fBpcre_dfa_exec()\fP finds both matches.  \fBpcre_dfa_exec()\fP finds both matches.
346  .P  .P
347  Because of these problems, it is probably best to use PCRE_PARTIAL_HARD when  Because of these problems, it is best to use PCRE_PARTIAL_HARD when matching
348  matching multi-segment data. The example above then behaves differently:  multi-segment data. The example above then behaves differently:
349  .sp  .sp
350      re> /dog(sbody)?/      re> /dog(sbody)?/
351    data> dogsb\eP\eP    data> dogsb\eP\eP
# Line 341  problem if \fBpcre_exec()\fP is used, be Line 379  problem if \fBpcre_exec()\fP is used, be
379  each time:  each time:
380  .sp  .sp
381      re> /1234|3789/      re> /1234|3789/
382    data> ABC123\eP    data> ABC123\eP\eP
383    Partial match: 123    Partial match: 123
384    data> 1237890    data> 1237890
385     0: 3789     0: 3789
386  .sp  .sp
387  Of course, instead of using PCRE_DFA_PARTIAL, the same technique of re-running  Of course, instead of using PCRE_DFA_RESTART, the same technique of re-running
388  the entire match can also be used with \fBpcre_dfa_exec()\fP. Another  the entire match can also be used with \fBpcre_dfa_exec()\fP. Another
389  possibility is to work with two buffers. If a partial match at offset \fIn\fP  possibility is to work with two buffers. If a partial match at offset \fIn\fP
390  in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on  in the first buffer is followed by "no match" when PCRE_DFA_RESTART is used on
# Line 368  Cambridge CB2 3QH, England. Line 406  Cambridge CB2 3QH, England.
406  .rs  .rs
407  .sp  .sp
408  .nf  .nf
409  Last updated: 19 October 2009  Last updated: 22 October 2010
410  Copyright (c) 1997-2009 University of Cambridge.  Copyright (c) 1997-2010 University of Cambridge.
411  .fi  .fi

Legend:
Removed from v.552  
changed lines
  Added in v.553

  ViewVC Help
Powered by ViewVC 1.1.5