/[pcre]/code/trunk/doc/html/pcrepartial.html
ViewVC logotype

Diff of /code/trunk/doc/html/pcrepartial.html

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 1297 by ph10, Wed Oct 31 17:42:29 2012 UTC revision 1298 by ph10, Fri Mar 22 16:13:13 2013 UTC
# Line 81  strings. This optimization is also disab Line 81  strings. This optimization is also disab
81  <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()</a><br>  <br><a name="SEC2" href="#TOC1">PARTIAL MATCHING USING pcre_exec() OR pcre[16|32]_exec()</a><br>
82  <P>  <P>
83  A partial match occurs during a call to <b>pcre_exec()</b> or  A partial match occurs during a call to <b>pcre_exec()</b> or
84  <b>pcre[16|32]_exec()</b> when the end of the subject string is reached successfully,  <b>pcre[16|32]_exec()</b> when the end of the subject string is reached
85  but matching cannot continue because more characters are needed. However, at  successfully, but matching cannot continue because more characters are needed.
86  least one character in the subject must have been inspected. This character  However, at least one character in the subject must have been inspected. This
87  need not form part of the final matched string; lookbehind assertions and the  character need not form part of the final matched string; lookbehind assertions
88  \K escape sequence provide ways of inspecting characters before the start of a  and the \K escape sequence provide ways of inspecting characters before the
89  matched substring. The requirement for inspecting at least one character exists  start of a matched substring. The requirement for inspecting at least one
90  because an empty string can always be matched; without such a restriction there  character exists because an empty string can always be matched; without such a
91  would always be a partial match of an empty string at the end of the subject.  restriction there would always be a partial match of an empty string at the end
92    of the subject.
93  </P>  </P>
94  <P>  <P>
95  If there are at least two slots in the offsets vector when a partial match is  If there are at least two slots in the offsets vector when a partial match is
96  returned, the first slot is set to the offset of the earliest character that  returned, the first slot is set to the offset of the earliest character that
97  was inspected. For convenience, the second offset points to the end of the  was inspected. For convenience, the second offset points to the end of the
98  subject so that a substring can easily be identified.  subject so that a substring can easily be identified. If there are at least
99    three slots in the offsets vector, the third slot is set to the offset of the
100    character where matching started.
101  </P>  </P>
102  <P>  <P>
103  For the majority of patterns, the first offset identifies the start of the  For the majority of patterns, the contents of the first and third slots will be
104  partially matched string. However, for patterns that contain lookbehind  the same. However, for patterns that contain lookbehind assertions, or begin
105  assertions, or \K, or begin with \b or \B, earlier characters have been  with \b or \B, characters before the one where matching started may have been
106  inspected while carrying out the match. For example:  inspected while carrying out the match. For example, consider this pattern:
107  <pre>  <pre>
108    /(?&#60;=abc)123/    /(?&#60;=abc)123/
109  </pre>  </pre>
110  This pattern matches "123", but only if it is preceded by "abc". If the subject  This pattern matches "123", but only if it is preceded by "abc". If the subject
111  string is "xyzabc12", the offsets after a partial match are for the substring  string is "xyzabc12", the first two offsets after a partial match are for the
112  "abc12", because all these characters are needed if another match is tried  substring "abc12", because all these characters were inspected. However, the
113  with extra characters added to the subject.  third offset is set to 6, because that is the offset where matching began.
114  </P>  </P>
115  <P>  <P>
116  What happens when a partial match is identified depends on which of the two  What happens when a partial match is identified depends on which of the two
# Line 334  processing time is needed. Line 337  processing time is needed.
337  <P>  <P>
338  <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts  <b>Note:</b> If the pattern contains lookbehind assertions, or \K, or starts
339  with \b or \B, the string that is returned for a partial match includes  with \b or \B, the string that is returned for a partial match includes
340  characters that precede the partially matched string itself, because these must  characters that precede the start of what would be returned for a complete
341  be retained when adding on more characters for a subsequent matching attempt.  match, because it contains all the characters that were inspected during the
342  However, in some cases you may need to retain even earlier characters, as  partial match.
 discussed in the next section.  
343  </P>  </P>
344  <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>  <br><a name="SEC9" href="#TOC1">ISSUES WITH MULTI-SEGMENT MATCHING</a><br>
345  <P>  <P>
# Line 356  includes the effect of PCRE_NOTEOL. Line 358  includes the effect of PCRE_NOTEOL.
358  offsets that are returned for a partial match. However a lookbehind assertion  offsets that are returned for a partial match. However a lookbehind assertion
359  later in the pattern could require even earlier characters to be inspected. You  later in the pattern could require even earlier characters to be inspected. You
360  can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the  can handle this case by using the PCRE_INFO_MAXLOOKBEHIND option of the
361  <b>pcre_fullinfo()</b> or <b>pcre[16|32]_fullinfo()</b> functions to obtain the length  <b>pcre_fullinfo()</b> or <b>pcre[16|32]_fullinfo()</b> functions to obtain the
362  of the largest lookbehind in the pattern. This length is given in characters,  length of the longest lookbehind in the pattern. This length is given in
363  not bytes. If you always retain at least that many characters before the  characters, not bytes. If you always retain at least that many characters
364  partially matched string, all should be well. (Of course, near the start of the  before the partially matched string, all should be well. (Of course, near the
365  subject, fewer characters may be present; in that case all characters should be  start of the subject, fewer characters may be present; in that case all
366  retained.)  characters should be retained.)
367    </P>
368    <P>
369    From release 8.33, there is a more accurate way of deciding which characters to
370    retain. Instead of subtracting the length of the longest lookbehind from the
371    earliest inspected character (<i>offsets[0]</i>), the match start position
372    (<i>offsets[2]</i>) should be used, and the next match attempt started at the
373    <i>offsets[2]</i> character by setting the <i>startoffset</i> argument of
374    <b>pcre_exec()</b> or <b>pcre_dfa_exec()</b>.
375    </P>
376    <P>
377    For example, if the pattern "(?&#60;=123)abc" is partially
378    matched against the string "xx123a", the three offset values returned are 2, 6,
379    and 5. This indicates that the matching process that gave a partial match
380    started at offset 5, but the characters "123a" were all inspected. The maximum
381    lookbehind for that pattern is 3, so taking that away from 5 shows that we need
382    only keep "123a", and the next match attempt can be started at offset 3 (that
383    is, at "a") when further characters have been added. When the match start is
384    not the earliest inspected character, <b>pcretest</b> shows it explicitly:
385    <pre>
386        re&#62; "(?&#60;=123)abc"
387      data&#62; xx123a\P\P
388      Partial match at offset 5: 123a
389    </PRE>
390  </P>  </P>
391  <P>  <P>
392  3. Because a partial match must always contain at least one character, what  3. Because a partial match must always contain at least one character, what
# Line 465  Cambridge CB2 3QH, England. Line 490  Cambridge CB2 3QH, England.
490  </P>  </P>
491  <br><a name="SEC11" href="#TOC1">REVISION</a><br>  <br><a name="SEC11" href="#TOC1">REVISION</a><br>
492  <P>  <P>
493  Last updated: 24 June 2012  Last updated: 20 February 2013
494  <br>  <br>
495  Copyright &copy; 1997-2012 University of Cambridge.  Copyright &copy; 1997-2013 University of Cambridge.
496  <br>  <br>
497  <p>  <p>
498  Return to the <a href="index.html">PCRE index page</a>.  Return to the <a href="index.html">PCRE index page</a>.

Legend:
Removed from v.1297  
changed lines
  Added in v.1298

  ViewVC Help
Powered by ViewVC 1.1.5