# Diff of /code/trunk/doc/html/pcrematching.html

revision 91 by nigel, Sat Feb 24 21:41:34 2007 UTC revision 111 by ph10, Thu Mar 8 16:53:09 2007 UTC
# Line 16  man page, in case the conversion went wr Line 16  man page, in case the conversion went wr
16  <li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>  <li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a>
17  <li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>  <li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a>
18  <li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>  <li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a>
19  <li><a name="TOC4" href="#SEC4">THE DFA MATCHING ALGORITHM</a>  <li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a>
20  <li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE DFA ALGORITHM</a>  <li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
21  <li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE DFA ALGORITHM</a>  <li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a>
22    <li><a name="TOC7" href="#SEC7">AUTHOR</a>
23    <li><a name="TOC8" href="#SEC8">REVISION</a>
24  </ul>  </ul>
25  <br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>  <br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br>
26  <P>  <P>
# Line 46  is matched against the string Line 48  is matched against the string
48    &#60;something&#62; &#60;something else&#62; &#60;something further&#62;    &#60;something&#62; &#60;something else&#62; &#60;something further&#62;
49  </pre>  </pre>
50  there are three possible answers. The standard algorithm finds only one of  there are three possible answers. The standard algorithm finds only one of
51  them, whereas the DFA algorithm finds all three.  them, whereas the alternative algorithm finds all three.
52  </P>  </P>
53  <br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>  <br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br>
54  <P>  <P>
# Line 83  straightforward for this algorithm to ke Line 85  straightforward for this algorithm to ke
85  matched by portions of the pattern in parentheses. This provides support for  matched by portions of the pattern in parentheses. This provides support for
86  capturing parentheses and back references.  capturing parentheses and back references.
87  </P>  </P>
88  <br><a name="SEC4" href="#TOC1">THE DFA MATCHING ALGORITHM</a><br>  <br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br>
89  <P>  <P>
90  DFA stands for "deterministic finite automaton", but you do not need to  This algorithm conducts a breadth-first search of the tree. Starting from the
91  understand the origins of that name. This algorithm conducts a breadth-first  first matching point in the subject, it scans the subject string from left to
92  search of the tree. Starting from the first matching point in the subject, it  right, once, character by character, and as it does this, it remembers all the
93  scans the subject string from left to right, once, character by character, and  paths through the tree that represent valid matches. In Friedl's terminology,
94  as it does this, it remembers all the paths through the tree that represent  this is a kind of "DFA algorithm", though it is not implemented as a
95  valid matches.  traditional finite state machine (it keeps multiple states active
96    simultaneously).
97  </P>  </P>
98  <P>  <P>
99  The scan continues until either the end of the subject is reached, or there are  The scan continues until either the end of the subject is reached, or there are
# Line 114  matches that start at later positions. Line 117  matches that start at later positions.
117  </P>  </P>
118  <P>  <P>
119  There are a number of features of PCRE regular expressions that are not  There are a number of features of PCRE regular expressions that are not
120  supported by the DFA matching algorithm. They are as follows:  supported by the alternative matching algorithm. They are as follows:
121  </P>  </P>
122  <P>  <P>
123  1. Because the algorithm finds all possible matches, the greedy or ungreedy  1. Because the algorithm finds all possible matches, the greedy or ungreedy
124  nature of repetition quantifiers is not relevant. Greedy and ungreedy  nature of repetition quantifiers is not relevant. Greedy and ungreedy
125  quantifiers are treated in exactly the same way.  quantifiers are treated in exactly the same way. However, possessive
126    quantifiers can make a difference when what follows could also match what is
127    quantified, for example in a pattern like this:
128    <pre>
129      ^a++\w!
130    </pre>
131    This pattern matches "aaab!" but not "aaa!", which would be matched by a
132    non-possessive quantifier. Similarly, if an atomic group is present, it is
133    matched as if it were a standalone pattern at the current point, and the
134    longest match is then "locked in" for the rest of the overall pattern.
135  </P>  </P>
136  <P>  <P>
137  2. When dealing with multiple paths through the tree simultaneously, it is not  2. When dealing with multiple paths through the tree simultaneously, it is not
# Line 133  not supported, and cause errors if encou Line 145  not supported, and cause errors if encou
145  </P>  </P>
146  <P>  <P>
147  4. For the same reason, conditional expressions that use a backreference as the  4. For the same reason, conditional expressions that use a backreference as the
148  condition are not supported.  condition or test for a specific group recursion are not supported.
149  </P>  </P>
150  <P>  <P>
151  5. Callouts are supported, but the value of the <i>capture_top</i> field is  5. Callouts are supported, but the value of the <i>capture_top</i> field is
# Line 142  always 1, and the value of the <i>captur Line 154  always 1, and the value of the <i>captur
154  <P>  <P>
155  6.  6.
156  The \C escape sequence, which (in the standard algorithm) matches a single  The \C escape sequence, which (in the standard algorithm) matches a single
157  byte, even in UTF-8 mode, is not supported because the DFA algorithm moves  byte, even in UTF-8 mode, is not supported because the alternative algorithm
158  through the subject string one character at a time, for all active paths  moves through the subject string one character at a time, for all active paths
159  through the tree.  through the tree.
160  </P>  </P>
161  <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE DFA ALGORITHM</a><br>  <br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
162  <P>  <P>
163  Using the DFA matching algorithm provides the following advantages:  Using the alternative matching algorithm provides the following advantages:
164  </P>  </P>
165  <P>  <P>
166  1. All possible matches (at a single point in the subject) are automatically  1. All possible matches (at a single point in the subject) are automatically
# Line 159  callouts. Line 171  callouts.
171  <P>  <P>
172  2. There is much better support for partial matching. The restrictions on the  2. There is much better support for partial matching. The restrictions on the
173  content of the pattern that apply when using the standard algorithm for partial  content of the pattern that apply when using the standard algorithm for partial
174  matching do not apply to the DFA algorithm. For non-anchored patterns, the  matching do not apply to the alternative algorithm. For non-anchored patterns,
175  starting position of a partial match is available.  the starting position of a partial match is available.
176  </P>  </P>
177  <P>  <P>
178  3. Because the DFA algorithm scans the subject string just once, and never  3. Because the alternative algorithm scans the subject string just once, and
179  needs to backtrack, it is possible to pass very long subject strings to the  never needs to backtrack, it is possible to pass very long subject strings to
180  matching function in several pieces, checking for partial matching each time.  the matching function in several pieces, checking for partial matching each
181    time.
182  </P>  </P>
183  <br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE DFA ALGORITHM</a><br>  <br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br>
184  <P>  <P>
185  The DFA algorithm suffers from a number of disadvantages:  The alternative algorithm suffers from a number of disadvantages:
186  </P>  </P>
187  <P>  <P>
188  1. It is substantially slower than the standard algorithm. This is partly  1. It is substantially slower than the standard algorithm. This is partly
# Line 180  less susceptible to optimization. Line 193  less susceptible to optimization.
193  2. Capturing parentheses and back references are not supported.  2. Capturing parentheses and back references are not supported.
194  </P>  </P>
195  <P>  <P>
196  3. The "atomic group" feature of PCRE regular expressions is supported, but  3. Although atomic groups are supported, their use does not provide the
197  does not provide the advantage that it does for the standard algorithm.  performance advantage that it does for the standard algorithm.
198    </P>
199    <br><a name="SEC7" href="#TOC1">AUTHOR</a><br>
200    <P>
201    Philip Hazel
202    <br>
203    University Computing Service
204    <br>
205    Cambridge CB2 3QH, England.
206    <br>
207  </P>  </P>
208    <br><a name="SEC8" href="#TOC1">REVISION</a><br>
209  <P>  <P>
210  Last updated: 06 June 2006  Last updated: 06 March 2007
211    <br>
212    Copyright &copy; 1997-2007 University of Cambridge.
213  <br>  <br>
Copyright &copy; 1997-2006 University of Cambridge.
214  <p>  <p>