16 
<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 
<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 
17 
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 
18 
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 
19 
<li><a name="TOC4" href="#SEC4">THE DFA MATCHING ALGORITHM</a> 
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> 
20 
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE DFA ALGORITHM</a> 
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 
21 
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE DFA ALGORITHM</a> 
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 
22 
</ul> 
</ul> 
23 
<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> 
<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> 
24 
<P> 
<P> 
46 
<something> <something else> <something further> 
<something> <something else> <something further> 
47 
</pre> 
</pre> 
48 
there are three possible answers. The standard algorithm finds only one of 
there are three possible answers. The standard algorithm finds only one of 
49 
them, whereas the DFA algorithm finds all three. 
them, whereas the alternative algorithm finds all three. 
50 
</P> 
</P> 
51 
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 
52 
<P> 
<P> 
54 
as a tree structure. An unlimited repetition in the pattern makes the tree of 
as a tree structure. An unlimited repetition in the pattern makes the tree of 
55 
infinite size, but it is still a tree. Matching the pattern to a given subject 
infinite size, but it is still a tree. Matching the pattern to a given subject 
56 
string (from a given starting point) can be thought of as a search of the tree. 
string (from a given starting point) can be thought of as a search of the tree. 
57 
There are two standard ways to search a tree: depthfirst and breadthfirst, 
There are two ways to search a tree: depthfirst and breadthfirst, and these 
58 
and these correspond to the two matching algorithms provided by PCRE. 
correspond to the two matching algorithms provided by PCRE. 
59 
</P> 
</P> 
60 
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 
61 
<P> 
<P> 
83 
matched by portions of the pattern in parentheses. This provides support for 
matched by portions of the pattern in parentheses. This provides support for 
84 
capturing parentheses and back references. 
capturing parentheses and back references. 
85 
</P> 
</P> 
86 
<br><a name="SEC4" href="#TOC1">THE DFA MATCHING ALGORITHM</a><br> 
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br> 
87 
<P> 
<P> 
88 
DFA stands for "deterministic finite automaton", but you do not need to 
This algorithm conducts a breadthfirst search of the tree. Starting from the 
89 
understand the origins of that name. This algorithm conducts a breadthfirst 
first matching point in the subject, it scans the subject string from left to 
90 
search of the tree. Starting from the first matching point in the subject, it 
right, once, character by character, and as it does this, it remembers all the 
91 
scans the subject string from left to right, once, character by character, and 
paths through the tree that represent valid matches. In Friedl's terminology, 
92 
as it does this, it remembers all the paths through the tree that represent 
this is a kind of "DFA algorithm", though it is not implemented as a 
93 
valid matches. 
traditional finite state machine (it keeps multiple states active 
94 

simultaneously). 
95 
</P> 
</P> 
96 
<P> 
<P> 
97 
The scan continues until either the end of the subject is reached, or there are 
The scan continues until either the end of the subject is reached, or there are 
115 
</P> 
</P> 
116 
<P> 
<P> 
117 
There are a number of features of PCRE regular expressions that are not 
There are a number of features of PCRE regular expressions that are not 
118 
supported by the DFA matching algorithm. They are as follows: 
supported by the alternative matching algorithm. They are as follows: 
119 
</P> 
</P> 
120 
<P> 
<P> 
121 
1. Because the algorithm finds all possible matches, the greedy or ungreedy 
1. Because the algorithm finds all possible matches, the greedy or ungreedy 
122 
nature of repetition quantifiers is not relevant. Greedy and ungreedy 
nature of repetition quantifiers is not relevant. Greedy and ungreedy 
123 
quantifiers are treated in exactly the same way. 
quantifiers are treated in exactly the same way. However, possessive 
124 

quantifiers can make a difference when what follows could also match what is 
125 

quantified, for example in a pattern like this: 
126 

<pre> 
127 

^a++\w! 
128 

</pre> 
129 

This pattern matches "aaab!" but not "aaa!", which would be matched by a 
130 

nonpossessive quantifier. Similarly, if an atomic group is present, it is 
131 

matched as if it were a standalone pattern at the current point, and the 
132 

longest match is then "locked in" for the rest of the overall pattern. 
133 
</P> 
</P> 
134 
<P> 
<P> 
135 
2. When dealing with multiple paths through the tree simultaneously, it is not 
2. When dealing with multiple paths through the tree simultaneously, it is not 
143 
</P> 
</P> 
144 
<P> 
<P> 
145 
4. For the same reason, conditional expressions that use a backreference as the 
4. For the same reason, conditional expressions that use a backreference as the 
146 
condition are not supported. 
condition or test for a specific group recursion are not supported. 
147 
</P> 
</P> 
148 
<P> 
<P> 
149 
5. Callouts are supported, but the value of the <i>capture_top</i> field is 
5. Callouts are supported, but the value of the <i>capture_top</i> field is 
152 
<P> 
<P> 
153 
6. 
6. 
154 
The \C escape sequence, which (in the standard algorithm) matches a single 
The \C escape sequence, which (in the standard algorithm) matches a single 
155 
byte, even in UTF8 mode, is not supported because the DFA algorithm moves 
byte, even in UTF8 mode, is not supported because the alternative algorithm 
156 
through the subject string one character at a time, for all active paths 
moves through the subject string one character at a time, for all active paths 
157 
through the tree. 
through the tree. 
158 
</P> 
</P> 
159 
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE DFA ALGORITHM</a><br> 
<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 
160 
<P> 
<P> 
161 
Using the DFA matching algorithm provides the following advantages: 
Using the alternative matching algorithm provides the following advantages: 
162 
</P> 
</P> 
163 
<P> 
<P> 
164 
1. All possible matches (at a single point in the subject) are automatically 
1. All possible matches (at a single point in the subject) are automatically 
169 
<P> 
<P> 
170 
2. There is much better support for partial matching. The restrictions on the 
2. There is much better support for partial matching. The restrictions on the 
171 
content of the pattern that apply when using the standard algorithm for partial 
content of the pattern that apply when using the standard algorithm for partial 
172 
matching do not apply to the DFA algorithm. For nonanchored patterns, the 
matching do not apply to the alternative algorithm. For nonanchored patterns, 
173 
starting position of a partial match is available. 
the starting position of a partial match is available. 
174 
</P> 
</P> 
175 
<P> 
<P> 
176 
3. Because the DFA algorithm scans the subject string just once, and never 
3. Because the alternative algorithm scans the subject string just once, and 
177 
needs to backtrack, it is possible to pass very long subject strings to the 
never needs to backtrack, it is possible to pass very long subject strings to 
178 
matching function in several pieces, checking for partial matching each time. 
the matching function in several pieces, checking for partial matching each 
179 

time. 
180 
</P> 
</P> 
181 
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE DFA ALGORITHM</a><br> 
<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 
182 
<P> 
<P> 
183 
The DFA algorithm suffers from a number of disadvantages: 
The alternative algorithm suffers from a number of disadvantages: 
184 
</P> 
</P> 
185 
<P> 
<P> 
186 
1. It is substantially slower than the standard algorithm. This is partly 
1. It is substantially slower than the standard algorithm. This is partly 
191 
2. Capturing parentheses and back references are not supported. 
2. Capturing parentheses and back references are not supported. 
192 
</P> 
</P> 
193 
<P> 
<P> 
194 
3. The "atomic group" feature of PCRE regular expressions is supported, but 
3. Although atomic groups are supported, their use does not provide the 
195 
does not provide the advantage that it does for the standard algorithm. 
performance advantage that it does for the standard algorithm. 
196 
</P> 
</P> 
197 
<P> 
<P> 
198 
Last updated: 28 February 2005 
Last updated: 24 November 2006 
199 
<br> 
<br> 
200 
Copyright © 19972005 University of Cambridge. 
Copyright © 19972006 University of Cambridge. 
201 
<p> 
<p> 
202 
Return to the <a href="index.html">PCRE index page</a>. 
Return to the <a href="index.html">PCRE index page</a>. 
203 
</p> 
</p> 