16 
<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 
<li><a name="TOC1" href="#SEC1">PCRE MATCHING ALGORITHMS</a> 
17 
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 
<li><a name="TOC2" href="#SEC2">REGULAR EXPRESSIONS AS TREES</a> 
18 
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 
<li><a name="TOC3" href="#SEC3">THE STANDARD MATCHING ALGORITHM</a> 
19 
<li><a name="TOC4" href="#SEC4">THE DFA MATCHING ALGORITHM</a> 
<li><a name="TOC4" href="#SEC4">THE ALTERNATIVE MATCHING ALGORITHM</a> 
20 
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE DFA ALGORITHM</a> 
<li><a name="TOC5" href="#SEC5">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 
21 
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE DFA ALGORITHM</a> 
<li><a name="TOC6" href="#SEC6">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a> 
22 

<li><a name="TOC7" href="#SEC7">AUTHOR</a> 
23 

<li><a name="TOC8" href="#SEC8">REVISION</a> 
24 
</ul> 
</ul> 
25 
<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> 
<br><a name="SEC1" href="#TOC1">PCRE MATCHING ALGORITHMS</a><br> 
26 
<P> 
<P> 
48 
<something> <something else> <something further> 
<something> <something else> <something further> 
49 
</pre> 
</pre> 
50 
there are three possible answers. The standard algorithm finds only one of 
there are three possible answers. The standard algorithm finds only one of 
51 
them, whereas the DFA algorithm finds all three. 
them, whereas the alternative algorithm finds all three. 
52 
</P> 
</P> 
53 
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 
<br><a name="SEC2" href="#TOC1">REGULAR EXPRESSIONS AS TREES</a><br> 
54 
<P> 
<P> 
61 
</P> 
</P> 
62 
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 
<br><a name="SEC3" href="#TOC1">THE STANDARD MATCHING ALGORITHM</a><br> 
63 
<P> 
<P> 
64 
In the terminology of Jeffrey Friedl's book \fIMastering Regular 
In the terminology of Jeffrey Friedl's book "Mastering Regular 
65 
Expressions\fP, the standard algorithm is an "NFA algorithm". It conducts a 
Expressions", the standard algorithm is an "NFA algorithm". It conducts a 
66 
depthfirst search of the pattern tree. That is, it proceeds along a single 
depthfirst search of the pattern tree. That is, it proceeds along a single 
67 
path through the tree, checking that the subject matches what is required. When 
path through the tree, checking that the subject matches what is required. When 
68 
there is a mismatch, the algorithm tries any alternatives at the current point, 
there is a mismatch, the algorithm tries any alternatives at the current point, 
85 
matched by portions of the pattern in parentheses. This provides support for 
matched by portions of the pattern in parentheses. This provides support for 
86 
capturing parentheses and back references. 
capturing parentheses and back references. 
87 
</P> 
</P> 
88 
<br><a name="SEC4" href="#TOC1">THE DFA MATCHING ALGORITHM</a><br> 
<br><a name="SEC4" href="#TOC1">THE ALTERNATIVE MATCHING ALGORITHM</a><br> 
89 
<P> 
<P> 
90 
DFA stands for "deterministic finite automaton", but you do not need to 
This algorithm conducts a breadthfirst search of the tree. Starting from the 
91 
understand the origins of that name. This algorithm conducts a breadthfirst 
first matching point in the subject, it scans the subject string from left to 
92 
search of the tree. Starting from the first matching point in the subject, it 
right, once, character by character, and as it does this, it remembers all the 
93 
scans the subject string from left to right, once, character by character, and 
paths through the tree that represent valid matches. In Friedl's terminology, 
94 
as it does this, it remembers all the paths through the tree that represent 
this is a kind of "DFA algorithm", though it is not implemented as a 
95 
valid matches. 
traditional finite state machine (it keeps multiple states active 
96 

simultaneously). 
97 

</P> 
98 

<P> 
99 

Although the general principle of this matching algorithm is that it scans the 
100 

subject string only once, without backtracking, there is one exception: when a 
101 

lookaround assertion is encountered, the characters following or preceding the 
102 

current point have to be independently inspected. 
103 
</P> 
</P> 
104 
<P> 
<P> 
105 
The scan continues until either the end of the subject is reached, or there are 
The scan continues until either the end of the subject is reached, or there are 
106 
no more unterminated paths. At this point, terminated paths represent the 
no more unterminated paths. At this point, terminated paths represent the 
107 
different matching possibilities (if there are none, the match has failed). 
different matching possibilities (if there are none, the match has failed). 
108 
Thus, if there is more than one possible match, this algorithm finds all of 
Thus, if there is more than one possible match, this algorithm finds all of 
109 
them, and in particular, it finds the longest. In PCRE, there is an option to 
them, and in particular, it finds the longest. The matches are returned in 
110 
stop the algorithm after the first match (which is necessarily the shortest) 
decreasing order of length. There is an option to stop the algorithm after the 
111 
has been found. 
first match (which is necessarily the shortest) is found. 
112 
</P> 
</P> 
113 
<P> 
<P> 
114 
Note that all the matches that are found start at the same point in the 
Note that all the matches that are found start at the same point in the 
115 
subject. If the pattern 
subject. If the pattern 
116 
<pre> 
<pre> 
117 
cat(er(pillar)?) 
cat(er(pillar)?)? 
118 
</pre> 
</pre> 
119 
is matched against the string "the caterpillar catchment", the result will be 
is matched against the string "the caterpillar catchment", the result will be 
120 
the three strings "cat", "cater", and "caterpillar" that start at the fourth 
the three strings "caterpillar", "cater", and "cat" that start at the fifth 
121 
character of the subject. The algorithm does not automatically move on to find 
character of the subject. The algorithm does not automatically move on to find 
122 
matches that start at later positions. 
matches that start at later positions. 
123 
</P> 
</P> 
124 
<P> 
<P> 
125 
There are a number of features of PCRE regular expressions that are not 
There are a number of features of PCRE regular expressions that are not 
126 
supported by the DFA matching algorithm. They are as follows: 
supported by the alternative matching algorithm. They are as follows: 
127 
</P> 
</P> 
128 
<P> 
<P> 
129 
1. Because the algorithm finds all possible matches, the greedy or ungreedy 
1. Because the algorithm finds all possible matches, the greedy or ungreedy 
130 
nature of repetition quantifiers is not relevant. Greedy and ungreedy 
nature of repetition quantifiers is not relevant. Greedy and ungreedy 
131 
quantifiers are treated in exactly the same way. 
quantifiers are treated in exactly the same way. However, possessive 
132 

quantifiers can make a difference when what follows could also match what is 
133 

quantified, for example in a pattern like this: 
134 

<pre> 
135 

^a++\w! 
136 

</pre> 
137 

This pattern matches "aaab!" but not "aaa!", which would be matched by a 
138 

nonpossessive quantifier. Similarly, if an atomic group is present, it is 
139 

matched as if it were a standalone pattern at the current point, and the 
140 

longest match is then "locked in" for the rest of the overall pattern. 
141 
</P> 
</P> 
142 
<P> 
<P> 
143 
2. When dealing with multiple paths through the tree simultaneously, it is not 
2. When dealing with multiple paths through the tree simultaneously, it is not 
151 
</P> 
</P> 
152 
<P> 
<P> 
153 
4. For the same reason, conditional expressions that use a backreference as the 
4. For the same reason, conditional expressions that use a backreference as the 
154 
condition are not supported. 
condition or test for a specific group recursion are not supported. 
155 
</P> 
</P> 
156 
<P> 
<P> 
157 
5. Callouts are supported, but the value of the <i>capture_top</i> field is 
5. Because many paths through the tree may be active, the \K escape sequence, 
158 

which resets the start of the match when encountered (but may be on some paths 
159 

and not on others), is not supported. It causes an error if encountered. 
160 

</P> 
161 

<P> 
162 

6. Callouts are supported, but the value of the <i>capture_top</i> field is 
163 
always 1, and the value of the <i>capture_last</i> field is always 1. 
always 1, and the value of the <i>capture_last</i> field is always 1. 
164 
</P> 
</P> 
165 
<P> 
<P> 
166 
6. 
7. The \C escape sequence, which (in the standard algorithm) matches a single 
167 
The \C escape sequence, which (in the standard algorithm) matches a single 
byte, even in UTF8 mode, is not supported because the alternative algorithm 
168 
byte, even in UTF8 mode, is not supported because the DFA algorithm moves 
moves through the subject string one character at a time, for all active paths 

through the subject string one character at a time, for all active paths 

169 
through the tree. 
through the tree. 
170 
</P> 
</P> 

<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE DFA ALGORITHM</a><br> 

171 
<P> 
<P> 
172 
Using the DFA matching algorithm provides the following advantages: 
8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE) are not 
173 

supported. (*FAIL) is supported, and behaves like a failing negative assertion. 
174 

</P> 
175 

<br><a name="SEC5" href="#TOC1">ADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 
176 

<P> 
177 

Using the alternative matching algorithm provides the following advantages: 
178 
</P> 
</P> 
179 
<P> 
<P> 
180 
1. All possible matches (at a single point in the subject) are automatically 
1. All possible matches (at a single point in the subject) are automatically 
183 
callouts. 
callouts. 
184 
</P> 
</P> 
185 
<P> 
<P> 
186 
2. There is much better support for partial matching. The restrictions on the 
2. Because the alternative algorithm scans the subject string just once, and 
187 
content of the pattern that apply when using the standard algorithm for partial 
never needs to backtrack, it is possible to pass very long subject strings to 
188 
matching do not apply to the DFA algorithm. For nonanchored patterns, the 
the matching function in several pieces, checking for partial matching each 
189 
starting position of a partial match is available. 
time. Although it is possible to do multisegment matching using the standard 
190 

algorithm (<b>pcre_exec()</b>), by retaining partially matched substrings, it is 
191 

more complicated. The 
192 

<a href="pcrepartial.html"><b>pcrepartial</b></a> 
193 

documentation gives details of partial matching and discusses multisegment 
194 

matching. 
195 
</P> 
</P> 
196 

<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE ALTERNATIVE ALGORITHM</a><br> 
197 
<P> 
<P> 
198 
3. Because the DFA algorithm scans the subject string just once, and never 
The alternative algorithm suffers from a number of disadvantages: 

needs to backtrack, it is possible to pass very long subject strings to the 


matching function in several pieces, checking for partial matching each time. 


</P> 


<br><a name="SEC6" href="#TOC1">DISADVANTAGES OF THE DFA ALGORITHM</a><br> 


<P> 


The DFA algorithm suffers from a number of disadvantages: 

199 
</P> 
</P> 
200 
<P> 
<P> 
201 
1. It is substantially slower than the standard algorithm. This is partly 
1. It is substantially slower than the standard algorithm. This is partly 
206 
2. Capturing parentheses and back references are not supported. 
2. Capturing parentheses and back references are not supported. 
207 
</P> 
</P> 
208 
<P> 
<P> 
209 
3. The "atomic group" feature of PCRE regular expressions is supported, but 
3. Although atomic groups are supported, their use does not provide the 
210 
does not provide the advantage that it does for the standard algorithm. 
performance advantage that it does for the standard algorithm. 
211 

</P> 
212 

<br><a name="SEC7" href="#TOC1">AUTHOR</a><br> 
213 

<P> 
214 

Philip Hazel 
215 

<br> 
216 

University Computing Service 
217 

<br> 
218 

Cambridge CB2 3QH, England. 
219 

<br> 
220 
</P> 
</P> 
221 

<br><a name="SEC8" href="#TOC1">REVISION</a><br> 
222 
<P> 
<P> 
223 
Last updated: 06 June 2006 
Last updated: 17 November 2010 
224 

<br> 
225 

Copyright © 19972010 University of Cambridge. 
226 
<br> 
<br> 

Copyright © 19972006 University of Cambridge. 

227 
<p> 
<p> 
228 
Return to the <a href="index.html">PCRE index page</a>. 
Return to the <a href="index.html">PCRE index page</a>. 
229 
</p> 
</p> 