18 |
By contrast, the code originally written by Henry Spencer and subsequently |
By contrast, the code originally written by Henry Spencer and subsequently |
19 |
heavily modified for Perl actually compiles the expression twice: once in a |
heavily modified for Perl actually compiles the expression twice: once in a |
20 |
dummy mode in order to find out how much store will be needed, and then for |
dummy mode in order to find out how much store will be needed, and then for |
21 |
real. The execution function operates by backtracking and maximizing (or |
real. The execution function operates by backtracking and maximizing (or, |
22 |
minimizing in Perl) the amount of the subject that matches individual wild |
optionally, minimizing in Perl) the amount of the subject that matches |
23 |
portions of the pattern. This is a "NFA algorithm". |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
24 |
|
terminology. |
25 |
|
|
26 |
For this set of functions, I tried at first to invent an algorithm that used an |
For this set of functions, I tried at first to invent an algorithm that used an |
27 |
amount of store bounded by a multiple of the number of characters in the |
amount of store bounded by a multiple of the number of characters in the |
28 |
pattern, to save on compiling time. However, because of the greater complexity |
pattern, to save on compiling time. However, because of the greater complexity |
29 |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
30 |
through the pattern is needed, in order to find internal flag settings like |
through the pattern is needed, in order to find internal flag settings like |
31 |
(?i). So it works by running a very degenerate first pass to calculate a |
(?i) at top level. So it works by running a very degenerate first pass to |
32 |
maximum store size, and then a second pass to do the real compile - which may |
calculate a maximum store size, and then a second pass to do the real compile - |
33 |
use a bit less than the predicted amount of store. The idea is that this is |
which may use a bit less than the predicted amount of store. The idea is that |
34 |
going to turn out faster because the first pass is degenerate and the second |
this is going to turn out faster because the first pass is degenerate and the |
35 |
can just store stuff straight into the vector. It does make the compiling |
second can just store stuff straight into the vector. It does make the |
36 |
functions bigger, of course, but they have got quite big anyway to handle all |
compiling functions bigger, of course, but they have got quite big anyway to |
37 |
the Perl stuff. |
handle all the Perl stuff. |
38 |
|
|
39 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
40 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
58 |
OP_WHITESPACE \s |
OP_WHITESPACE \s |
59 |
OP_NOT_WORDCHAR \W |
OP_NOT_WORDCHAR \W |
60 |
OP_WORDCHAR \w |
OP_WORDCHAR \w |
61 |
OP_CUT analogue of Prolog's "cut" |
OP_EODN match end of data or \n at end: \Z |
62 |
OP_EOD match end of data: \Z |
OP_EOD match end of data: \z |
63 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
64 |
|
|
65 |
|
|
197 |
Assertions |
Assertions |
198 |
---------- |
---------- |
199 |
|
|
200 |
Assertions are just like other subpatterns, but starting with one of the |
Forward assertions are just like other subpatterns, but starting with one of |
201 |
opcodes OP_ASSERT or OP_ASSERT_NOT. |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
202 |
|
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
203 |
|
is OP_REVERSE, followed by a two byte count of the number of characters to move |
204 |
|
back. A separate count is present in each alternative of a lookbehind |
205 |
|
assertion, allowing them to have different fixed lengths. |
206 |
|
|
207 |
|
|
208 |
Once-only subpatterns |
Once-only subpatterns |
212 |
OP_ONCE. |
OP_ONCE. |
213 |
|
|
214 |
|
|
215 |
|
Conditional subpatterns |
216 |
|
----------------------- |
217 |
|
|
218 |
|
These are like other subpatterns, but they start with the opcode OP_COND. If |
219 |
|
the condition is a back reference, this is stored at the start of the |
220 |
|
subpattern using the opcode OP_CREF followed by one byte containing the |
221 |
|
reference number. Otherwise, a conditional subpattern will always start with |
222 |
|
one of the assertions. |
223 |
|
|
224 |
|
|
225 |
|
Changing options |
226 |
|
---------------- |
227 |
|
|
228 |
|
If any of the /i, /m, or /s options are changed within a parenthesized group, |
229 |
|
an OP_OPT opcode is compiled, followed by one byte containing the new settings |
230 |
|
of these flags. If there are several alternatives in a group, there is an |
231 |
|
occurrence of OP_OPT at the start of all those following the first options |
232 |
|
change, to set appropriate options for the start of the alternative. |
233 |
|
Immediately after the end of the group there is another such item to reset the |
234 |
|
flags to their previous values. Other changes of flag within the pattern can be |
235 |
|
handled entirely at compile time, and so do not cause anything to be put into |
236 |
|
the compiled data. |
237 |
|
|
238 |
|
|
239 |
Philip Hazel |
Philip Hazel |
240 |
December 1997 |
September 1998 |