23 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
24 |
terminology. |
terminology. |
25 |
|
|
26 |
For this set of functions that forms PCRE, I tried at first to invent an |
For the set of functions that forms PCRE (which are unrelated to those |
27 |
algorithm that used an amount of store bounded by a multiple of the number of |
mentioned above), I tried at first to invent an algorithm that used an amount |
28 |
characters in the pattern, to save on compiling time. However, because of the |
of store bounded by a multiple of the number of characters in the pattern, to |
29 |
greater complexity in Perl regular expressions, I couldn't do this. In any |
save on compiling time. However, because of the greater complexity in Perl |
30 |
case, a first pass through the pattern is needed, in order to find internal |
regular expressions, I couldn't do this. In any case, a first pass through the |
31 |
flag settings like (?i) at top level. So it works by running a very degenerate |
pattern is needed, in order to find internal flag settings like (?i) at top |
32 |
first pass to calculate a maximum store size, and then a second pass to do the |
level. So PCRE works by running a very degenerate first pass to calculate a |
33 |
real compile - which may use a bit less than the predicted amount of store. The |
maximum store size, and then a second pass to do the real compile - which may |
34 |
idea is that this is going to turn out faster because the first pass is |
use a bit less than the predicted amount of store. The idea is that this is |
35 |
degenerate and the second can just store stuff straight into the vector. It |
going to turn out faster because the first pass is degenerate and the second |
36 |
does make the compiling functions bigger, of course, but they have got quite |
pass can just store stuff straight into the vector. It does make the compiling |
37 |
big anyway to handle all the Perl stuff. |
functions bigger, of course, but they have got quite big anyway to handle all |
38 |
|
the Perl stuff. |
39 |
|
|
40 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
41 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
62 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
63 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
64 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
65 |
|
OP_RECURSE match the pattern recursively |
66 |
|
|
67 |
|
|
68 |
Repeating single characters |
Repeating single characters |
127 |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
128 |
used for a repeated positive single-character class. |
used for a repeated positive single-character class. |
129 |
|
|
130 |
OP_CLASS is followed by a 32-byte bit map containing a 1 |
OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every |
131 |
bit for every character that is acceptable. The bits are counted from the least |
character that is acceptable. The bits are counted from the least significant |
132 |
significant end of each byte. |
end of each byte. |
133 |
|
|
134 |
|
|
135 |
Back references |
Back references |
161 |
Brackets and alternation |
Brackets and alternation |
162 |
------------------------ |
------------------------ |
163 |
|
|
164 |
A pair of non-identifying (round) brackets is wrapped round each expression at |
A pair of non-capturing (round) brackets is wrapped round each expression at |
165 |
compile time, so alternation always happens in the context of brackets. |
compile time, so alternation always happens in the context of brackets. |
166 |
Non-identifying brackets use the opcode OP_BRA, while identifying brackets use |
Non-capturing brackets use the opcode OP_BRA, while capturing brackets use |
167 |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
OP_BRA+1, OP_BRA+2, etc. [Note for North Americans: "bracket" to some English |
168 |
speakers, including myself, can be round, square, or curly. Hence this usage.] |
speakers, including myself, can be round, square, curly, or pointy. Hence this |
169 |
|
usage.] |
170 |
|
|
171 |
A bracket opcode is followed by two bytes which give the offset to the next |
A bracket opcode is followed by two bytes which give the offset to the next |
172 |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
alternative OP_ALT or, if there aren't any branches, to the matching KET |
239 |
|
|
240 |
|
|
241 |
Philip Hazel |
Philip Hazel |
242 |
January 1999 |
February 2000 |