23 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
24 |
terminology. |
terminology. |
25 |
|
|
26 |
For this set of functions, I tried at first to invent an algorithm that used an |
For this set of functions that forms PCRE, I tried at first to invent an |
27 |
amount of store bounded by a multiple of the number of characters in the |
algorithm that used an amount of store bounded by a multiple of the number of |
28 |
pattern, to save on compiling time. However, because of the greater complexity |
characters in the pattern, to save on compiling time. However, because of the |
29 |
in Perl regular expressions, I couldn't do this. In any case, a first pass |
greater complexity in Perl regular expressions, I couldn't do this. In any |
30 |
through the pattern is needed, in order to find internal flag settings like |
case, a first pass through the pattern is needed, in order to find internal |
31 |
(?i) at top level. So it works by running a very degenerate first pass to |
flag settings like (?i) at top level. So it works by running a very degenerate |
32 |
calculate a maximum store size, and then a second pass to do the real compile - |
first pass to calculate a maximum store size, and then a second pass to do the |
33 |
which may use a bit less than the predicted amount of store. The idea is that |
real compile - which may use a bit less than the predicted amount of store. The |
34 |
this is going to turn out faster because the first pass is degenerate and the |
idea is that this is going to turn out faster because the first pass is |
35 |
second can just store stuff straight into the vector. It does make the |
degenerate and the second can just store stuff straight into the vector. It |
36 |
compiling functions bigger, of course, but they have got quite big anyway to |
does make the compiling functions bigger, of course, but they have got quite |
37 |
handle all the Perl stuff. |
big anyway to handle all the Perl stuff. |
38 |
|
|
39 |
The compiled form of a pattern is a vector of bytes, containing items of |
The compiled form of a pattern is a vector of bytes, containing items of |
40 |
variable length. The first byte in an item is an opcode, and the length of the |
variable length. The first byte in an item is an opcode, and the length of the |
118 |
Character classes |
Character classes |
119 |
----------------- |
----------------- |
120 |
|
|
121 |
OP_CLASS is used for a character class, and OP_NEGCLASS for a negated character |
OP_CLASS is used for a character class, provided there are at least two |
122 |
class, provided there are at least two characters in the class. If there is |
characters in the class. If there is only one character, OP_CHARS is used for a |
123 |
only one character, OP_CHARS is used for a positive class, and OP_NOT for a |
positive class, and OP_NOT for a negative one (that is, for something like |
124 |
negative one. A set of repeating opcodes (OP_NOTSTAR etc.) are used for a |
[^a]). Another set of repeating opcodes (OP_NOTSTAR etc.) are used for a |
125 |
repeated, negated, single-character class. |
repeated, negated, single-character class. The normal ones (OP_STAR etc.) are |
126 |
|
used for a repeated positive single-character class. |
127 |
|
|
128 |
Both OP_CLASS and OP_NEGCLASS are followed by a 32-byte bit map containing a 1 |
OP_CLASS is followed by a 32-byte bit map containing a 1 |
129 |
bit for every character that is acceptable. The bits are counted from the least |
bit for every character that is acceptable. The bits are counted from the least |
130 |
significant end of each byte. The reason for having two opcodes is to cope with |
significant end of each byte. |
|
negated character classes when caseless matching is specified at run time but |
|
|
not at compile time. If it is specified at compile time, the bit map is built |
|
|
appropriately. This is the only time that a distinction is made between |
|
|
OP_CLASS and OP_NEGCLASS, when the bit map was built in a caseful manner but |
|
|
matching must be caseless. For OP_CLASS, a character matches if either of its |
|
|
cases is in the bit map, but for OP_NEGCLASS, both of them must be present. |
|
131 |
|
|
132 |
|
|
133 |
Back references |
Back references |
139 |
Repeating character classes and back references |
Repeating character classes and back references |
140 |
----------------------------------------------- |
----------------------------------------------- |
141 |
|
|
142 |
In both cases, the repeat information follows the base item. The matching code |
Single-character classes are handled specially (see above). This applies to |
143 |
looks at the following opcode to see if it is one of |
OP_CLASS and OP_REF. In both cases, the repeat information follows the base |
144 |
|
item. The matching code looks at the following opcode to see if it is one of |
145 |
|
|
146 |
OP_CRSTAR |
OP_CRSTAR |
147 |
OP_CRMINSTAR |
OP_CRMINSTAR |
197 |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
the opcodes OP_ASSERT or OP_ASSERT_NOT. Backward assertions use the opcodes |
198 |
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
OP_ASSERTBACK and OP_ASSERTBACK_NOT, and the first opcode inside the assertion |
199 |
is OP_REVERSE, followed by a two byte count of the number of characters to move |
is OP_REVERSE, followed by a two byte count of the number of characters to move |
200 |
back. A separate count is present in each alternative of a lookbehind |
back the pointer in the subject string. A separate count is present in each |
201 |
assertion, allowing them to have different fixed lengths. |
alternative of a lookbehind assertion, allowing them to have different fixed |
202 |
|
lengths. |
203 |
|
|
204 |
|
|
205 |
Once-only subpatterns |
Once-only subpatterns |
234 |
|
|
235 |
|
|
236 |
Philip Hazel |
Philip Hazel |
237 |
September 1998 |
October 1998 |