48 |
|
|
49 |
OP_END end of pattern |
OP_END end of pattern |
50 |
OP_ANY match any character |
OP_ANY match any character |
51 |
|
OP_ANYBYTE match any single byte, even in UTF-8 mode |
52 |
OP_SOD match start of data: \A |
OP_SOD match start of data: \A |
53 |
|
OP_SOM, start of match (subject + offset): \G |
54 |
OP_CIRC ^ (start of data, or after \n in multiline) |
OP_CIRC ^ (start of data, or after \n in multiline) |
55 |
OP_NOT_WORD_BOUNDARY \W |
OP_NOT_WORD_BOUNDARY \W |
56 |
OP_WORD_BOUNDARY \w |
OP_WORD_BOUNDARY \w |
63 |
OP_EODN match end of data or \n at end: \Z |
OP_EODN match end of data or \n at end: \Z |
64 |
OP_EOD match end of data: \z |
OP_EOD match end of data: \z |
65 |
OP_DOLL $ (end of data, or before \n in multiline) |
OP_DOLL $ (end of data, or before \n in multiline) |
|
OP_RECURSE match the pattern recursively |
|
66 |
|
|
67 |
|
|
68 |
Repeating single characters |
Repeating single characters |
120 |
Character classes |
Character classes |
121 |
----------------- |
----------------- |
122 |
|
|
123 |
When characters less than 256 are involved, OP_CLASS is used for a character |
If there is only one character, OP_CHARS is used for a positive class, |
|
class. If there is only one character, OP_CHARS is used for a positive class, |
|
124 |
and OP_NOT for a negative one (that is, for something like [^a]). However, in |
and OP_NOT for a negative one (that is, for something like [^a]). However, in |
125 |
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT |
UTF-8 mode, this applies only to characters with values < 128, because OP_NOT |
126 |
is confined to single bytes. |
is confined to single bytes. |
129 |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
negated, single-character class. The normal ones (OP_STAR etc.) are used for a |
130 |
repeated positive single-character class. |
repeated positive single-character class. |
131 |
|
|
132 |
OP_CLASS is followed by a 32-byte bit map containing a 1 bit for every |
When there's more than one character in a class and all the characters are less |
133 |
character that is acceptable. The bits are counted from the least significant |
than 256, OP_CLASS is used for a positive class, and OP_NCLASS for a negative |
134 |
end of each byte. |
one. In either case, the opcode is followed by a 32-byte bit map containing a 1 |
135 |
|
bit for every character that is acceptable. The bits are counted from the least |
136 |
|
significant end of each byte. |
137 |
|
|
138 |
|
The reason for having both OP_CLASS and OP_NCLASS is so that, in UTF-8 mode, |
139 |
|
subject characters with values greater than 256 can be handled correctly. For |
140 |
|
OP_CLASS they don't match, whereas for OP_NCLASS they do. |
141 |
|
|
142 |
For classes containing characters with values > 255, OP_XCLASS is used. It |
For classes containing characters with values > 255, OP_XCLASS is used. It |
143 |
optionally uses a bit map (if any characters lie within it), followed by a list |
optionally uses a bit map (if any characters lie within it), followed by a list |
249 |
conditional subpattern always starts with one of the assertions. |
conditional subpattern always starts with one of the assertions. |
250 |
|
|
251 |
|
|
252 |
|
Recursion |
253 |
|
--------- |
254 |
|
|
255 |
|
Recursion either matches the current regex, or some subexpression. The opcode |
256 |
|
OP_RECURSE is followed by an value which is the offset to the starting bracket |
257 |
|
from the start of the whole pattern. |
258 |
|
|
259 |
|
|
260 |
|
Callout |
261 |
|
------- |
262 |
|
|
263 |
|
OP_CALLOUT is followed by one byte of data that holds a callout number in the |
264 |
|
range 0 to 255. |
265 |
|
|
266 |
|
|
267 |
Changing options |
Changing options |
268 |
---------------- |
---------------- |
269 |
|
|
278 |
data. |
data. |
279 |
|
|
280 |
Philip Hazel |
Philip Hazel |
281 |
August 2002 |
August 2003 |