1 |
Technical Notes about PCRE |
Technical Notes about PCRE |
2 |
-------------------------- |
-------------------------- |
3 |
|
|
4 |
|
These are very rough technical notes that record potentially useful information |
5 |
|
about PCRE internals. |
6 |
|
|
7 |
Historical note 1 |
Historical note 1 |
8 |
----------------- |
----------------- |
9 |
|
|
24 |
Historical note 2 |
Historical note 2 |
25 |
----------------- |
----------------- |
26 |
|
|
27 |
By contrast, the code originally written by Henry Spencer and subsequently |
By contrast, the code originally written by Henry Spencer (which was |
28 |
heavily modified for Perl actually compiles the expression twice: once in a |
subsequently heavily modified for Perl) compiles the expression twice: once in |
29 |
dummy mode in order to find out how much store will be needed, and then for |
a dummy mode in order to find out how much store will be needed, and then for |
30 |
real. The execution function operates by backtracking and maximizing (or, |
real. (The Perl version probably doesn't do this any more; I'm talking about |
31 |
optionally, minimizing in Perl) the amount of the subject that matches |
the original library.) The execution function operates by backtracking and |
32 |
individual wild portions of the pattern. This is an "NFA algorithm" in Friedl's |
maximizing (or, optionally, minimizing in Perl) the amount of the subject that |
33 |
terminology. |
matches individual wild portions of the pattern. This is an "NFA algorithm" in |
34 |
|
Friedl's terminology. |
35 |
|
|
36 |
OK, here's the real stuff |
OK, here's the real stuff |
37 |
------------------------- |
------------------------- |
47 |
predicted amount of store. The idea is that this is going to turn out faster |
predicted amount of store. The idea is that this is going to turn out faster |
48 |
because the first pass is degenerate and the second pass can just store stuff |
because the first pass is degenerate and the second pass can just store stuff |
49 |
straight into the vector, which it knows is big enough. It does make the |
straight into the vector, which it knows is big enough. It does make the |
50 |
compiling functions bigger, of course, but they have got quite big anyway to |
compiling functions bigger, of course, but they have become quite big anyway to |
51 |
handle all the Perl stuff. |
handle all the Perl stuff. |
52 |
|
|
53 |
Traditional matching function |
Traditional matching function |
67 |
simultaneously for all possible matches that start at one point in the subject |
simultaneously for all possible matches that start at one point in the subject |
68 |
string. (Going back to my roots: see Historical Note 1 above.) This function |
string. (Going back to my roots: see Historical Note 1 above.) This function |
69 |
intreprets the same compiled pattern data as pcre_exec(); however, not all the |
intreprets the same compiled pattern data as pcre_exec(); however, not all the |
70 |
facilities are available, and those that are don't always work in quite the |
facilities are available, and those that are do not always work in quite the |
71 |
same way. See the user documentation for details. |
same way. See the user documentation for details. |
72 |
|
|
73 |
Format of compiled patterns |
Format of compiled patterns |
161 |
|
|
162 |
OP_PROP and OP_NOTPROP are used for positive and negative matches of a |
OP_PROP and OP_NOTPROP are used for positive and negative matches of a |
163 |
character by testing its Unicode property (the \p and \P escape sequences). |
character by testing its Unicode property (the \p and \P escape sequences). |
164 |
Each is followed by a single byte that encodes the desired property value. |
Each is followed by two bytes that encode the desired property as a type and a |
165 |
|
value. |
166 |
|
|
167 |
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by two |
Repeats of these items use the OP_TYPESTAR etc. set of opcodes, followed by |
168 |
bytes: OP_PROP or OP_NOTPROP and then the desired property value. |
three bytes: OP_PROP or OP_NOTPROP and then the desired property type and |
169 |
|
value. |
170 |
|
|
171 |
|
|
172 |
Matching literal characters |
Matching literal characters |
345 |
data. |
data. |
346 |
|
|
347 |
Philip Hazel |
Philip Hazel |
348 |
January 2006 |
June 2006 |