1 |
Some comparisons of PCRE with the original Henry Spencer (1986) regular
|
2 |
expression functions were done on a SPARCstation IPC using gcc version 2.6.3
|
3 |
with -O optimization, to give some idea as to how the two libraries compare.
|
4 |
This is not a major statistical investigation.
|
5 |
|
6 |
|
7 |
Code size
|
8 |
---------
|
9 |
|
10 |
The code size of PCRE is a bit over twice the size of the Henry Spencer
|
11 |
functions (roughly 33K vs 14K bytes on a SPARCstation with gcc -O).
|
12 |
|
13 |
|
14 |
Store size for compiled expressions
|
15 |
-----------------------------------
|
16 |
|
17 |
For expressions that are compatible with both libraries, PCRE uses less store
|
18 |
for the examples tried, except in some cases that involve the use of character
|
19 |
classes. Except in the special case of a negated charcter class containing only
|
20 |
one character (e.g. [^a]), PCRE uses a 32-byte bit map for each character
|
21 |
class, in order to get the maximum matching speed. By contrast the Spencer code
|
22 |
uses a strchr() call.
|
23 |
|
24 |
The Spencer functions have an overhead of 92 bytes per expression, because
|
25 |
there is a table for up to 10 matched substrings held with every compiled
|
26 |
expression. In contrast, PCRE's overhead is just 9 bytes, since it requires the
|
27 |
caller to supply a vector to receive the offsets of the matched substrings. In
|
28 |
the table below, the size without the overhead is shown in brackets.
|
29 |
|
30 |
PCRE Spencer Pattern
|
31 |
---- ------- -------
|
32 |
|
33 |
18 (09) 109 (17) /^$/
|
34 |
25 (16) 120 (28) /^.*nter/
|
35 |
26 (17) 121 (29) /^12.34/
|
36 |
37 (28) 126 (34) /the quick brown fox/
|
37 |
50 (41) 114 (22) /^[]cde]/
|
38 |
50 (41) 114 (22) /^[^]cde]/
|
39 |
51 (42) 125 (33) /^[.^$|()*+?{,}]+/
|
40 |
52 (43) 126 (34) /^[0-9]+$/
|
41 |
56 (47) 153 (61) /^(abc)(abc)?zz/
|
42 |
57 (48) 133 (41) /^xxx[0-9]+$/
|
43 |
57 (48) 145 (53) /([0-9a-fA-F:]+)$/
|
44 |
62 (53) 171 (79) /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$$/
|
45 |
70 (61) 170 (78) /^(b+|a)(b+|a)?c/
|
46 |
74 (65) 173 (81) /^(ba|b*)(ba|b*)?bc/
|
47 |
99 (90) 235 (143) /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
48 |
119 (110) 157 (65) /^.+[0-9][0-9][0-9]$/
|
49 |
165 (156) 446 (354) /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
50 |
451 (442) 605 (513) /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
51 |
|
52 |
|
53 |
Compilation time
|
54 |
----------------
|
55 |
|
56 |
Timing was done using the clock() function to time 2000 compilations of each
|
57 |
expression and then dividing by twice the number of clocks per second, to get a
|
58 |
value in milliseconds. The variation observed over several runs was never more
|
59 |
than 0.01:
|
60 |
|
61 |
PCRE Spencer Pattern
|
62 |
---- ------- -------
|
63 |
|
64 |
0.04 0.07 /^$/
|
65 |
0.06 0.12 /^.*nter/
|
66 |
0.06 0.13 /^12.34/
|
67 |
0.06 0.09 /^[]cde]/
|
68 |
0.07 0.14 /^[0-9]+$/
|
69 |
0.07 0.10 /^[^]cde]/
|
70 |
0.08 0.17 /^xxx[0-9]+$/
|
71 |
0.08 0.14 /the quick brown fox/
|
72 |
0.09 0.14 /^[.^$|()*+?{,}]+/
|
73 |
0.10 0.33 /([0-9a-fA-F:]+)$/
|
74 |
0.12 0.26 /^.+[0-9][0-9][0-9]$/
|
75 |
0.12 0.42 /^(abc)(abc)?zz/
|
76 |
0.14 0.51 /^(b+|a)(b+|a)?c/
|
77 |
0.15 0.53 /^(ba|b*)(ba|b*)?bc/
|
78 |
0.19 0.51 /^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
|
79 |
0.34 1.59 /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
80 |
0.47 1.32 /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
81 |
0.66 1.78 /^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
82 |
|
83 |
|
84 |
Execution time
|
85 |
--------------
|
86 |
|
87 |
Execution timing was done in a similar manner. Blank entries in the "pattern"
|
88 |
column below indicate the use of the same pattern as before.
|
89 |
|
90 |
PCRE Spencer Subject Pattern
|
91 |
---- ------- ------- -------
|
92 |
|
93 |
0.03 0.02 <null string> /^$/
|
94 |
0.04 0.04 enter /^.*nter/
|
95 |
0.04 0.04 uponter
|
96 |
0.03 0.03 12\r34 /^12.34/
|
97 |
0.03 0.03 0 /^[0-9]+$/
|
98 |
0.04 0.03 100
|
99 |
0.03 0.03 ]thing /^[]cde]/
|
100 |
0.03 0.03 ething
|
101 |
0.03 0.03 athing /^[^]cde]/
|
102 |
0.04 0.04 xxx0 /^xxx[0-9]+$/
|
103 |
0.04 0.04 xxx1234
|
104 |
0.04 0.07 .^\$(*+)|{?,?} /^[.^$|()*+?{,}]+/
|
105 |
0.03 0.03 the quick brown fox /the quick brown fox/
|
106 |
0.06 0.08 What do you know about the quick brown fox?
|
107 |
0.04 0.07 0abc /([0-9a-fA-F:]+)$/
|
108 |
0.04 0.07 abc
|
109 |
0.05 0.13 5f03:12C0::932e
|
110 |
0.06 0.07 x123 /^.+[0-9][0-9][0-9]$/
|
111 |
0.06 0.07 123456
|
112 |
0.06 0.09 abczz /^(abc)(abc)?zz/
|
113 |
0.06 0.12 abcabczz /^(abc)(abc)?zz/
|
114 |
/^([^!]+)!(.+)=apquxz\.ixr\.zzz\.ac\.uk$/
|
115 |
0.23 0.28 abc!pqr=apquxz.ixr.zzz.ac.uk
|
116 |
0.09 0.15 bc /^(b+|a)(b+|a)?c/
|
117 |
0.09 0.15 bbc
|
118 |
0.08 0.15 bbbc
|
119 |
0.09 0.15 bac
|
120 |
0.09 0.15 bbac
|
121 |
0.07 0.14 aac
|
122 |
0.09 0.15 abbbbbbbbbbbc
|
123 |
0.09 0.15 bbbbbbbbbbbac
|
124 |
0.09 0.18 babc /^(ba|b*)(ba|b*)?bc/
|
125 |
0.12 0.24 bbabc
|
126 |
0.07 0.15 bababc
|
127 |
0.06 0.10 a. /^[a-zA-Z0-9][a-zA-Z0-9\-]*(\.[a-zA-Z0-9][a-zA-z0-9\-]*)*\.$/
|
128 |
0.13 0.34 ab-c.pq-r.
|
129 |
0.24 0.58 sxk.zzz.ac.uk.
|
130 |
0.12 0.34 x-.y-.
|
131 |
0.20 0.38 abcdefhij /^(a(b(c)))(d(e(f)))(h(i(j)))$/
|
132 |
/^From +([^ ]+) +[a-zA-Z][a-zA-Z][a-zA-Z] +[a-zA-Z][a-zA-Z][a-zA-Z] +[0-9]?[0-9] +[0-9][0-9]:[0-9][0-9]/
|
133 |
0.18 0.30 From abcd Mon Sep 01 12:33:02 1997
|
134 |
|
135 |
In general, PCRE runs faster than the Spencer function, but remember, this
|
136 |
is just for one particular compiler on one set of hardware and operating
|
137 |
system. Until comprehensive tests have been run in other environments, the most
|
138 |
one can plausibly say is that it is probably no worse on average for the kinds
|
139 |
of expression tested here.
|
140 |
|
141 |
|
142 |
Speeding up matching
|
143 |
--------------------
|
144 |
|
145 |
A character class is much more efficient than a set of bracketed alternatives.
|
146 |
Matching /^[abc]{12}/ against "abcabcabcabc" took 0.03 ms, whereas
|
147 |
/^(a|b|c){12}/ took 0.33 ms. This is because brackets and alternatives involve
|
148 |
recursion.
|
149 |
|
150 |
|
151 |
Serious test
|
152 |
------------
|
153 |
|
154 |
One of the tests of PCRE is the monster regular expression from "Mastering
|
155 |
Regular Expressions" (O'Reilly's "hip owls" book, 1997, ISBN 1-56592-257-3)
|
156 |
which recognizes email addresses. There are two versions, unoptimized and
|
157 |
optimized. For interest, here are their timings, again on a SPARCstation IPC.
|
158 |
The compile times were 55 ms and 94 ms, and the compiled expressions
|
159 |
occupied 11010 and 15426 bytes of store, respectively. The following strings
|
160 |
were matched in the times shown (unoptimized first):
|
161 |
|
162 |
0.34 0.38 user@dom.ain
|
163 |
0.38 0.42 <user@dom.ain>
|
164 |
0.88 0.60 Alan Other <user@dom.ain>
|
165 |
1.87 0.82 "A. Other" <user.1234@dom.ain> (a comment)
|
166 |
1.77 1.19 A. Other <user.1234@dom.ain> (a comment)
|
167 |
2.21 0.42 "/s=user/ou=host/o=place/prmd=uu.yy/admd= /c=gb/"@x400-re.lay
|
168 |
|
169 |
The optimization of the expression clearly has a dramatic effect in some cases.
|
170 |
|
171 |
Philip Hazel <ph10@cus.cam.ac.uk>
|
172 |
October 1997
|