1 |
MAINTENANCE README FOR PCRE
|
2 |
---------------------------
|
3 |
|
4 |
The files in the "maint" directory of the PCRE source contain data, scripts,
|
5 |
and programs that are used for the maintenance of PCRE, but which do not form
|
6 |
part of the PCRE distribution tarballs. This document describes these files and
|
7 |
also contains some notes for maintainers. Its contents are:
|
8 |
|
9 |
Files in the maint directory
|
10 |
Updating to a new Unicode release
|
11 |
Preparing for a PCRE release
|
12 |
Making a PCRE release
|
13 |
Long-term ideas (wish list)
|
14 |
|
15 |
|
16 |
Files in the maint directory
|
17 |
----------------------------
|
18 |
|
19 |
Builducptable A Perl script that creates the contents of the ucptable.h file
|
20 |
from two Unicode data files, which themselves are downloaded
|
21 |
from the Unicode web site. Run this script in the "maint"
|
22 |
directory.
|
23 |
|
24 |
ManyConfigTests A shell script that runs "configure, make, test" a number of
|
25 |
times with different configuration settings.
|
26 |
|
27 |
Unicode.tables The files in this directory, Scripts.txt and UnicodeData.txt,
|
28 |
were downloaded from the Unicode web site. They contain
|
29 |
information about Unicode characters and scripts.
|
30 |
|
31 |
ucptest.c A short C program for testing the Unicode property functions
|
32 |
in pcre_ucp_searchfuncs.c, mainly useful after rebuilding the
|
33 |
Unicode property table. Compile and run this in the "maint"
|
34 |
directory.
|
35 |
|
36 |
ucptestdata A directory containing two files, testinput1 and testoutput1,
|
37 |
to use in conjunction with the ucptest program.
|
38 |
|
39 |
utf8.c A short, freestanding C program for converting a Unicode code
|
40 |
point into a sequence of bytes in the UTF-8 encoding, and vice
|
41 |
versa. If its argument is a hex number such as 0x1234, it
|
42 |
outputs a list of the equivalent UTF-8 bytes. If its argument
|
43 |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
44 |
treats them as a UTF-8 character and outputs the equivalent
|
45 |
code point in hex.
|
46 |
|
47 |
|
48 |
Updating to a new Unicode release
|
49 |
---------------------------------
|
50 |
|
51 |
When there is a new release of Unicode, the files in Unicode.tables must be
|
52 |
refreshed from the web site, and the Buildupctable script can then be run to
|
53 |
generate a new version of ucptable.h. The ucptest program can be used to check
|
54 |
that the resulting table works properly, using the data files in ucptestdata to
|
55 |
check a number of test characters.
|
56 |
|
57 |
|
58 |
Preparing for a PCRE release
|
59 |
----------------------------
|
60 |
|
61 |
This section contains a checklist of things that I consult before building a
|
62 |
distribution for a new release.
|
63 |
|
64 |
. Ensure that the version number and version date are correct in configure.ac,
|
65 |
ChangeLog, and NEWS.
|
66 |
|
67 |
. If new build options have been added, ensure that they are added to the CMake
|
68 |
files as well as to the autoconf files.
|
69 |
|
70 |
. Run ./autogen.sh to ensure everything is up-to-date.
|
71 |
|
72 |
. Compile and test with many different config options, and combinations of
|
73 |
options. The maint/ManyConfigTests script now encapsulates this testing.
|
74 |
|
75 |
. Run perltest.pl on the test data for tests 1 and 4. The output should match
|
76 |
the PCRE test output, apart from the version identification at the top. The
|
77 |
other tests are not Perl-compatible (they use various special PCRE options).
|
78 |
|
79 |
. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest
|
80 |
valgrind", though that takes quite a long time.
|
81 |
|
82 |
. It may also useful to test with Electric Fence, though the fact that it
|
83 |
grumbles for missing free() calls can be a nuisance. (A missing free() in
|
84 |
pcretest is hardly a big problem.) To build with EF, use:
|
85 |
|
86 |
LIBS='/usr/lib/libefence.a -lpthread' with ./configure.
|
87 |
|
88 |
Then all normal runs use it to check for buffer overflow. Also run everything
|
89 |
with:
|
90 |
|
91 |
EF_PROTECT_BELOW=1 <whatever>
|
92 |
|
93 |
because there have been problems with lookbehinds that looked too far.
|
94 |
|
95 |
. Test with the emulated memmove() function by undefining HAVE_MEMMOVE and
|
96 |
HAVE_BCOPY in config.h. You may see a number of "pcre_memmove defined but not
|
97 |
used" warnings for the modules in which there is no call to memmove(). These
|
98 |
can be ignored.
|
99 |
|
100 |
. Documentation: check AUTHORS, COPYING, ChangeLog (check date), INSTALL,
|
101 |
LICENCE, NEWS (check date), NON-UNIX-USE, and README. Many of these won't
|
102 |
need changing, but over the long term things do change.
|
103 |
|
104 |
. Man pages: Check all man pages for \ not followed by e or f or " because
|
105 |
that indicates a markup error.
|
106 |
|
107 |
. When the release is built, test it on a number of different operating
|
108 |
systems if possible, and using different compilers as well. For example,
|
109 |
on Solaris it is helpful to test using Sun's cc compiler as a change from
|
110 |
gcc. Adding -xarch=v9 to the cc options does a 64-bit test, but it also
|
111 |
needs -S 64 for pcretest to increase the stack size for test 2.
|
112 |
|
113 |
|
114 |
Making a PCRE release
|
115 |
---------------------
|
116 |
|
117 |
Run PrepareRelease and commit the files that it changes (by removing trailing
|
118 |
spaces). Then run "make distcheck" to create the tarballs and the zipball.
|
119 |
Double-check with "svn status", then create an SVN tagged copy:
|
120 |
|
121 |
svn copy svn://vcs.exim.org/pcre/code/trunk \
|
122 |
svn://vcs.exim.org/pcre/code/tags/pcre-7.x
|
123 |
|
124 |
Don't forget to update Freshmeat when the new release is out, and to tell
|
125 |
webmaster@pcre.org and the mailing list.
|
126 |
|
127 |
|
128 |
Future ideas (wish list)
|
129 |
------------------------
|
130 |
|
131 |
This section records a list of ideas so that they do not get forgotten. They
|
132 |
vary enormously in their usefulness and potential for implementation. Some are
|
133 |
very sensible; some are rather wacky. Some have been on this list for years;
|
134 |
others are relatively new.
|
135 |
|
136 |
. Optimization
|
137 |
|
138 |
There are always ideas for new optimizations so as to speed up pattern
|
139 |
matching. Most of them try to save work by recognizing a non-match without
|
140 |
having to scan all the possibilities. These are some that I've recorded:
|
141 |
|
142 |
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very
|
143 |
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}?
|
144 |
OTOH, this is pathological - the user could easily fix it.
|
145 |
|
146 |
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems
|
147 |
to have little effect, and maybe makes things worse.
|
148 |
|
149 |
* "Ends with literal string" - note that a single character doesn't gain much
|
150 |
over the existing "required byte" (reqbyte) feature that just saves one
|
151 |
byte.
|
152 |
|
153 |
* These probably need to go in study():
|
154 |
|
155 |
o Remember an initial string rather than just 1 char?
|
156 |
|
157 |
o A required byte from alternatives - not just the last char, but an
|
158 |
earlier one if common to all alternatives.
|
159 |
|
160 |
o Minimum length of subject needed.
|
161 |
|
162 |
o Friedl contains other ideas.
|
163 |
|
164 |
. If Perl gets to a consistent state over the settings of capturing sub-
|
165 |
patterns inside repeats, see if we can match it. One example of the
|
166 |
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
|
167 |
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard
|
168 |
because I think it needs much more state to be remembered.
|
169 |
|
170 |
. Perl 6 will be a revolution. Is it a revolution too far for PCRE?
|
171 |
|
172 |
. Unicode
|
173 |
|
174 |
* Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX
|
175 |
character classes. For the moment, I've chosen not to support this for
|
176 |
backward compatibility, for speed, and because it would be messy to
|
177 |
implement.
|
178 |
|
179 |
* A different approach to Unicode might be to use a typedef to do everything
|
180 |
in unsigned shorts instead of unsigned chars. Actually, we'd have to have a
|
181 |
new typedef to distinguish data from bits of compiled pattern that are in
|
182 |
bytes, I think. There would need to be conversion functions in and out. I
|
183 |
don't think this is particularly trivial - and anyway, Unicode now has
|
184 |
characters that need more than 16 bits, so is this at all sensible?
|
185 |
|
186 |
* There has been a request for direct support of 16-bit characters and
|
187 |
UTF-16. However, since Unicode is moving beyond purely 16-bit characters,
|
188 |
is this worth it at all? One possible way of handling 16-bit characters
|
189 |
would be to "load" them in the same way that UTF-8 characters are loaded.
|
190 |
|
191 |
. Allow errorptr and erroroffset to be NULL. I don't like this idea.
|
192 |
|
193 |
. Line endings:
|
194 |
|
195 |
* Option to use NUL as a line terminator in subject strings. This could now
|
196 |
be done relatively easily since the extension to support LF, CR, and CRLF.
|
197 |
If this is done, a suitable option for pcregrep is also required.
|
198 |
|
199 |
. Option to provide the pattern with a length instead of with a NUL terminator.
|
200 |
This probably affects quite a few places in the code.
|
201 |
|
202 |
. Catch SIGSEGV for stack overflows?
|
203 |
|
204 |
. A feature to suspend a match via a callout was once requested.
|
205 |
|
206 |
. Option to convert results into character offsets and character lengths.
|
207 |
|
208 |
. Option for pcregrep to scan only the start of a file. I am not keen - this is
|
209 |
the job of "head".
|
210 |
|
211 |
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
|
212 |
preceded by a blank line, instead of adding it to every matched line, and (b)
|
213 |
support --outputfile=name.
|
214 |
|
215 |
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 7.
|
216 |
|
217 |
. Add a user pointer to pcre_malloc/free functions -- some option would be
|
218 |
needed to retain backward compatibility.
|
219 |
|
220 |
. Define a union for the results from pcre_fullinfo().
|
221 |
|
222 |
. Provide a "random access to the subject" facility so that the way in which it
|
223 |
is stored is independent of PCRE. For efficiency, it probably isn't possible
|
224 |
to switch this dynamically. It would have to be specified when PCRE was
|
225 |
compiled. PCRE would then call a function every time it wanted a character.
|
226 |
|
227 |
. Wild thought: the ability to compile from PCRE's internal byte code to a real
|
228 |
FSM and a very fast (third) matcher to process the result. There would be
|
229 |
even more restrictions than for pcre_dfa_exec(), however. This is not easy.
|
230 |
|
231 |
. Should pcretest have some private locale data, to avoid relying on the
|
232 |
available locales for the test data, since different OS have different ideas?
|
233 |
This won't be as thorough a test, but perhaps that doesn't really matter.
|
234 |
|
235 |
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort
|
236 |
them will of course slow it down.
|
237 |
|
238 |
. Someone suggested --disable-callout to save code space when callouts are
|
239 |
never wanted. This seems rather marginal.
|
240 |
|
241 |
. Check names that consist entirely of digits: PCRE allows, but do Perl and
|
242 |
Python, etc?
|
243 |
|
244 |
Philip Hazel
|
245 |
Email local part: ph10
|
246 |
Email domain: cam.ac.uk
|
247 |
Last updated: 27 December 2007
|