1 |
MAINTENANCE README FOR PCRE
|
2 |
===========================
|
3 |
|
4 |
The files in the "maint" directory of the PCRE source contain data, scripts,
|
5 |
and programs that are used for the maintenance of PCRE, but which do not form
|
6 |
part of the PCRE distribution tarballs. This document describes these files and
|
7 |
also contains some notes for maintainers. Its contents are:
|
8 |
|
9 |
Files in the maint directory
|
10 |
Updating to a new Unicode release
|
11 |
Preparing for a PCRE release
|
12 |
Making a PCRE release
|
13 |
Long-term ideas (wish list)
|
14 |
|
15 |
|
16 |
Files in the maint directory
|
17 |
============================
|
18 |
|
19 |
---------------- This file is now OBSOLETE and no longer used ----------------
|
20 |
Builducptable A Perl script that creates the contents of the ucptable.h file
|
21 |
from two Unicode data files, which themselves are downloaded
|
22 |
from the Unicode web site. Run this script in the "maint"
|
23 |
directory.
|
24 |
---------------- This file is now OBSOLETE and no longer used ----------------
|
25 |
|
26 |
GenerateUtt.py A Python script to generate part of the pcre_tables.c file
|
27 |
that contains Unicode script names in a long string with
|
28 |
offsets, which is tedious to maintain by hand.
|
29 |
|
30 |
ManyConfigTests A shell script that runs "configure, make, test" a number of
|
31 |
times with different configuration settings.
|
32 |
|
33 |
MultiStage2.py A Python script that generates the file pcre_ucd.c from three
|
34 |
Unicode data tables, which are themselves downloaded from the
|
35 |
Unicode web site. Run this script in the "maint" directory.
|
36 |
The generated file contains the tables for a 2-stage lookup
|
37 |
of Unicode properties.
|
38 |
|
39 |
pcre_chartables.c.non-standard
|
40 |
This is a set of character tables that came from a Windows
|
41 |
system. It has characters greater than 128 that are set as
|
42 |
spaces, amongst other things. I kept it so that it can be
|
43 |
used for testing from time to time.
|
44 |
|
45 |
README This file.
|
46 |
|
47 |
Unicode.tables The files in this directory (CaseFolding.txt,
|
48 |
DerivedGeneralCategory.txt, GraphemeBreakProperty.txt,
|
49 |
Scripts.txt and UnicodeData.txt) were downloaded from the
|
50 |
Unicode web site. They contain information about Unicode
|
51 |
characters and scripts.
|
52 |
|
53 |
ucptest.c A short C program for testing the Unicode property macros
|
54 |
that do lookups in the pcre_ucd.c data, mainly useful after
|
55 |
rebuilding the Unicode property table. Compile and run this in
|
56 |
the "maint" directory (see comments at its head).
|
57 |
|
58 |
ucptestdata A directory containing two files, testinput1 and testoutput1,
|
59 |
to use in conjunction with the ucptest program.
|
60 |
|
61 |
utf8.c A short, freestanding C program for converting a Unicode code
|
62 |
point into a sequence of bytes in the UTF-8 encoding, and vice
|
63 |
versa. If its argument is a hex number such as 0x1234, it
|
64 |
outputs a list of the equivalent UTF-8 bytes. If its argument
|
65 |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it
|
66 |
treats them as a UTF-8 character and outputs the equivalent
|
67 |
code point in hex.
|
68 |
|
69 |
|
70 |
Updating to a new Unicode release
|
71 |
=================================
|
72 |
|
73 |
When there is a new release of Unicode, the files in Unicode.tables must be
|
74 |
refreshed from the web site. If the new version of Unicode adds new character
|
75 |
scripts, the source file ucp.h and both the MultiStage2.py and the
|
76 |
GenerateUtt.py scripts must be edited to add the new names. Then MultiStage2.py
|
77 |
can be run to generate a new version of pcre_ucd.c, and GenerateUtt.py can be
|
78 |
run to generate the tricky tables for inclusion in pcre_tables.c.
|
79 |
|
80 |
If MultiStage2.py gives the error "ValueError: list.index(x): x not in list",
|
81 |
the cause is usually a missing (or misspelt) name in the list of scripts. I
|
82 |
couldn't find a straightforward list of scripts on the Unicode site, but
|
83 |
there's a useful Wikipedia page that list them, and notes the Unicode version
|
84 |
in which they were introduced:
|
85 |
|
86 |
http://en.wikipedia.org/wiki/Unicode_scripts#Table_of_Unicode_scripts
|
87 |
|
88 |
The ucptest program can be compiled and used to check that the new tables in
|
89 |
pcre_ucd.c work properly, using the data files in ucptestdata to check a number
|
90 |
of test characters. The source file ucptest.c must be updated whenever new
|
91 |
Unicode script names are added.
|
92 |
|
93 |
Note also that both the pcresyntax.3 and pcrepattern.3 man pages contain lists
|
94 |
of Unicode script names.
|
95 |
|
96 |
|
97 |
Preparing for a PCRE release
|
98 |
============================
|
99 |
|
100 |
This section contains a checklist of things that I consult before building a
|
101 |
distribution for a new release.
|
102 |
|
103 |
. Ensure that the version number and version date are correct in configure.ac.
|
104 |
|
105 |
. Update the library version numbers in configure.ac according to the rules
|
106 |
given below.
|
107 |
|
108 |
. If new build options have been added, ensure that they are added to the CMake
|
109 |
files as well as to the autoconf files. The relevant files are CMakeLists.txt
|
110 |
and config-cmake.h.in. After making a release tarball, test it out with CMake
|
111 |
if there have been changes here.
|
112 |
|
113 |
. Run ./autogen.sh to ensure everything is up-to-date.
|
114 |
|
115 |
. Compile and test with many different config options, and combinations of
|
116 |
options. Also, test with valgrind by running "RunTest valgrind" and
|
117 |
"RunGrepTest valgrind" (which takes quite a long time). The script
|
118 |
maint/ManyConfigTests now encapsulates this testing. It runs tests with
|
119 |
different configurations, and it also runs some of them with valgrind, all of
|
120 |
which can take quite some time.
|
121 |
|
122 |
. Run perltest.pl on the test data for tests 1, 4, and 6. The output
|
123 |
should match the PCRE test output, apart from the version identification at
|
124 |
the start of each test. The other tests are not Perl-compatible (they use
|
125 |
various PCRE-specific features or options).
|
126 |
|
127 |
. It is possible to test with the emulated memmove() function by undefining
|
128 |
HAVE_MEMMOVE and HAVE_BCOPY in config.h, though I do not do this often. You
|
129 |
may see a number of "pcre_memmove defined but not used" warnings for the
|
130 |
modules in which there is no call to memmove(). These can be ignored.
|
131 |
|
132 |
. Documentation: check AUTHORS, ChangeLog (check version and date), LICENCE,
|
133 |
NEWS (check version and date), NON-AUTOTOOLS-BUILD, and README. Many of these
|
134 |
won't need changing, but over the long term things do change.
|
135 |
|
136 |
. I used to test new releases myself on a number of different operating
|
137 |
systems, using different compilers as well. For example, on Solaris it is
|
138 |
helpful to test using Sun's cc compiler as a change from gcc. Adding
|
139 |
-xarch=v9 to the cc options does a 64-bit test, but it also needs -S 64 for
|
140 |
pcretest to increase the stack size for test 2. Since I retired I can no
|
141 |
longer do this, but instead I rely on putting out release candidates for
|
142 |
folks on the pcre-dev list to test.
|
143 |
|
144 |
|
145 |
Updating version info for libtool
|
146 |
=================================
|
147 |
|
148 |
This set of rules for updating library version information came from a web page
|
149 |
whose URL I have forgotten. The version information consists of three parts:
|
150 |
(current, revision, age).
|
151 |
|
152 |
1. Start with version information of 0:0:0 for each libtool library.
|
153 |
|
154 |
2. Update the version information only immediately before a public release of
|
155 |
your software. More frequent updates are unnecessary, and only guarantee
|
156 |
that the current interface number gets larger faster.
|
157 |
|
158 |
3. If the library source code has changed at all since the last update, then
|
159 |
increment revision; c:r:a becomes c:r+1:a.
|
160 |
|
161 |
4. If any interfaces have been added, removed, or changed since the last
|
162 |
update, increment current, and set revision to 0.
|
163 |
|
164 |
5. If any interfaces have been added since the last public release, then
|
165 |
increment age.
|
166 |
|
167 |
6. If any interfaces have been removed or changed since the last public
|
168 |
release, then set age to 0.
|
169 |
|
170 |
The following explanation may help in understanding the above rules a bit
|
171 |
better. Consider that there are three possible kinds of reaction from users to
|
172 |
changes in a shared library:
|
173 |
|
174 |
1. Programs using the previous version may use the new version as a drop-in
|
175 |
replacement, and programs using the new version can also work with the
|
176 |
previous one. In other words, no recompiling nor relinking is needed. In
|
177 |
this case, increment revision only, don't touch current or age.
|
178 |
|
179 |
2. Programs using the previous version may use the new version as a drop-in
|
180 |
replacement, but programs using the new version may use APIs not present in
|
181 |
the previous one. In other words, a program linking against the new version
|
182 |
may fail if linked against the old version at run time. In this case, set
|
183 |
revision to 0, increment current and age.
|
184 |
|
185 |
3. Programs may need to be changed, recompiled, relinked in order to use the
|
186 |
new version. Increment current, set revision and age to 0.
|
187 |
|
188 |
|
189 |
Making a PCRE release
|
190 |
=====================
|
191 |
|
192 |
Run PrepareRelease and commit the files that it changes (by removing trailing
|
193 |
spaces). The first thing this script does is to run CheckMan on the man pages;
|
194 |
if it finds any markup errors, it reports them and then aborts.
|
195 |
|
196 |
Once PrepareRelease has run clean, run "make distcheck" to create the tarballs
|
197 |
and the zipball. Double-check with "svn status", then create an SVN tagged
|
198 |
copy:
|
199 |
|
200 |
svn copy svn://vcs.exim.org/pcre/code/trunk \
|
201 |
svn://vcs.exim.org/pcre/code/tags/pcre-8.xx
|
202 |
|
203 |
Don't forget to update Freshmeat when the new release is out, and to tell
|
204 |
webmaster@pcre.org and the mailing list. Also, update the list of version
|
205 |
numbers in Bugzilla (edit products).
|
206 |
|
207 |
|
208 |
Future ideas (wish list)
|
209 |
========================
|
210 |
|
211 |
This section records a list of ideas so that they do not get forgotten. They
|
212 |
vary enormously in their usefulness and potential for implementation. Some are
|
213 |
very sensible; some are rather wacky. Some have been on this list for years;
|
214 |
others are relatively new.
|
215 |
|
216 |
. Optimization
|
217 |
|
218 |
There are always ideas for new optimizations so as to speed up pattern
|
219 |
matching. Most of them try to save work by recognizing a non-match without
|
220 |
having to scan all the possibilities. These are some that I've recorded:
|
221 |
|
222 |
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very
|
223 |
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}?
|
224 |
OTOH, this is pathological - the user could easily fix it.
|
225 |
|
226 |
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems
|
227 |
to have little effect, and maybe makes things worse.
|
228 |
|
229 |
* "Ends with literal string" - note that a single character doesn't gain much
|
230 |
over the existing "required byte" (reqbyte) feature that just remembers one
|
231 |
data unit.
|
232 |
|
233 |
* These probably need to go in pcre_study():
|
234 |
|
235 |
o Remember an initial string rather than just 1 char?
|
236 |
|
237 |
o A required data unit from alternatives - not just the last unit, but an
|
238 |
earlier one if common to all alternatives.
|
239 |
|
240 |
o Friedl contains other ideas.
|
241 |
|
242 |
* pcre_study() does not set initial byte flags for Unicode property types
|
243 |
such as \p; I don't know how much benefit there would be for, for example,
|
244 |
setting the bits for 0-9 and all bytes >= xC0 when a pattern starts with
|
245 |
\p{N}.
|
246 |
|
247 |
* There is scope for more "auto-possessifying" in connection with \p and \P.
|
248 |
|
249 |
. If Perl gets to a consistent state over the settings of capturing sub-
|
250 |
patterns inside repeats, see if we can match it. One example of the
|
251 |
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE
|
252 |
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard
|
253 |
because I think it needs much more state to be remembered.
|
254 |
|
255 |
. Perl 6 will be a revolution. Is it a revolution too far for PCRE?
|
256 |
|
257 |
. Allow errorptr and erroroffset to be NULL. I don't like this idea.
|
258 |
|
259 |
. Line endings:
|
260 |
|
261 |
* Option to use NUL as a line terminator in subject strings. This could now
|
262 |
be done relatively easily since the extension to support LF, CR, and CRLF.
|
263 |
If it is done, a suitable option for pcregrep is also required.
|
264 |
|
265 |
. Option to provide the pattern with a length instead of with a NUL terminator.
|
266 |
This affects quite a few places in the code and is not trivial.
|
267 |
|
268 |
. Catch SIGSEGV for stack overflows?
|
269 |
|
270 |
. A feature to suspend a match via a callout was once requested.
|
271 |
|
272 |
. Option to convert results into character offsets and character lengths.
|
273 |
|
274 |
. Option for pcregrep to scan only the start of a file. I am not keen - this is
|
275 |
the job of "head".
|
276 |
|
277 |
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once,
|
278 |
preceded by a blank line, instead of adding it to every matched line, and (b)
|
279 |
support --outputfile=name.
|
280 |
|
281 |
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 8.
|
282 |
(And now presumably UTF-16 and UCP for the 16-bit library, and UTF-32 and UCP
|
283 |
for the 32-bit library.)
|
284 |
|
285 |
. Add a user pointer to pcre_malloc/free functions -- some option would be
|
286 |
needed to retain backward compatibility.
|
287 |
|
288 |
. Define a union for the results from pcre_fullinfo().
|
289 |
|
290 |
. Provide a "random access to the subject" facility so that the way in which it
|
291 |
is stored is independent of PCRE. For efficiency, it probably isn't possible
|
292 |
to switch this dynamically. It would have to be specified when PCRE was
|
293 |
compiled. PCRE would then call a function every time it wanted a character.
|
294 |
|
295 |
. Wild thought: the ability to compile from PCRE's internal byte code to a real
|
296 |
FSM and a very fast (third) matcher to process the result. There would be
|
297 |
even more restrictions than for pcre_dfa_exec(), however. This is not easy.
|
298 |
This is probably obsolete now that we have the JIT support.
|
299 |
|
300 |
. Should pcretest have some private locale data, to avoid relying on the
|
301 |
available locales for the test data, since different OS have different ideas?
|
302 |
This won't be as thorough a test, but perhaps that doesn't really matter.
|
303 |
|
304 |
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort
|
305 |
them will of course slow it down.
|
306 |
|
307 |
. Someone suggested --disable-callout to save code space when callouts are
|
308 |
never wanted. This seems rather marginal.
|
309 |
|
310 |
. Check names that consist entirely of digits: PCRE allows, but do Perl and
|
311 |
Python, etc?
|
312 |
|
313 |
. A user suggested a parameter to limit the length of string matched, for
|
314 |
example if the parameter is N, the current match should fail if the matched
|
315 |
substring exceeds N. This could apply to both match functions. The value
|
316 |
could be a new field in the extra block.
|
317 |
|
318 |
. Callouts with arguments: (?Cn:ARG) for instance.
|
319 |
|
320 |
. A user is going to supply a patch to generalize the API for user-specific
|
321 |
memory allocation so that it is more flexible in threaded environments. This
|
322 |
was promised a long time ago, and never appeared. However, this is a live
|
323 |
issue not only for threaded environments, but for libraries that use PCRE and
|
324 |
want not to be beholden to their caller's memory allocation.
|
325 |
|
326 |
. Write a wrapper to maintain a structure with specified runtime parameters,
|
327 |
such as recurse limit, and pass these to PCRE each time it is called. Also
|
328 |
maybe malloc and free. A user sent a prototype. This relates the the previous
|
329 |
item.
|
330 |
|
331 |
. Write a function that generates random matching strings for a compiled regex.
|
332 |
|
333 |
. Pcregrep: an option to specify the output line separator, either as a string
|
334 |
or select from a fixed list. This is not dead easy, because at the moment it
|
335 |
outputs whatever is in the input file.
|
336 |
|
337 |
. Improve the code for duplicate checking in pcre_dfa_exec(). An incomplete,
|
338 |
non-thread-safe patch showed that this can help performance for patterns
|
339 |
where there are many alternatives. However, a simple thread-safe
|
340 |
implementation that I tried made things worse in many simple cases, so this
|
341 |
is not an obviously good thing.
|
342 |
|
343 |
. PCRE cannot at present distinguish between subpatterns with different names,
|
344 |
but the same number (created by the use of ?|). In order to do so, a way of
|
345 |
remembering *which* subpattern numbered n matched is needed. Bugzilla #760.
|
346 |
Now that (*MARK) has been implemented, it can perhaps be used as a way round
|
347 |
this problem.
|
348 |
|
349 |
. Instead of having #ifdef HAVE_CONFIG_H in each module, put #include
|
350 |
"something" and the the #ifdef appears only in one place, in "something".
|
351 |
|
352 |
Philip Hazel
|
353 |
Email local part: ph10
|
354 |
Email domain: cam.ac.uk
|
355 |
Last updated: 09 November 2012
|