1 |
The files in the "maintain" directory of the PCRE source contain data, scripts, |
MAINTENANCE README FOR PCRE |
2 |
and programs that are used for the maintenance of PCRE, but do not form part of |
--------------------------- |
3 |
the PCRE distribution tarballs. |
|
4 |
|
The files in the "maint" directory of the PCRE source contain data, scripts, |
5 |
132html A Perl script that converts a .1 or .3 man page into HTML. It |
and programs that are used for the maintenance of PCRE, but which do not form |
6 |
is called from MakeRelease. It "knows" the relevant troff |
part of the PCRE distribution tarballs. This document describes these files and |
7 |
constructs that are used in the PCRE man pages. |
also contains some notes for maintainers. Its contents are: |
8 |
|
|
9 |
Builducptable A Perl script that creates the contents of the ucptable.h file |
Files in the maint directory |
10 |
from two Unicode data files, which themselves are downloaded |
Updating to a new Unicode release |
11 |
from the Unicode web site. Run this script in the "maintain" |
Preparing for a PCRE release |
12 |
directory. |
Making a PCRE release |
13 |
|
Long-term ideas (wish list) |
14 |
CleanTxt A Perl script that cleans up the output of "nroff -man" by |
|
15 |
removing backspaces and other redundant text so as to produce |
|
16 |
a readable .txt file. It is called from MakeRelease. |
Files in the maint directory |
17 |
|
---------------------------- |
18 |
Detrail A Perl script that removes trailing spaces from files. It is |
|
19 |
called from MakeRelease. |
----------------- This file is now OBSOLETE and no longer used ---------------- |
20 |
|
Builducptable A Perl script that creates the contents of the ucptable.h file |
21 |
Index.html A file that is copied as index.html into the doc/html |
from two Unicode data files, which themselves are downloaded |
22 |
directory when the HTML documentation is being built. It works |
from the Unicode web site. Run this script in the "maint" |
23 |
like this so that doc/html can be deleted and re-created from |
directory. |
24 |
scratch. |
----------------- This file is now OBSOLETE and no longer used ---------------- |
25 |
|
|
26 |
MakeRelease My script for creating a new release. It processes the |
GenerateUtt.py A Python script to generate part of the pcre_tables.c file |
27 |
documentation man pages into .text and HTML formats before |
that contains Unicode script names in a long string with |
28 |
creating tarballs and putting them in the Releases directory. |
offsets, which is tedious to maintain by hand. |
29 |
|
|
30 |
Tech.Notes Some notes about the internals of the PCRE code. |
ManyConfigTests A shell script that runs "configure, make, test" a number of |
31 |
|
times with different configuration settings. |
32 |
Unicode.tables The files in this directory, Scripts.txt and UnicodeData.txt, |
|
33 |
were downloaded from the Unicode web site. They contain |
MultiStage2.py A Python script that generates the file pcre_ucd.c from three |
34 |
information about Unicode characters and scripts. |
Unicode data tables, which are themselves downloaded from the |
35 |
|
Unicode web site. Run this script in the "maint" directory. |
36 |
ucptest.c A short C program for testing the Unicode property functions in |
The generated file contains the tables for a 2-stage lookup |
37 |
pcre_ucp_searchfuncs.c, mainly useful after rebuilding the |
of Unicode properties. |
38 |
Unicode property table. Compile and run this in the "maintain" |
|
39 |
directory. |
Unicode.tables The files in this directory, DerivedGeneralCategory.txt, |
40 |
|
Scripts.txt and UnicodeData.txt, were downloaded from the |
41 |
ucptestdata A directory containing two files, testinput1 and testoutput1, |
Unicode web site. They contain information about Unicode |
42 |
to use in conjunction with the ucptest program. |
characters and scripts. |
43 |
|
|
44 |
utf8.c A short, freestanding C program for converting a Unicode code |
ucptest.c A short C program for testing the Unicode property macros |
45 |
point into a sequence of bytes in the UTF-8 encoding, and vice |
that do lookups in the pcre_ucd.c data, mainly useful after |
46 |
versa. If its argument is a hex number such as 0x1234, it |
rebuilding the Unicode property table. Compile and run this in |
47 |
outputs a list of the equivalent UTF-8 bytes. If its argument |
the "maint" directory (see comments at its head). |
48 |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it treats |
|
49 |
them as a UTF-8 character and outputs the equivalent code point |
ucptestdata A directory containing two files, testinput1 and testoutput1, |
50 |
in hex. |
to use in conjunction with the ucptest program. |
51 |
|
|
52 |
When there is a new release of Unicode, the files in Unicode.tables must be |
utf8.c A short, freestanding C program for converting a Unicode code |
53 |
refreshed from the web site, and the Buildupctable script can then be run to |
point into a sequence of bytes in the UTF-8 encoding, and vice |
54 |
generate a new version of ucptable.h. The ucptest program can be used to check |
versa. If its argument is a hex number such as 0x1234, it |
55 |
that the resulting table works properly, using the data files in ucptestdata to |
outputs a list of the equivalent UTF-8 bytes. If its argument |
56 |
check a number of test characters. |
is sequence of concatenated UTF-8 bytes (e.g. e188b4) it |
57 |
|
treats them as a UTF-8 character and outputs the equivalent |
58 |
**** |
code point in hex. |
59 |
|
|
60 |
|
|
61 |
|
Updating to a new Unicode release |
62 |
|
--------------------------------- |
63 |
|
|
64 |
|
When there is a new release of Unicode, the files in Unicode.tables must be |
65 |
|
refreshed from the web site. If the new version of Unicode adds new character |
66 |
|
scripts, the source file ucp.h and both the MultiStage2.py and the |
67 |
|
GenerateUtt.py scripts must be edited to add the new names. Then the |
68 |
|
MultiStage2.py script can then be run to generate a new version of pcre_ucd.c |
69 |
|
and the GenerateUtt.py can be run to generate the tricky tables for inclusion |
70 |
|
in pcre_tables.c. |
71 |
|
|
72 |
|
The ucptest program can then be compiled and used to check that the new tables |
73 |
|
in pcre_ucd.c work properly, using the data files in ucptestdata to check a |
74 |
|
number of test characters. |
75 |
|
|
76 |
|
|
77 |
|
Preparing for a PCRE release |
78 |
|
---------------------------- |
79 |
|
|
80 |
|
This section contains a checklist of things that I consult before building a |
81 |
|
distribution for a new release. |
82 |
|
|
83 |
|
. Ensure that the version number and version date are correct in configure.ac, |
84 |
|
ChangeLog, and NEWS. |
85 |
|
|
86 |
|
. If new build options have been added, ensure that they are added to the CMake |
87 |
|
files as well as to the autoconf files. |
88 |
|
|
89 |
|
. Run ./autogen.sh to ensure everything is up-to-date. |
90 |
|
|
91 |
|
. Compile and test with many different config options, and combinations of |
92 |
|
options. The maint/ManyConfigTests script now encapsulates this testing. |
93 |
|
|
94 |
|
. Run perltest.pl on the test data for tests 1 and 4. The output should match |
95 |
|
the PCRE test output, apart from the version identification at the top. The |
96 |
|
other tests are not Perl-compatible (they use various special PCRE options). |
97 |
|
|
98 |
|
. Test with valgrind by running "RunTest valgrind". There is also "RunGrepTest |
99 |
|
valgrind", though that takes quite a long time. |
100 |
|
|
101 |
|
. It may also useful to test with Electric Fence, though the fact that it |
102 |
|
grumbles for missing free() calls can be a nuisance. (A missing free() in |
103 |
|
pcretest is hardly a big problem.) To build with EF, use: |
104 |
|
|
105 |
|
LIBS='/usr/lib/libefence.a -lpthread' with ./configure. |
106 |
|
|
107 |
|
Then all normal runs use it to check for buffer overflow. Also run everything |
108 |
|
with: |
109 |
|
|
110 |
|
EF_PROTECT_BELOW=1 <whatever> |
111 |
|
|
112 |
|
because there have been problems with lookbehinds that looked too far. |
113 |
|
|
114 |
|
. Test with the emulated memmove() function by undefining HAVE_MEMMOVE and |
115 |
|
HAVE_BCOPY in config.h. You may see a number of "pcre_memmove defined but not |
116 |
|
used" warnings for the modules in which there is no call to memmove(). These |
117 |
|
can be ignored. |
118 |
|
|
119 |
|
. Documentation: check AUTHORS, COPYING, ChangeLog (check date), INSTALL, |
120 |
|
LICENCE, NEWS (check date), NON-UNIX-USE, and README. Many of these won't |
121 |
|
need changing, but over the long term things do change. |
122 |
|
|
123 |
|
. Man pages: Check all man pages for \ not followed by e or f or " because |
124 |
|
that indicates a markup error. |
125 |
|
|
126 |
|
. When the release is built, test it on a number of different operating |
127 |
|
systems if possible, and using different compilers as well. For example, |
128 |
|
on Solaris it is helpful to test using Sun's cc compiler as a change from |
129 |
|
gcc. Adding -xarch=v9 to the cc options does a 64-bit test, but it also |
130 |
|
needs -S 64 for pcretest to increase the stack size for test 2. |
131 |
|
|
132 |
|
|
133 |
|
Making a PCRE release |
134 |
|
--------------------- |
135 |
|
|
136 |
|
Run PrepareRelease and commit the files that it changes (by removing trailing |
137 |
|
spaces). Then run "make distcheck" to create the tarballs and the zipball. |
138 |
|
Double-check with "svn status", then create an SVN tagged copy: |
139 |
|
|
140 |
|
svn copy svn://vcs.exim.org/pcre/code/trunk \ |
141 |
|
svn://vcs.exim.org/pcre/code/tags/pcre-7.x |
142 |
|
|
143 |
|
Don't forget to update Freshmeat when the new release is out, and to tell |
144 |
|
webmaster@pcre.org and the mailing list. |
145 |
|
|
146 |
|
|
147 |
|
Future ideas (wish list) |
148 |
|
------------------------ |
149 |
|
|
150 |
|
This section records a list of ideas so that they do not get forgotten. They |
151 |
|
vary enormously in their usefulness and potential for implementation. Some are |
152 |
|
very sensible; some are rather wacky. Some have been on this list for years; |
153 |
|
others are relatively new. |
154 |
|
|
155 |
|
. Optimization |
156 |
|
|
157 |
|
There are always ideas for new optimizations so as to speed up pattern |
158 |
|
matching. Most of them try to save work by recognizing a non-match without |
159 |
|
having to scan all the possibilities. These are some that I've recorded: |
160 |
|
|
161 |
|
* /((A{0,5}){0,5}){0,5}(something complex)/ on a non-matching string is very |
162 |
|
slow, though Perl is fast. Can we speed up somehow? Convert to {0,125}? |
163 |
|
OTOH, this is pathological - the user could easily fix it. |
164 |
|
|
165 |
|
* Turn ={4} into ==== ? (for speed). I once did an experiment, and it seems |
166 |
|
to have little effect, and maybe makes things worse. |
167 |
|
|
168 |
|
* "Ends with literal string" - note that a single character doesn't gain much |
169 |
|
over the existing "required byte" (reqbyte) feature that just saves one |
170 |
|
byte. |
171 |
|
|
172 |
|
* These probably need to go in study(): |
173 |
|
|
174 |
|
o Remember an initial string rather than just 1 char? |
175 |
|
|
176 |
|
o A required byte from alternatives - not just the last char, but an |
177 |
|
earlier one if common to all alternatives. |
178 |
|
|
179 |
|
o Minimum length of subject needed. |
180 |
|
|
181 |
|
o Friedl contains other ideas. |
182 |
|
|
183 |
|
. If Perl gets to a consistent state over the settings of capturing sub- |
184 |
|
patterns inside repeats, see if we can match it. One example of the |
185 |
|
difference is the matching of /(main(O)?)+/ against mainOmain, where PCRE |
186 |
|
leaves $2 set. In Perl, it's unset. Changing this in PCRE will be very hard |
187 |
|
because I think it needs much more state to be remembered. |
188 |
|
|
189 |
|
. Perl 6 will be a revolution. Is it a revolution too far for PCRE? |
190 |
|
|
191 |
|
. Unicode |
192 |
|
|
193 |
|
* Note that in Perl, \s matches \pZ and similarly for \d, \w and the POSIX |
194 |
|
character classes. For the moment, I've chosen not to support this for |
195 |
|
backward compatibility, for speed, and because it would be messy to |
196 |
|
implement. |
197 |
|
|
198 |
|
* A different approach to Unicode might be to use a typedef to do everything |
199 |
|
in unsigned shorts instead of unsigned chars. Actually, we'd have to have a |
200 |
|
new typedef to distinguish data from bits of compiled pattern that are in |
201 |
|
bytes, I think. There would need to be conversion functions in and out. I |
202 |
|
don't think this is particularly trivial - and anyway, Unicode now has |
203 |
|
characters that need more than 16 bits, so is this at all sensible? |
204 |
|
|
205 |
|
* There has been a request for direct support of 16-bit characters and |
206 |
|
UTF-16. However, since Unicode is moving beyond purely 16-bit characters, |
207 |
|
is this worth it at all? One possible way of handling 16-bit characters |
208 |
|
would be to "load" them in the same way that UTF-8 characters are loaded. |
209 |
|
|
210 |
|
. Allow errorptr and erroroffset to be NULL. I don't like this idea. |
211 |
|
|
212 |
|
. Line endings: |
213 |
|
|
214 |
|
* Option to use NUL as a line terminator in subject strings. This could now |
215 |
|
be done relatively easily since the extension to support LF, CR, and CRLF. |
216 |
|
If this is done, a suitable option for pcregrep is also required. |
217 |
|
|
218 |
|
. Option to provide the pattern with a length instead of with a NUL terminator. |
219 |
|
This probably affects quite a few places in the code. |
220 |
|
|
221 |
|
. Catch SIGSEGV for stack overflows? |
222 |
|
|
223 |
|
. A feature to suspend a match via a callout was once requested. |
224 |
|
|
225 |
|
. Option to convert results into character offsets and character lengths. |
226 |
|
|
227 |
|
. Option for pcregrep to scan only the start of a file. I am not keen - this is |
228 |
|
the job of "head". |
229 |
|
|
230 |
|
. A (non-Unix) user wanted pcregrep options to (a) list a file name just once, |
231 |
|
preceded by a blank line, instead of adding it to every matched line, and (b) |
232 |
|
support --outputfile=name. |
233 |
|
|
234 |
|
. Consider making UTF-8 and UCP the default for PCRE n.0 for some n > 7. |
235 |
|
|
236 |
|
. Add a user pointer to pcre_malloc/free functions -- some option would be |
237 |
|
needed to retain backward compatibility. |
238 |
|
|
239 |
|
. Define a union for the results from pcre_fullinfo(). |
240 |
|
|
241 |
|
. Provide a "random access to the subject" facility so that the way in which it |
242 |
|
is stored is independent of PCRE. For efficiency, it probably isn't possible |
243 |
|
to switch this dynamically. It would have to be specified when PCRE was |
244 |
|
compiled. PCRE would then call a function every time it wanted a character. |
245 |
|
|
246 |
|
. Wild thought: the ability to compile from PCRE's internal byte code to a real |
247 |
|
FSM and a very fast (third) matcher to process the result. There would be |
248 |
|
even more restrictions than for pcre_dfa_exec(), however. This is not easy. |
249 |
|
|
250 |
|
. Should pcretest have some private locale data, to avoid relying on the |
251 |
|
available locales for the test data, since different OS have different ideas? |
252 |
|
This won't be as thorough a test, but perhaps that doesn't really matter. |
253 |
|
|
254 |
|
. pcregrep: add -rs for a sorted recurse? Having to store file names and sort |
255 |
|
them will of course slow it down. |
256 |
|
|
257 |
|
. Someone suggested --disable-callout to save code space when callouts are |
258 |
|
never wanted. This seems rather marginal. |
259 |
|
|
260 |
|
. Check names that consist entirely of digits: PCRE allows, but do Perl and |
261 |
|
Python, etc? |
262 |
|
|
263 |
|
. A user suggested a parameter to limit the length of string matched, for |
264 |
|
example if the parameter is N, the current match should fail if the matched |
265 |
|
substring exceeds N. This could apply to both match functions. The value |
266 |
|
could be a new field in the extra block. |
267 |
|
|
268 |
|
. Callouts with arguments: (?Cn:ARG) for instance. |
269 |
|
|
270 |
|
. A user is going to supply a patch to generalize the API for user-specific |
271 |
|
memory allocation so that it is more flexible in threaded environments. |
272 |
|
|
273 |
|
Philip Hazel |
274 |
|
Email local part: ph10 |
275 |
|
Email domain: cam.ac.uk |
276 |
|
Last updated: 26 August 2008 |