/[pcre]/code/trunk/ChangeLog
ViewVC logotype

Diff of /code/trunk/ChangeLog

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 63 by nigel, Sat Feb 24 21:40:03 2007 UTC revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC
# Line 1  Line 1 
1  ChangeLog for PCRE  ChangeLog for PCRE
2  ------------------  ------------------
3    
4  Version 4.00 17-Feb-03  Version 4.5 01-Dec-03
5  ----------------------  ---------------------
6    
7     1. There has been some re-arrangement of the code for the match() function so
8        that it can be compiled in a version that does not call itself recursively.
9        Instead, it keeps those local variables that need separate instances for
10        each "recursion" in a frame on the heap, and gets/frees frames whenever it
11        needs to "recurse". Keeping track of where control must go is done by means
12        of setjmp/longjmp. The whole thing is implemented by a set of macros that
13        hide most of the details from the main code, and operates only if
14        NO_RECURSE is defined while compiling pcre.c. If PCRE is built using the
15        "configure" mechanism, "--disable-stack-for-recursion" turns on this way of
16        operating.
17    
18        To make it easier for callers to provide specially tailored get/free
19        functions for this usage, two new functions, pcre_stack_malloc, and
20        pcre_stack_free, are used. They are always called in strict stacking order,
21        and the size of block requested is always the same.
22    
23        The PCRE_CONFIG_STACKRECURSE info parameter can be used to find out whether
24        PCRE has been compiled to use the stack or the heap for recursion. The
25        -C option of pcretest uses this to show which version is compiled.
26    
27        A new data escape \S, is added to pcretest; it causes the amounts of store
28        obtained and freed by both kinds of malloc/free at match time to be added
29        to the output.
30    
31     2. Changed the locale test to use "fr_FR" instead of "fr" because that's
32        what's available on my current Linux desktop machine.
33    
34     3. When matching a UTF-8 string, the test for a valid string at the start has
35        been extended. If start_offset is not zero, PCRE now checks that it points
36        to a byte that is the start of a UTF-8 character. If not, it returns
37        PCRE_ERROR_BADUTF8_OFFSET (-11). Note: the whole string is still checked;
38        this is necessary because there may be backward assertions in the pattern.
39        When matching the same subject several times, it may save resources to use
40        PCRE_NO_UTF8_CHECK on all but the first call if the string is long.
41    
42     4. The code for checking the validity of UTF-8 strings has been tightened so
43        that it rejects (a) strings containing 0xfe or 0xff bytes and (b) strings
44        containing "overlong sequences".
45    
46     5. Fixed a bug (appearing twice) that I could not find any way of exploiting!
47        I had written "if ((digitab[*p++] && chtab_digit) == 0)" where the "&&"
48        should have been "&", but it just so happened that all the cases this let
49        through by mistake were picked up later in the function.
50    
51     6. I had used a variable called "isblank" - this is a C99 function, causing
52        some compilers to warn. To avoid this, I renamed it (as "blankclass").
53    
54     7. Cosmetic: (a) only output another newline at the end of pcretest if it is
55        prompting; (b) run "./pcretest /dev/null" at the start of the test script
56        so the version is shown; (c) stop "make test" echoing "./RunTest".
57    
58     8. Added patches from David Burgess to enable PCRE to run on EBCDIC systems.
59    
60     9. The prototype for memmove() for systems that don't have it was using
61        size_t, but the inclusion of the header that defines size_t was later. I've
62        moved the #includes for the C headers earlier to avoid this.
63    
64    10. Added some adjustments to the code to make it easier to compiler on certain
65        special systems:
66    
67          (a) Some "const" qualifiers were missing.
68          (b) Added the macro EXPORT before all exported functions; by default this
69              is defined to be empty.
70          (c) Changed the dftables auxiliary program (that builds chartables.c) so
71              that it reads its output file name as an argument instead of writing
72              to the standard output and assuming this can be redirected.
73    
74    11. In UTF-8 mode, if a recursive reference (e.g. (?1)) followed a character
75        class containing characters with values greater than 255, PCRE compilation
76        went into a loop.
77    
78    12. A recursive reference to a subpattern that was within another subpattern
79        that had a minimum quantifier of zero caused PCRE to crash. For example,
80        (x(y(?2))z)? provoked this bug with a subject that got as far as the
81        recursion. If the recursively-called subpattern itself had a zero repeat,
82        that was OK.
83    
84    13. In pcretest, the buffer for reading a data line was set at 30K, but the
85        buffer into which it was copied (for escape processing) was still set at
86        1024, so long lines caused crashes.
87    
88    14. A pattern such as /[ab]{1,3}+/ failed to compile, giving the error
89        "internal error: code overflow...". This applied to any character class
90        that was followed by a possessive quantifier.
91    
92    15. Modified the Makefile to add libpcre.la as a prerequisite for
93        libpcreposix.la because I was told this is needed for a parallel build to
94        work.
95    
96    16. If a pattern that contained .* following optional items at the start was
97        studied, the wrong optimizing data was generated, leading to matching
98        errors. For example, studying /[ab]*.*c/ concluded, erroneously, that any
99        matching string must start with a or b or c. The correct conclusion for
100        this pattern is that a match can start with any character.
101    
102    
103    Version 4.4 13-Aug-03
104    ---------------------
105    
106     1. In UTF-8 mode, a character class containing characters with values between
107        127 and 255 was not handled correctly if the compiled pattern was studied.
108        In fixing this, I have also improved the studying algorithm for such
109        classes (slightly).
110    
111     2. Three internal functions had redundant arguments passed to them. Removal
112        might give a very teeny performance improvement.
113    
114     3. Documentation bug: the value of the capture_top field in a callout is *one
115        more than* the number of the hightest numbered captured substring.
116    
117     4. The Makefile linked pcretest and pcregrep with -lpcre, which could result
118        in incorrectly linking with a previously installed version. They now link
119        explicitly with libpcre.la.
120    
121     5. configure.in no longer needs to recognize Cygwin specially.
122    
123     6. A problem in pcre.in for Windows platforms is fixed.
124    
125     7. If a pattern was successfully studied, and the -d (or /D) flag was given to
126        pcretest, it used to include the size of the study block as part of its
127        output. Unfortunately, the structure contains a field that has a different
128        size on different hardware architectures. This meant that the tests that
129        showed this size failed. As the block is currently always of a fixed size,
130        this information isn't actually particularly useful in pcretest output, so
131        I have just removed it.
132    
133     8. Three pre-processor statements accidentally did not start in column 1.
134        Sadly, there are *still* compilers around that complain, even though
135        standard C has not required this for well over a decade. Sigh.
136    
137     9. In pcretest, the code for checking callouts passed small integers in the
138        callout_data field, which is a void * field. However, some picky compilers
139        complained about the casts involved for this on 64-bit systems. Now
140        pcretest passes the address of the small integer instead, which should get
141        rid of the warnings.
142    
143    10. By default, when in UTF-8 mode, PCRE now checks for valid UTF-8 strings at
144        both compile and run time, and gives an error if an invalid UTF-8 sequence
145        is found. There is a option for disabling this check in cases where the
146        string is known to be correct and/or the maximum performance is wanted.
147    
148    11. In response to a bug report, I changed one line in Makefile.in from
149    
150            -Wl,--out-implib,.libs/lib@WIN_PREFIX@pcreposix.dll.a \
151        to
152            -Wl,--out-implib,.libs/@WIN_PREFIX@libpcreposix.dll.a \
153    
154        to look similar to other lines, but I have no way of telling whether this
155        is the right thing to do, as I do not use Windows. No doubt I'll get told
156        if it's wrong...
157    
158    
159    Version 4.3 21-May-03
160    ---------------------
161    
162    1. Two instances of @WIN_PREFIX@ omitted from the Windows targets in the
163       Makefile.
164    
165    2. Some refactoring to improve the quality of the code:
166    
167       (i)   The utf8_table... variables are now declared "const".
168    
169       (ii)  The code for \cx, which used the "case flipping" table to upper case
170             lower case letters, now just substracts 32. This is ASCII-specific,
171             but the whole concept of \cx is ASCII-specific, so it seems
172             reasonable.
173    
174       (iii) PCRE was using its character types table to recognize decimal and
175             hexadecimal digits in the pattern. This is silly, because it handles
176             only 0-9, a-f, and A-F, but the character types table is locale-
177             specific, which means strange things might happen. A private
178             table is now used for this - though it costs 256 bytes, a table is
179             much faster than multiple explicit tests. Of course, the standard
180             character types table is still used for matching digits in subject
181             strings against \d.
182    
183       (iv)  Strictly, the identifier ESC_t is reserved by POSIX (all identifiers
184             ending in _t are). So I've renamed it as ESC_tee.
185    
186    3. The first argument for regexec() in the POSIX wrapper should have been
187       defined as "const".
188    
189    4. Changed pcretest to use malloc() for its buffers so that they can be
190       Electric Fenced for debugging.
191    
192    5. There were several places in the code where, in UTF-8 mode, PCRE would try
193       to read one or more bytes before the start of the subject string. Often this
194       had no effect on PCRE's behaviour, but in some circumstances it could
195       provoke a segmentation fault.
196    
197    6. A lookbehind at the start of a pattern in UTF-8 mode could also cause PCRE
198       to try to read one or more bytes before the start of the subject string.
199    
200    7. A lookbehind in a pattern matched in non-UTF-8 mode on a PCRE compiled with
201       UTF-8 support could misbehave in various ways if the subject string
202       contained bytes with the 0x80 bit set and the 0x40 bit unset in a lookbehind
203       area. (PCRE was not checking for the UTF-8 mode flag, and trying to move
204       back over UTF-8 characters.)
205    
206    
207    Version 4.2 14-Apr-03
208    ---------------------
209    
210    1. Typo "#if SUPPORT_UTF8" instead of "#ifdef SUPPORT_UTF8" fixed.
211    
212    2. Changes to the building process, supplied by Ronald Landheer-Cieslak
213         [ON_WINDOWS]: new variable, "#" on non-Windows platforms
214         [NOT_ON_WINDOWS]: new variable, "#" on Windows platforms
215         [WIN_PREFIX]: new variable, "cyg" for Cygwin
216         * Makefile.in: use autoconf substitution for OBJEXT, EXEEXT, BUILD_OBJEXT
217           and BUILD_EXEEXT
218         Note: automatic setting of the BUILD variables is not yet working
219         set CPPFLAGS and BUILD_CPPFLAGS (but don't use yet) - should be used at
220           compile-time but not at link-time
221         [LINK]: use for linking executables only
222         make different versions for Windows and non-Windows
223         [LINKLIB]: new variable, copy of UNIX-style LINK, used for linking
224           libraries
225         [LINK_FOR_BUILD]: new variable
226         [OBJEXT]: use throughout
227         [EXEEXT]: use throughout
228         <winshared>: new target
229         <wininstall>: new target
230         <dftables.o>: use native compiler
231         <dftables>: use native linker
232         <install>: handle Windows platform correctly
233         <clean>: ditto
234         <check>: ditto
235         copy DLL to top builddir before testing
236    
237       As part of these changes, -no-undefined was removed again. This was reported
238       to give trouble on HP-UX 11.0, so getting rid of it seems like a good idea
239       in any case.
240    
241    3. Some tidies to get rid of compiler warnings:
242    
243       . In the match_data structure, match_limit was an unsigned long int, whereas
244         match_call_count was an int. I've made them both unsigned long ints.
245    
246       . In pcretest the fact that a const uschar * doesn't automatically cast to
247         a void * provoked a warning.
248    
249       . Turning on some more compiler warnings threw up some "shadow" variables
250         and a few more missing casts.
251    
252    4. If PCRE was complied with UTF-8 support, but called without the PCRE_UTF8
253       option, a class that contained a single character with a value between 128
254       and 255 (e.g. /[\xFF]/) caused PCRE to crash.
255    
256    5. If PCRE was compiled with UTF-8 support, but called without the PCRE_UTF8
257       option, a class that contained several characters, but with at least one
258       whose value was between 128 and 255 caused PCRE to crash.
259    
260    
261    Version 4.1 12-Mar-03
262    ---------------------
263    
264    1. Compiling with gcc -pedantic found a couple of places where casts were
265    needed, and a string in dftables.c that was longer than standard compilers are
266    required to support.
267    
268    2. Compiling with Sun's compiler found a few more places where the code could
269    be tidied up in order to avoid warnings.
270    
271    3. The variables for cross-compiling were called HOST_CC and HOST_CFLAGS; the
272    first of these names is deprecated in the latest Autoconf in favour of the name
273    CC_FOR_BUILD, because "host" is typically used to mean the system on which the
274    compiled code will be run. I can't find a reference for HOST_CFLAGS, but by
275    analogy I have changed it to CFLAGS_FOR_BUILD.
276    
277    4. Added -no-undefined to the linking command in the Makefile, because this is
278    apparently helpful for Windows. To make it work, also added "-L. -lpcre" to the
279    linking step for the pcreposix library.
280    
281    5. PCRE was failing to diagnose the case of two named groups with the same
282    name.
283    
284    6. A problem with one of PCRE's optimizations was discovered. PCRE remembers a
285    literal character that is needed in the subject for a match, and scans along to
286    ensure that it is present before embarking on the full matching process. This
287    saves time in cases of nested unlimited repeats that are never going to match.
288    Problem: the scan can take a lot of time if the subject is very long (e.g.
289    megabytes), thus penalizing straightforward matches. It is now done only if the
290    amount of subject to be scanned is less than 1000 bytes.
291    
292    7. A lesser problem with the same optimization is that it was recording the
293    first character of an anchored pattern as "needed", thus provoking a search
294    right along the subject, even when the first match of the pattern was going to
295    fail. The "needed" character is now not set for anchored patterns, unless it
296    follows something in the pattern that is of non-fixed length. Thus, it still
297    fulfils its original purpose of finding quick non-matches in cases of nested
298    unlimited repeats, but isn't used for simple anchored patterns such as /^abc/.
299    
300    
301    Version 4.0 17-Feb-03
302    ---------------------
303    
304  1. If a comment in an extended regex that started immediately after a meta-item  1. If a comment in an extended regex that started immediately after a meta-item
305  extended to the end of string, PCRE compiled incorrect data. This could lead to  extended to the end of string, PCRE compiled incorrect data. This could lead to

Legend:
Removed from v.63  
changed lines
  Added in v.73

  ViewVC Help
Powered by ViewVC 1.1.5