1 |
ChangeLog for PCRE |
ChangeLog for PCRE |
2 |
------------------ |
------------------ |
3 |
|
|
4 |
|
Version 6.7 04-Jul-06 |
5 |
|
--------------------- |
6 |
|
|
7 |
|
1. In order to handle tests when input lines are enormously long, pcretest has |
8 |
|
been re-factored so that it automatically extends its buffers when |
9 |
|
necessary. The code is crude, but this _is_ just a test program. The |
10 |
|
default size has been increased from 32K to 50K. |
11 |
|
|
12 |
|
2. The code in pcre_study() was using the value of the re argument before |
13 |
|
testing it for NULL. (Of course, in any sensible call of the function, it |
14 |
|
won't be NULL.) |
15 |
|
|
16 |
|
3. The memmove() emulation function in pcre_internal.h, which is used on |
17 |
|
systems that lack both memmove() and bcopy() - that is, hardly ever - |
18 |
|
was missing a "static" storage class specifier. |
19 |
|
|
20 |
|
4. When UTF-8 mode was not set, PCRE looped when compiling certain patterns |
21 |
|
containing an extended class (one that cannot be represented by a bitmap |
22 |
|
because it contains high-valued characters or Unicode property items, e.g. |
23 |
|
[\pZ]). Almost always one would set UTF-8 mode when processing such a |
24 |
|
pattern, but PCRE should not loop if you do not (it no longer does). |
25 |
|
[Detail: two cases were found: (a) a repeated subpattern containing an |
26 |
|
extended class; (b) a recursive reference to a subpattern that followed a |
27 |
|
previous extended class. It wasn't skipping over the extended class |
28 |
|
correctly when UTF-8 mode was not set.] |
29 |
|
|
30 |
|
5. A negated single-character class was not being recognized as fixed-length |
31 |
|
in lookbehind assertions such as (?<=[^f]), leading to an incorrect |
32 |
|
compile error "lookbehind assertion is not fixed length". |
33 |
|
|
34 |
|
6. The RunPerlTest auxiliary script was showing an unexpected difference |
35 |
|
between PCRE and Perl for UTF-8 tests. It turns out that it is hard to |
36 |
|
write a Perl script that can interpret lines of an input file either as |
37 |
|
byte characters or as UTF-8, which is what "perltest" was being required to |
38 |
|
do for the non-UTF-8 and UTF-8 tests, respectively. Essentially what you |
39 |
|
can't do is switch easily at run time between having the "use utf8;" pragma |
40 |
|
or not. In the end, I fudged it by using the RunPerlTest script to insert |
41 |
|
"use utf8;" explicitly for the UTF-8 tests. |
42 |
|
|
43 |
|
7. In multiline (/m) mode, PCRE was matching ^ after a terminating newline at |
44 |
|
the end of the subject string, contrary to the documentation and to what |
45 |
|
Perl does. This was true of both matching functions. Now it matches only at |
46 |
|
the start of the subject and immediately after *internal* newlines. |
47 |
|
|
48 |
|
8. A call of pcre_fullinfo() from pcretest to get the option bits was passing |
49 |
|
a pointer to an int instead of a pointer to an unsigned long int. This |
50 |
|
caused problems on 64-bit systems. |
51 |
|
|
52 |
|
9. Applied a patch from the folks at Google to pcrecpp.cc, to fix "another |
53 |
|
instance of the 'standard' template library not being so standard". |
54 |
|
|
55 |
|
10. There was no check on the number of named subpatterns nor the maximum |
56 |
|
length of a subpattern name. The product of these values is used to compute |
57 |
|
the size of the memory block for a compiled pattern. By supplying a very |
58 |
|
long subpattern name and a large number of named subpatterns, the size |
59 |
|
computation could be caused to overflow. This is now prevented by limiting |
60 |
|
the length of names to 32 characters, and the number of named subpatterns |
61 |
|
to 10,000. |
62 |
|
|
63 |
|
11. Subpatterns that are repeated with specific counts have to be replicated in |
64 |
|
the compiled pattern. The size of memory for this was computed from the |
65 |
|
length of the subpattern and the repeat count. The latter is limited to |
66 |
|
65535, but there was no limit on the former, meaning that integer overflow |
67 |
|
could in principle occur. The compiled length of a repeated subpattern is |
68 |
|
now limited to 30,000 bytes in order to prevent this. |
69 |
|
|
70 |
|
12. Added the optional facility to have named substrings with the same name. |
71 |
|
|
72 |
|
13. Added the ability to use a named substring as a condition, using the |
73 |
|
Python syntax: (?(name)yes|no). This overloads (?(R)... and names that |
74 |
|
are numbers (not recommended). Forward references are permitted. |
75 |
|
|
76 |
|
14. Added forward references in named backreferences (if you see what I mean). |
77 |
|
|
78 |
|
15. In UTF-8 mode, with the PCRE_DOTALL option set, a quantified dot in the |
79 |
|
pattern could run off the end of the subject. For example, the pattern |
80 |
|
"(?s)(.{1,5})"8 did this with the subject "ab". |
81 |
|
|
82 |
|
16. If PCRE_DOTALL or PCRE_MULTILINE were set, pcre_dfa_exec() behaved as if |
83 |
|
PCRE_CASELESS was set when matching characters that were quantified with ? |
84 |
|
or *. |
85 |
|
|
86 |
|
17. A character class other than a single negated character that had a minimum |
87 |
|
but no maximum quantifier - for example [ab]{6,} - was not handled |
88 |
|
correctly by pce_dfa_exec(). It would match only one character. |
89 |
|
|
90 |
|
18. A valid (though odd) pattern that looked like a POSIX character |
91 |
|
class but used an invalid character after [ (for example [[,abc,]]) caused |
92 |
|
pcre_compile() to give the error "Failed: internal error: code overflow" or |
93 |
|
in some cases to crash with a glibc free() error. This could even happen if |
94 |
|
the pattern terminated after [[ but there just happened to be a sequence of |
95 |
|
letters, a binary zero, and a closing ] in the memory that followed. |
96 |
|
|
97 |
|
19. Perl's treatment of octal escapes in the range \400 to \777 has changed |
98 |
|
over the years. Originally (before any Unicode support), just the bottom 8 |
99 |
|
bits were taken. Thus, for example, \500 really meant \100. Nowadays the |
100 |
|
output from "man perlunicode" includes this: |
101 |
|
|
102 |
|
The regular expression compiler produces polymorphic opcodes. That |
103 |
|
is, the pattern adapts to the data and automatically switches to |
104 |
|
the Unicode character scheme when presented with Unicode data--or |
105 |
|
instead uses a traditional byte scheme when presented with byte |
106 |
|
data. |
107 |
|
|
108 |
|
Sadly, a wide octal escape does not cause a switch, and in a string with |
109 |
|
no other multibyte characters, these octal escapes are treated as before. |
110 |
|
Thus, in Perl, the pattern /\500/ actually matches \100 but the pattern |
111 |
|
/\500|\x{1ff}/ matches \500 or \777 because the whole thing is treated as a |
112 |
|
Unicode string. |
113 |
|
|
114 |
|
I have not perpetrated such confusion in PCRE. Up till now, it took just |
115 |
|
the bottom 8 bits, as in old Perl. I have now made octal escapes with |
116 |
|
values greater than \377 illegal in non-UTF-8 mode. In UTF-8 mode they |
117 |
|
translate to the appropriate multibyte character. |
118 |
|
|
119 |
|
29. Applied some refactoring to reduce the number of warnings from Microsoft |
120 |
|
and Borland compilers. This has included removing the fudge introduced |
121 |
|
seven years ago for the OS/2 compiler (see 2.02/2 below) because it caused |
122 |
|
a warning about an unused variable. |
123 |
|
|
124 |
|
21. PCRE has not included VT (character 0x0b) in the set of whitespace |
125 |
|
characters since release 4.0, because Perl (from release 5.004) does not. |
126 |
|
[Or at least, is documented not to: some releases seem to be in conflict |
127 |
|
with the documentation.] However, when a pattern was studied with |
128 |
|
pcre_study() and all its branches started with \s, PCRE still included VT |
129 |
|
as a possible starting character. Of course, this did no harm; it just |
130 |
|
caused an unnecessary match attempt. |
131 |
|
|
132 |
|
22. Removed a now-redundant internal flag bit that recorded the fact that case |
133 |
|
dependency changed within the pattern. This was once needed for "required |
134 |
|
byte" processing, but is no longer used. This recovers a now-scarce options |
135 |
|
bit. Also moved the least significant internal flag bit to the most- |
136 |
|
significant bit of the word, which was not previously used (hangover from |
137 |
|
the days when it was an int rather than a uint) to free up another bit for |
138 |
|
the future. |
139 |
|
|
140 |
|
23. Added support for CRLF line endings as well as CR and LF. As well as the |
141 |
|
default being selectable at build time, it can now be changed at runtime |
142 |
|
via the PCRE_NEWLINE_xxx flags. There are now options for pcregrep to |
143 |
|
specify that it is scanning data with non-default line endings. |
144 |
|
|
145 |
|
24. Changed the definition of CXXLINK to make it agree with the definition of |
146 |
|
LINK in the Makefile, by replacing LDFLAGS to CXXFLAGS. |
147 |
|
|
148 |
|
25. Applied Ian Taylor's patches to avoid using another stack frame for tail |
149 |
|
recursions. This makes a big different to stack usage for some patterns. |
150 |
|
|
151 |
|
26. If a subpattern containing a named recursion or subroutine reference such |
152 |
|
as (?P>B) was quantified, for example (xxx(?P>B)){3}, the calculation of |
153 |
|
the space required for the compiled pattern went wrong and gave too small a |
154 |
|
value. Depending on the environment, this could lead to "Failed: internal |
155 |
|
error: code overflow at offset 49" or "glibc detected double free or |
156 |
|
corruption" errors. |
157 |
|
|
158 |
|
27. Applied patches from Google (a) to support the new newline modes and (b) to |
159 |
|
advance over multibyte UTF-8 characters in GlobalReplace. |
160 |
|
|
161 |
|
28. Change free() to pcre_free() in pcredemo.c. Apparently this makes a |
162 |
|
difference for some implementation of PCRE in some Windows version. |
163 |
|
|
164 |
|
29. Added some extra testing facilities to pcretest: |
165 |
|
|
166 |
|
\q<number> in a data line sets the "match limit" value |
167 |
|
\Q<number> in a data line sets the "match recursion limt" value |
168 |
|
-S <number> sets the stack size, where <number> is in megabytes |
169 |
|
|
170 |
|
The -S option isn't available for Windows. |
171 |
|
|
172 |
|
|
173 |
Version 6.6 06-Feb-06 |
Version 6.6 06-Feb-06 |
174 |
--------------------- |
--------------------- |
175 |
|
|