ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log

Revision 73 - (show annotations)
Sat Feb 24 21:40:30 2007 UTC (14 years, 2 months ago) by nigel
File MIME type: text/plain
File size: 16738 byte(s)
Load pcre-4.5 into code/trunk.
6 pcretest - a program for testing Perl-compatible regular expressions.
9 pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source] [destination]
11 pcretest was written as a test program for the PCRE regular expression
12 library itself, but it can also be used for experimenting with regular
13 expressions. This document describes the features of the test program;
14 for details of the regular expressions themselves, see the pcrepattern
15 documentation. For details of PCRE and its options, see the pcreapi
16 documentation.
22 -C Output the version number of the PCRE library, and all avail-
23 able information about the optional features that are
24 included, and then exit.
26 -d Behave as if each regex had the /D modifier (see below); the
27 internal form is output after compilation.
29 -i Behave as if each regex had the /I modifier; information
30 about the compiled pattern is given after compilation.
32 -m Output the size of each compiled pattern after it has been
33 compiled. This is equivalent to adding /M to each regular
34 expression. For compatibility with earlier versions of
35 pcretest, -s is a synonym for -m.
37 -o osize Set the number of elements in the output vector that is used
38 when calling PCRE to be osize. The default value is 45, which
39 is enough for 14 capturing subexpressions. The vector size
40 can be changed for individual matching calls by including \O
41 in the data line (see below).
43 -p Behave as if each regex has /P modifier; the POSIX wrapper
44 API is used to call PCRE. None of the other options has any
45 effect when -p is set.
47 -t Run each compile, study, and match many times with a timer,
48 and output resulting time per compile or match (in millisec-
49 onds). Do not set -t with -m, because you will then get the
50 size output 20000 times and the timing will be distorted.
55 If pcretest is given two filename arguments, it reads from the first
56 and writes to the second. If it is given only one filename argument, it
57 reads from that file and writes to stdout. Otherwise, it reads from
58 stdin and writes to stdout, and prompts for each line of input, using
59 "re>" to prompt for regular expressions, and "data>" to prompt for data
60 lines.
62 The program handles any number of sets of input on a single input file.
63 Each set starts with a regular expression, and continues with any num-
64 ber of data lines to be matched against the pattern.
66 Each line is matched separately and independently. If you want to do
67 multiple-line matches, you have to use the \n escape sequence in a sin-
68 gle line of input to encode the newline characters. The maximum length
69 of data line is 30,000 characters.
71 An empty line signals the end of the data lines, at which point a new
72 regular expression is read. The regular expressions are given enclosed
73 in any non-alphameric delimiters other than backslash, for example
75 /(a|bc)x+yz/
77 White space before the initial delimiter is ignored. A regular expres-
78 sion may be continued over several input lines, in which case the new-
79 line characters are included within it. It is possible to include the
80 delimiter within the pattern by escaping it, for example
82 /abc\/def/
84 If you do so, the escape and the delimiter form part of the pattern,
85 but since delimiters are always non-alphameric, this does not affect
86 its interpretation. If the terminating delimiter is immediately fol-
87 lowed by a backslash, for example,
89 /abc/\
91 then a backslash is added to the end of the pattern. This is done to
92 provide a way of testing the error condition that arises if a pattern
93 finishes with a backslash, because
95 /abc\/
97 is interpreted as the first line of a pattern that starts with "abc/",
98 causing pcretest to read the next line as a continuation of the regular
99 expression.
104 The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
105 PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively.
106 For example:
108 /caseless/i
110 These modifier letters have the same effect as they do in Perl. There
111 are others that set PCRE options that do not correspond to anything in
112 Perl: /A, /E, /N, /U, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY,
115 Searching for all possible matches within each subject string can be
116 requested by the /g or /G modifier. After finding a match, PCRE is
117 called again to search the remainder of the subject string. The differ-
118 ence between /g and /G is that the former uses the startoffset argument
119 to pcre_exec() to start searching at a new point within the entire
120 string (which is in effect what Perl does), whereas the latter passes
121 over a shortened substring. This makes a difference to the matching
122 process if the pattern begins with a lookbehind assertion (including \b
123 or \B).
125 If any call to pcre_exec() in a /g or /G sequence matches an empty
126 string, the next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED
127 flags set in order to search for another, non-empty, match at the same
128 point. If this second match fails, the start offset is advanced by
129 one, and the normal match is retried. This imitates the way Perl han-
130 dles such cases when using the /g modifier or the split() function.
132 There are a number of other modifiers for controlling the way pcretest
133 operates.
135 The /+ modifier requests that as well as outputting the substring that
136 matched the entire pattern, pcretest should in addition output the
137 remainder of the subject string. This is useful for tests where the
138 subject contains multiple copies of the same substring.
140 The /L modifier must be followed directly by the name of a locale, for
141 example,
143 /pattern/Lfr
145 For this reason, it must be the last modifier letter. The given locale
146 is set, pcre_maketables() is called to build a set of character tables
147 for the locale, and this is then passed to pcre_compile() when compil-
148 ing the regular expression. Without an /L modifier, NULL is passed as
149 the tables pointer; that is, /L applies only to the expression on which
150 it appears.
152 The /I modifier requests that pcretest output information about the
153 compiled expression (whether it is anchored, has a fixed first charac-
154 ter, and so on). It does this by calling pcre_fullinfo() after compil-
155 ing an expression, and outputting the information it gets back. If the
156 pattern is studied, the results of that are also output.
158 The /D modifier is a PCRE debugging feature, which also assumes /I. It
159 causes the internal form of compiled regular expressions to be output
160 after compilation. If the pattern was studied, the information returned
161 is also output.
163 The /S modifier causes pcre_study() to be called after the expression
164 has been compiled, and the results used when the expression is matched.
166 The /M modifier causes the size of memory block used to hold the com-
167 piled pattern to be output.
169 The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
170 rather than its native API. When this is done, all other modifiers
171 except /i, /m, and /+ are ignored. REG_ICASE is set if /i is present,
172 and REG_NEWLINE is set if /m is present. The wrapper functions force
173 PCRE_DOLLAR_ENDONLY always, and PCRE_DOTALL unless REG_NEWLINE is set.
175 The /8 modifier causes pcretest to call PCRE with the PCRE_UTF8 option
176 set. This turns on support for UTF-8 character handling in PCRE, pro-
177 vided that it was compiled with this support enabled. This modifier
178 also causes any non-printing characters in output strings to be printed
179 using the \x{hh...} notation if they are valid UTF-8 sequences.
181 If the /? modifier is used with /8, it causes pcretest to call
182 pcre_compile() with the PCRE_NO_UTF8_CHECK option, to suppress the
183 checking of the string for UTF-8 validity.
188 If the pattern contains any callout requests, pcretest's callout func-
189 tion will be called. By default, it displays the callout number, and
190 the start and current positions in the text at the callout time. For
191 example, the output
193 --->pqrabcdef
194 0 ^ ^
196 indicates that callout number 0 occurred for a match attempt starting
197 at the fourth character of the subject string, when the pointer was at
198 the seventh character. The callout function returns zero (carry on
199 matching) by default.
201 Inserting callouts may be helpful when using pcretest to check compli-
202 cated regular expressions. For further information about callouts, see
203 the pcrecallout documentation.
205 For testing the PCRE library, additional control of callout behaviour
206 is available via escape sequences in the data, as described in the fol-
207 lowing section. In particular, it is possible to pass in a number as
208 callout data (the default is zero). If the callout function receives a
209 non-zero number, it returns that value instead of zero.
214 Before each data line is passed to pcre_exec(), leading and trailing
215 whitespace is removed, and it is then scanned for \ escapes. Some of
216 these are pretty esoteric features, intended for checking out some of
217 the more complicated features of PCRE. If you are just testing "ordi-
218 nary" regular expressions, you probably don't need any of these. The
219 following escapes are recognized:
221 \a alarm (= BEL)
222 \b backspace
223 \e escape
224 \f formfeed
225 \n newline
226 \r carriage return
227 \t tab
228 \v vertical tab
229 \nnn octal character (up to 3 octal digits)
230 \xhh hexadecimal character (up to 2 hex digits)
231 \x{hh...} hexadecimal character, any number of digits
232 in UTF-8 mode
233 \A pass the PCRE_ANCHORED option to pcre_exec()
234 \B pass the PCRE_NOTBOL option to pcre_exec()
235 \Cdd call pcre_copy_substring() for substring dd
236 after a successful match (any decimal number
237 less than 32)
238 \Cname call pcre_copy_named_substring() for substring
239 "name" after a successful match (name termin-
240 ated by next non alphanumeric character)
241 \C+ show the current captured substrings at callout
242 time
243 \C- do not supply a callout function
244 \C!n return 1 instead of 0 when callout number n is
245 reached
246 \C!n!m return 1 instead of 0 when callout number n is
247 reached for the nth time
248 \C*n pass the number n (may be negative) as callout
249 data
250 \Gdd call pcre_get_substring() for substring dd
251 after a successful match (any decimal number
252 less than 32)
253 \Gname call pcre_get_named_substring() for substring
254 "name" after a successful match (name termin-
255 ated by next non-alphanumeric character)
256 \L call pcre_get_substringlist() after a
257 successful match
258 \M discover the minimum MATCH_LIMIT setting
259 \N pass the PCRE_NOTEMPTY option to pcre_exec()
260 \Odd set the size of the output vector passed to
261 pcre_exec() to dd (any number of decimal
262 digits)
263 \S output details of memory get/free calls during matching
264 \Z pass the PCRE_NOTEOL option to pcre_exec()
265 \? pass the PCRE_NO_UTF8_CHECK option to
266 pcre_exec()
268 If \M is present, pcretest calls pcre_exec() several times, with dif-
269 ferent values in the match_limit field of the pcre_extra data struc-
270 ture, until it finds the minimum number that is needed for pcre_exec()
271 to complete. This number is a measure of the amount of recursion and
272 backtracking that takes place, and checking it out can be instructive.
273 For most simple matches, the number is quite small, but for patterns
274 with very large numbers of matching possibilities, it can become large
275 very quickly with increasing length of subject string.
277 When \O is used, it may be higher or lower than the size set by the -O
278 option (or defaulted to 45); \O applies only to the call of pcre_exec()
279 for the line in which it appears.
281 A backslash followed by anything else just escapes the anything else.
282 If the very last character is a backslash, it is ignored. This gives a
283 way of passing an empty line as data, since a real empty line termi-
284 nates the data input.
286 If /P was present on the regex, causing the POSIX wrapper API to be
287 used, only 0 causing REG_NOTBOL and REG_NOTEOL to be passed to
288 regexec() respectively.
290 The use of \x{hh...} to represent UTF-8 characters is not dependent on
291 the use of the /8 modifier on the pattern. It is recognized always.
292 There may be any number of hexadecimal digits inside the braces. The
293 result is from one to six bytes, encoded according to the UTF-8 rules.
298 When a match succeeds, pcretest outputs the list of captured substrings
299 that pcre_exec() returns, starting with number 0 for the string that
300 matched the whole pattern. Here is an example of an interactive
301 pcretest run.
303 $ pcretest
304 PCRE version 4.00 08-Jan-2003
306 re> /^abc(\d+)/
307 data> abc123
308 0: abc123
309 1: 123
310 data> xyz
311 No match
313 If the strings contain any non-printing characters, they are output as
314 \0x escapes, or as \x{...} escapes if the /8 modifier was present on
315 the pattern. If the pattern has the /+ modifier, then the output for
316 substring 0 is followed by the the rest of the subject string, identi-
317 fied by "0+" like this:
319 re> /cat/+
320 data> cataract
321 0: cat
322 0+ aract
324 If the pattern has the /g or /G modifier, the results of successive
325 matching attempts are output in sequence, like this:
327 re> /\Bi(\w\w)/g
328 data> Mississippi
329 0: iss
330 1: ss
331 0: iss
332 1: ss
333 0: ipp
334 1: pp
336 "No match" is output only if the first match attempt fails.
338 If any of the sequences \C, \G, or \L are present in a data line that
339 is successfully matched, the substrings extracted by the convenience
340 functions are output with C, G, or L after the string number instead of
341 a colon. This is in addition to the normal full list. The string length
342 (that is, the return from the extraction function) is given in paren-
343 theses after each string for \C and \G.
345 Note that while patterns can be continued over several lines (a plain
346 ">" prompt is used for continuations), data lines may not. However new-
347 lines can be included in data by means of the \n escape.
352 Philip Hazel <ph10@cam.ac.uk>
353 University Computing Service,
354 Cambridge CB2 3QG, England.
356 Last updated: 09 December 2003
357 Copyright (c) 1997-2003 University of Cambridge.

  ViewVC Help
Powered by ViewVC 1.1.5