/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Contents of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log


Revision 1759 - (show annotations)
Mon Feb 10 16:45:25 2020 UTC (8 months, 2 weeks ago) by ph10
File MIME type: text/plain
File size: 54126 byte(s)
Error occurred while calculating annotation data.
Documentation update.
1 PCRETEST(1) General Commands Manual PCRETEST(1)
2
3
4
5 NAME
6 pcretest - a program for testing Perl-compatible regular expressions.
7
8 SYNOPSIS
9
10 pcretest [options] [input file [output file]]
11
12 pcretest was written as a test program for the PCRE regular expression
13 library itself, but it can also be used for experimenting with regular
14 expressions. This document describes the features of the test program;
15 for details of the regular expressions themselves, see the pcrepattern
16 documentation. For details of the PCRE library function calls and their
17 options, see the pcreapi , pcre16 and pcre32 documentation.
18
19 The input for pcretest is a sequence of regular expression patterns and
20 strings to be matched, as described below. The output shows the result
21 of each match. Options on the command line and the patterns control
22 PCRE options and exactly what is output.
23
24 As PCRE has evolved, it has acquired many different features, and as a
25 result, pcretest now has rather a lot of obscure options for testing
26 every possible feature. Some of these options are specifically designed
27 for use in conjunction with the test script and data files that are
28 distributed as part of PCRE, and are unlikely to be of use otherwise.
29 They are all documented here, but without much justification.
30
31
32 INPUT DATA FORMAT
33
34 Input to pcretest is processed line by line, either by calling the C
35 library's fgets() function, or via the libreadline library (see below).
36 In Unix-like environments, fgets() treats any bytes other than newline
37 as data characters. However, in some Windows environments character 26
38 (hex 1A) causes an immediate end of file, and no further data is read.
39 For maximum portability, therefore, it is safest to use only ASCII
40 characters in pcretest input files.
41
42 The input is processed using using C's string functions, so must not
43 contain binary zeroes, even though in Unix-like environments, fgets()
44 treats any bytes other than newline as data characters.
45
46
47 PCRE's 8-BIT, 16-BIT AND 32-BIT LIBRARIES
48
49 From release 8.30, two separate PCRE libraries can be built. The origi-
50 nal one supports 8-bit character strings, whereas the newer 16-bit li-
51 brary supports character strings encoded in 16-bit units. From release
52 8.32, a third library can be built, supporting character strings en-
53 coded in 32-bit units. The pcretest program can be used to test all
54 three libraries. However, it is itself still an 8-bit program, reading
55 8-bit input and writing 8-bit output. When testing the 16-bit or
56 32-bit library, the patterns and data strings are converted to 16- or
57 32-bit format before being passed to the PCRE library functions. Re-
58 sults are converted to 8-bit for output.
59
60 References to functions and structures of the form pcre[16|32]_xx below
61 mean "pcre_xx when using the 8-bit library, pcre16_xx when using the
62 16-bit library, or pcre32_xx when using the 32-bit library".
63
64
65 COMMAND LINE OPTIONS
66
67 -8 If the 8-bit library has been built, this option causes it to
68 be used (this is the default). If the 8-bit library has not
69 been built, this option causes an error.
70
71 -16 If the 16-bit library has been built, this option causes it
72 to be used. If only the 16-bit library has been built, this
73 is the default. If the 16-bit library has not been built,
74 this option causes an error.
75
76 -32 If the 32-bit library has been built, this option causes it
77 to be used. If only the 32-bit library has been built, this
78 is the default. If the 32-bit library has not been built,
79 this option causes an error.
80
81 -b Behave as if each pattern has the /B (show byte code) modi-
82 fier; the internal form is output after compilation.
83
84 -C Output the version number of the PCRE library, and all avail-
85 able information about the optional features that are in-
86 cluded, and then exit with zero exit code. All other options
87 are ignored.
88
89 -C option Output information about a specific build-time option, then
90 exit. This functionality is intended for use in scripts such
91 as RunTest. The following options output the value and set
92 the exit code as indicated:
93
94 ebcdic-nl the code for LF (= NL) in an EBCDIC environment:
95 0x15 or 0x25
96 0 if used in an ASCII environment
97 exit code is always 0
98 linksize the configured internal link size (2, 3, or 4)
99 exit code is set to the link size
100 newline the default newline setting:
101 CR, LF, CRLF, ANYCRLF, or ANY
102 exit code is always 0
103 bsr the default setting for what \R matches:
104 ANYCRLF or ANY
105 exit code is always 0
106
107 The following options output 1 for true or 0 for false, and
108 set the exit code to the same value:
109
110 ebcdic compiled for an EBCDIC environment
111 jit just-in-time support is available
112 pcre16 the 16-bit library was built
113 pcre32 the 32-bit library was built
114 pcre8 the 8-bit library was built
115 ucp Unicode property support is available
116 utf UTF-8 and/or UTF-16 and/or UTF-32 support
117 is available
118
119 If an unknown option is given, an error message is output;
120 the exit code is 0.
121
122 -d Behave as if each pattern has the /D (debug) modifier; the
123 internal form and information about the compiled pattern is
124 output after compilation; -d is equivalent to -b -i.
125
126 -dfa Behave as if each data line contains the \D escape sequence;
127 this causes the alternative matching function,
128 pcre[16|32]_dfa_exec(), to be used instead of the standard
129 pcre[16|32]_exec() function (more detail is given below).
130
131 -help Output a brief summary these options and then exit.
132
133 -i Behave as if each pattern has the /I modifier; information
134 about the compiled pattern is given after compilation.
135
136 -M Behave as if each data line contains the \M escape sequence;
137 this causes PCRE to discover the minimum MATCH_LIMIT and
138 MATCH_LIMIT_RECURSION settings by calling pcre[16|32]_exec()
139 repeatedly with different limits.
140
141 -m Output the size of each compiled pattern after it has been
142 compiled. This is equivalent to adding /M to each regular ex-
143 pression. The size is given in bytes for both libraries.
144
145 -O Behave as if each pattern has the /O modifier, that is dis-
146 able auto-possessification for all patterns.
147
148 -o osize Set the number of elements in the output vector that is used
149 when calling pcre[16|32]_exec() or pcre[16|32]_dfa_exec() to
150 be osize. The default value is 45, which is enough for 14
151 capturing subexpressions for pcre[16|32]_exec() or 22 differ-
152 ent matches for pcre[16|32]_dfa_exec(). The vector size can
153 be changed for individual matching calls by including \O in
154 the data line (see below).
155
156 -p Behave as if each pattern has the /P modifier; the POSIX
157 wrapper API is used to call PCRE. None of the other options
158 has any effect when -p is set. This option can be used only
159 with the 8-bit library.
160
161 -q Do not output the version number of pcretest at the start of
162 execution.
163
164 -S size On Unix-like systems, set the size of the run-time stack to
165 size megabytes.
166
167 -s or -s+ Behave as if each pattern has the /S modifier; in other
168 words, force each pattern to be studied. If -s+ is used, all
169 the JIT compile options are passed to pcre[16|32]_study(),
170 causing just-in-time optimization to be set up if it is
171 available, for both full and partial matching. Specific JIT
172 compile options can be selected by following -s+ with a digit
173 in the range 1 to 7, which selects the JIT compile modes as
174 follows:
175
176 1 normal match only
177 2 soft partial match only
178 3 normal match and soft partial match
179 4 hard partial match only
180 6 soft and hard partial match
181 7 all three modes (default)
182
183 If -s++ is used instead of -s+ (with or without a following
184 digit), the text "(JIT)" is added to the first output line
185 after a match or no match when JIT-compiled code was actually
186 used.
187
188 Note that there are pattern options that can override -s, ei-
189 ther specifying no studying at all, or suppressing JIT compi-
190 lation.
191
192 If the /I or /D option is present on a pattern (requesting
193 output about the compiled pattern), information about the re-
194 sult of studying is not included when studying is caused only
195 by -s and neither -i nor -d is present on the command line.
196 This behaviour means that the output from tests that are run
197 with and without -s should be identical, except when options
198 that output information about the actual running of a match
199 are set.
200
201 The -M, -t, and -tm options, which give information about re-
202 sources used, are likely to produce different output with and
203 without -s. Output may also differ if the /C option is
204 present on an individual pattern. This uses callouts to trace
205 the the matching process, and this may be different between
206 studied and non-studied patterns. If the pattern contains
207 (*MARK) items there may also be differences, for the same
208 reason. The -s command line option can be overridden for spe-
209 cific patterns that should never be studied (see the /S pat-
210 tern modifier below).
211
212 -t Run each compile, study, and match many times with a timer,
213 and output the resulting times per compile, study, or match
214 (in milliseconds). Do not set -m with -t, because you will
215 then get the size output a zillion times, and the timing will
216 be distorted. You can control the number of iterations that
217 are used for timing by following -t with a number (as a sepa-
218 rate item on the command line). For example, "-t 1000" iter-
219 ates 1000 times. The default is to iterate 500000 times.
220
221 -tm This is like -t except that it times only the matching phase,
222 not the compile or study phases.
223
224 -T -TM These behave like -t and -tm, but in addition, at the end of
225 a run, the total times for all compiles, studies, and matches
226 are output.
227
228
229 DESCRIPTION
230
231 If pcretest is given two filename arguments, it reads from the first
232 and writes to the second. If it is given only one filename argument, it
233 reads from that file and writes to stdout. Otherwise, it reads from
234 stdin and writes to stdout, and prompts for each line of input, using
235 "re>" to prompt for regular expressions, and "data>" to prompt for data
236 lines.
237
238 When pcretest is built, a configuration option can specify that it
239 should be linked with the libreadline library. When this is done, if
240 the input is from a terminal, it is read using the readline() function.
241 This provides line-editing and history facilities. The output from the
242 -help option states whether or not readline() will be used.
243
244 The program handles any number of sets of input on a single input file.
245 Each set starts with a regular expression, and continues with any num-
246 ber of data lines to be matched against that pattern.
247
248 Each data line is matched separately and independently. If you want to
249 do multi-line matches, you have to use the \n escape sequence (or \r or
250 \r\n, etc., depending on the newline setting) in a single line of input
251 to encode the newline sequences. There is no limit on the length of
252 data lines; the input buffer is automatically extended if it is too
253 small.
254
255 An empty line signals the end of the data lines, at which point a new
256 regular expression is read. The regular expressions are given enclosed
257 in any non-alphanumeric delimiters other than backslash, for example:
258
259 /(a|bc)x+yz/
260
261 White space before the initial delimiter is ignored. A regular expres-
262 sion may be continued over several input lines, in which case the new-
263 line characters are included within it. It is possible to include the
264 delimiter within the pattern by escaping it, for example
265
266 /abc\/def/
267
268 If you do so, the escape and the delimiter form part of the pattern,
269 but since delimiters are always non-alphanumeric, this does not affect
270 its interpretation. If the terminating delimiter is immediately fol-
271 lowed by a backslash, for example,
272
273 /abc/\
274
275 then a backslash is added to the end of the pattern. This is done to
276 provide a way of testing the error condition that arises if a pattern
277 finishes with a backslash, because
278
279 /abc\/
280
281 is interpreted as the first line of a pattern that starts with "abc/",
282 causing pcretest to read the next line as a continuation of the regular
283 expression.
284
285
286 PATTERN MODIFIERS
287
288 A pattern may be followed by any number of modifiers, which are mostly
289 single characters, though some of these can be qualified by further
290 characters. Following Perl usage, these are referred to below as, for
291 example, "the /i modifier", even though the delimiter of the pattern
292 need not always be a slash, and no slash is used when writing modi-
293 fiers. White space may appear between the final pattern delimiter and
294 the first modifier, and between the modifiers themselves. For refer-
295 ence, here is a complete list of modifiers. They fall into several
296 groups that are described in detail in the following sections.
297
298 /8 set UTF mode
299 /9 set PCRE_NEVER_UTF (locks out UTF mode)
300 /? disable UTF validity check
301 /+ show remainder of subject after match
302 /= show all captures (not just those that are set)
303
304 /A set PCRE_ANCHORED
305 /B show compiled code
306 /C set PCRE_AUTO_CALLOUT
307 /D same as /B plus /I
308 /E set PCRE_DOLLAR_ENDONLY
309 /F flip byte order in compiled pattern
310 /f set PCRE_FIRSTLINE
311 /G find all matches (shorten string)
312 /g find all matches (use startoffset)
313 /I show information about pattern
314 /i set PCRE_CASELESS
315 /J set PCRE_DUPNAMES
316 /K show backtracking control names
317 /L set locale
318 /M show compiled memory size
319 /m set PCRE_MULTILINE
320 /N set PCRE_NO_AUTO_CAPTURE
321 /O set PCRE_NO_AUTO_POSSESS
322 /P use the POSIX wrapper
323 /Q test external stack check function
324 /S study the pattern after compilation
325 /s set PCRE_DOTALL
326 /T select character tables
327 /U set PCRE_UNGREEDY
328 /W set PCRE_UCP
329 /X set PCRE_EXTRA
330 /x set PCRE_EXTENDED
331 /Y set PCRE_NO_START_OPTIMIZE
332 /Z don't show lengths in /B output
333
334 /<any> set PCRE_NEWLINE_ANY
335 /<anycrlf> set PCRE_NEWLINE_ANYCRLF
336 /<cr> set PCRE_NEWLINE_CR
337 /<crlf> set PCRE_NEWLINE_CRLF
338 /<lf> set PCRE_NEWLINE_LF
339 /<bsr_anycrlf> set PCRE_BSR_ANYCRLF
340 /<bsr_unicode> set PCRE_BSR_UNICODE
341 /<JS> set PCRE_JAVASCRIPT_COMPAT
342
343
344 Perl-compatible modifiers
345
346 The /i, /m, /s, and /x modifiers set the PCRE_CASELESS, PCRE_MULTILINE,
347 PCRE_DOTALL, or PCRE_EXTENDED options, respectively, when
348 pcre[16|32]_compile() is called. These four modifier letters have the
349 same effect as they do in Perl. For example:
350
351 /caseless/i
352
353
354 Modifiers for other PCRE options
355
356 The following table shows additional modifiers for setting PCRE com-
357 pile-time options that do not correspond to anything in Perl:
358
359 /8 PCRE_UTF8 ) when using the 8-bit
360 /? PCRE_NO_UTF8_CHECK ) library
361
362 /8 PCRE_UTF16 ) when using the 16-bit
363 /? PCRE_NO_UTF16_CHECK ) library
364
365 /8 PCRE_UTF32 ) when using the 32-bit
366 /? PCRE_NO_UTF32_CHECK ) library
367
368 /9 PCRE_NEVER_UTF
369 /A PCRE_ANCHORED
370 /C PCRE_AUTO_CALLOUT
371 /E PCRE_DOLLAR_ENDONLY
372 /f PCRE_FIRSTLINE
373 /J PCRE_DUPNAMES
374 /N PCRE_NO_AUTO_CAPTURE
375 /O PCRE_NO_AUTO_POSSESS
376 /U PCRE_UNGREEDY
377 /W PCRE_UCP
378 /X PCRE_EXTRA
379 /Y PCRE_NO_START_OPTIMIZE
380 /<any> PCRE_NEWLINE_ANY
381 /<anycrlf> PCRE_NEWLINE_ANYCRLF
382 /<cr> PCRE_NEWLINE_CR
383 /<crlf> PCRE_NEWLINE_CRLF
384 /<lf> PCRE_NEWLINE_LF
385 /<bsr_anycrlf> PCRE_BSR_ANYCRLF
386 /<bsr_unicode> PCRE_BSR_UNICODE
387 /<JS> PCRE_JAVASCRIPT_COMPAT
388
389 The modifiers that are enclosed in angle brackets are literal strings
390 as shown, including the angle brackets, but the letters within can be
391 in either case. This example sets multiline matching with CRLF as the
392 line ending sequence:
393
394 /^abc/m<CRLF>
395
396 As well as turning on the PCRE_UTF8/16/32 option, the /8 modifier
397 causes all non-printing characters in output strings to be printed us-
398 ing the \x{hh...} notation. Otherwise, those less than 0x100 are output
399 in hex without the curly brackets.
400
401 Full details of the PCRE options are given in the pcreapi documenta-
402 tion.
403
404 Finding all matches in a string
405
406 Searching for all possible matches within each subject string can be
407 requested by the /g or /G modifier. After finding a match, PCRE is
408 called again to search the remainder of the subject string. The differ-
409 ence between /g and /G is that the former uses the startoffset argument
410 to pcre[16|32]_exec() to start searching at a new point within the en-
411 tire string (which is in effect what Perl does), whereas the latter
412 passes over a shortened substring. This makes a difference to the
413 matching process if the pattern begins with a lookbehind assertion (in-
414 cluding \b or \B).
415
416 If any call to pcre[16|32]_exec() in a /g or /G sequence matches an
417 empty string, the next call is done with the PCRE_NOTEMPTY_ATSTART and
418 PCRE_ANCHORED flags set in order to search for another, non-empty,
419 match at the same point. If this second match fails, the start offset
420 is advanced, and the normal match is retried. This imitates the way
421 Perl handles such cases when using the /g modifier or the split() func-
422 tion. Normally, the start offset is advanced by one character, but if
423 the newline convention recognizes CRLF as a newline, and the current
424 character is CR followed by LF, an advance of two is used.
425
426 Other modifiers
427
428 There are yet more modifiers for controlling the way pcretest operates.
429
430 The /+ modifier requests that as well as outputting the substring that
431 matched the entire pattern, pcretest should in addition output the re-
432 mainder of the subject string. This is useful for tests where the sub-
433 ject contains multiple copies of the same substring. If the + modifier
434 appears twice, the same action is taken for captured substrings. In
435 each case the remainder is output on the following line with a plus
436 character following the capture number. Note that this modifier must
437 not immediately follow the /S modifier because /S+ and /S++ have other
438 meanings.
439
440 The /= modifier requests that the values of all potential captured
441 parentheses be output after a match. By default, only those up to the
442 highest one actually used in the match are output (corresponding to the
443 return code from pcre[16|32]_exec()). Values in the offsets vector cor-
444 responding to higher numbers should be set to -1, and these are output
445 as "<unset>". This modifier gives a way of checking that this is hap-
446 pening.
447
448 The /B modifier is a debugging feature. It requests that pcretest out-
449 put a representation of the compiled code after compilation. Normally
450 this information contains length and offset values; however, if /Z is
451 also present, this data is replaced by spaces. This is a special fea-
452 ture for use in the automatic test scripts; it ensures that the same
453 output is generated for different internal link sizes.
454
455 The /D modifier is a PCRE debugging feature, and is equivalent to /BI,
456 that is, both the /B and the /I modifiers.
457
458 The /F modifier causes pcretest to flip the byte order of the 2-byte
459 and 4-byte fields in the compiled pattern. This facility is for testing
460 the feature in PCRE that allows it to execute patterns that were com-
461 piled on a host with a different endianness. This feature is not avail-
462 able when the POSIX interface to PCRE is being used, that is, when the
463 /P pattern modifier is specified. See also the section about saving and
464 reloading compiled patterns below.
465
466 The /I modifier requests that pcretest output information about the
467 compiled pattern (whether it is anchored, has a fixed first character,
468 and so on). It does this by calling pcre[16|32]_fullinfo() after com-
469 piling a pattern. If the pattern is studied, the results of that are
470 also output. In this output, the word "char" means a non-UTF character,
471 that is, the value of a single data item (8-bit, 16-bit, or 32-bit, de-
472 pending on the library that is being tested).
473
474 The /K modifier requests pcretest to show names from backtracking con-
475 trol verbs that are returned from calls to pcre[16|32]_exec(). It
476 causes pcretest to create a pcre[16|32]_extra block if one has not al-
477 ready been created by a call to pcre[16|32]_study(), and to set the
478 PCRE_EXTRA_MARK flag and the mark field within it, every time that
479 pcre[16|32]_exec() is called. If the variable that the mark field
480 points to is non-NULL for a match, non-match, or partial match,
481 pcretest prints the string to which it points. For a match, this is
482 shown on a line by itself, tagged with "MK:". For a non-match it is
483 added to the message.
484
485 The /L modifier must be followed directly by the name of a locale, for
486 example,
487
488 /pattern/Lfr_FR
489
490 For this reason, it must be the last modifier. The given locale is set,
491 pcre[16|32]_maketables() is called to build a set of character tables
492 for the locale, and this is then passed to pcre[16|32]_compile() when
493 compiling the regular expression. Without an /L (or /T) modifier, NULL
494 is passed as the tables pointer; that is, /L applies only to the ex-
495 pression on which it appears.
496
497 The /M modifier causes the size in bytes of the memory block used to
498 hold the compiled pattern to be output. This does not include the size
499 of the pcre[16|32] block; it is just the actual compiled data. If the
500 pattern is successfully studied with the PCRE_STUDY_JIT_COMPILE option,
501 the size of the JIT compiled code is also output.
502
503 The /Q modifier is used to test the use of pcre_stack_guard. It must be
504 followed by '0' or '1', specifying the return code to be given from an
505 external function that is passed to PCRE and used for stack checking
506 during compilation (see the pcreapi documentation for details).
507
508 The /S modifier causes pcre[16|32]_study() to be called after the ex-
509 pression has been compiled, and the results used when the expression is
510 matched. There are a number of qualifying characters that may follow
511 /S. They may appear in any order.
512
513 If /S is followed by an exclamation mark, pcre[16|32]_study() is called
514 with the PCRE_STUDY_EXTRA_NEEDED option, causing it always to return a
515 pcre_extra block, even when studying discovers no useful information.
516
517 If /S is followed by a second S character, it suppresses studying, even
518 if it was requested externally by the -s command line option. This
519 makes it possible to specify that certain patterns are always studied,
520 and others are never studied, independently of -s. This feature is used
521 in the test files in a few cases where the output is different when the
522 pattern is studied.
523
524 If the /S modifier is followed by a + character, the call to
525 pcre[16|32]_study() is made with all the JIT study options, requesting
526 just-in-time optimization support if it is available, for both normal
527 and partial matching. If you want to restrict the JIT compiling modes,
528 you can follow /S+ with a digit in the range 1 to 7:
529
530 1 normal match only
531 2 soft partial match only
532 3 normal match and soft partial match
533 4 hard partial match only
534 6 soft and hard partial match
535 7 all three modes (default)
536
537 If /S++ is used instead of /S+ (with or without a following digit), the
538 text "(JIT)" is added to the first output line after a match or no
539 match when JIT-compiled code was actually used.
540
541 Note that there is also an independent /+ modifier; it must not be
542 given immediately after /S or /S+ because this will be misinterpreted.
543
544 If JIT studying is successful, the compiled JIT code will automatically
545 be used when pcre[16|32]_exec() is run, except when incompatible run-
546 time options are specified. For more details, see the pcrejit documen-
547 tation. See also the \J escape sequence below for a way of setting the
548 size of the JIT stack.
549
550 Finally, if /S is followed by a minus character, JIT compilation is
551 suppressed, even if it was requested externally by the -s command line
552 option. This makes it possible to specify that JIT is never to be used
553 for certain patterns.
554
555 The /T modifier must be followed by a single digit. It causes a spe-
556 cific set of built-in character tables to be passed to pcre[16|32]_com-
557 pile(). It is used in the standard PCRE tests to check behaviour with
558 different character tables. The digit specifies the tables as follows:
559
560 0 the default ASCII tables, as distributed in
561 pcre_chartables.c.dist
562 1 a set of tables defining ISO 8859 characters
563
564 In table 1, some characters whose codes are greater than 128 are iden-
565 tified as letters, digits, spaces, etc.
566
567 Using the POSIX wrapper API
568
569 The /P modifier causes pcretest to call PCRE via the POSIX wrapper API
570 rather than its native API. This supports only the 8-bit library. When
571 /P is set, the following modifiers set options for the regcomp() func-
572 tion:
573
574 /i REG_ICASE
575 /m REG_NEWLINE
576 /N REG_NOSUB
577 /s REG_DOTALL )
578 /U REG_UNGREEDY ) These options are not part of
579 /W REG_UCP ) the POSIX standard
580 /8 REG_UTF8 )
581
582 The /+ modifier works as described above. All other modifiers are ig-
583 nored.
584
585 Locking out certain modifiers
586
587 PCRE can be compiled with or without support for certain features such
588 as UTF-8/16/32 or Unicode properties. Accordingly, the standard tests
589 are split up into a number of different files that are selected for
590 running depending on which features are available. When updating the
591 tests, it is all too easy to put a new test into the wrong file by mis-
592 take; for example, to put a test that requires UTF support into a file
593 that is used when it is not available. To help detect such mistakes as
594 early as possible, there is a facility for locking out specific modi-
595 fiers. If an input line for pcretest starts with the string "< forbid "
596 the following sequence of characters is taken as a list of forbidden
597 modifiers. For example, in the test files that must not use UTF or Uni-
598 code property support, this line appears:
599
600 < forbid 8W
601
602 This locks out the /8 and /W modifiers. An immediate error is given if
603 they are subsequently encountered. If the character string contains <
604 but not >, all the multi-character modifiers that begin with < are
605 locked out. Otherwise, such modifiers must be explicitly listed, for
606 example:
607
608 < forbid <JS><cr>
609
610 There must be a single space between < and "forbid" for this feature to
611 be recognised. If there is not, the line is interpreted either as a re-
612 quest to re-load a pre-compiled pattern (see "SAVING AND RELOADING COM-
613 PILED PATTERNS" below) or, if there is a another < character, as a pat-
614 tern that uses < as its delimiter.
615
616
617 DATA LINES
618
619 Before each data line is passed to pcre[16|32]_exec(), leading and
620 trailing white space is removed, and it is then scanned for \ escapes.
621 Some of these are pretty esoteric features, intended for checking out
622 some of the more complicated features of PCRE. If you are just testing
623 "ordinary" regular expressions, you probably don't need any of these.
624 The following escapes are recognized:
625
626 \a alarm (BEL, \x07)
627 \b backspace (\x08)
628 \e escape (\x27)
629 \f form feed (\x0c)
630 \n newline (\x0a)
631 \qdd set the PCRE_MATCH_LIMIT limit to dd
632 (any number of digits)
633 \r carriage return (\x0d)
634 \t tab (\x09)
635 \v vertical tab (\x0b)
636 \nnn octal character (up to 3 octal digits); always
637 a byte unless > 255 in UTF-8 or 16-bit or 32-bit mode
638 \o{dd...} octal character (any number of octal digits}
639 \xhh hexadecimal byte (up to 2 hex digits)
640 \x{hh...} hexadecimal character (any number of hex digits)
641 \A pass the PCRE_ANCHORED option to pcre[16|32]_exec()
642 or pcre[16|32]_dfa_exec()
643 \B pass the PCRE_NOTBOL option to pcre[16|32]_exec()
644 or pcre[16|32]_dfa_exec()
645 \Cdd call pcre[16|32]_copy_substring() for substring dd
646 after a successful match (number less than 32)
647 \Cname call pcre[16|32]_copy_named_substring() for substring
648 "name" after a successful match (name termin-
649 ated by next non alphanumeric character)
650 \C+ show the current captured substrings at callout
651 time
652 \C- do not supply a callout function
653 \C!n return 1 instead of 0 when callout number n is
654 reached
655 \C!n!m return 1 instead of 0 when callout number n is
656 reached for the nth time
657 \C*n pass the number n (may be negative) as callout
658 data; this is used as the callout return value
659 \D use the pcre[16|32]_dfa_exec() match function
660 \F only shortest match for pcre[16|32]_dfa_exec()
661 \Gdd call pcre[16|32]_get_substring() for substring dd
662 after a successful match (number less than 32)
663 \Gname call pcre[16|32]_get_named_substring() for substring
664 "name" after a successful match (name termin-
665 ated by next non-alphanumeric character)
666 \Jdd set up a JIT stack of dd kilobytes maximum (any
667 number of digits)
668 \L call pcre[16|32]_get_substringlist() after a
669 successful match
670 \M discover the minimum MATCH_LIMIT and
671 MATCH_LIMIT_RECURSION settings
672 \N pass the PCRE_NOTEMPTY option to pcre[16|32]_exec()
673 or pcre[16|32]_dfa_exec(); if used twice, pass the
674 PCRE_NOTEMPTY_ATSTART option
675 \Odd set the size of the output vector passed to
676 pcre[16|32]_exec() to dd (any number of digits)
677 \P pass the PCRE_PARTIAL_SOFT option to pcre[16|32]_exec()
678 or pcre[16|32]_dfa_exec(); if used twice, pass the
679 PCRE_PARTIAL_HARD option
680 \Qdd set the PCRE_MATCH_LIMIT_RECURSION limit to dd
681 (any number of digits)
682 \R pass the PCRE_DFA_RESTART option to pcre[16|32]_dfa_exec()
683 \S output details of memory get/free calls during matching
684 \Y pass the PCRE_NO_START_OPTIMIZE option to
685 pcre[16|32]_exec()
686 or pcre[16|32]_dfa_exec()
687 \Z pass the PCRE_NOTEOL option to pcre[16|32]_exec()
688 or pcre[16|32]_dfa_exec()
689 \? pass the PCRE_NO_UTF[8|16|32]_CHECK option to
690 pcre[16|32]_exec() or pcre[16|32]_dfa_exec()
691 \>dd start the match at offset dd (optional "-"; then
692 any number of digits); this sets the startoffset
693 argument for pcre[16|32]_exec() or
694 pcre[16|32]_dfa_exec()
695 \<cr> pass the PCRE_NEWLINE_CR option to pcre[16|32]_exec()
696 or pcre[16|32]_dfa_exec()
697 \<lf> pass the PCRE_NEWLINE_LF option to pcre[16|32]_exec()
698 or pcre[16|32]_dfa_exec()
699 \<crlf> pass the PCRE_NEWLINE_CRLF option to pcre[16|32]_exec()
700 or pcre[16|32]_dfa_exec()
701 \<anycrlf> pass the PCRE_NEWLINE_ANYCRLF option to pcre[16|32]_exec()
702 or pcre[16|32]_dfa_exec()
703 \<any> pass the PCRE_NEWLINE_ANY option to pcre[16|32]_exec()
704 or pcre[16|32]_dfa_exec()
705
706 The use of \x{hh...} is not dependent on the use of the /8 modifier on
707 the pattern. It is recognized always. There may be any number of hexa-
708 decimal digits inside the braces; invalid values provoke error mes-
709 sages.
710
711 Note that \xhh specifies one byte rather than one character in UTF-8
712 mode; this makes it possible to construct invalid UTF-8 sequences for
713 testing purposes. On the other hand, \x{hh} is interpreted as a UTF-8
714 character in UTF-8 mode, generating more than one byte if the value is
715 greater than 127. When testing the 8-bit library not in UTF-8 mode,
716 \x{hh} generates one byte for values less than 256, and causes an error
717 for greater values.
718
719 In UTF-16 mode, all 4-digit \x{hhhh} values are accepted. This makes it
720 possible to construct invalid UTF-16 sequences for testing purposes.
721
722 In UTF-32 mode, all 4- to 8-digit \x{...} values are accepted. This
723 makes it possible to construct invalid UTF-32 sequences for testing
724 purposes.
725
726 The escapes that specify line ending sequences are literal strings, ex-
727 actly as shown. No more than one newline setting should be present in
728 any data line.
729
730 A backslash followed by anything else just escapes the anything else.
731 If the very last character is a backslash, it is ignored. This gives a
732 way of passing an empty line as data, since a real empty line termi-
733 nates the data input.
734
735 The \J escape provides a way of setting the maximum stack size that is
736 used by the just-in-time optimization code. It is ignored if JIT opti-
737 mization is not being used. Providing a stack that is larger than the
738 default 32K is necessary only for very complicated patterns.
739
740 If \M is present, pcretest calls pcre[16|32]_exec() several times, with
741 different values in the match_limit and match_limit_recursion fields of
742 the pcre[16|32]_extra data structure, until it finds the minimum num-
743 bers for each parameter that allow pcre[16|32]_exec() to complete with-
744 out error. Because this is testing a specific feature of the normal in-
745 terpretive pcre[16|32]_exec() execution, the use of any JIT optimiza-
746 tion that might have been set up by the /S+ qualifier of -s+ option is
747 disabled.
748
749 The match_limit number is a measure of the amount of backtracking that
750 takes place, and checking it out can be instructive. For most simple
751 matches, the number is quite small, but for patterns with very large
752 numbers of matching possibilities, it can become large very quickly
753 with increasing length of subject string. The match_limit_recursion
754 number is a measure of how much stack (or, if PCRE is compiled with
755 NO_RECURSE, how much heap) memory is needed to complete the match at-
756 tempt.
757
758 When \O is used, the value specified may be higher or lower than the
759 size set by the -O command line option (or defaulted to 45); \O applies
760 only to the call of pcre[16|32]_exec() for the line in which it ap-
761 pears.
762
763 If the /P modifier was present on the pattern, causing the POSIX wrap-
764 per API to be used, the only option-setting sequences that have any ef-
765 fect are \B, \N, and \Z, causing REG_NOTBOL, REG_NOTEMPTY, and REG_NO-
766 TEOL, respectively, to be passed to regexec().
767
768
769 THE ALTERNATIVE MATCHING FUNCTION
770
771 By default, pcretest uses the standard PCRE matching function,
772 pcre[16|32]_exec() to match each data line. PCRE also supports an al-
773 ternative matching function, pcre[16|32]_dfa_test(), which operates in
774 a different way, and has some restrictions. The differences between the
775 two functions are described in the pcrematching documentation.
776
777 If a data line contains the \D escape sequence, or if the command line
778 contains the -dfa option, the alternative matching function is used.
779 This function finds all possible matches at a given point. If, however,
780 the \F escape sequence is present in the data line, it stops after the
781 first match is found. This is always the shortest possible match.
782
783
784 DEFAULT OUTPUT FROM PCRETEST
785
786 This section describes the output when the normal matching function,
787 pcre[16|32]_exec(), is being used.
788
789 When a match succeeds, pcretest outputs the list of captured substrings
790 that pcre[16|32]_exec() returns, starting with number 0 for the string
791 that matched the whole pattern. Otherwise, it outputs "No match" when
792 the return is PCRE_ERROR_NOMATCH, and "Partial match:" followed by the
793 partially matching substring when pcre[16|32]_exec() returns PCRE_ER-
794 ROR_PARTIAL. (Note that this is the entire substring that was inspected
795 during the partial match; it may include characters before the actual
796 match start if a lookbehind assertion, \K, \b, or \B was involved.) For
797 any other return, pcretest outputs the PCRE negative error number and a
798 short descriptive phrase. If the error is a failed UTF string check,
799 the offset of the start of the failing character and the reason code
800 are also output, provided that the size of the output vector is at
801 least two. Here is an example of an interactive pcretest run.
802
803 $ pcretest
804 PCRE version 8.13 2011-04-30
805
806 re> /^abc(\d+)/
807 data> abc123
808 0: abc123
809 1: 123
810 data> xyz
811 No match
812
813 Unset capturing substrings that are not followed by one that is set are
814 not returned by pcre[16|32]_exec(), and are not shown by pcretest. In
815 the following example, there are two capturing substrings, but when the
816 first data line is matched, the second, unset substring is not shown.
817 An "internal" unset substring is shown as "<unset>", as for the second
818 data line.
819
820 re> /(a)|(b)/
821 data> a
822 0: a
823 1: a
824 data> b
825 0: b
826 1: <unset>
827 2: b
828
829 If the strings contain any non-printing characters, they are output as
830 \xhh escapes if the value is less than 256 and UTF mode is not set.
831 Otherwise they are output as \x{hh...} escapes. See below for the defi-
832 nition of non-printing characters. If the pattern has the /+ modifier,
833 the output for substring 0 is followed by the the rest of the subject
834 string, identified by "0+" like this:
835
836 re> /cat/+
837 data> cataract
838 0: cat
839 0+ aract
840
841 If the pattern has the /g or /G modifier, the results of successive
842 matching attempts are output in sequence, like this:
843
844 re> /\Bi(\w\w)/g
845 data> Mississippi
846 0: iss
847 1: ss
848 0: iss
849 1: ss
850 0: ipp
851 1: pp
852
853 "No match" is output only if the first match attempt fails. Here is an
854 example of a failure message (the offset 4 that is specified by \>4 is
855 past the end of the subject string):
856
857 re> /xyz/
858 data> xyz\>4
859 Error -24 (bad offset value)
860
861 If any of the sequences \C, \G, or \L are present in a data line that
862 is successfully matched, the substrings extracted by the convenience
863 functions are output with C, G, or L after the string number instead of
864 a colon. This is in addition to the normal full list. The string length
865 (that is, the return from the extraction function) is given in paren-
866 theses after each string for \C and \G.
867
868 Note that whereas patterns can be continued over several lines (a plain
869 ">" prompt is used for continuations), data lines may not. However new-
870 lines can be included in data by means of the \n escape (or \r, \r\n,
871 etc., depending on the newline sequence setting).
872
873
874 OUTPUT FROM THE ALTERNATIVE MATCHING FUNCTION
875
876 When the alternative matching function, pcre[16|32]_dfa_exec(), is used
877 (by means of the \D escape sequence or the -dfa command line option),
878 the output consists of a list of all the matches that start at the
879 first point in the subject where there is at least one match. For exam-
880 ple:
881
882 re> /(tang|tangerine|tan)/
883 data> yellow tangerine\D
884 0: tangerine
885 1: tang
886 2: tan
887
888 (Using the normal matching function on this data finds only "tang".)
889 The longest matching string is always given first (and numbered zero).
890 After a PCRE_ERROR_PARTIAL return, the output is "Partial match:", fol-
891 lowed by the partially matching substring. (Note that this is the en-
892 tire substring that was inspected during the partial match; it may in-
893 clude characters before the actual match start if a lookbehind asser-
894 tion, \K, \b, or \B was involved.)
895
896 If /g is present on the pattern, the search for further matches resumes
897 at the end of the longest match. For example:
898
899 re> /(tang|tangerine|tan)/g
900 data> yellow tangerine and tangy sultana\D
901 0: tangerine
902 1: tang
903 2: tan
904 0: tang
905 1: tan
906 0: tan
907
908 Since the matching function does not support substring capture, the es-
909 cape sequences that are concerned with captured substrings are not rel-
910 evant.
911
912
913 RESTARTING AFTER A PARTIAL MATCH
914
915 When the alternative matching function has given the PCRE_ERROR_PARTIAL
916 return, indicating that the subject partially matched the pattern, you
917 can restart the match with additional subject data by means of the \R
918 escape sequence. For example:
919
920 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
921 data> 23ja\P\D
922 Partial match: 23ja
923 data> n05\R\D
924 0: n05
925
926 For further information about partial matching, see the pcrepartial
927 documentation.
928
929
930 CALLOUTS
931
932 If the pattern contains any callout requests, pcretest's callout func-
933 tion is called during matching. This works with both matching func-
934 tions. By default, the called function displays the callout number, the
935 start and current positions in the text at the callout time, and the
936 next pattern item to be tested. For example:
937
938 --->pqrabcdef
939 0 ^ ^ \d
940
941 This output indicates that callout number 0 occurred for a match at-
942 tempt starting at the fourth character of the subject string, when the
943 pointer was at the seventh character of the data, and when the next
944 pattern item was \d. Just one circumflex is output if the start and
945 current positions are the same.
946
947 Callouts numbered 255 are assumed to be automatic callouts, inserted as
948 a result of the /C pattern modifier. In this case, instead of showing
949 the callout number, the offset in the pattern, preceded by a plus, is
950 output. For example:
951
952 re> /\d?[A-E]\*/C
953 data> E*
954 --->E*
955 +0 ^ \d?
956 +3 ^ [A-E]
957 +8 ^^ \*
958 +10 ^ ^
959 0: E*
960
961 If a pattern contains (*MARK) items, an additional line is output when-
962 ever a change of latest mark is passed to the callout function. For ex-
963 ample:
964
965 re> /a(*MARK:X)bc/C
966 data> abc
967 --->abc
968 +0 ^ a
969 +1 ^^ (*MARK:X)
970 +10 ^^ b
971 Latest Mark: X
972 +11 ^ ^ c
973 +12 ^ ^
974 0: abc
975
976 The mark changes between matching "a" and "b", but stays the same for
977 the rest of the match, so nothing more is output. If, as a result of
978 backtracking, the mark reverts to being unset, the text "<unset>" is
979 output.
980
981 The callout function in pcretest returns zero (carry on matching) by
982 default, but you can use a \C item in a data line (as described above)
983 to change this and other parameters of the callout.
984
985 Inserting callouts can be helpful when using pcretest to check compli-
986 cated regular expressions. For further information about callouts, see
987 the pcrecallout documentation.
988
989
990 NON-PRINTING CHARACTERS
991
992 When pcretest is outputting text in the compiled version of a pattern,
993 bytes other than 32-126 are always treated as non-printing characters
994 are are therefore shown as hex escapes.
995
996 When pcretest is outputting text that is a matched part of a subject
997 string, it behaves in the same way, unless a different locale has been
998 set for the pattern (using the /L modifier). In this case, the is-
999 print() function to distinguish printing and non-printing characters.
1000
1001
1002 SAVING AND RELOADING COMPILED PATTERNS
1003
1004 The facilities described in this section are not available when the
1005 POSIX interface to PCRE is being used, that is, when the /P pattern
1006 modifier is specified.
1007
1008 When the POSIX interface is not in use, you can cause pcretest to write
1009 a compiled pattern to a file, by following the modifiers with > and a
1010 file name. For example:
1011
1012 /pattern/im >/some/file
1013
1014 See the pcreprecompile documentation for a discussion about saving and
1015 re-using compiled patterns. Note that if the pattern was successfully
1016 studied with JIT optimization, the JIT data cannot be saved.
1017
1018 The data that is written is binary. The first eight bytes are the
1019 length of the compiled pattern data followed by the length of the op-
1020 tional study data, each written as four bytes in big-endian order (most
1021 significant byte first). If there is no study data (either the pattern
1022 was not studied, or studying did not return any data), the second
1023 length is zero. The lengths are followed by an exact copy of the com-
1024 piled pattern. If there is additional study data, this (excluding any
1025 JIT data) follows immediately after the compiled pattern. After writing
1026 the file, pcretest expects to read a new pattern.
1027
1028 A saved pattern can be reloaded into pcretest by specifying < and a
1029 file name instead of a pattern. There must be no space between < and
1030 the file name, which must not contain a < character, as otherwise
1031 pcretest will interpret the line as a pattern delimited by < charac-
1032 ters. For example:
1033
1034 re> </some/file
1035 Compiled pattern loaded from /some/file
1036 No study data
1037
1038 If the pattern was previously studied with the JIT optimization, the
1039 JIT information cannot be saved and restored, and so is lost. When the
1040 pattern has been loaded, pcretest proceeds to read data lines in the
1041 usual way.
1042
1043 You can copy a file written by pcretest to a different host and reload
1044 it there, even if the new host has opposite endianness to the one on
1045 which the pattern was compiled. For example, you can compile on an i86
1046 machine and run on a SPARC machine. When a pattern is reloaded on a
1047 host with different endianness, the confirmation message is changed to:
1048
1049 Compiled pattern (byte-inverted) loaded from /some/file
1050
1051 The test suite contains some saved pre-compiled patterns with different
1052 endianness. These are reloaded using "<!" instead of just "<". This
1053 suppresses the "(byte-inverted)" text so that the output is the same on
1054 all hosts. It also forces debugging output once the pattern has been
1055 reloaded.
1056
1057 File names for saving and reloading can be absolute or relative, but
1058 note that the shell facility of expanding a file name that starts with
1059 a tilde (~) is not available.
1060
1061 The ability to save and reload files in pcretest is intended for test-
1062 ing and experimentation. It is not intended for production use because
1063 only a single pattern can be written to a file. Furthermore, there is
1064 no facility for supplying custom character tables for use with a
1065 reloaded pattern. If the original pattern was compiled with custom ta-
1066 bles, an attempt to match a subject string using a reloaded pattern is
1067 likely to cause pcretest to crash. Finally, if you attempt to load a
1068 file that is not in the correct format, the result is undefined.
1069
1070
1071 SEE ALSO
1072
1073 pcre(3), pcre16(3), pcre32(3), pcreapi(3), pcrecallout(3), pcrejit,
1074 pcrematching(3), pcrepartial(d), pcrepattern(3), pcreprecompile(3).
1075
1076
1077 AUTHOR
1078
1079 Philip Hazel
1080 University Computing Service
1081 Cambridge CB2 3QH, England.
1082
1083
1084 REVISION
1085
1086 Last updated: 10 February 2020
1087 Copyright (c) 1997-2020 University of Cambridge.

Properties

Name Value
svn:eol-style native
svn:keywords "Author Date Id Revision Url"

  ViewVC Help
Powered by ViewVC 1.1.5