/[pcre]/code/trunk/doc/pcretest.txt
ViewVC logotype

Diff of /code/trunk/doc/pcretest.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 47 by nigel, Sat Feb 24 21:39:29 2007 UTC revision 63 by nigel, Sat Feb 24 21:40:03 2007 UTC
# Line 1  Line 1 
1  The pcretest program  NAME
2  --------------------       pcretest - a program  for  testing  Perl-compatible  regular
3         expressions.
4    
5    
6    SYNOPSIS
7         pcretest [-d] [-i] [-m] [-o osize] [-p] [-t] [source]  [des-
8         tination]
9    
10         pcretest was written as a test program for the PCRE  regular
11         expression  library  itself,  but  it  can  also be used for
12         experimenting  with  regular  expressions.   This   document
13         describes  the  features of the test program; for details of
14         the regular  expressions  themselves,  see  the  pcrepattern
15         documentation.  For details of PCRE and its options, see the
16         pcreapi documentation.
17    
18    
19    OPTIONS
20    
21    
22         -C        Output the version number of the PCRE library, and
23                   all   available  information  about  the  optional
24                   features that are included, and then exit.
25    
26         -d        Behave as if each regex had the /D  modifier  (see
27                   below); the internal form is output after compila-
28                   tion.
29    
30         -i        Behave as if  each  regex  had  the  /I  modifier;
31                   information  about  the  compiled pattern is given
32                   after compilation.
33    
34         -m        Output the size of each compiled pattern after  it
35                   has been compiled. This is equivalent to adding /M
36                   to each regular expression. For compatibility with
37                   earlier  versions of pcretest, -s is a synonym for
38                   -m.
39    
40         -o osize  Set the number of elements in  the  output  vector
41                   that  is  used  when calling PCRE to be osize. The
42                   default value is 45, which is enough for  14  cap-
43                   turing  subexpressions.  The  vector  size  can be
44                   changed for individual matching calls by including
45                   \O in the data line (see below).
46    
47         -p        Behave as if each regex has /P modifier; the POSIX
48                   wrapper  API  is  used  to  call PCRE. None of the
49                   other options has any effect when -p is set.
50    
51         -t        Run each compile, study, and match many times with
52                   a  timer, and output resulting time per compile or
53                   match (in milliseconds). Do not set  -t  with  -m,
54                   because  you  will  then get the size output 20000
55                   times and the timing will be distorted.
56    
57    
58    DESCRIPTION
59    
60         If pcretest is given two filename arguments, it  reads  from
61         the  first and writes to the second. If it is given only one
62         filename argument, it reads from that  file  and  writes  to
63         stdout. Otherwise, it reads from stdin and writes to stdout,
64         and prompts for each line of input, using  "re>"  to  prompt
65         for  regular  expressions,  and  "data>"  to prompt for data
66         lines.
67    
68         The program handles any number of sets of input on a  single
69         input  file.  Each set starts with a regular expression, and
70         continues with any  number  of  data  lines  to  be  matched
71         against the pattern.
72    
73         Each line is matched separately and  independently.  If  you
74         want  to  do  multiple-line  matches, you have to use the \n
75         escape sequence in a single line of input to encode the new-
76         line  characters.  The maximum length of data line is 30,000
77         characters.
78    
79         An empty line signals the end of the data  lines,  at  which
80         point  a new regular expression is read. The regular expres-
81         sions are given enclosed in  any  non-alphameric  delimiters
82         other than backslash, for example
83    
84           /(a|bc)x+yz/
85    
86         White space before the initial delimiter is ignored. A regu-
87         lar expression may be continued over several input lines, in
88         which case the newline characters are included within it. It
89         is  possible  to include the delimiter within the pattern by
90         escaping it, for example
91    
92           /abc\/def/
93    
94         If you do so, the escape and the delimiter form part of  the
95         pattern,  but  since  delimiters  are always non-alphameric,
96         this does not affect its interpretation.  If the terminating
97         delimiter  is immediately followed by a backslash, for exam-
98         ple,
99    
100           /abc/\
101    
102         then a backslash is added to the end of the pattern. This is
103         done  to  provide  a way of testing the error condition that
104         arises if a pattern finishes with a backslash, because
105    
106           /abc\/
107    
108         is interpreted as the first line of a  pattern  that  starts
109         with  "abc/",  causing  pcretest  to read the next line as a
110         continuation of the regular expression.
111    
112    
113    PATTERN MODIFIERS
114    
115         The pattern may be followed by i, m, s,  or  x  to  set  the
116         PCRE_CASELESS, PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED
117         options, respectively. For example:
118    
119           /caseless/i
120    
121         These modifier letters have the same effect as  they  do  in
122         Perl.  There  are  others which set PCRE options that do not
123         correspond  to  anything  in  Perl:   /A,  /E,  and  /X  set
124         PCRE_ANCHORED,  PCRE_DOLLAR_ENDONLY,  and PCRE_EXTRA respec-
125         tively.
126    
127         Searching for  all  possible  matches  within  each  subject
128         string  can  be  requested  by  the /g or /G modifier. After
129         finding  a  match,  PCRE  is  called  again  to  search  the
130         remainder  of  the subject string. The difference between /g
131         and /G is that the former uses the startoffset  argument  to
132         pcre_exec()  to  start  searching  at a new point within the
133         entire string (which is in effect what Perl  does),  whereas
134         the  latter  passes over a shortened substring. This makes a
135         difference to the matching process  if  the  pattern  begins
136         with a lookbehind assertion (including \b or \B).
137    
138         If any call to pcre_exec() in a /g or /G sequence matches an
139         empty  string,  the next call is done with the PCRE_NOTEMPTY
140         and PCRE_ANCHORED flags set in order to search for  another,
141         non-empty,  match  at  the same point.  If this second match
142         fails, the start offset is advanced by one, and  the  normal
143         match  is  retried.  This imitates the way Perl handles such
144         cases when using the /g modifier or the split() function.
145    
146         There are a number of other modifiers  for  controlling  the
147         way pcretest operates.
148    
149         The /+ modifier requests that as well as outputting the sub-
150         string  that  matched the entire pattern, pcretest should in
151         addition output the remainder of the subject string. This is
152         useful  for tests where the subject contains multiple copies
153         of the same substring.
154    
155         The /L modifier must be followed directly by the name  of  a
156         locale, for example,
157    
158           /pattern/Lfr
159    
160         For this reason, it must be the last  modifier  letter.  The
161         given  locale is set, pcre_maketables() is called to build a
162         set of character tables for the locale,  and  this  is  then
163         passed  to pcre_compile() when compiling the regular expres-
164         sion. Without an /L modifier, NULL is passed as  the  tables
165         pointer; that is, /L applies only to the expression on which
166         it appears.
167    
168         The /I modifier requests that  pcretest  output  information
169         about the compiled expression (whether it is anchored, has a
170         fixed first character, and so on). It does this  by  calling
171         pcre_fullinfo()  after  compiling an expression, and output-
172         ting the information it gets back. If the  pattern  is  stu-
173         died, the results of that are also output.
174    
175         The /D modifier is a  PCRE  debugging  feature,  which  also
176         assumes /I.  It causes the internal form of compiled regular
177         expressions to be output after compilation. If  the  pattern
178         was studied, the information returned is also output.
179    
180         The /S modifier causes pcre_study() to be called  after  the
181         expression  has been compiled, and the results used when the
182         expression is matched.
183    
184         The /M modifier causes the size of memory block used to hold
185         the compiled pattern to be output.
186    
187         The /P modifier causes pcretest to call PCRE via  the  POSIX
188         wrapper  API  rather than its native API. When this is done,
189         all other modifiers except  /i,  /m,  and  /+  are  ignored.
190         REG_ICASE is set if /i is present, and REG_NEWLINE is set if
191         /m    is    present.    The    wrapper    functions    force
192         PCRE_DOLLAR_ENDONLY    always,    and   PCRE_DOTALL   unless
193         REG_NEWLINE is set.
194    
195         The /8 modifier  causes  pcretest  to  call  PCRE  with  the
196         PCRE_UTF8  option set. This turns on support for UTF-8 char-
197         acter handling in PCRE, provided that it was  compiled  with
198         this  support  enabled.  This  modifier also causes any non-
199         printing characters in output strings to  be  printed  using
200         the \x{hh...} notation if they are valid UTF-8 sequences.
201    
202    
203    CALLOUTS
204    
205         If the pattern contains  any  callout  requests,  pcretest's
206         callout function will be called. By default, it displays the
207         callout number, and the start and current positions  in  the
208         text at the callout time. For example, the output
209    
210           --->pqrabcdef
211             0    ^  ^
212    
213         indicates that callout number 0 occurred for a match attempt
214         starting at the fourth character of the subject string, when
215         the pointer was at the seventh character. The callout  func-
216         tion returns zero (carry on matching) by default.
217    
218         Inserting callouts may be helpful  when  using  pcretest  to
219         check  complicated regular expressions. For further informa-
220         tion about callouts, see the pcrecallout documentation.
221    
222         For testing the PCRE library, additional control of  callout
223         behaviour  is available via escape sequences in the data, as
224         described in the following section.  In  particular,  it  is
225         possible to pass in a number as callout data (the default is
226         zero). If the callout function receives a  non-zero  number,
227         it returns that value instead of zero.
228    
229    
230    DATA LINES
231    
232         Before each data line is passed to pcre_exec(), leading  and
233         trailing whitespace is removed, and it is then scanned for \
234         escapes.  Some  of  these  are  pretty  esoteric   features,
235         intended  for  checking  out  some  of  the more complicated
236         features of PCRE. If you are just testing "ordinary" regular
237         expressions,  you probably don't need any of these. The fol-
238         lowing escapes are recognized:
239    
240           \a         alarm (= BEL)
241           \b         backspace
242           \e         escape
243           \f         formfeed
244           \n         newline
245           \r         carriage return
246           \t         tab
247           \v         vertical tab
248           \nnn       octal character (up to 3 octal digits)
249           \xhh       hexadecimal character (up to 2 hex digits)
250           \x{hh...}  hexadecimal character, any number of digits
251                        in UTF-8 mode
252           \A         pass the PCRE_ANCHORED option to pcre_exec()
253           \B         pass the PCRE_NOTBOL option to pcre_exec()
254           \Cdd       call pcre_copy_substring() for substring dd
255                        after a successful match (any decimal number
256                        less than 32)
257           \Cname     call pcre_copy_named_substring() for substring
258                        "name" after a successful match (name termin-
259                        ated by next non alphanumeric character)
260           \C+        show the current captured substrings at callout
261                        time
262    
263           C-        do not supply a callout function
264           \C!n       return 1 instead of 0 when callout number n is
265                        reached
266           \C!n!m     return 1 instead of 0 when callout number n is
267                        reached for the nth time
268           \C*n       pass the number n (may be negative) as callout
269                        data
270           \Gdd       call pcre_get_substring() for substring dd
271                        after a successful match (any decimal number
272                        less than 32)
273           \Gname     call pcre_get_named_substring() for substring
274                        "name" after a successful match (name termin-
275                        ated by next non-alphanumeric character)
276           \L         call pcre_get_substringlist() after a
277                        successful match
278           \M         discover the minimum MATCH_LIMIT setting
279           \N         pass the PCRE_NOTEMPTY option to pcre_exec()
280           \Odd       set the size of the output vector passed to
281                        pcre_exec() to dd (any number of decimal
282                        digits)
283           \Z         pass the PCRE_NOTEOL option to pcre_exec()
284    
285         If \M is present, pcretest calls pcre_exec() several  times,
286         with  different  values  in  the  match_limit  field  of the
287         pcre_extra data structure, until it finds the minimum number
288         that is needed for pcre_exec() to complete. This number is a
289         measure of the amount of  recursion  and  backtracking  that
290         takes  place,  and  checking  it out can be instructive. For
291         most simple matches, the number is quite small, but for pat-
292         terns  with very large numbers of matching possibilities, it
293         can become large very quickly with increasing length of sub-
294         ject string.
295    
296         When \O is used, it may be higher or lower than the size set
297         by  the  -O  option (or defaulted to 45); \O applies only to
298         the call of pcre_exec() for the line in which it appears.
299    
300         A backslash followed by anything else just escapes the  any-
301         thing else. If the very last character is a backslash, it is
302         ignored. This gives a way of passing an empty line as  data,
303         since a real empty line terminates the data input.
304    
305         If /P was present on the regex, causing  the  POSIX  wrapper
306         API  to  be  used,  only  B,  and Z have any effect, causing
307         REG_NOTBOL and REG_NOTEOL to be passed to regexec()  respec-
308         tively.
309    
310         The use of \x{hh...} to represent UTF-8  characters  is  not
311         dependent  on  the use of the /8 modifier on the pattern. It
312         is recognized always. There may be any number of hexadecimal
313         digits  inside  the  braces.  The  result is from one to six
314         bytes, encoded according to the UTF-8 rules.
315    
316    
317    OUTPUT FROM PCRETEST
318    
319         When a match succeeds, pcretest outputs the list of captured
320         substrings  that pcre_exec() returns, starting with number 0
321         for the string that matched the whole pattern.  Here  is  an
322         example of an interactive pcretest run.
323    
324           $ pcretest
325           PCRE version 4.00 08-Jan-2003
326    
327             re> /^abc(\d+)/
328           data> abc123
329            0: abc123
330            1: 123
331           data> xyz
332           No match
333    
334         If the strings contain any non-printing characters, they are
335         output  as  \0x  escapes,  or  as  \x{...} escapes if the /8
336         modifier was present on the pattern. If the pattern has  the
337         /+  modifier, then the output for substring 0 is followed by
338         the the rest of the subject string, identified by "0+"  like
339         this:
340    
341             re> /cat/+
342           data> cataract
343            0: cat
344            0+ aract
345    
346         If the pattern has the /g or /G  modifier,  the  results  of
347         successive  matching  attempts  are output in sequence, like
348         this:
349    
350             re> /\Bi(\w\w)/g
351           data> Mississippi
352            0: iss
353            1: ss
354            0: iss
355            1: ss
356            0: ipp
357            1: pp
358    
359         "No match" is output only if the first match attempt fails.
360    
361         If any of the sequences \C, \G, or \L are present in a  data
362         line  that is successfully matched, the substrings extracted
363         by the convenience functions are output  with  C,  G,  or  L
364         after the string number instead of a colon. This is in addi-
365         tion to the normal full list. The string  length  (that  is,
366         the  return  from  the  extraction  function)  is  given  in
367         parentheses after each string for \C and \G.
368    
369         Note that while patterns can be continued over several lines
370         (a  plain  ">" prompt is used for continuations), data lines
371         may not. However newlines can be included in data  by  means
372         of the \n escape.
373    
374    
375    AUTHOR
376    
377         Philip Hazel <ph10@cam.ac.uk>
378         University Computing Service,
379         Cambridge CB2 3QG, England.
380    
381  This program is intended for testing PCRE, but it can also be used for  Last updated: 03 February 2003
382  experimenting with regular expressions.  Copyright (c) 1997-2003 University of Cambridge.
   
 If it is given two filename arguments, it reads from the first and writes to  
 the second. If it is given only one filename argument, it reads from that file  
 and writes to stdout. Otherwise, it reads from stdin and writes to stdout, and  
 prompts for each line of input, using "re>" to prompt for regular expressions,  
 and "data>" to prompt for data lines.  
   
 The program handles any number of sets of input on a single input file. Each  
 set starts with a regular expression, and continues with any number of data  
 lines to be matched against the pattern. An empty line signals the end of the  
 data lines, at which point a new regular expression is read. The regular  
 expressions are given enclosed in any non-alphameric delimiters other than  
 backslash, for example  
   
   /(a|bc)x+yz/  
   
 White space before the initial delimiter is ignored. A regular expression may  
 be continued over several input lines, in which case the newline characters are  
 included within it. See the test input files in the testdata directory for many  
 examples. It is possible to include the delimiter within the pattern by  
 escaping it, for example  
   
   /abc\/def/  
   
 If you do so, the escape and the delimiter form part of the pattern, but since  
 delimiters are always non-alphameric, this does not affect its interpretation.  
 If the terminating delimiter is immediately followed by a backslash, for  
 example,  
   
   /abc/\  
   
 then a backslash is added to the end of the pattern. This is done to provide a  
 way of testing the error condition that arises if a pattern finishes with a  
 backslash, because  
   
   /abc\/  
   
 is interpreted as the first line of a pattern that starts with "abc/", causing  
 pcretest to read the next line as a continuation of the regular expression.  
   
 The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,  
 PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. For  
 example:  
   
   /caseless/i  
   
 These modifier letters have the same effect as they do in Perl. There are  
 others which set PCRE options that do not correspond to anything in Perl: /A,  
 /E, and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.  
   
 Searching for all possible matches within each subject string can be requested  
 by the /g or /G modifier. After finding a match, PCRE is called again to search  
 the remainder of the subject string. The difference between /g and /G is that  
 the former uses the startoffset argument to pcre_exec() to start searching at  
 a new point within the entire string (which is in effect what Perl does),  
 whereas the latter passes over a shortened substring. This makes a difference  
 to the matching process if the pattern begins with a lookbehind assertion  
 (including \b or \B).  
   
 If any call to pcre_exec() in a /g or /G sequence matches an empty string, the  
 next call is done with the PCRE_NOTEMPTY and PCRE_ANCHORED flags set in order  
 to search for another, non-empty, match at the same point. If this second match  
 fails, the start offset is advanced by one, and the normal match is retried.  
 This imitates the way Perl handles such cases when using the /g modifier or the  
 split() function.  
   
 There are a number of other modifiers for controlling the way pcretest  
 operates.  
   
 The /+ modifier requests that as well as outputting the substring that matched  
 the entire pattern, pcretest should in addition output the remainder of the  
 subject string. This is useful for tests where the subject contains multiple  
 copies of the same substring.  
   
 The /L modifier must be followed directly by the name of a locale, for example,  
   
   /pattern/Lfr  
   
 For this reason, it must be the last modifier letter. The given locale is set,  
 pcre_maketables() is called to build a set of character tables for the locale,  
 and this is then passed to pcre_compile() when compiling the regular  
 expression. Without an /L modifier, NULL is passed as the tables pointer; that  
 is, /L applies only to the expression on which it appears.  
   
 The /I modifier requests that pcretest output information about the compiled  
 expression (whether it is anchored, has a fixed first character, and so on). It  
 does this by calling pcre_fullinfo() after compiling an expression, and  
 outputting the information it gets back. If the pattern is studied, the results  
 of that are also output.  
   
 The /D modifier is a PCRE debugging feature, which also assumes /I. It causes  
 the internal form of compiled regular expressions to be output after  
 compilation.  
   
 The /S modifier causes pcre_study() to be called after the expression has been  
 compiled, and the results used when the expression is matched.  
   
 The /M modifier causes the size of memory block used to hold the compiled  
 pattern to be output.  
   
 Finally, the /P modifier causes pcretest to call PCRE via the POSIX wrapper API  
 rather than its native API. When this is done, all other modifiers except /i,  
 /m, and /+ are ignored. REG_ICASE is set if /i is present, and REG_NEWLINE is  
 set if /m is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always,  
 and PCRE_DOTALL unless REG_NEWLINE is set.  
   
 Before each data line is passed to pcre_exec(), leading and trailing whitespace  
 is removed, and it is then scanned for \ escapes. The following are recognized:  
   
   \a     alarm (= BEL)  
   \b     backspace  
   \e     escape  
   \f     formfeed  
   \n     newline  
   \r     carriage return  
   \t     tab  
   \v     vertical tab  
   \nnn   octal character (up to 3 octal digits)  
   \xhh   hexadecimal character (up to 2 hex digits)  
   
   \A     pass the PCRE_ANCHORED option to pcre_exec()  
   \B     pass the PCRE_NOTBOL option to pcre_exec()  
   \Cdd   call pcre_copy_substring() for substring dd after a successful match  
            (any decimal number less than 32)  
   \Gdd   call pcre_get_substring() for substring dd after a successful match  
            (any decimal number less than 32)  
   \L     call pcre_get_substringlist() after a successful match  
   \N     pass the PCRE_NOTEMPTY option to pcre_exec()  
   \Odd   set the size of the output vector passed to pcre_exec() to dd  
            (any number of decimal digits)  
   \Z     pass the PCRE_NOTEOL option to pcre_exec()  
   
 A backslash followed by anything else just escapes the anything else. If the  
 very last character is a backslash, it is ignored. This gives a way of passing  
 an empty line as data, since a real empty line terminates the data input.  
   
 If /P was present on the regex, causing the POSIX wrapper API to be used, only  
 \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to  
 regexec() respectively.  
   
 When a match succeeds, pcretest outputs the list of captured substrings that  
 pcre_exec() returns, starting with number 0 for the string that matched the  
 whole pattern. Here is an example of an interactive pcretest run.  
   
   $ pcretest  
   PCRE version 2.06 08-Jun-1999  
   
     re> /^abc(\d+)/  
   data> abc123  
    0: abc123  
    1: 123  
   data> xyz  
   No match  
   
 If the strings contain any non-printing characters, they are output as \0x  
 escapes. If the pattern has the /+ modifier, then the output for substring 0 is  
 followed by the the rest of the subject string, identified by "0+" like this:  
   
     re> /cat/+  
   data> cataract  
    0: cat  
    0+ aract  
   
 If the pattern has the /g or /G modifier, the results of successive matching  
 attempts are output in sequence, like this:  
   
     re> /\Bi(\w\w)/g  
   data> Mississippi  
    0: iss  
    1: ss  
    0: iss  
    1: ss  
    0: ipp  
    1: pp  
   
 "No match" is output only if the first match attempt fails.  
   
 If any of \C, \G, or \L are present in a data line that is successfully  
 matched, the substrings extracted by the convenience functions are output with  
 C, G, or L after the string number instead of a colon. This is in addition to  
 the normal full list. The string length (that is, the return from the  
 extraction function) is given in parentheses after each string for \C and \G.  
   
 Note that while patterns can be continued over several lines (a plain ">"  
 prompt is used for continuations), data lines may not. However newlines can be  
 included in data by means of the \n escape.  
   
 If the -p option is given to pcretest, it is equivalent to adding /P to each  
 regular expression: the POSIX wrapper API is used to call PCRE. None of the  
 following flags has any effect in this case.  
   
 If the option -d is given to pcretest, it is equivalent to adding /D to each  
 regular expression: the internal form is output after compilation.  
   
 If the option -i is given to pcretest, it is equivalent to adding /I to each  
 regular expression: information about the compiled pattern is given after  
 compilation.  
   
 If the option -m is given to pcretest, it outputs the size of each compiled  
 pattern after it has been compiled. It is equivalent to adding /M to each  
 regular expression. For compatibility with earlier versions of pcretest, -s is  
 a synonym for -m.  
   
 If the -t option is given, each compile, study, and match is run 20000 times  
 while being timed, and the resulting time per compile or match is output in  
 milliseconds. Do not set -t with -s, because you will then get the size output  
 20000 times and the timing will be distorted. If you want to change the number  
 of repetitions used for timing, edit the definition of LOOPREPEAT at the top of  
 pcretest.c  
   
 Philip Hazel <ph10@cam.ac.uk>  
 January 2000  

Legend:
Removed from v.47  
changed lines
  Added in v.63

  ViewVC Help
Powered by ViewVC 1.1.5