/[pcre]/code/trunk/README
ViewVC logotype

Diff of /code/trunk/README

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 3 by nigel, Sat Feb 24 21:38:01 2007 UTC revision 29 by nigel, Sat Feb 24 21:38:53 2007 UTC
# Line 1  Line 1 
1  README file for PCRE (Perl-compatible regular expressions)  README file for PCRE (Perl-compatible regular expressions)
2  ----------------------------------------------------------  ----------------------------------------------------------
3    
4    *******************************************************************************
5    *           IMPORTANT FOR THOSE UPGRADING FROM VERSIONS BEFORE 2.00           *
6    *                                                                             *
7    * Please note that there has been a change in the API such that a larger      *
8    * ovector is required at matching time, to provide some additional workspace. *
9    * The new man page has details. This change was necessary in order to support *
10    * some of the new functionality in Perl 5.005.                                *
11    *                                                                             *
12    *           IMPORTANT FOR THOSE UPGRADING FROM VERSION 2.00                   *
13    *                                                                             *
14    * Another (I hope this is the last!) change has been made to the API for the  *
15    * pcre_compile() function. An additional argument has been added to make it   *
16    * possible to pass over a pointer to character tables built in the current    *
17    * locale by pcre_maketables(). To use the default tables, this new arguement  *
18    * should be passed as NULL.                                                   *
19    *******************************************************************************
20    
21  The distribution should contain the following files:  The distribution should contain the following files:
22    
23    ChangeLog         log of changes to the code    ChangeLog         log of changes to the code
24      LICENCE           conditions for the use of PCRE
25    Makefile          for building PCRE    Makefile          for building PCRE
   Performance       notes on performance  
26    README            this file    README            this file
27      RunTest           a shell script for running tests
28    Tech.Notes        notes on the encoding    Tech.Notes        notes on the encoding
29    pcre.3            man page for the functions    pcre.3            man page for the functions
30    pcreposix.3       man page for the POSIX wrapper API    pcreposix.3       man page for the POSIX wrapper API
31    maketables.c      auxiliary program for building chartables.c    dftables.c        auxiliary program for building chartables.c
32      get.c             )
33      maketables.c      )
34    study.c           ) source of    study.c           ) source of
35    pcre.c            )   the functions    pcre.c            )   the functions
36    pcreposix.c       )    pcreposix.c       )
# Line 21  The distribution should contain the foll Line 41  The distribution should contain the foll
41    pgrep.1           man page for pgrep    pgrep.1           man page for pgrep
42    pgrep.c           source of a grep utility that uses PCRE    pgrep.c           source of a grep utility that uses PCRE
43    perltest          Perl test program    perltest          Perl test program
44    testinput         test data, compatible with Perl    testinput         test data, compatible with Perl 5.004 and 5.005
45    testinput2        test data for error messages and non-Perl things    testinput2        test data for error messages and non-Perl things
46      testinput3        test data, compatible with Perl 5.005
47      testinput4        test data for locale-specific tests
48    testoutput        test results corresponding to testinput    testoutput        test results corresponding to testinput
49    testoutput2       test results corresponding to testinput2    testoutput2       test results corresponding to testinput2
50      testoutput3       test results corresponding to testinput3
51      testoutput4       test results corresponding to testinput4
52    
53    To build PCRE, edit Makefile for your system (it is a fairly simple make file,
54    and there are some comments at the top) and then run it. It builds two
55    libraries called libpcre.a and libpcreposix.a, a test program called pcretest,
56    and the pgrep command.
57    
58    To test PCRE, run the RunTest script in the pcre directory. This runs pcretest
59    on each of the testinput files in turn, and compares the output with the
60    contents of the corresponding testoutput file. A file called testtry is used to
61    hold the output from pcretest (which is documented below).
62    
63    To run pcretest on just one of the test files, give its number as an argument
64    to RunTest, for example:
65    
66      RunTest 3
67    
68    The first and third test files can also be fed directly into the perltest
69    program to check that Perl gives the same results. The third file requires the
70    additional features of release 5.005, which is why it is kept separate from the
71    main test input, which needs only Perl 5.004. In the long run, when 5.005 is
72    widespread, these two test files may get amalgamated.
73    
74    The second set of tests check pcre_info(), pcre_study(), pcre_copy_substring(),
75    pcre_get_substring(), pcre_get_substring_list(), error detection and run-time
76    flags that are specific to PCRE, as well as the POSIX wrapper API.
77    
78    The fourth set of tests checks pcre_maketables(), the facility for building a
79    set of character tables for a specific locale and using them instead of the
80    default tables. The tests make use of the "fr" (French) locale. Before running
81    the test, the script checks for the presence of this locale by running the
82    "locale" command. If that command fails, or if it doesn't include "fr" in the
83    list of available locales, the fourth test cannot be run, and a comment is
84    output to say why. If running this test produces instances of the error
85    
86      ** Failed to set locale "fr"
87    
88  To build PCRE, edit Makefile for your system (it is a fairly simple make file)  in the comparison output, it means that locale is not available on your system,
89  and then run it. It builds a two libraries called libpcre.a and libpcreposix.a,  despite being listed by "locale". This does not mean that PCRE is broken.
 a test program called pcretest, and the pgrep command.  
   
 To test PCRE, run pcretest on the file testinput, and compare the output with  
 the contents of testoutput. There should be no differences. For example:  
   
   pcretest testinput /tmp/anything  
   diff /tmp/anything testoutput  
   
 Do the same with testinput2, comparing the output with testoutput2, but this  
 time using the -i flag for pcretest, i.e.  
   
   pcretest -i testinput2 /tmp/anything  
   diff /tmp/anything testoutput2  
   
 There are two sets of tests because the first set can also be fed directly into  
 the perltest program to check that Perl gives the same results. The second set  
 of tests check pcre_info(), pcre_study(), error detection and run-time flags  
 that are specific to PCRE, as well as the POSIX wrapper API.  
90    
91  To install PCRE, copy libpcre.a to any suitable library directory (e.g.  To install PCRE, copy libpcre.a to any suitable library directory (e.g.
92  /usr/local/lib), pcre.h to any suitable include directory (e.g.  /usr/local/lib), pcre.h to any suitable include directory (e.g.
# Line 63  themselves still follow Perl syntax and Line 104  themselves still follow Perl syntax and
104  for the POSIX-style functions is called pcreposix.h. The official POSIX name is  for the POSIX-style functions is called pcreposix.h. The official POSIX name is
105  regex.h, but I didn't want to risk possible problems with existing files of  regex.h, but I didn't want to risk possible problems with existing files of
106  that name by distributing it that way. To use it with an existing program that  that name by distributing it that way. To use it with an existing program that
107  uses the POSIX API it will have to be renamed or pointed at by a link.  uses the POSIX API, it will have to be renamed or pointed at by a link.
108    
109    
110  Character tables  Character tables
111  ----------------  ----------------
112    
113  PCRE uses four tables for manipulating and identifying characters. These are  PCRE uses four tables for manipulating and identifying characters. The final
114  compiled from a source file called chartables.c. This is not supplied in  argument of the pcre_compile() function is a pointer to a block of memory
115  the distribution, but is built by the program maketables (compiled from  containing the concatenated tables. A call to pcre_maketables() is used to
116  maketables.c), which uses the ANSI C character handling functions such as  generate a set of tables in the current locale. However, if the final argument
117  isalnum(), isalpha(), isupper(), islower(), etc. to build the table sources.  is passed as NULL, a set of default tables that is built into the binary is
118  This means that the default C locale set in your system may affect the contents  used.
119  of the tables. You can change the tables by editing chartables.c and then  
120  re-building PCRE. If you do this, you should probably also edit Makefile to  The source file called chartables.c contains the default set of tables. This is
121  ensure that the file doesn't ever get re-generated.  not supplied in the distribution, but is built by the program dftables
122    (compiled from dftables.c), which uses the ANSI C character handling functions
123  The first two tables pcre_lcc[] and pcre_fcc[] provide lower casing and a  such as isalnum(), isalpha(), isupper(), islower(), etc. to build the table
124  case flipping functions, respectively. The pcre_cbits[] table consists of four  sources. This means that the default C locale set your system will control the
125  32-byte bit maps which identify digits, letters, "word" characters, and white  contents of the tables. You can change the default tables by editing
126  space, respectively. These are used when building 32-byte bit maps that  chartables.c and then re-building PCRE. If you do this, you should probably
127  represent character classes.  also edit Makefile to ensure that the file doesn't ever get re-generated.
128    
129    The first two 256-byte tables provide lower casing and case flipping functions,
130    respectively. The next table consists of three 32-byte bit maps which identify
131    digits, "word" characters, and white space, respectively. These are used when
132    building 32-byte bit maps that represent character classes.
133    
134  The pcre_ctypes[] table has bits indicating various character types, as  The final 256-byte table has bits indicating various character types, as
135  follows:  follows:
136    
137      1   white space character      1   white space character
# Line 114  The program handles any number of sets o Line 160  The program handles any number of sets o
160  set starts with a regular expression, and continues with any number of data  set starts with a regular expression, and continues with any number of data
161  lines to be matched against the pattern. An empty line signals the end of the  lines to be matched against the pattern. An empty line signals the end of the
162  set. The regular expressions are given enclosed in any non-alphameric  set. The regular expressions are given enclosed in any non-alphameric
163  delimiters, for example  delimiters other than backslash, for example
164    
165    /(a|bc)x+yz/    /(a|bc)x+yz/
166    
167  and may be followed by i, m, s, or x to set the PCRE_CASELESS, PCRE_MULTILINE,  White space before the initial delimiter is ignored. A regular expression may
168  PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These options have the  be continued over several input lines, in which case the newline characters are
169  same effect as they do in Perl.  included within it. See the testinput files for many examples. It is possible
170    to include the delimiter within the pattern by escaping it, for example
171    
172      /abc\/def/
173    
174    If you do so, the escape and the delimiter form part of the pattern, but since
175    delimiters are always non-alphameric, this does not affect its interpretation.
176    If the terminating delimiter is immediately followed by a backslash, for
177    example,
178    
179      /abc/\
180    
181    then a backslash is added to the end of the pattern. This provides a way of
182    testing the error condition that arises if a pattern finishes with a backslash,
183    because
184    
185      /abc\/
186    
187    is interpreted as the first line of a pattern that starts with "abc/", causing
188    pcretest to read the next line as a continuation of the regular expression.
189    
190    The pattern may be followed by i, m, s, or x to set the PCRE_CASELESS,
191    PCRE_MULTILINE, PCRE_DOTALL, or PCRE_EXTENDED options, respectively. These
192    options have the same effect as they do in Perl.
193    
194  There are also some upper case options that do not match Perl options: /A, /E,  There are also some upper case options that do not match Perl options: /A, /E,
195  and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.  and /X set PCRE_ANCHORED, PCRE_DOLLAR_ENDONLY, and PCRE_EXTRA respectively.
196  The /D option is a PCRE debugging feature. It causes the internal form of  
197  compiled regular expressions to be output after compilation. The /S option  The /L option must be followed directly by the name of a locale, for example,
198  causes pcre_study() to be called after the expression has been compiled, and  
199  the results used when the expression is matched. If /I is present as well as    /pattern/Lfr
200  /S, then pcre_study() is called with the PCRE_CASELESS option.  
201    For this reason, it must be the last option letter. The given locale is set,
202    pcre_maketables() is called to build a set of character tables for the locale,
203    and this is then passed to pcre_compile() when compiling the regular
204    expression. Without an /L option, NULL is passed as the tables pointer; that
205    is, /L applies only to the expression on which it appears.
206    
207    The /I option requests that pcretest output information about the compiled
208    expression (whether it is anchored, has a fixed first character, and so on). It
209    does this by calling pcre_info() after compiling an expression, and outputting
210    the information it gets back. If the pattern is studied, the results of that
211    are also output.
212    
213    The /D option is a PCRE debugging feature, which also assumes /I. It causes the
214    internal form of compiled regular expressions to be output after compilation.
215    
216    The /S option causes pcre_study() to be called after the expression has been
217    compiled, and the results used when the expression is matched.
218    
219  Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API  Finally, the /P option causes pcretest to call PCRE via the POSIX wrapper API
220  rather than its native API. When this is done, all other options except /i and  rather than its native API. When this is done, all other options except /i and
# Line 136  rather than its native API. When this is Line 222  rather than its native API. When this is
222  is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and  is present. The wrapper functions force PCRE_DOLLAR_ENDONLY always, and
223  PCRE_DOTALL unless REG_NEWLINE is set.  PCRE_DOTALL unless REG_NEWLINE is set.
224    
 A regular expression can extend over several lines of input; the newlines are  
 included in it. See the testinput file for many examples.  
   
225  Before each data line is passed to pcre_exec(), leading and trailing whitespace  Before each data line is passed to pcre_exec(), leading and trailing whitespace
226  is removed, and it is then scanned for \ escapes. The following are recognized:  is removed, and it is then scanned for \ escapes. The following are recognized:
227    
# Line 155  is removed, and it is then scanned for \ Line 238  is removed, and it is then scanned for \
238    
239    \A     pass the PCRE_ANCHORED option to pcre_exec()    \A     pass the PCRE_ANCHORED option to pcre_exec()
240    \B     pass the PCRE_NOTBOL option to pcre_exec()    \B     pass the PCRE_NOTBOL option to pcre_exec()
241    \E     pass the PCRE_DOLLAR_ENDONLY option to pcre_exec()    \Cdd   call pcre_copy_substring() for substring dd after a successful match
242    \I     pass the PCRE_CASELESS option to pcre_exec()             (any decimal number less than 32)
243    \M     pass the PCRE_MULTILINE option to pcre_exec()    \Gdd   call pcre_get_substring() for substring dd after a successful match
244    \S     pass the PCRE_DOTALL option to pcre_exec()             (any decimal number less than 32)
245      \L     call pcre_get_substringlist() after a successful match
246    \Odd   set the size of the output vector passed to pcre_exec() to dd    \Odd   set the size of the output vector passed to pcre_exec() to dd
247             (any number of decimal digits)             (any number of decimal digits)
248    \Z     pass the PCRE_NOTEOL option to pcre_exec()    \Z     pass the PCRE_NOTEOL option to pcre_exec()
# Line 171  If /P was present on the regex, causing Line 255  If /P was present on the regex, causing
255  \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to  \B, and \Z have any effect, causing REG_NOTBOL and REG_NOTEOL to be passed to
256  regexec() respectively.  regexec() respectively.
257    
258  When a match succeeds, pcretest outputs the list of identified substrings that  When a match succeeds, pcretest outputs the list of captured substrings that
259  pcre_exec() returns, starting with number 0 for the string that matched the  pcre_exec() returns, starting with number 0 for the string that matched the
260  whole pattern. Here is an example of an interactive pcretest run.  whole pattern. Here is an example of an interactive pcretest run.
261    
# Line 179  whole pattern. Here is an example of an Line 263  whole pattern. Here is an example of an
263    Testing Perl-Compatible Regular Expressions    Testing Perl-Compatible Regular Expressions
264    PCRE version 0.90 08-Sep-1997    PCRE version 0.90 08-Sep-1997
265    
266        re> /^abc(\d+)/      re> /^abc(\d+)/
267      data> abc123    data> abc123
268     0: abc123      0: abc123
269     1: 123      1: 123
270      data> xyz    data> xyz
271    No match    No match
272    
273    If any of \C, \G, or \L are present in a data line that is successfully
274    matched, the substrings extracted by the convenience functions are output with
275    C, G, or L after the string number instead of a colon. This is in addition to
276    the normal full list. The string length (that is, the return from the
277    extraction function) is given in parentheses after each string for \C and \G.
278    
279  Note that while patterns can be continued over several lines (a plain ">"  Note that while patterns can be continued over several lines (a plain ">"
280  prompt is used for continuations), data lines may not. However newlines can be  prompt is used for continuations), data lines may not. However newlines can be
281  included in data by means of the \n escape.  included in data by means of the \n escape.
# Line 197  following flags has any effect in this c Line 287  following flags has any effect in this c
287  If the option -d is given to pcretest, it is equivalent to adding /D to each  If the option -d is given to pcretest, it is equivalent to adding /D to each
288  regular expression: the internal form is output after compilation.  regular expression: the internal form is output after compilation.
289    
290  If the option -i (for "information") is given to pcretest, it calls pcre_info()  If the option -i is given to pcretest, it is equivalent to adding /I to each
291  after compiling an expression, and outputs the information it gets back. If the  regular expression: information about the compiled pattern is given after
292  pattern is studied, the results of that are also output.  compilation.
293    
294  If the option -s is given to pcretest, it outputs the size of each compiled  If the option -s is given to pcretest, it outputs the size of each compiled
295  pattern after it has been compiled.  pattern after it has been compiled.
296    
297  If the -t option is given, each compile, study, and match is run 2000 times  If the -t option is given, each compile, study, and match is run 20000 times
298  while being timed, and the resulting time per compile or match is output in  while being timed, and the resulting time per compile or match is output in
299  milliseconds. Do not set -t with -s, because you will then get the size output  milliseconds. Do not set -t with -s, because you will then get the size output
300  2000 times and the timing will be distorted.  20000 times and the timing will be distorted. If you want to change the number
301    of repetitions used for timing, edit the definition of LOOPREPEAT at the top of
302    pcretest.c
303    
304    
305    
# Line 216  The perltest program Line 308  The perltest program
308    
309  The perltest program tests Perl's regular expressions; it has the same  The perltest program tests Perl's regular expressions; it has the same
310  specification as pcretest, and so can be given identical input, except that  specification as pcretest, and so can be given identical input, except that
311  input patterns can be followed only by Perl's lower case options.  input patterns can be followed only by Perl's lower case options. The contents
312    of testinput and testinput3 meet this condition.
313    
314  The data lines are processed as Perl strings, so if they contain $ or @  The data lines are processed as Perl strings, so if they contain $ or @
315  characters, these have to be escaped. For this reason, all such characters in  characters, these have to be escaped. For this reason, all such characters in
# Line 225  for pcretest, and the special upper case Line 318  for pcretest, and the special upper case
318  recognizes are not used in this file. The output should be identical, apart  recognizes are not used in this file. The output should be identical, apart
319  from the initial identifying banner.  from the initial identifying banner.
320    
321  The testinput2 file is not suitable for feeding to Perltest, since it does  The testinput2 and testinput4 files are not suitable for feeding to Perltest,
322  make use of the special upper case options and escapes that pcretest uses to  since they do make use of the special upper case options and escapes that
323  test additional features of PCRE.  pcretest uses to test some features of PCRE. The first of these files also
324    contains malformed regular expressions, in order to check that PCRE diagnoses
325    them correctly.
326    
327  Philip Hazel <ph10@cam.ac.uk>  Philip Hazel <ph10@cam.ac.uk>
328  October 1997  February 1999

Legend:
Removed from v.3  
changed lines
  Added in v.29

  ViewVC Help
Powered by ViewVC 1.1.5