/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC revision 488 by ph10, Mon Jan 11 15:29:42 2010 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. There are  synopses of each function in the library have not been included. Neither has
6  separate text files for the pcregrep and pcretest commands.  the pcredemo program. There are separate text files for the pcregrep and
7    pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
9    
 PCRE(3)                                                                PCRE(3)  
10    
11    PCRE(3)                                                                PCRE(3)
12    
13    
14  NAME  NAME
15         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
16    
17  DESCRIPTION  
18    INTRODUCTION
19    
20         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
21         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
22         just  a  few  differences.  The current implementation of PCRE (release         just  a few differences. Some features that appeared in Python and PCRE
23         4.x) corresponds approximately with Perl  5.8,  including  support  for         before they appeared in Perl are also available using the  Python  syn-
24         UTF-8  encoded  strings.   However,  this  support has to be explicitly         tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25         enabled; it is not the default.         items, and there is an option for requesting some  minor  changes  that
26           give better JavaScript compatibility.
27         PCRE is written in C and released as a C library. However, a number  of  
28         people  have  written  wrappers  and interfaces of various kinds. A C++         The  current implementation of PCRE corresponds approximately with Perl
29         class is included in these contributions, which can  be  found  in  the         5.10, including support for UTF-8 encoded strings and  Unicode  general
30           category  properties.  However,  UTF-8  and  Unicode  support has to be
31           explicitly enabled; it is not the default. The  Unicode  tables  corre-
32           spond to Unicode release 5.1.
33    
34           In  addition to the Perl-compatible matching function, PCRE contains an
35           alternative function that matches the same compiled patterns in a  dif-
36           ferent way. In certain circumstances, the alternative function has some
37           advantages.  For a discussion of the two matching algorithms,  see  the
38           pcrematching page.
39    
40           PCRE  is  written  in C and released as a C library. A number of people
41           have written wrappers and interfaces of various kinds.  In  particular,
42           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
43           included as part of the PCRE distribution. The pcrecpp page has details
44           of  this  interface.  Other  people's contributions can be found in the
45         Contrib directory at the primary FTP site, which is:         Contrib directory at the primary FTP site, which is:
46    
47         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
48    
49         Details  of  exactly which Perl regular expression features are and are         Details of exactly which Perl regular expression features are  and  are
50         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
51         tern and pcrecompat pages.         tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
52           page.
53    
54         Some  features  of  PCRE can be included, excluded, or changed when the         Some  features  of  PCRE can be included, excluded, or changed when the
55         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
56         client  to  discover  which features are available. Documentation about         client  to  discover  which  features are available. The features them-
57         building PCRE for various operating systems can be found in the  README         selves are described in the pcrebuild page. Documentation about  build-
58         file in the source distribution.         ing  PCRE  for various operating systems can be found in the README and
59           NON-UNIX-USE files in the source distribution.
60    
61           The library contains a number of undocumented  internal  functions  and
62           data  tables  that  are  used by more than one of the exported external
63           functions, but which are not intended  for  use  by  external  callers.
64           Their  names  all begin with "_pcre_", which hopefully will not provoke
65           any name clashes. In some environments, it is possible to control which
66           external  symbols  are  exported when a shared library is built, and in
67           these cases the undocumented symbols are not exported.
68    
69    
70  USER DOCUMENTATION  USER DOCUMENTATION
71    
72         The user documentation for PCRE has been split up into a number of dif-         The user documentation for PCRE comprises a number  of  different  sec-
73         ferent sections. In the "man" format, each of these is a separate  "man         tions.  In the "man" format, each of these is a separate "man page". In
74         page".  In  the  HTML  format, each is a separate page, linked from the         the HTML format, each is a separate page, linked from the  index  page.
75         index page. In the plain text format, all  the  sections  are  concate-         In  the  plain  text format, all the sections, except the pcredemo sec-
76         nated, for ease of searching. The sections are as follows:         tion, are concatenated, for ease of searching. The sections are as fol-
77           lows:
78    
79           pcre              this document           pcre              this document
80           pcreapi           details of PCRE's native API           pcre-config       show PCRE installation configuration information
81             pcreapi           details of PCRE's native C API
82           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
83           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
84           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
85             pcrecpp           details of the C++ wrapper
86             pcredemo          a demonstration C program that uses PCRE
87           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
88             pcrematching      discussion of the two matching algorithms
89             pcrepartial       details of the partial matching facility
90           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
91                               regular expressions                               regular expressions
92           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
93           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible C API
94           pcresample        discussion of the sample program           pcreprecompile    details of saving and re-using precompiled patterns
95           pcretest          the pcretest testing command           pcresample        discussion of the pcredemo program
96             pcrestack         discussion of stack usage
97             pcresyntax        quick syntax reference
98             pcretest          description of the pcretest testing command
99    
100         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
101         each library function, listing its arguments and results.         each C library function, listing its arguments and results.
102    
103    
104  LIMITATIONS  LIMITATIONS
# Line 74  LIMITATIONS Line 111  LIMITATIONS
111         process  regular  expressions  that are truly enormous, you can compile         process  regular  expressions  that are truly enormous, you can compile
112         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
113         the  source  distribution and the pcrebuild documentation for details).         the  source  distribution and the pcrebuild documentation for details).
114         If these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
115         of execution will be slower.         of execution is slower.
116    
117         All values in repeating quantifiers must be less than 65536.  The maxi-         All values in repeating quantifiers must be less than 65536.
        mum number of capturing subpatterns is 65535.  
118    
119         There is no limit to the number of non-capturing subpatterns,  but  the         There is no limit to the number of parenthesized subpatterns, but there
120         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         can be no more than 65535 capturing subpatterns.
        including capturing subpatterns, assertions, and other types of subpat-  
        tern, is 200.  
121    
122         The  maximum  length of a subject string is the largest positive number         The maximum length of name for a named subpattern is 32 characters, and
123         that an integer variable can hold. However, PCRE uses recursion to han-         the maximum number of named subpatterns is 10000.
        dle  subpatterns  and indefinite repetition. This means that the avail-  
        able stack space may limit the size of a subject  string  that  can  be  
        processed by certain patterns.  
124    
125           The  maximum  length of a subject string is the largest positive number
126           that an integer variable can hold. However, when using the  traditional
127           matching function, PCRE uses recursion to handle subpatterns and indef-
128           inite repetition.  This means that the available stack space may  limit
129           the size of a subject string that can be processed by certain patterns.
130           For a discussion of stack issues, see the pcrestack documentation.
131    
 UTF-8 SUPPORT  
132    
133         Starting  at  release  3.3,  PCRE  has  had  some support for character  UTF-8 AND UNICODE PROPERTY SUPPORT
134         strings encoded in the UTF-8 format. For  release  4.0  this  has  been  
135         greatly extended to cover most common requirements.         From release 3.3, PCRE has  had  some  support  for  character  strings
136           encoded  in the UTF-8 format. For release 4.0 this was greatly extended
137           to cover most common requirements, and in release 5.0  additional  sup-
138           port for Unicode general category properties was added.
139    
140         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
141         support in the code, and, in addition,  you  must  call  pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
142         with  the PCRE_UTF8 option flag. When you do this, both the pattern and         with  the  PCRE_UTF8  option  flag,  or the pattern must start with the
143         any subject strings that are matched against it are  treated  as  UTF-8         sequence (*UTF8). When either of these is the case,  both  the  pattern
144         strings instead of just strings of bytes.         and  any  subject  strings  that  are matched against it are treated as
145           UTF-8 strings instead of strings of 1-byte characters.
146         If  you compile PCRE with UTF-8 support, but do not use it at run time,  
147         the library will be a bit bigger, but the additional run time  overhead         If you compile PCRE with UTF-8 support, but do not use it at run  time,
148         is  limited  to testing the PCRE_UTF8 flag in several places, so should         the  library will be a bit bigger, but the additional run time overhead
149         not be very large.         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
150           very big.
151         The following comments apply when PCRE is running in UTF-8 mode:  
152           If PCRE is built with Unicode character property support (which implies
153         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and         UTF-8 support), the escape sequences \p{..}, \P{..}, and  \X  are  sup-
154         subjects  are  checked for validity on entry to the relevant functions.         ported.  The available properties that can be tested are limited to the
155         If an invalid UTF-8 string is passed, an error return is given. In some         general category properties such as Lu for an upper case letter  or  Nd
156         situations,  you  may  already  know  that  your strings are valid, and         for  a  decimal number, the Unicode script names such as Arabic or Han,
157         therefore want to skip these checks in order to improve performance. If         and the derived properties Any and L&. A full  list  is  given  in  the
158         you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,         pcrepattern documentation. Only the short names for properties are sup-
159         PCRE assumes that the pattern or subject  it  is  given  (respectively)         ported. For example, \p{L} matches a letter. Its Perl synonym,  \p{Let-
160         contains  only valid UTF-8 codes. In this case, it does not diagnose an         ter},  is  not  supported.   Furthermore,  in Perl, many properties may
161         invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when         optionally be prefixed by "Is", for compatibility with Perl  5.6.  PCRE
162         PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may         does not support this.
163         crash.  
164       Validity of UTF-8 strings
165         2. In a pattern, the escape sequence \x{...}, where the contents of the  
166         braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8         When  you  set  the  PCRE_UTF8 flag, the strings passed as patterns and
167         character whose code number is the given hexadecimal number, for  exam-         subjects are (by default) checked for validity on entry to the relevant
168         ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,         functions.  From  release 7.3 of PCRE, the check is according the rules
169         the item is not recognized.  This escape sequence can be used either as         of RFC 3629, which are themselves derived from the  Unicode  specifica-
170         a literal, or within a character class.         tion.  Earlier  releases  of PCRE followed the rules of RFC 2279, which
171           allows the full range of 31-bit values (0 to 0x7FFFFFFF).  The  current
172           check allows only values in the range U+0 to U+10FFFF, excluding U+D800
173           to U+DFFF.
174    
175           The excluded code points are the "Low Surrogate Area"  of  Unicode,  of
176           which  the Unicode Standard says this: "The Low Surrogate Area does not
177           contain any  character  assignments,  consequently  no  character  code
178           charts or namelists are provided for this area. Surrogates are reserved
179           for use with UTF-16 and then must be used in pairs."  The  code  points
180           that  are  encoded  by  UTF-16  pairs are available as independent code
181           points in the UTF-8 encoding. (In  other  words,  the  whole  surrogate
182           thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
183    
184           If  an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error return
185           (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
186           that your strings are valid, and therefore want to skip these checks in
187           order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
188           compile  time  or at run time, PCRE assumes that the pattern or subject
189           it is given (respectively) contains only valid  UTF-8  codes.  In  this
190           case, it does not diagnose an invalid UTF-8 string.
191    
192           If  you  pass  an  invalid UTF-8 string when PCRE_NO_UTF8_CHECK is set,
193           what happens depends on why the string is invalid. If the  string  con-
194           forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
195           string of characters in the range 0  to  0x7FFFFFFF.  In  other  words,
196           apart from the initial validity test, PCRE (when in UTF-8 mode) handles
197           strings according to the more liberal rules of RFC  2279.  However,  if
198           the  string does not even conform to RFC 2279, the result is undefined.
199           Your program may crash.
200    
201           If you want to process strings  of  values  in  the  full  range  0  to
202           0x7FFFFFFF,  encoded in a UTF-8-like manner as per the old RFC, you can
203           set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
204           this situation, you will have to apply your own validity check.
205    
206       General comments about UTF-8 mode
207    
208         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte         1.  An  unbraced  hexadecimal  escape sequence (such as \xb3) matches a
209         UTF-8 character if the value is greater than 127.         two-byte UTF-8 character if the value is greater than 127.
210    
211         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         2. Octal numbers up to \777 are recognized, and  match  two-byte  UTF-8
212           characters for values greater than \177.
213    
214           3.  Repeat quantifiers apply to complete UTF-8 characters, not to indi-
215         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
216    
217         5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a         4. The dot metacharacter matches one UTF-8 character instead of a  sin-
218         single byte.         gle byte.
219    
220         6. The escape sequence \C can be used to match a single byte  in  UTF-8         5.  The  escape sequence \C can be used to match a single byte in UTF-8
221         mode, but its use can lead to some strange effects.         mode, but its use can lead to some strange effects.  This  facility  is
222           not available in the alternative matching function, pcre_dfa_exec().
223    
224         7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly         6.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly
225         test characters of any code value, but the characters that PCRE  recog-         test characters of any code value, but the characters that PCRE  recog-
226         nizes  as  digits,  spaces,  or  word characters remain the same set as         nizes  as  digits,  spaces,  or  word characters remain the same set as
227         before, all with values less than 256.         before, all with values less than 256. This remains true even when PCRE
228           includes  Unicode  property support, because to do otherwise would slow
229           down PCRE in many common cases. If you really want to test for a  wider
230           sense  of,  say,  "digit",  you must use Unicode property tests such as
231           \p{Nd}. Note that this also applies to \b, because  it  is  defined  in
232           terms of \w and \W.
233    
234           7.  Similarly,  characters that match the POSIX named character classes
235           are all low-valued characters.
236    
237           8. However, the Perl 5.10 horizontal and vertical  whitespace  matching
238           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
239           acters.
240    
241           9. Case-insensitive matching applies only to  characters  whose  values
242           are  less than 128, unless PCRE is built with Unicode property support.
243           Even when Unicode property support is available, PCRE  still  uses  its
244           own  character  tables when checking the case of low-valued characters,
245           so as not to degrade performance.  The Unicode property information  is
246           used only for characters with higher values. Even when Unicode property
247           support is available, PCRE supports case-insensitive matching only when
248           there  is  a  one-to-one  mapping between a letter's cases. There are a
249           small number of many-to-one mappings in Unicode;  these  are  not  sup-
250           ported by PCRE.
251    
        8. Case-insensitive matching applies only to  characters  whose  values  
        are  less  than  256.  PCRE  does  not support the notion of "case" for  
        higher-valued characters.  
252    
253         9. PCRE does not support the use of Unicode tables  and  properties  or  AUTHOR
        the Perl escapes \p, \P, and \X.  
254    
255           Philip Hazel
256           University Computing Service
257           Cambridge CB2 3QH, England.
258    
259  AUTHOR         Putting  an actual email address here seems to have been a spam magnet,
260           so I've taken it away. If you want to email me, use  my  two  initials,
261           followed by the two digits 10, at the domain cam.ac.uk.
262    
        Philip Hazel <ph10@cam.ac.uk>  
        University Computing Service,  
        Cambridge CB2 3QG, England.  
        Phone: +44 1223 334714  
263    
264  Last updated: 20 August 2003  REVISION
265  Copyright (c) 1997-2003 University of Cambridge.  
266  -----------------------------------------------------------------------------         Last updated: 28 September 2009
267           Copyright (c) 1997-2009 University of Cambridge.
268    ------------------------------------------------------------------------------
269    
 PCRE(3)                                                                PCRE(3)  
270    
271    PCREBUILD(3)                                                      PCREBUILD(3)
272    
273    
274  NAME  NAME
275         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
276    
277    
278  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
279    
280         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
281         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. It assumes use of the  configure
282         lected,  by  providing  options  to  the  configure script which is run         script,  where the optional features are selected or deselected by pro-
283         before the make command. The complete list  of  options  for  configure         viding options to configure before running the make  command.  However,
284         (which  includes the standard ones such as the selection of the instal-         the  same  options  can be selected in both Unix-like and non-Unix-like
285         lation directory) can be obtained by running         environments using the GUI facility of cmake-gui if you are using CMake
286           instead of configure to build PCRE.
287    
288           There  is  a  lot more information about building PCRE in non-Unix-like
289           environments in the file called NON_UNIX_USE, which is part of the PCRE
290           distribution.  You  should consult this file as well as the README file
291           if you are building in a non-Unix-like environment.
292    
293           The complete list of options for configure (which includes the standard
294           ones  such  as  the  selection  of  the  installation directory) can be
295           obtained by running
296    
297           ./configure --help           ./configure --help
298    
299         The following sections describe certain options whose names begin  with         The following sections include  descriptions  of  options  whose  names
300         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
301         for the configure command. Because of the  way  that  configure  works,         defaults for the configure command. Because of the way  that  configure
302         --enable  and  --disable  always  come  in  pairs, so the complementary         works,  --enable  and --disable always come in pairs, so the complemen-
303         option always exists as well, but as it specifies the  default,  it  is         tary option always exists as well, but as it specifies the default,  it
304         not described.         is not described.
305    
306    
307    C++ SUPPORT
308    
309           By default, the configure script will search for a C++ compiler and C++
310           header files. If it finds them, it automatically builds the C++ wrapper
311           library for PCRE. You can disable this by adding
312    
313             --disable-cpp
314    
315           to the configure command.
316    
317    
318  UTF-8 SUPPORT  UTF-8 SUPPORT
319    
320         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 Unicode character strings, add
321    
322           --enable-utf8           --enable-utf8
323    
324         to  the  configure  command.  Of  itself, this does not make PCRE treat         to  the  configure  command.  Of  itself, this does not make PCRE treat
325         strings as UTF-8. As well as compiling PCRE with this option, you  also         strings as UTF-8. As well as compiling PCRE with this option, you  also
326         have  have to set the PCRE_UTF8 option when you call the pcre_compile()         have  have to set the PCRE_UTF8 option when you call the pcre_compile()
327         function.         or pcre_compile2() functions.
328    
329           If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE
330           expects its input to be either ASCII or UTF-8 (depending on the runtime
331           option). It is not possible to support both EBCDIC and UTF-8  codes  in
332           the  same  version  of  the  library.  Consequently,  --enable-utf8 and
333           --enable-ebcdic are mutually exclusive.
334    
335    
336    UNICODE CHARACTER PROPERTY SUPPORT
337    
338           UTF-8 support allows PCRE to process character values greater than  255
339           in  the  strings that it handles. On its own, however, it does not pro-
340           vide any facilities for accessing the properties of such characters. If
341           you  want  to  be able to use the pattern escapes \P, \p, and \X, which
342           refer to Unicode character properties, you must add
343    
344             --enable-unicode-properties
345    
346           to the configure command. This implies UTF-8 support, even if you  have
347           not explicitly requested it.
348    
349           Including  Unicode  property  support  adds around 30K of tables to the
350           PCRE library. Only the general category properties such as  Lu  and  Nd
351           are supported. Details are given in the pcrepattern documentation.
352    
353    
354  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
355    
356         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By  default,  PCRE interprets the linefeed (LF) character as indicating
357         ter. This is the normal newline character on Unix-like systems. You can         the end of a line. This is the normal newline  character  on  Unix-like
358         compile PCRE to use character 13 (carriage return) instead by adding         systems.  You  can compile PCRE to use carriage return (CR) instead, by
359           adding
360    
361           --enable-newline-is-cr           --enable-newline-is-cr
362    
363         to the configure command. For completeness there is  also  a  --enable-         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
364         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
365         line character.  
366           Alternatively, you can specify that line endings are to be indicated by
367           the two character sequence CRLF. If you want this, add
368    
369             --enable-newline-is-crlf
370    
371           to the configure command. There is a fourth option, specified by
372    
373             --enable-newline-is-anycrlf
374    
375           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
376           CRLF as indicating a line ending. Finally, a fifth option, specified by
377    
378             --enable-newline-is-any
379    
380           causes PCRE to recognize any Unicode newline sequence.
381    
382           Whatever  line  ending convention is selected when PCRE is built can be
383           overridden when the library functions are called. At build time  it  is
384           conventional to use the standard for your operating system.
385    
386    
387    WHAT \R MATCHES
388    
389           By  default,  the  sequence \R in a pattern matches any Unicode newline
390           sequence, whatever has been selected as the line  ending  sequence.  If
391           you specify
392    
393             --enable-bsr-anycrlf
394    
395           the  default  is changed so that \R matches only CR, LF, or CRLF. What-
396           ever is selected when PCRE is built can be overridden when the  library
397           functions are called.
398    
399    
400  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
401    
402         The PCRE building process uses libtool to build both shared and  static         The  PCRE building process uses libtool to build both shared and static
403         Unix  libraries by default. You can suppress one of these by adding one         Unix libraries by default. You can suppress one of these by adding  one
404         of         of
405    
406           --disable-shared           --disable-shared
# Line 231  BUILDING SHARED AND STATIC LIBRARIES Line 411  BUILDING SHARED AND STATIC LIBRARIES
411    
412  POSIX MALLOC USAGE  POSIX MALLOC USAGE
413    
414         When PCRE is called through the  POSIX  interface  (see  the  pcreposix         When PCRE is called through the POSIX interface (see the pcreposix doc-
415         documentation),  additional working storage is required for holding the         umentation), additional working storage is  required  for  holding  the
416         pointers to capturing substrings because PCRE requires  three  integers         pointers  to capturing substrings, because PCRE requires three integers
417         per  substring,  whereas  the POSIX interface provides only two. If the         per substring, whereas the POSIX interface provides only  two.  If  the
418         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
419         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
420         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 245  POSIX MALLOC USAGE Line 425  POSIX MALLOC USAGE
425         to the configure command.         to the configure command.
426    
427    
 LIMITING PCRE RESOURCE USAGE  
   
        Internally,  PCRE  has a function called match() which it calls repeat-  
        edly (possibly recursively) when performing a  matching  operation.  By  
        limiting  the  number of times this function may be called, a limit can  
        be placed on the resources used by a single call  to  pcre_exec().  The  
        limit  can be changed at run time, as described in the pcreapi documen-  
        tation. The default is 10 million, but this can be changed by adding  a  
        setting such as  
   
          --with-match-limit=500000  
   
        to the configure command.  
   
   
428  HANDLING VERY LARGE PATTERNS  HANDLING VERY LARGE PATTERNS
429    
430         Within  a  compiled  pattern,  offset values are used to point from one         Within a compiled pattern, offset values are used  to  point  from  one
431         part to another (for example, from an opening parenthesis to an  alter-         part  to another (for example, from an opening parenthesis to an alter-
432         nation  metacharacter).  By  default two-byte values are used for these         nation metacharacter). By default, two-byte values are used  for  these
433         offsets, leading to a maximum size for a  compiled  pattern  of  around         offsets,  leading  to  a  maximum size for a compiled pattern of around
434         64K.  This  is sufficient to handle all but the most gigantic patterns.         64K. This is sufficient to handle all but the most  gigantic  patterns.
435         Nevertheless, some people do want to process enormous patterns,  so  it         Nevertheless,  some  people do want to process truyl enormous patterns,
436         is  possible  to compile PCRE to use three-byte or four-byte offsets by         so it is possible to compile PCRE to use three-byte or  four-byte  off-
437         adding a setting such as         sets by adding a setting such as
438    
439           --with-link-size=3           --with-link-size=3
440    
441         to the configure command. The value given must be 2,  3,  or  4.  Using         to  the  configure  command.  The value given must be 2, 3, or 4. Using
442         longer  offsets slows down the operation of PCRE because it has to load         longer offsets slows down the operation of PCRE because it has to  load
443         additional bytes when handling them.         additional bytes when handling them.
444    
        If you build PCRE with an increased link size, test 2 (and  test  5  if  
        you  are using UTF-8) will fail. Part of the output of these tests is a  
        representation of the compiled pattern, and this changes with the  link  
        size.  
   
445    
446  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
447    
448         PCRE  implements  backtracking while matching by making recursive calls         When matching with the pcre_exec() function, PCRE implements backtrack-
449         to an internal function called match(). In environments where the  size         ing by making recursive calls to an internal function  called  match().
450         of the stack is limited, this can severely limit PCRE's operation. (The         In  environments  where  the size of the stack is limited, this can se-
451         Unix environment does not usually suffer from this problem.) An  alter-         verely limit PCRE's operation. (The Unix environment does  not  usually
452         native  approach  that  uses  memory  from  the  heap to remember data,         suffer from this problem, but it may sometimes be necessary to increase
453         instead of using recursive function calls, has been implemented to work         the maximum stack size.  There is a discussion in the  pcrestack  docu-
454         round  this  problem. If you want to build a version of PCRE that works         mentation.)  An alternative approach to recursion that uses memory from
455         this way, add         the heap to remember data, instead of using recursive  function  calls,
456           has  been  implemented to work round the problem of limited stack size.
457           If you want to build a version of PCRE that works this way, add
458    
459           --disable-stack-for-recursion           --disable-stack-for-recursion
460    
461         to the configure command. With this configuration, PCRE  will  use  the         to the configure command. With this configuration, PCRE  will  use  the
462         pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
463         management functions. Separate functions are provided because the usage         ment functions. By default these point to malloc() and free(), but  you
464         is very predictable: the block sizes requested are always the same, and         can replace the pointers so that your own functions are used instead.
465         the blocks are always freed in reverse order. A calling  program  might  
466         be  able  to implement optimized functions that perform better than the         Separate  functions  are  provided  rather  than  using pcre_malloc and
467         standard malloc() and  free()  functions.  PCRE  runs  noticeably  more         pcre_free because the  usage  is  very  predictable:  the  block  sizes
468         slowly when built in this way.         requested  are  always  the  same,  and  the blocks are always freed in
469           reverse order. A calling program might be able to  implement  optimized
470           functions  that  perform  better  than  malloc()  and free(). PCRE runs
471           noticeably more slowly when built in this way. This option affects only
472           the pcre_exec() function; it is not relevant for pcre_dfa_exec().
473    
474    
475    LIMITING PCRE RESOURCE USAGE
476    
477           Internally,  PCRE has a function called match(), which it calls repeat-
478           edly  (sometimes  recursively)  when  matching  a  pattern   with   the
479           pcre_exec()  function.  By controlling the maximum number of times this
480           function may be called during a single matching operation, a limit  can
481           be  placed  on  the resources used by a single call to pcre_exec(). The
482           limit can be changed at run time, as described in the pcreapi  documen-
483           tation.  The default is 10 million, but this can be changed by adding a
484           setting such as
485    
486             --with-match-limit=500000
487    
488           to  the  configure  command.  This  setting  has  no  effect   on   the
489           pcre_dfa_exec() matching function.
490    
491           In  some  environments  it is desirable to limit the depth of recursive
492           calls of match() more strictly than the total number of calls, in order
493           to  restrict  the maximum amount of stack (or heap, if --disable-stack-
494           for-recursion is specified) that is used. A second limit controls this;
495           it  defaults  to  the  value  that is set for --with-match-limit, which
496           imposes no additional constraints. However, you can set a  lower  limit
497           by adding, for example,
498    
499             --with-match-limit-recursion=10000
500    
501           to  the  configure  command.  This  value can also be overridden at run
502           time.
503    
504    
505    CREATING CHARACTER TABLES AT BUILD TIME
506    
507           PCRE uses fixed tables for processing characters whose code values  are
508           less  than 256. By default, PCRE is built with a set of tables that are
509           distributed in the file pcre_chartables.c.dist. These  tables  are  for
510           ASCII codes only. If you add
511    
512             --enable-rebuild-chartables
513    
514           to  the  configure  command, the distributed tables are no longer used.
515           Instead, a program called dftables is compiled and  run.  This  outputs
516           the source for new set of tables, created in the default locale of your
517           C runtime system. (This method of replacing the tables does not work if
518           you  are cross compiling, because dftables is run on the local host. If
519           you need to create alternative tables when cross  compiling,  you  will
520           have to do so "by hand".)
521    
522    
523  USING EBCDIC CODE  USING EBCDIC CODE
524    
525         PCRE  assumes  by  default that it will run in an environment where the         PCRE  assumes  by  default that it will run in an environment where the
526         character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
527         can, however, be compiled to run in an EBCDIC environment by adding         This  is  the  case for most computer operating systems. PCRE can, how-
528           ever, be compiled to run in an EBCDIC environment by adding
529    
530           --enable-ebcdic           --enable-ebcdic
531    
532         to the configure command.         to the configure command. This setting implies --enable-rebuild-charta-
533           bles.  You  should  only  use  it if you know that you are in an EBCDIC
534           environment (for example,  an  IBM  mainframe  operating  system).  The
535           --enable-ebcdic option is incompatible with --enable-utf8.
536    
 Last updated: 09 December 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
537    
538  PCRE(3)                                                                PCRE(3)  PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
539    
540           By default, pcregrep reads all files as plain text. You can build it so
541           that it recognizes files whose names end in .gz or .bz2, and reads them
542           with libz or libbz2, respectively, by adding one or both of
543    
544             --enable-pcregrep-libz
545             --enable-pcregrep-libbz2
546    
547           to the configure command. These options naturally require that the rel-
548           evant libraries are installed on your system. Configuration  will  fail
549           if they are not.
550    
551    
552    PCRETEST OPTION FOR LIBREADLINE SUPPORT
553    
554           If you add
555    
556             --enable-pcretest-libreadline
557    
558           to  the  configure  command,  pcretest  is  linked with the libreadline
559           library, and when its input is from a terminal, it reads it  using  the
560           readline() function. This provides line-editing and history facilities.
561           Note that libreadline is GPL-licensed, so if you distribute a binary of
562           pcretest linked in this way, there may be licensing issues.
563    
564           Setting  this  option  causes  the -lreadline option to be added to the
565           pcretest build. In many operating environments with  a  sytem-installed
566           libreadline this is sufficient. However, in some environments (e.g.  if
567           an unmodified distribution version of readline is in use),  some  extra
568           configuration  may  be necessary. The INSTALL file for libreadline says
569           this:
570    
571             "Readline uses the termcap functions, but does not link with the
572             termcap or curses library itself, allowing applications which link
573             with readline the to choose an appropriate library."
574    
575           If your environment has not been set up so that an appropriate  library
576           is automatically included, you may need to add something like
577    
578             LIBS="-ncurses"
579    
580           immediately before the configure command.
581    
582    
583    SEE ALSO
584    
585           pcreapi(3), pcre_config(3).
586    
587    
588    AUTHOR
589    
590           Philip Hazel
591           University Computing Service
592           Cambridge CB2 3QH, England.
593    
594    
595    REVISION
596    
597           Last updated: 29 September 2009
598           Copyright (c) 1997-2009 University of Cambridge.
599    ------------------------------------------------------------------------------
600    
601    
602    PCREMATCHING(3)                                                PCREMATCHING(3)
603    
604    
605    NAME
606           PCRE - Perl-compatible regular expressions
607    
608    
609    PCRE MATCHING ALGORITHMS
610    
611           This document describes the two different algorithms that are available
612           in PCRE for matching a compiled regular expression against a given sub-
613           ject  string.  The  "standard"  algorithm  is  the  one provided by the
614           pcre_exec() function.  This works in the same was  as  Perl's  matching
615           function, and provides a Perl-compatible matching operation.
616    
617           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
618           this operates in a different way, and is not  Perl-compatible.  It  has
619           advantages  and disadvantages compared with the standard algorithm, and
620           these are described below.
621    
622           When there is only one possible way in which a given subject string can
623           match  a pattern, the two algorithms give the same answer. A difference
624           arises, however, when there are multiple possibilities. For example, if
625           the pattern
626    
627             ^<.*>
628    
629           is matched against the string
630    
631             <something> <something else> <something further>
632    
633           there are three possible answers. The standard algorithm finds only one
634           of them, whereas the alternative algorithm finds all three.
635    
636    
637    REGULAR EXPRESSIONS AS TREES
638    
639           The set of strings that are matched by a regular expression can be rep-
640           resented  as  a  tree structure. An unlimited repetition in the pattern
641           makes the tree of infinite size, but it is still a tree.  Matching  the
642           pattern  to a given subject string (from a given starting point) can be
643           thought of as a search of the tree.  There are two  ways  to  search  a
644           tree:  depth-first  and  breadth-first, and these correspond to the two
645           matching algorithms provided by PCRE.
646    
647    
648    THE STANDARD MATCHING ALGORITHM
649    
650           In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
651           sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
652           depth-first search of the pattern tree. That is, it  proceeds  along  a
653           single path through the tree, checking that the subject matches what is
654           required. When there is a mismatch, the algorithm  tries  any  alterna-
655           tives  at  the  current point, and if they all fail, it backs up to the
656           previous branch point in the  tree,  and  tries  the  next  alternative
657           branch  at  that  level.  This often involves backing up (moving to the
658           left) in the subject string as well.  The  order  in  which  repetition
659           branches  are  tried  is controlled by the greedy or ungreedy nature of
660           the quantifier.
661    
662           If a leaf node is reached, a matching string has  been  found,  and  at
663           that  point the algorithm stops. Thus, if there is more than one possi-
664           ble match, this algorithm returns the first one that it finds.  Whether
665           this  is the shortest, the longest, or some intermediate length depends
666           on the way the greedy and ungreedy repetition quantifiers are specified
667           in the pattern.
668    
669           Because  it  ends  up  with a single path through the tree, it is rela-
670           tively straightforward for this algorithm to keep  track  of  the  sub-
671           strings  that  are  matched  by portions of the pattern in parentheses.
672           This provides support for capturing parentheses and back references.
673    
674    
675    THE ALTERNATIVE MATCHING ALGORITHM
676    
677           This algorithm conducts a breadth-first search of  the  tree.  Starting
678           from  the  first  matching  point  in the subject, it scans the subject
679           string from left to right, once, character by character, and as it does
680           this,  it remembers all the paths through the tree that represent valid
681           matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
682           though  it is not implemented as a traditional finite state machine (it
683           keeps multiple states active simultaneously).
684    
685           Although the general principle of this matching algorithm  is  that  it
686           scans  the subject string only once, without backtracking, there is one
687           exception: when a lookaround assertion is encountered,  the  characters
688           following  or  preceding  the  current  point  have to be independently
689           inspected.
690    
691           The scan continues until either the end of the subject is  reached,  or
692           there  are  no more unterminated paths. At this point, terminated paths
693           represent the different matching possibilities (if there are none,  the
694           match  has  failed).   Thus,  if there is more than one possible match,
695           this algorithm finds all of them, and in particular, it finds the long-
696           est.  There  is  an  option to stop the algorithm after the first match
697           (which is necessarily the shortest) is found.
698    
699           Note that all the matches that are found start at the same point in the
700           subject. If the pattern
701    
702             cat(er(pillar)?)
703    
704           is  matched  against the string "the caterpillar catchment", the result
705           will be the three strings "cat", "cater", and "caterpillar" that  start
706           at the fourth character of the subject. The algorithm does not automat-
707           ically move on to find matches that start at later positions.
708    
709           There are a number of features of PCRE regular expressions that are not
710           supported by the alternative matching algorithm. They are as follows:
711    
712           1.  Because  the  algorithm  finds  all possible matches, the greedy or
713           ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
714           ungreedy quantifiers are treated in exactly the same way. However, pos-
715           sessive quantifiers can make a difference when what follows could  also
716           match what is quantified, for example in a pattern like this:
717    
718             ^a++\w!
719    
720           This  pattern matches "aaab!" but not "aaa!", which would be matched by
721           a non-possessive quantifier. Similarly, if an atomic group is  present,
722           it  is matched as if it were a standalone pattern at the current point,
723           and the longest match is then "locked in" for the rest of  the  overall
724           pattern.
725    
726           2. When dealing with multiple paths through the tree simultaneously, it
727           is not straightforward to keep track of  captured  substrings  for  the
728           different  matching  possibilities,  and  PCRE's implementation of this
729           algorithm does not attempt to do this. This means that no captured sub-
730           strings are available.
731    
732           3.  Because no substrings are captured, back references within the pat-
733           tern are not supported, and cause errors if encountered.
734    
735           4. For the same reason, conditional expressions that use  a  backrefer-
736           ence  as  the  condition or test for a specific group recursion are not
737           supported.
738    
739           5. Because many paths through the tree may be  active,  the  \K  escape
740           sequence, which resets the start of the match when encountered (but may
741           be on some paths and not on others), is not  supported.  It  causes  an
742           error if encountered.
743    
744           6.  Callouts  are  supported, but the value of the capture_top field is
745           always 1, and the value of the capture_last field is always -1.
746    
747           7. The \C escape sequence, which (in the standard algorithm) matches  a
748           single  byte, even in UTF-8 mode, is not supported because the alterna-
749           tive algorithm moves through the subject  string  one  character  at  a
750           time, for all active paths through the tree.
751    
752           8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
753           are not supported. (*FAIL) is supported, and  behaves  like  a  failing
754           negative assertion.
755    
756    
757    ADVANTAGES OF THE ALTERNATIVE ALGORITHM
758    
759           Using  the alternative matching algorithm provides the following advan-
760           tages:
761    
762           1. All possible matches (at a single point in the subject) are automat-
763           ically  found,  and  in particular, the longest match is found. To find
764           more than one match using the standard algorithm, you have to do kludgy
765           things with callouts.
766    
767           2.  Because  the  alternative  algorithm  scans the subject string just
768           once, and never needs to backtrack, it is possible to  pass  very  long
769           subject  strings  to  the matching function in several pieces, checking
770           for partial matching each time.  The  pcrepartial  documentation  gives
771           details of partial matching.
772    
773    
774    DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
775    
776           The alternative algorithm suffers from a number of disadvantages:
777    
778           1.  It  is  substantially  slower  than the standard algorithm. This is
779           partly because it has to search for all possible matches, but  is  also
780           because it is less susceptible to optimization.
781    
782           2. Capturing parentheses and back references are not supported.
783    
784           3. Although atomic groups are supported, their use does not provide the
785           performance advantage that it does for the standard algorithm.
786    
787    
788    AUTHOR
789    
790           Philip Hazel
791           University Computing Service
792           Cambridge CB2 3QH, England.
793    
794    
795    REVISION
796    
797           Last updated: 29 September 2009
798           Copyright (c) 1997-2009 University of Cambridge.
799    ------------------------------------------------------------------------------
800    
801    
802    PCREAPI(3)                                                          PCREAPI(3)
803    
804    
805  NAME  NAME
806         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
807    
808  SYNOPSIS OF PCRE API  
809    PCRE NATIVE API
810    
811         #include <pcre.h>         #include <pcre.h>
812    
# Line 335  SYNOPSIS OF PCRE API Line 814  SYNOPSIS OF PCRE API
814              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
815              const unsigned char *tableptr);              const unsigned char *tableptr);
816    
817           pcre *pcre_compile2(const char *pattern, int options,
818                int *errorcodeptr,
819                const char **errptr, int *erroffset,
820                const unsigned char *tableptr);
821    
822         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
823              const char **errptr);              const char **errptr);
824    
# Line 342  SYNOPSIS OF PCRE API Line 826  SYNOPSIS OF PCRE API
826              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
827              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
828    
829           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
830                const char *subject, int length, int startoffset,
831                int options, int *ovector, int ovecsize,
832                int *workspace, int wscount);
833    
834         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
835              const char *subject, int *ovector,              const char *subject, int *ovector,
836              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 359  SYNOPSIS OF PCRE API Line 848  SYNOPSIS OF PCRE API
848         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
849              const char *name);              const char *name);
850    
851           int pcre_get_stringtable_entries(const pcre *code,
852                const char *name, char **first, char **last);
853    
854         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
855              int stringcount, int stringnumber,              int stringcount, int stringnumber,
856              const char **stringptr);              const char **stringptr);
# Line 377  SYNOPSIS OF PCRE API Line 869  SYNOPSIS OF PCRE API
869    
870         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
871    
872           int pcre_refcount(pcre *code, int adjust);
873    
874         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
875    
876         char *pcre_version(void);         char *pcre_version(void);
# Line 392  SYNOPSIS OF PCRE API Line 886  SYNOPSIS OF PCRE API
886         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
887    
888    
889  PCRE API  PCRE API OVERVIEW
890    
891         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
892         is also a set of wrapper functions that correspond to the POSIX regular         are also some wrapper functions that correspond to  the  POSIX  regular
893         expression API.  These are described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
894           Both of these APIs define a set of C function calls. A C++  wrapper  is
895         The  native  API  function  prototypes  are  defined in the header file         distributed with PCRE. It is documented in the pcrecpp page.
896         pcre.h, and on Unix systems the library itself is called libpcre.a,  so  
897         can be accessed by adding -lpcre to the command for linking an applica-         The  native  API  C  function prototypes are defined in the header file
898         tion which calls it. The header file defines the macros PCRE_MAJOR  and         pcre.h, and on Unix systems the library itself is called  libpcre.   It
899         PCRE_MINOR  to  contain  the  major  and  minor release numbers for the         can normally be accessed by adding -lpcre to the command for linking an
900         library. Applications can use these to include  support  for  different         application  that  uses  PCRE.  The  header  file  defines  the  macros
901         releases.         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
902           bers for the library.  Applications can use these  to  include  support
903         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         for different releases of PCRE.
904         for compiling and matching regular expressions. A sample  program  that  
905         demonstrates  the simplest way of using them is given in the file pcre-         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
906         demo.c. The pcresample documentation describes how to run it.         pcre_exec() are used for compiling and matching regular expressions  in
907           a  Perl-compatible  manner. A sample program that demonstrates the sim-
908         There are convenience functions for extracting captured substrings from         plest way of using them is provided in the file  called  pcredemo.c  in
909         a matched subject string. They are:         the PCRE source distribution. A listing of this program is given in the
910           pcredemo documentation, and the pcresample documentation describes  how
911           to compile and run it.
912    
913           A second matching function, pcre_dfa_exec(), which is not Perl-compati-
914           ble, is also provided. This uses a different algorithm for  the  match-
915           ing.  The  alternative algorithm finds all possible matches (at a given
916           point in the subject), and scans the subject just  once  (unless  there
917           are  lookbehind  assertions).  However,  this algorithm does not return
918           captured substrings. A description of the two matching  algorithms  and
919           their  advantages  and disadvantages is given in the pcrematching docu-
920           mentation.
921    
922           In addition to the main compiling and  matching  functions,  there  are
923           convenience functions for extracting captured substrings from a subject
924           string that is matched by pcre_exec(). They are:
925    
926           pcre_copy_substring()           pcre_copy_substring()
927           pcre_copy_named_substring()           pcre_copy_named_substring()
928           pcre_get_substring()           pcre_get_substring()
929           pcre_get_named_substring()           pcre_get_named_substring()
930           pcre_get_substring_list()           pcre_get_substring_list()
931             pcre_get_stringnumber()
932             pcre_get_stringtable_entries()
933    
934         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
935         to free the memory used for extracted strings.         to free the memory used for extracted strings.
936    
937         The function pcre_maketables() is used (optionally) to build a  set  of         The  function  pcre_maketables()  is  used  to build a set of character
938         character tables in the current locale for passing to pcre_compile().         tables  in  the  current  locale   for   passing   to   pcre_compile(),
939           pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
940         The  function  pcre_fullinfo()  is used to find out information about a         provided for specialist use.  Most  commonly,  no  special  tables  are
941         compiled pattern; pcre_info() is an obsolete version which returns only         passed,  in  which case internal tables that are generated when PCRE is
942         some  of  the available information, but is retained for backwards com-         built are used.
943         patibility.  The function pcre_version() returns a pointer to a  string  
944           The function pcre_fullinfo() is used to find out  information  about  a
945           compiled  pattern; pcre_info() is an obsolete version that returns only
946           some of the available information, but is retained for  backwards  com-
947           patibility.   The function pcre_version() returns a pointer to a string
948         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
949    
950         The  global  variables  pcre_malloc and pcre_free initially contain the         The function pcre_refcount() maintains a  reference  count  in  a  data
951         entry points of the standard  malloc()  and  free()  functions  respec-         block  containing  a compiled pattern. This is provided for the benefit
952           of object-oriented applications.
953    
954           The global variables pcre_malloc and pcre_free  initially  contain  the
955           entry  points  of  the  standard malloc() and free() functions, respec-
956         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
957         so a calling program can replace them if it  wishes  to  intercept  the         so  a  calling  program  can replace them if it wishes to intercept the
958         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
959    
960         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
961         indirections to memory management functions.  These  special  functions         indirections  to  memory  management functions. These special functions
962         are  used  only  when  PCRE is compiled to use the heap for remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
963         data, instead of recursive function calls. This is a  non-standard  way         data, instead of recursive function calls, when running the pcre_exec()
964         of  building  PCRE,  for  use in environments that have limited stacks.         function. See the pcrebuild documentation for  details  of  how  to  do
965         Because of the greater use of memory management, it runs  more  slowly.         this.  It  is  a non-standard way of building PCRE, for use in environ-
966         Separate  functions  are provided so that special-purpose external code         ments that have limited stacks. Because of the greater  use  of  memory
967         can be used for this case. When used, these functions are always called         management,  it  runs  more  slowly. Separate functions are provided so
968         in  a  stack-like  manner  (last obtained, first freed), and always for         that special-purpose external code can be  used  for  this  case.  When
969         memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
970           obtained, first freed), and always for memory blocks of the same  size.
971           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
972           mentation.
973    
974         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
975         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 455  PCRE API Line 977  PCRE API
977         pcrecallout documentation.         pcrecallout documentation.
978    
979    
980    NEWLINES
981    
982           PCRE  supports five different conventions for indicating line breaks in
983           strings: a single CR (carriage return) character, a  single  LF  (line-
984           feed) character, the two-character sequence CRLF, any of the three pre-
985           ceding, or any Unicode newline sequence. The Unicode newline  sequences
986           are  the  three just mentioned, plus the single characters VT (vertical
987           tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
988           separator, U+2028), and PS (paragraph separator, U+2029).
989    
990           Each  of  the first three conventions is used by at least one operating
991           system as its standard newline sequence. When PCRE is built, a  default
992           can  be  specified.  The default default is LF, which is the Unix stan-
993           dard. When PCRE is run, the default can be overridden,  either  when  a
994           pattern is compiled, or when it is matched.
995    
996           At compile time, the newline convention can be specified by the options
997           argument of pcre_compile(), or it can be specified by special  text  at
998           the start of the pattern itself; this overrides any other settings. See
999           the pcrepattern page for details of the special character sequences.
1000    
1001           In the PCRE documentation the word "newline" is used to mean "the char-
1002           acter  or pair of characters that indicate a line break". The choice of
1003           newline convention affects the handling of  the  dot,  circumflex,  and
1004           dollar metacharacters, the handling of #-comments in /x mode, and, when
1005           CRLF is a recognized line ending sequence, the match position  advance-
1006           ment for a non-anchored pattern. There is more detail about this in the
1007           section on pcre_exec() options below.
1008    
1009           The choice of newline convention does not affect the interpretation  of
1010           the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
1011           which is controlled in a similar way, but by separate options.
1012    
1013    
1014  MULTITHREADING  MULTITHREADING
1015    
1016         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
1017         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
1018         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1019         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
# Line 467  MULTITHREADING Line 1023  MULTITHREADING
1023         at once.         at once.
1024    
1025    
1026    SAVING PRECOMPILED PATTERNS FOR LATER USE
1027    
1028           The compiled form of a regular expression can be saved and re-used at a
1029           later time, possibly by a different program, and even on a  host  other
1030           than  the  one  on  which  it  was  compiled.  Details are given in the
1031           pcreprecompile documentation. However, compiling a  regular  expression
1032           with  one version of PCRE for use with a different version is not guar-
1033           anteed to work and may cause crashes.
1034    
1035    
1036  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
1037    
1038         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1039    
1040         The  function pcre_config() makes it possible for a PCRE client to dis-         The function pcre_config() makes it possible for a PCRE client to  dis-
1041         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
1042         The  pcrebuild documentation has more details about these optional fea-         The pcrebuild documentation has more details about these optional  fea-
1043         tures.         tures.
1044    
1045         The first argument for pcre_config() is an  integer,  specifying  which         The  first  argument  for pcre_config() is an integer, specifying which
1046         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
1047         into which the information is  placed.  The  following  information  is         into  which  the  information  is  placed. The following information is
1048         available:         available:
1049    
1050           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
1051    
1052         The  output is an integer that is set to one if UTF-8 support is avail-         The output is an integer that is set to one if UTF-8 support is  avail-
1053         able; otherwise it is set to zero.         able; otherwise it is set to zero.
1054    
1055             PCRE_CONFIG_UNICODE_PROPERTIES
1056    
1057           The  output  is  an  integer  that is set to one if support for Unicode
1058           character properties is available; otherwise it is set to zero.
1059    
1060           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1061    
1062         The output is an integer that is set to the value of the code  that  is         The output is an integer whose value specifies  the  default  character
1063         used  for the newline character. It is either linefeed (10) or carriage         sequence  that is recognized as meaning "newline". The four values that
1064         return (13), and should normally be the  standard  character  for  your         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1065         operating system.         and  -1  for  ANY.  Though they are derived from ASCII, the same values
1066           are returned in EBCDIC environments. The default should normally corre-
1067           spond to the standard sequence for your operating system.
1068    
1069             PCRE_CONFIG_BSR
1070    
1071           The output is an integer whose value indicates what character sequences
1072           the \R escape sequence matches by default. A value of 0 means  that  \R
1073           matches  any  Unicode  line ending sequence; a value of 1 means that \R
1074           matches only CR, LF, or CRLF. The default can be overridden when a pat-
1075           tern is compiled or matched.
1076    
1077           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1078    
# Line 510  CHECKING BUILD-TIME OPTIONS Line 1091  CHECKING BUILD-TIME OPTIONS
1091    
1092           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1093    
1094         The output is an integer that gives the default limit for the number of         The output is a long integer that gives the default limit for the  num-
1095         internal  matching  function  calls in a pcre_exec() execution. Further         ber  of  internal  matching  function calls in a pcre_exec() execution.
1096         details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1097    
1098             PCRE_CONFIG_MATCH_LIMIT_RECURSION
1099    
1100           The output is a long integer that gives the default limit for the depth
1101           of   recursion  when  calling  the  internal  matching  function  in  a
1102           pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1103           below.
1104    
1105           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1106    
1107         The output is an integer that is set to one if  internal  recursion  is         The  output is an integer that is set to one if internal recursion when
1108         implemented  by recursive function calls that use the stack to remember         running pcre_exec() is implemented by recursive function calls that use
1109         their state. This is the usual way that PCRE is compiled. The output is         the  stack  to remember their state. This is the usual way that PCRE is
1110         zero  if PCRE was compiled to use blocks of data on the heap instead of         compiled. The output is zero if PCRE was compiled to use blocks of data
1111         recursive  function  calls.  In  this   case,   pcre_stack_malloc   and         on  the  heap  instead  of  recursive  function  calls.  In  this case,
1112         pcre_stack_free  are  called  to manage memory blocks on the heap, thus         pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1113         avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1114    
1115    
1116  COMPILING A PATTERN  COMPILING A PATTERN
# Line 531  COMPILING A PATTERN Line 1119  COMPILING A PATTERN
1119              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
1120              const unsigned char *tableptr);              const unsigned char *tableptr);
1121    
1122           pcre *pcre_compile2(const char *pattern, int options,
1123                int *errorcodeptr,
1124                const char **errptr, int *erroffset,
1125                const unsigned char *tableptr);
1126    
1127         The function pcre_compile() is called to  compile  a  pattern  into  an         Either of the functions pcre_compile() or pcre_compile2() can be called
1128         internal  form.  The pattern is a C string terminated by a binary zero,         to compile a pattern into an internal form. The only difference between
1129         and is passed in the argument pattern. A pointer to a single  block  of         the  two interfaces is that pcre_compile2() has an additional argument,
1130         memory  that is obtained via pcre_malloc is returned. This contains the         errorcodeptr, via which a numerical error  code  can  be  returned.  To
1131         compiled code and related data.  The  pcre  type  is  defined  for  the         avoid  too  much repetition, we refer just to pcre_compile() below, but
1132         returned  block;  this  is a typedef for a structure whose contents are         the information applies equally to pcre_compile2().
1133         not externally defined. It is up to the caller to free the memory  when  
1134         it is no longer required.         The pattern is a C string terminated by a binary zero, and is passed in
1135           the  pattern  argument.  A  pointer to a single block of memory that is
1136           obtained via pcre_malloc is returned. This contains the  compiled  code
1137           and related data. The pcre type is defined for the returned block; this
1138           is a typedef for a structure whose contents are not externally defined.
1139           It is up to the caller to free the memory (via pcre_free) when it is no
1140           longer required.
1141    
1142         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although the compiled code of a PCRE regex is relocatable, that is,  it
1143         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1144         fully relocatable, because it contains a copy of the tableptr argument,         fully relocatable, because it may contain a copy of the tableptr  argu-
1145         which is an address (see below).         ment, which is an address (see below).
1146    
1147         The options argument contains independent bits that affect the compila-         The options argument contains various bit settings that affect the com-
1148         tion.  It  should  be  zero  if  no  options  are required. Some of the         pilation. It should be zero if no options are required.  The  available
1149         options, in particular, those that are compatible with Perl,  can  also         options  are  described  below. Some of them (in particular, those that
1150         be  set and unset from within the pattern (see the detailed description         are compatible with Perl, but some others as well) can also be set  and
1151         of regular expressions in the  pcrepattern  documentation).  For  these         unset  from  within  the  pattern  (see the detailed description in the
1152         options,  the  contents of the options argument specifies their initial         pcrepattern documentation). For those options that can be different  in
1153         settings at the start of compilation and execution.  The  PCRE_ANCHORED         different  parts  of  the pattern, the contents of the options argument
1154         option can be set at the time of matching as well as at compile time.         specifies their settings at the start of compilation and execution. The
1155           PCRE_ANCHORED, PCRE_BSR_xxx, and PCRE_NEWLINE_xxx options can be set at
1156           the time of matching as well as at compile time.
1157    
1158         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1159         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1160         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1161         sage. The offset from the start of the pattern to the  character  where         sage. This is a static string that is part of the library. You must not
1162         the  error  was  discovered  is  placed  in  the variable pointed to by         try to free it. The byte offset from the start of the  pattern  to  the
1163         erroffset, which must not be NULL. If it  is,  an  immediate  error  is         character  that  was  being  processed when the error was discovered is
1164         given.         placed in the variable pointed to by erroffset, which must not be NULL.
1165           If  it  is,  an  immediate error is given. Some errors are not detected
1166         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         until checks are carried out when the whole pattern has  been  scanned;
1167         character tables which are built when it is compiled, using the default         in this case the offset is set to the end of the pattern.
1168         C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to  
1169         pcre_maketables(). See the section on locale support below.         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1170           codeptr argument is not NULL, a non-zero error code number is  returned
1171           via  this argument in the event of an error. This is in addition to the
1172           textual error message. Error codes and messages are listed below.
1173    
1174           If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1175           character  tables  that  are  built  when  PCRE  is compiled, using the
1176           default C locale. Otherwise, tableptr must be an address  that  is  the
1177           result  of  a  call to pcre_maketables(). This value is stored with the
1178           compiled pattern, and used again by pcre_exec(), unless  another  table
1179           pointer is passed to it. For more discussion, see the section on locale
1180           support below.
1181    
1182         This code fragment shows a typical straightforward  call  to  pcre_com-         This code fragment shows a typical straightforward  call  to  pcre_com-
1183         pile():         pile():
# Line 581  COMPILING A PATTERN Line 1192  COMPILING A PATTERN
1192             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
1193             NULL);            /* use default character tables */             NULL);            /* use default character tables */
1194    
1195         The following option bits are defined:         The  following  names  for option bits are defined in the pcre.h header
1196           file:
1197    
1198           PCRE_ANCHORED           PCRE_ANCHORED
1199    
1200         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
1201         is constrained to match only at the first matching point in the  string         is  constrained to match only at the first matching point in the string
1202         which is being searched (the "subject string"). This effect can also be         that is being searched (the "subject string"). This effect can also  be
1203         achieved by appropriate constructs in the pattern itself, which is  the         achieved  by appropriate constructs in the pattern itself, which is the
1204         only way to do it in Perl.         only way to do it in Perl.
1205    
1206             PCRE_AUTO_CALLOUT
1207    
1208           If this bit is set, pcre_compile() automatically inserts callout items,
1209           all  with  number  255, before each pattern item. For discussion of the
1210           callout facility, see the pcrecallout documentation.
1211    
1212             PCRE_BSR_ANYCRLF
1213             PCRE_BSR_UNICODE
1214    
1215           These options (which are mutually exclusive) control what the \R escape
1216           sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1217           or to match any Unicode newline sequence. The default is specified when
1218           PCRE is built. It can be overridden from within the pattern, or by set-
1219           ting an option when a compiled pattern is matched.
1220    
1221           PCRE_CASELESS           PCRE_CASELESS
1222    
1223         If  this  bit is set, letters in the pattern match both upper and lower         If this bit is set, letters in the pattern match both upper  and  lower
1224         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1225         changed within a pattern by a (?i) option setting.         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1226           always  understands the concept of case for characters whose values are
1227           less than 128, so caseless matching is always possible. For  characters
1228           with  higher  values,  the concept of case is supported if PCRE is com-
1229           piled with Unicode property support, but not otherwise. If you want  to
1230           use  caseless  matching  for  characters 128 and above, you must ensure
1231           that PCRE is compiled with Unicode property support  as  well  as  with
1232           UTF-8 support.
1233    
1234           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
1235    
1236         If  this bit is set, a dollar metacharacter in the pattern matches only         If  this bit is set, a dollar metacharacter in the pattern matches only
1237         at the end of the subject string. Without this option,  a  dollar  also         at the end of the subject string. Without this option,  a  dollar  also
1238         matches  immediately before the final character if it is a newline (but         matches  immediately before a newline at the end of the string (but not
1239         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1240         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1241         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1242    
1243           PCRE_DOTALL           PCRE_DOTALL
1244    
1245         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
1246         acters,  including  newlines.  Without  it, newlines are excluded. This         acters,  including  those that indicate newline. Without it, a dot does
1247         option is equivalent to Perl's /s option, and it can be changed  within         not match when the current position is at a  newline.  This  option  is
1248         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]         equivalent  to Perl's /s option, and it can be changed within a pattern
1249         always matches a newline character, independent of the setting of  this         by a (?s) option setting. A negative class such as [^a] always  matches
1250         option.         newline characters, independent of the setting of this option.
1251    
1252             PCRE_DUPNAMES
1253    
1254           If  this  bit is set, names used to identify capturing subpatterns need
1255           not be unique. This can be helpful for certain types of pattern when it
1256           is  known  that  only  one instance of the named subpattern can ever be
1257           matched. There are more details of named subpatterns  below;  see  also
1258           the pcrepattern documentation.
1259    
1260           PCRE_EXTENDED           PCRE_EXTENDED
1261    
1262         If  this  bit  is  set,  whitespace  data characters in the pattern are         If  this  bit  is  set,  whitespace  data characters in the pattern are
1263         totally ignored except  when  escaped  or  inside  a  character  class.         totally ignored except when escaped or inside a character class. White-
1264         Whitespace  does  not  include the VT character (code 11). In addition,         space does not include the VT character (code 11). In addition, charac-
1265         characters between an unescaped # outside a  character  class  and  the         ters between an unescaped # outside a character class and the next new-
1266         next newline character, inclusive, are also ignored. This is equivalent         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1267         to Perl's /x option, and it can be changed within a pattern by  a  (?x)         option, and it can be changed within a pattern by a  (?x)  option  set-
1268         option setting.         ting.
1269    
1270         This  option  makes  it possible to include comments inside complicated         This  option  makes  it possible to include comments inside complicated
1271         patterns.  Note, however, that this applies only  to  data  characters.         patterns.  Note, however, that this applies only  to  data  characters.
# Line 639  COMPILING A PATTERN Line 1281  COMPILING A PATTERN
1281         letter that has no special meaning  causes  an  error,  thus  reserving         letter that has no special meaning  causes  an  error,  thus  reserving
1282         these  combinations  for  future  expansion.  By default, as in Perl, a         these  combinations  for  future  expansion.  By default, as in Perl, a
1283         backslash followed by a letter with no special meaning is treated as  a         backslash followed by a letter with no special meaning is treated as  a
1284         literal.  There  are  at  present  no other features controlled by this         literal.  (Perl can, however, be persuaded to give a warning for this.)
1285         option. It can also be set by a (?X) option setting within a pattern.         There are at present no other features controlled by  this  option.  It
1286           can also be set by a (?X) option setting within a pattern.
1287    
1288             PCRE_FIRSTLINE
1289    
1290           If  this  option  is  set,  an  unanchored pattern is required to match
1291           before or at the first  newline  in  the  subject  string,  though  the
1292           matched text may continue over the newline.
1293    
1294             PCRE_JAVASCRIPT_COMPAT
1295    
1296           If this option is set, PCRE's behaviour is changed in some ways so that
1297           it is compatible with JavaScript rather than Perl. The changes  are  as
1298           follows:
1299    
1300           (1)  A  lone  closing square bracket in a pattern causes a compile-time
1301           error, because this is illegal in JavaScript (by default it is  treated
1302           as a data character). Thus, the pattern AB]CD becomes illegal when this
1303           option is set.
1304    
1305           (2) At run time, a back reference to an unset subpattern group  matches
1306           an  empty  string (by default this causes the current matching alterna-
1307           tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1308           set  (assuming  it can find an "a" in the subject), whereas it fails by
1309           default, for Perl compatibility.
1310    
1311           PCRE_MULTILINE           PCRE_MULTILINE
1312    
1313         By default, PCRE treats the subject string as consisting  of  a  single         By default, PCRE treats the subject string as consisting  of  a  single
1314         "line"  of  characters (even if it actually contains several newlines).         line  of characters (even if it actually contains newlines). The "start
1315         The "start of line" metacharacter (^) matches only at the start of  the         of line" metacharacter (^) matches only at the  start  of  the  string,
1316         string,  while  the "end of line" metacharacter ($) matches only at the         while  the  "end  of line" metacharacter ($) matches only at the end of
1317         end of the string, or before a terminating  newline  (unless  PCRE_DOL-         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1318         LAR_ENDONLY is set). This is the same as Perl.         is set). This is the same as Perl.
1319    
1320         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1321         constructs match immediately following or immediately before  any  new-         constructs match immediately following or immediately  before  internal
1322         line  in the subject string, respectively, as well as at the very start         newlines  in  the  subject string, respectively, as well as at the very
1323         and end. This is equivalent to Perl's /m option, and it can be  changed         start and end. This is equivalent to Perl's /m option, and  it  can  be
1324         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1325         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1326         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1327    
1328             PCRE_NEWLINE_CR
1329             PCRE_NEWLINE_LF
1330             PCRE_NEWLINE_CRLF
1331             PCRE_NEWLINE_ANYCRLF
1332             PCRE_NEWLINE_ANY
1333    
1334           These  options  override the default newline definition that was chosen
1335           when PCRE was built. Setting the first or the second specifies  that  a
1336           newline  is  indicated  by a single character (CR or LF, respectively).
1337           Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1338           two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1339           that any of the three preceding sequences should be recognized. Setting
1340           PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1341           recognized. The Unicode newline sequences are the three just mentioned,
1342           plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1343           U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1344           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1345           UTF-8 mode.
1346    
1347           The newline setting in the  options  word  uses  three  bits  that  are
1348           treated as a number, giving eight possibilities. Currently only six are
1349           used (default plus the five values above). This means that if  you  set
1350           more  than one newline option, the combination may or may not be sensi-
1351           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1352           PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1353           cause an error.
1354    
1355           The only time that a line break is specially recognized when  compiling
1356           a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a
1357           character class is encountered. This indicates  a  comment  that  lasts
1358           until  after the next line break sequence. In other circumstances, line
1359           break  sequences  are  treated  as  literal  data,   except   that   in
1360           PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1361           and are therefore ignored.
1362    
1363           The newline option that is set at compile time becomes the default that
1364           is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1365    
1366           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1367    
1368         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
# Line 678  COMPILING A PATTERN Line 1382  COMPILING A PATTERN
1382    
1383         This option causes PCRE to regard both the pattern and the  subject  as         This option causes PCRE to regard both the pattern and the  subject  as
1384         strings  of  UTF-8 characters instead of single-byte character strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1385         However, it is available only if PCRE has been built to  include  UTF-8         However, it is available only when PCRE is built to include UTF-8  sup-
1386         support.  If  not, the use of this option provokes an error. Details of         port.  If not, the use of this option provokes an error. Details of how
1387         how this option changes the behaviour of PCRE are given in the  section         this option changes the behaviour of PCRE are given in the  section  on
1388         on UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1389    
1390           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1391    
1392         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1393         automatically checked. If an invalid UTF-8 sequence of bytes is  found,         automatically checked. There is a  discussion  about  the  validity  of
1394         pcre_compile()  returns an error. If you already know that your pattern         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1395         is valid, and you want to skip this check for performance reasons,  you         bytes is found, pcre_compile() returns an error. If  you  already  know
1396         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         that your pattern is valid, and you want to skip this check for perfor-
1397         passing an invalid UTF-8 string as a pattern is undefined. It may cause         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1398         your  program  to  crash.  Note that there is a similar option for sup-         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1399         pressing the checking of subject strings passed to pcre_exec().         undefined. It may cause your program to crash. Note  that  this  option
1400           can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1401           UTF-8 validity checking of subject strings.
1402    
1403    
1404    COMPILATION ERROR CODES
1405    
1406           The following table lists the error  codes  than  may  be  returned  by
1407           pcre_compile2(),  along with the error messages that may be returned by
1408           both compiling functions. As PCRE has developed, some error codes  have
1409           fallen out of use. To avoid confusion, they have not been re-used.
1410    
1411              0  no error
1412              1  \ at end of pattern
1413              2  \c at end of pattern
1414              3  unrecognized character follows \
1415              4  numbers out of order in {} quantifier
1416              5  number too big in {} quantifier
1417              6  missing terminating ] for character class
1418              7  invalid escape sequence in character class
1419              8  range out of order in character class
1420              9  nothing to repeat
1421             10  [this code is not in use]
1422             11  internal error: unexpected repeat
1423             12  unrecognized character after (? or (?-
1424             13  POSIX named classes are supported only within a class
1425             14  missing )
1426             15  reference to non-existent subpattern
1427             16  erroffset passed as NULL
1428             17  unknown option bit(s) set
1429             18  missing ) after comment
1430             19  [this code is not in use]
1431             20  regular expression is too large
1432             21  failed to get memory
1433             22  unmatched parentheses
1434             23  internal error: code overflow
1435             24  unrecognized character after (?<
1436             25  lookbehind assertion is not fixed length
1437             26  malformed number or name after (?(
1438             27  conditional group contains more than two branches
1439             28  assertion expected after (?(
1440             29  (?R or (?[+-]digits must be followed by )
1441             30  unknown POSIX class name
1442             31  POSIX collating elements are not supported
1443             32  this version of PCRE is not compiled with PCRE_UTF8 support
1444             33  [this code is not in use]
1445             34  character value in \x{...} sequence is too large
1446             35  invalid condition (?(0)
1447             36  \C not allowed in lookbehind assertion
1448             37  PCRE does not support \L, \l, \N, \U, or \u
1449             38  number after (?C is > 255
1450             39  closing ) for (?C expected
1451             40  recursive call could loop indefinitely
1452             41  unrecognized character after (?P
1453             42  syntax error in subpattern name (missing terminator)
1454             43  two named subpatterns have the same name
1455             44  invalid UTF-8 string
1456             45  support for \P, \p, and \X has not been compiled
1457             46  malformed \P or \p sequence
1458             47  unknown property name after \P or \p
1459             48  subpattern name is too long (maximum 32 characters)
1460             49  too many named subpatterns (maximum 10000)
1461             50  [this code is not in use]
1462             51  octal value is greater than \377 (not in UTF-8 mode)
1463             52  internal error: overran compiling workspace
1464             53   internal  error:  previously-checked  referenced  subpattern not
1465           found
1466             54  DEFINE group contains more than one branch
1467             55  repeating a DEFINE group is not allowed
1468             56  inconsistent NEWLINE options
1469             57  \g is not followed by a braced, angle-bracketed, or quoted
1470                   name/number or by a plain number
1471             58  a numbered reference must not be zero
1472             59  (*VERB) with an argument is not supported
1473             60  (*VERB) not recognized
1474             61  number is too big
1475             62  subpattern name expected
1476             63  digit expected after (?+
1477             64  ] is an invalid data character in JavaScript compatibility mode
1478    
1479           The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1480           values may be used if the limits were changed when PCRE was built.
1481    
1482    
1483  STUDYING A PATTERN  STUDYING A PATTERN
1484    
1485         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options
1486              const char **errptr);              const char **errptr);
1487    
1488         When a pattern is going to be used several times, it is worth  spending         If  a  compiled  pattern is going to be used several times, it is worth
1489         more  time  analyzing it in order to speed up the time taken for match-         spending more time analyzing it in order to speed up the time taken for
1490         ing. The function pcre_study() takes a pointer to a compiled pattern as         matching.  The function pcre_study() takes a pointer to a compiled pat-
1491         its first argument. If studing the pattern produces additional informa-         tern as its first argument. If studying the pattern produces additional
1492         tion that will help speed up matching, pcre_study() returns  a  pointer         information  that  will  help speed up matching, pcre_study() returns a
1493         to  a  pcre_extra  block,  in  which the study_data field points to the         pointer to a pcre_extra block, in which the study_data field points  to
1494         results of the study.         the results of the study.
1495    
1496         The returned value from  a  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1497         pcre_exec().  However,  the pcre_extra block also contains other fields         pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
1498         that can be set by the caller before the block  is  passed;  these  are         tains  other  fields  that can be set by the caller before the block is
1499         described  below.  If  studying  the pattern does not produce any addi-         passed; these are described below in the section on matching a pattern.
1500         tional information, pcre_study() returns NULL. In that circumstance, if  
1501         the  calling  program  wants  to  pass  some  of  the  other  fields to         If studying the  pattern  does  not  produce  any  useful  information,
1502         pcre_exec(), it must set up its own pcre_extra block.         pcre_study() returns NULL. In that circumstance, if the calling program
1503           wants  to  pass  any  of   the   other   fields   to   pcre_exec()   or
1504         The second argument contains option bits. At present,  no  options  are         pcre_dfa_exec(), it must set up its own pcre_extra block.
1505         defined for pcre_study(), and this argument should always be zero.  
1506           The  second  argument of pcre_study() contains option bits. At present,
1507         The  third argument for pcre_study() is a pointer for an error message.         no options are defined, and this argument should always be zero.
1508         If studying succeeds (even if no data is  returned),  the  variable  it  
1509         points  to  is set to NULL. Otherwise it points to a textual error mes-         The third argument for pcre_study() is a pointer for an error  message.
1510         sage. You should therefore test the error pointer for NULL after  call-         If  studying  succeeds  (even  if no data is returned), the variable it
1511         ing pcre_study(), to be sure that it has run successfully.         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1512           error message. This is a static string that is part of the library. You
1513           must not try to free it. You should test the  error  pointer  for  NULL
1514           after calling pcre_study(), to be sure that it has run successfully.
1515    
1516         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1517    
# Line 734  STUDYING A PATTERN Line 1521  STUDYING A PATTERN
1521             0,              /* no options exist */             0,              /* no options exist */
1522             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
1523    
1524         At present, studying a pattern is useful only for non-anchored patterns         Studying a pattern does two things: first, a lower bound for the length
1525         that do not have a single fixed starting character. A bitmap of  possi-         of subject string that is needed to match the pattern is computed. This
1526         ble starting characters is created.         does not mean that there are any strings of that length that match, but
1527           it does guarantee that no shorter strings match. The value is  used  by
1528           pcre_exec()  and  pcre_dfa_exec()  to  avoid  wasting time by trying to
1529           match strings that are shorter than the lower bound. You can  find  out
1530           the value in a calling program via the pcre_fullinfo() function.
1531    
1532           Studying a pattern is also useful for non-anchored patterns that do not
1533           have a single fixed starting character. A bitmap of  possible  starting
1534           bytes  is  created. This speeds up finding a position in the subject at
1535           which to start matching.
1536    
1537    
1538  LOCALE SUPPORT  LOCALE SUPPORT
1539    
1540         PCRE  handles  caseless matching, and determines whether characters are         PCRE handles caseless matching, and determines whether  characters  are
1541         letters, digits, or whatever, by reference to a  set  of  tables.  When         letters,  digits, or whatever, by reference to a set of tables, indexed
1542         running  in UTF-8 mode, this applies only to characters with codes less         by character value. When running in UTF-8 mode, this  applies  only  to
1543         than 256. The library contains a default set of tables that is  created         characters  with  codes  less than 128. Higher-valued codes never match
1544         in  the  default  C locale when PCRE is compiled. This is used when the         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built
1545         final argument of pcre_compile() is NULL, and is  sufficient  for  many         with  Unicode  character property support. The use of locales with Uni-
1546         applications.         code is discouraged. If you are handling characters with codes  greater
1547           than  128, you should either use UTF-8 and Unicode, or use locales, but
1548         An alternative set of tables can, however, be supplied. Such tables are         not try to mix the two.
1549         built by calling the pcre_maketables() function,  which  has  no  argu-  
1550         ments,  in  the  relevant  locale.  The  result  can  then be passed to         PCRE contains an internal set of tables that are used  when  the  final
1551         pcre_compile() as often as necessary. For example,  to  build  and  use         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many
1552         tables that are appropriate for the French locale (where accented char-         applications.  Normally, the internal tables recognize only ASCII char-
1553         acters with codes greater than 128 are treated as letters), the follow-         acters. However, when PCRE is built, it is possible to cause the inter-
1554         ing code could be used:         nal tables to be rebuilt in the default "C" locale of the local system,
1555           which may cause them to be different.
1556    
1557           The  internal tables can always be overridden by tables supplied by the
1558           application that calls PCRE. These may be created in a different locale
1559           from  the  default.  As more and more applications change to using Uni-
1560           code, the need for this locale support is expected to die away.
1561    
1562           External tables are built by calling  the  pcre_maketables()  function,
1563           which  has no arguments, in the relevant locale. The result can then be
1564           passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For
1565           example,  to  build  and use tables that are appropriate for the French
1566           locale (where accented characters with  values  greater  than  128  are
1567           treated as letters), the following code could be used:
1568    
1569           setlocale(LC_CTYPE, "fr");           setlocale(LC_CTYPE, "fr_FR");
1570           tables = pcre_maketables();           tables = pcre_maketables();
1571           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1572    
1573         The  tables  are  built in memory that is obtained via pcre_malloc. The         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;
1574         pointer that is passed to pcre_compile is saved with the compiled  pat-         if you are using Windows, the name for the French locale is "french".
1575         tern, and the same tables are used via this pointer by pcre_study() and  
1576         pcre_exec(). Thus, for any single pattern,  compilation,  studying  and         When pcre_maketables() runs, the tables are built  in  memory  that  is
1577         matching  all  happen in the same locale, but different patterns can be         obtained  via  pcre_malloc. It is the caller's responsibility to ensure
1578         compiled in different locales. It is  the  caller's  responsibility  to         that the memory containing the tables remains available for as long  as
1579         ensure  that  the memory containing the tables remains available for as         it is needed.
1580         long as it is needed.  
1581           The pointer that is passed to pcre_compile() is saved with the compiled
1582           pattern, and the same tables are used via this pointer by  pcre_study()
1583           and normally also by pcre_exec(). Thus, by default, for any single pat-
1584           tern, compilation, studying and matching all happen in the same locale,
1585           but different patterns can be compiled in different locales.
1586    
1587           It  is  possible to pass a table pointer or NULL (indicating the use of
1588           the internal tables) to pcre_exec(). Although  not  intended  for  this
1589           purpose,  this facility could be used to match a pattern in a different
1590           locale from the one in which it was compiled. Passing table pointers at
1591           run time is discussed below in the section on matching a pattern.
1592    
1593    
1594  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
# Line 776  INFORMATION ABOUT A PATTERN Line 1596  INFORMATION ABOUT A PATTERN
1596         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1597              int what, void *where);              int what, void *where);
1598    
1599         The pcre_fullinfo() function returns information about a compiled  pat-         The  pcre_fullinfo() function returns information about a compiled pat-
1600         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern. It replaces the obsolete pcre_info() function, which is neverthe-
1601         less retained for backwards compability (and is documented below).         less retained for backwards compability (and is documented below).
1602    
1603         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled
1604         pattern.  The second argument is the result of pcre_study(), or NULL if         pattern. The second argument is the result of pcre_study(), or NULL  if
1605         the pattern was not studied. The third argument specifies  which  piece         the  pattern  was not studied. The third argument specifies which piece
1606         of  information  is required, and the fourth argument is a pointer to a         of information is required, and the fourth argument is a pointer  to  a
1607         variable to receive the data. The yield of the  function  is  zero  for         variable  to  receive  the  data. The yield of the function is zero for
1608         success, or one of the following negative numbers:         success, or one of the following negative numbers:
1609    
1610           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
# Line 792  INFORMATION ABOUT A PATTERN Line 1612  INFORMATION ABOUT A PATTERN
1612           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1613           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1614    
1615         Here  is a typical call of pcre_fullinfo(), to obtain the length of the         The "magic number" is placed at the start of each compiled  pattern  as
1616         compiled pattern:         an  simple check against passing an arbitrary memory pointer. Here is a
1617           typical call of pcre_fullinfo(), to obtain the length of  the  compiled
1618           pattern:
1619    
1620           int rc;           int rc;
1621           unsigned long int length;           size_t length;
1622           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1623             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1624             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
1625             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
1626             &length);         /* where to put the data */             &length);         /* where to put the data */
1627    
1628         The possible values for the third argument are defined in  pcre.h,  and         The  possible  values for the third argument are defined in pcre.h, and
1629         are as follows:         are as follows:
1630    
1631           PCRE_INFO_BACKREFMAX           PCRE_INFO_BACKREFMAX
1632    
1633         Return  the  number  of  the highest back reference in the pattern. The         Return the number of the highest back reference  in  the  pattern.  The
1634         fourth argument should point to an int variable. Zero  is  returned  if         fourth  argument  should  point to an int variable. Zero is returned if
1635         there are no back references.         there are no back references.
1636    
1637           PCRE_INFO_CAPTURECOUNT           PCRE_INFO_CAPTURECOUNT
1638    
1639         Return  the  number of capturing subpatterns in the pattern. The fourth         Return the number of capturing subpatterns in the pattern.  The  fourth
1640         argument should point to an int variable.         argument should point to an int variable.
1641    
1642             PCRE_INFO_DEFAULT_TABLES
1643    
1644           Return  a pointer to the internal default character tables within PCRE.
1645           The fourth argument should point to an unsigned char *  variable.  This
1646           information call is provided for internal use by the pcre_study() func-
1647           tion. External callers can cause PCRE to use  its  internal  tables  by
1648           passing a NULL table pointer.
1649    
1650           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1651    
1652         Return information about the first byte of any matched  string,  for  a         Return  information  about  the first byte of any matched string, for a
1653         non-anchored    pattern.    (This    option    used    to   be   called         non-anchored pattern. The fourth argument should point to an int  vari-
1654         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name
1655         compatibility.)         is still recognized for backwards compatibility.)
1656    
1657         If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as         If there is a fixed first byte, for example, from  a  pattern  such  as
1658         (cat|cow|coyote), it is returned in the integer pointed  to  by  where.         (cat|cow|coyote), its value is returned. Otherwise, if either
        Otherwise, if either  
1659    
1660         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
1661         branch starts with "^", or         branch starts with "^", or
# Line 846  INFORMATION ABOUT A PATTERN Line 1675  INFORMATION ABOUT A PATTERN
1675         returned. The fourth argument should point to an unsigned char *  vari-         returned. The fourth argument should point to an unsigned char *  vari-
1676         able.         able.
1677    
1678             PCRE_INFO_HASCRORLF
1679    
1680           Return  1  if  the  pattern  contains any explicit matches for CR or LF
1681           characters, otherwise 0. The fourth argument should  point  to  an  int
1682           variable.  An explicit match is either a literal CR or LF character, or
1683           \r or \n.
1684    
1685             PCRE_INFO_JCHANGED
1686    
1687           Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
1688           otherwise  0. The fourth argument should point to an int variable. (?J)
1689           and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1690    
1691           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1692    
1693         Return  the  value of the rightmost literal byte that must exist in any         Return the value of the rightmost literal byte that must exist  in  any
1694         matched string, other than at its  start,  if  such  a  byte  has  been         matched  string,  other  than  at  its  start,  if such a byte has been
1695         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
1696         is no such byte, -1 is returned. For anchored patterns, a last  literal         is  no such byte, -1 is returned. For anchored patterns, a last literal
1697         byte  is  recorded only if it follows something of variable length. For         byte is recorded only if it follows something of variable  length.  For
1698         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1699         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
1700    
1701             PCRE_INFO_MINLENGTH
1702    
1703           If the pattern was studied and a minimum length  for  matching  subject
1704           strings  was  computed,  its  value is returned. Otherwise the returned
1705           value is -1. The value is a number of characters, not bytes  (this  may
1706           be  relevant in UTF-8 mode). The fourth argument should point to an int
1707           variable. A non-negative value is a lower bound to the  length  of  any
1708           matching  string.  There  may not be any strings of that length that do
1709           actually match, but every string that does match is at least that long.
1710    
1711           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
1712           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
1713           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
1714    
1715         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE supports the use of named as well as numbered capturing  parenthe-
1716         ses. The names are just an additional way of identifying the  parenthe-         ses.  The names are just an additional way of identifying the parenthe-
1717         ses,  which still acquire a number. A caller that wants to extract data         ses, which still acquire numbers. Several convenience functions such as
1718         from a named subpattern must convert the name to a number in  order  to         pcre_get_named_substring()  are  provided  for extracting captured sub-
1719         access  the  correct  pointers  in  the  output  vector (described with         strings by name. It is also possible to extract the data  directly,  by
1720         pcre_exec() below). In order to do this, it must first use these  three         first  converting  the  name to a number in order to access the correct
1721         values to obtain the name-to-number mapping table for the pattern.         pointers in the output vector (described with pcre_exec() below). To do
1722           the  conversion,  you  need  to  use  the  name-to-number map, which is
1723           described by these three values.
1724    
1725         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1726         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1727         of  each  entry;  both  of  these  return  an int value. The entry size         of each entry; both of these  return  an  int  value.  The  entry  size
1728         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns
1729         a  pointer  to  the  first  entry of the table (a pointer to char). The         a pointer to the first entry of the table  (a  pointer  to  char).  The
1730         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1731         sis,  most  significant byte first. The rest of the entry is the corre-         sis, most significant byte first. The rest of the entry is  the  corre-
1732         sponding name, zero terminated. The names are  in  alphabetical  order.         sponding name, zero terminated.
1733         For  example,  consider  the following pattern (assume PCRE_EXTENDED is  
1734         set, so white space - including newlines - is ignored):         The  names are in alphabetical order. Duplicate names may appear if (?|
1735           is used to create multiple groups with the same number, as described in
1736           (?P<date> (?P<year>(\d\d)?\d\d) -         the  section  on  duplicate subpattern numbers in the pcrepattern page.
1737           (?P<month>\d\d) - (?P<day>\d\d) )         Duplicate names for subpatterns with different  numbers  are  permitted
1738           only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they
1739         There are four named subpatterns, so the table has  four  entries,  and         appear in the table in the order in which they were found in  the  pat-
1740         each  entry  in the table is eight bytes long. The table is as follows,         tern.  In  the  absence  of (?| this is the order of increasing number;
1741         with non-printing bytes shows in hex, and undefined bytes shown as ??:         when (?| is used this is not necessarily the case because later subpat-
1742           terns may have lower numbers.
1743    
1744           As  a  simple  example of the name/number table, consider the following
1745           pattern (assume PCRE_EXTENDED is set, so white space -  including  new-
1746           lines - is ignored):
1747    
1748             (?<date> (?<year>(\d\d)?\d\d) -
1749             (?<month>\d\d) - (?<day>\d\d) )
1750    
1751           There  are  four  named subpatterns, so the table has four entries, and
1752           each entry in the table is eight bytes long. The table is  as  follows,
1753           with non-printing bytes shows in hexadecimal, and undefined bytes shown
1754           as ??:
1755    
1756           00 01 d  a  t  e  00 ??           00 01 d  a  t  e  00 ??
1757           00 05 d  a  y  00 ?? ??           00 05 d  a  y  00 ?? ??
1758           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1759           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1760    
1761         When writing code to extract data from named subpatterns, remember that         When writing code to extract data  from  named  subpatterns  using  the
1762         the length of each entry may be different for each compiled pattern.         name-to-number  map,  remember that the length of the entries is likely
1763           to be different for each compiled pattern.
1764    
1765             PCRE_INFO_OKPARTIAL
1766    
1767           Return 1  if  the  pattern  can  be  used  for  partial  matching  with
1768           pcre_exec(),  otherwise  0.  The fourth argument should point to an int
1769           variable. From  release  8.00,  this  always  returns  1,  because  the
1770           restrictions  that  previously  applied  to  partial matching have been
1771           lifted. The pcrepartial documentation gives details of  partial  match-
1772           ing.
1773    
1774           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1775    
1776         Return  a  copy of the options with which the pattern was compiled. The         Return  a  copy of the options with which the pattern was compiled. The
1777         fourth argument should point to an unsigned long  int  variable.  These         fourth argument should point to an unsigned long  int  variable.  These
1778         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1779         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1780           other  words,  they are the options that will be in force when matching
1781           starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
1782           the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
1783           and PCRE_EXTENDED.
1784    
1785         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A pattern is automatically anchored by PCRE if  all  of  its  top-level
1786         alternatives begin with one of the following:         alternatives begin with one of the following:
# Line 922  INFORMATION ABOUT A PATTERN Line 1803  INFORMATION ABOUT A PATTERN
1803    
1804           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1805    
1806         Returns  the  size of the data block pointed to by the study_data field         Return the size of the data block pointed to by the study_data field in
1807         in a pcre_extra block. That is, it is the  value  that  was  passed  to         a pcre_extra block. That is,  it  is  the  value  that  was  passed  to
1808         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1809         created by pcre_study(). The fourth argument should point to  a  size_t         created by pcre_study(). If pcre_extra is NULL, or there  is  no  study
1810           data,  zero  is  returned. The fourth argument should point to a size_t
1811         variable.         variable.
1812    
1813    
# Line 933  OBSOLETE INFO FUNCTION Line 1815  OBSOLETE INFO FUNCTION
1815    
1816         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1817    
1818         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1819         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1820         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1821         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1822         lowing negative numbers:         lowing negative numbers:
1823    
1824           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1825           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1826    
1827         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1828         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1829         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1830    
1831         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1832         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1833         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1834    
1835    
1836  MATCHING A PATTERN  REFERENCE COUNTS
1837    
1838           int pcre_refcount(pcre *code, int adjust);
1839    
1840           The pcre_refcount() function is used to maintain a reference  count  in
1841           the data block that contains a compiled pattern. It is provided for the
1842           benefit of applications that  operate  in  an  object-oriented  manner,
1843           where different parts of the application may be using the same compiled
1844           pattern, but you want to free the block when they are all done.
1845    
1846           When a pattern is compiled, the reference count field is initialized to
1847           zero.   It is changed only by calling this function, whose action is to
1848           add the adjust value (which may be positive or  negative)  to  it.  The
1849           yield of the function is the new value. However, the value of the count
1850           is constrained to lie between 0 and 65535, inclusive. If the new  value
1851           is outside these limits, it is forced to the appropriate limit value.
1852    
1853           Except  when it is zero, the reference count is not correctly preserved
1854           if a pattern is compiled on one host and then  transferred  to  a  host
1855           whose byte-order is different. (This seems a highly unlikely scenario.)
1856    
1857    
1858    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1859    
1860         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1861              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1862              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1863    
1864         The  function pcre_exec() is called to match a subject string against a         The  function pcre_exec() is called to match a subject string against a
1865         pre-compiled pattern, which is passed in the code argument. If the pat-         compiled pattern, which is passed in the code argument. If the  pattern
1866         tern  has been studied, the result of the study should be passed in the         was  studied,  the  result  of  the study should be passed in the extra
1867         extra argument.         argument. This function is the main matching facility of  the  library,
1868           and it operates in a Perl-like manner. For specialist use there is also
1869           an alternative matching function, which is described below in the  sec-
1870           tion about the pcre_dfa_exec() function.
1871    
1872           In  most applications, the pattern will have been compiled (and option-
1873           ally studied) in the same process that calls pcre_exec().  However,  it
1874           is possible to save compiled patterns and study data, and then use them
1875           later in different processes, possibly even on different hosts.  For  a
1876           discussion about this, see the pcreprecompile documentation.
1877    
1878         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
1879    
# Line 973  MATCHING A PATTERN Line 1886  MATCHING A PATTERN
1886             11,             /* the length of the subject string */             11,             /* the length of the subject string */
1887             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1888             0,              /* default options */             0,              /* default options */
1889             ovector,        /* vector for substring information */             ovector,        /* vector of integers for substring information */
1890             30);            /* number of elements in the vector */             30);            /* number of elements (NOT size in bytes) */
1891    
1892       Extra data for pcre_exec()
1893    
1894         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
1895         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
1896         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
1897         tional information in it. The fields in the block are as follows:         tional information in it. The pcre_extra block contains  the  following
1898           fields (not necessarily in this order):
1899    
1900           unsigned long int flags;           unsigned long int flags;
1901           void *study_data;           void *study_data;
1902           unsigned long int match_limit;           unsigned long int match_limit;
1903             unsigned long int match_limit_recursion;
1904           void *callout_data;           void *callout_data;
1905             const unsigned char *tables;
1906    
1907         The  flags  field  is a bitmap that specifies which of the other fields         The  flags  field  is a bitmap that specifies which of the other fields
1908         are set. The flag bits are:         are set. The flag bits are:
1909    
1910           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1911           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1912             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1913           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1914             PCRE_EXTRA_TABLES
1915    
1916         Other flag bits should be set to zero. The study_data field is  set  in         Other flag bits should be set to zero. The study_data field is  set  in
1917         the  pcre_extra  block  that is returned by pcre_study(), together with         the  pcre_extra  block  that is returned by pcre_study(), together with
1918         the appropriate flag bit. You should not set this yourself, but you can         the appropriate flag bit. You should not set this yourself, but you may
1919         add to the block by setting the other fields.         add  to  the  block by setting the other fields and their corresponding
1920           flag bits.
1921    
1922         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1923         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
1924         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
1925         search trees. The classic  example  is  the  use  of  nested  unlimited         search  trees. The classic example is a pattern that uses nested unlim-
1926         repeats. Internally, PCRE uses a function called match() which it calls         ited repeats.
1927         repeatedly (sometimes recursively). The limit is imposed on the  number  
1928         of  times  this function is called during a match, which has the effect         Internally, PCRE uses a function called match() which it calls  repeat-
1929         of limiting the amount of recursion  and  backtracking  that  can  take         edly  (sometimes  recursively). The limit set by match_limit is imposed
1930         place.  For  patterns that are not anchored, the count starts from zero         on the number of times this function is called during  a  match,  which
1931           has  the  effect  of  limiting the amount of backtracking that can take
1932           place. For patterns that are not anchored, the count restarts from zero
1933         for each position in the subject string.         for each position in the subject string.
1934    
1935         The default limit for the library can be set when PCRE  is  built;  the         The  default  value  for  the  limit can be set when PCRE is built; the
1936         default  default  is 10 million, which handles all but the most extreme         default default is 10 million, which handles all but the  most  extreme
1937         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a         cases.  You  can  override  the  default by suppling pcre_exec() with a
1938         pcre_extra  block  in  which match_limit is set to a smaller value, and         pcre_extra    block    in    which    match_limit    is    set,     and
1939         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT  is  set  in  the  flags  field. If the limit is
1940         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1941    
1942         The  pcre_callout  field is used in conjunction with the "callout" fea-         The match_limit_recursion field is similar to match_limit, but  instead
1943         ture, which is described in the pcrecallout documentation.         of limiting the total number of times that match() is called, it limits
1944           the depth of recursion. The recursion depth is a  smaller  number  than
1945           the  total number of calls, because not all calls to match() are recur-
1946           sive.  This limit is of use only if it is set smaller than match_limit.
1947    
1948           Limiting the recursion depth limits the amount of  stack  that  can  be
1949           used, or, when PCRE has been compiled to use memory on the heap instead
1950           of the stack, the amount of heap memory that can be used.
1951    
1952           The default value for match_limit_recursion can be  set  when  PCRE  is
1953           built;  the  default  default  is  the  same  value  as the default for
1954           match_limit. You can override the default by suppling pcre_exec()  with
1955           a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1956           PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1957           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1958    
1959           The  callout_data  field is used in conjunction with the "callout" fea-
1960           ture, and is described in the pcrecallout documentation.
1961    
1962           The tables field  is  used  to  pass  a  character  tables  pointer  to
1963           pcre_exec();  this overrides the value that is stored with the compiled
1964           pattern. A non-NULL value is stored with the compiled pattern  only  if
1965           custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1966           ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1967           PCRE's  internal  tables  to be used. This facility is helpful when re-
1968           using patterns that have been saved after compiling  with  an  external
1969           set  of  tables,  because  the  external tables might be at a different
1970           address when pcre_exec() is called. See the  pcreprecompile  documenta-
1971           tion for a discussion of saving compiled patterns for later use.
1972    
1973       Option bits for pcre_exec()
1974    
1975           The  unused  bits of the options argument for pcre_exec() must be zero.
1976           The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1977           PCRE_NOTBOL,    PCRE_NOTEOL,    PCRE_NOTEMPTY,   PCRE_NOTEMPTY_ATSTART,
1978           PCRE_NO_START_OPTIMIZE,  PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_SOFT,   and
1979           PCRE_PARTIAL_HARD.
1980    
1981         The PCRE_ANCHORED option can be passed in the options  argument,  whose           PCRE_ANCHORED
        unused  bits  must  be zero. This limits pcre_exec() to matching at the  
        first matching position.  However,  if  a  pattern  was  compiled  with  
        PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,  
        it cannot be made unachored at matching time.  
   
        When PCRE_UTF8 was set at compile time, the validity of the subject  as  
        a  UTF-8  string is automatically checked, and the value of startoffset  
        is also checked to ensure that it points to the start of a UTF-8  char-  
        acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()  
        returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an  
        invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.  
1982    
1983         If  you  already  know that your subject is valid, and you want to skip         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
1984         these   checks   for   performance   reasons,   you   can    set    the         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
1985         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         turned  out to be anchored by virtue of its contents, it cannot be made
1986         do this for the second and subsequent calls to pcre_exec() if  you  are         unachored at matching time.
1987         making  repeated  calls  to  find  all  the matches in a single subject  
1988         string. However, you should be  sure  that  the  value  of  startoffset           PCRE_BSR_ANYCRLF
1989         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is           PCRE_BSR_UNICODE
1990         set, the effect of passing an invalid UTF-8 string as a subject,  or  a  
1991         value  of startoffset that does not point to the start of a UTF-8 char-         These options (which are mutually exclusive) control what the \R escape
1992         acter, is undefined. Your program may crash.         sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1993           or to match any Unicode newline sequence. These  options  override  the
1994           choice that was made or defaulted when the pattern was compiled.
1995    
1996             PCRE_NEWLINE_CR
1997             PCRE_NEWLINE_LF
1998             PCRE_NEWLINE_CRLF
1999             PCRE_NEWLINE_ANYCRLF
2000             PCRE_NEWLINE_ANY
2001    
2002           These  options  override  the  newline  definition  that  was chosen or
2003           defaulted when the pattern was compiled. For details, see the  descrip-
2004           tion  of  pcre_compile()  above.  During  matching,  the newline choice
2005           affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
2006           ters.  It may also alter the way the match position is advanced after a
2007           match failure for an unanchored pattern.
2008    
2009           When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is
2010           set,  and a match attempt for an unanchored pattern fails when the cur-
2011           rent position is at a  CRLF  sequence,  and  the  pattern  contains  no
2012           explicit  matches  for  CR  or  LF  characters,  the  match position is
2013           advanced by two characters instead of one, in other words, to after the
2014           CRLF.
2015    
2016           The above rule is a compromise that makes the most common cases work as
2017           expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL
2018           option is not set), it does not match the string "\r\nA" because, after
2019           failing at the start, it skips both the CR and the LF before  retrying.
2020           However,  the  pattern  [\r\n]A does match that string, because it con-
2021           tains an explicit CR or LF reference, and so advances only by one char-
2022           acter after the first failure.
2023    
2024           An explicit match for CR of LF is either a literal appearance of one of
2025           those characters, or one of the \r or  \n  escape  sequences.  Implicit
2026           matches  such  as [^X] do not count, nor does \s (which includes CR and
2027           LF in the characters that it matches).
2028    
2029         There are also three further options that can be set only  at  matching         Notwithstanding the above, anomalous effects may still occur when  CRLF
2030         time:         is a valid newline sequence and explicit \r or \n escapes appear in the
2031           pattern.
2032    
2033           PCRE_NOTBOL           PCRE_NOTBOL
2034    
2035         The  first  character  of the string is not the beginning of a line, so         This option specifies that first character of the subject string is not
2036         the circumflex metacharacter should not match before it.  Setting  this         the  beginning  of  a  line, so the circumflex metacharacter should not
2037         without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to         match before it. Setting this without PCRE_MULTILINE (at compile  time)
2038         match.         causes  circumflex  never to match. This option affects only the behav-
2039           iour of the circumflex metacharacter. It does not affect \A.
2040    
2041           PCRE_NOTEOL           PCRE_NOTEOL
2042    
2043         The end of the string is not the end of a line, so the dollar metachar-         This option specifies that the end of the subject string is not the end
2044         acter  should  not  match  it  nor (except in multiline mode) a newline         of  a line, so the dollar metacharacter should not match it nor (except
2045         immediately before it. Setting this without PCRE_MULTILINE (at  compile         in multiline mode) a newline immediately before it. Setting this  with-
2046         time) causes dollar never to match.         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
2047           option affects only the behaviour of the dollar metacharacter. It  does
2048           not affect \Z or \z.
2049    
2050           PCRE_NOTEMPTY           PCRE_NOTEMPTY
2051    
# Line 1069  MATCHING A PATTERN Line 2056  MATCHING A PATTERN
2056    
2057           a?b?           a?b?
2058    
2059         is applied to a string not beginning with "a" or "b",  it  matches  the         is applied to a string not beginning with "a" or  "b",  it  matches  an
2060         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
2061         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
2062         rences of "a" or "b".         rences of "a" or "b".
2063    
2064         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-           PCRE_NOTEMPTY_ATSTART
2065         cial case of a pattern match of the empty  string  within  its  split()  
2066         function,  and  when  using  the /g modifier. It is possible to emulate         This  is  like PCRE_NOTEMPTY, except that an empty string match that is
2067         Perl's behaviour after matching a null string by first trying the match         not at the start of  the  subject  is  permitted.  If  the  pattern  is
2068         again at the same offset with PCRE_NOTEMPTY set, and then if that fails         anchored, such a match can occur only if the pattern contains \K.
2069         by advancing the starting offset (see below)  and  trying  an  ordinary  
2070         match again.         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
2071           PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
2072         The  subject string is passed to pcre_exec() as a pointer in subject, a         match  of  the empty string within its split() function, and when using
2073         length in length, and a starting byte offset in startoffset. Unlike the         the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
2074         pattern  string,  the  subject  may contain binary zero bytes. When the         matching a null string by first trying the match again at the same off-
2075         starting offset is zero, the search for a match starts at the beginning         set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
2076         of the subject, and this is by far the most common case.         fails, by advancing the starting offset (see below) and trying an ordi-
2077           nary match again. There is some code that demonstrates how to  do  this
2078         If the pattern was compiled with the PCRE_UTF8 option, the subject must         in the pcredemo sample program.
2079         be a sequence of bytes that is a valid UTF-8 string, and  the  starting  
2080         offset  must point to the beginning of a UTF-8 character. If an invalid           PCRE_NO_START_OPTIMIZE
2081         UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8  
2082         or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option         There  are a number of optimizations that pcre_exec() uses at the start
2083         PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not         of a match, in order to speed up the process. For  example,  if  it  is
2084         defined.         known  that  a  match must start with a specific character, it searches
2085           the subject for that character, and fails immediately if it cannot find
2086         A  non-zero  starting offset is useful when searching for another match         it,  without actually running the main matching function. When callouts
2087         in the same subject by calling pcre_exec() again after a previous  suc-         are in use, these optimizations can cause  them  to  be  skipped.  This
2088         cess.   Setting  startoffset differs from just passing over a shortened         option  disables  the  "start-up" optimizations, causing performance to
2089         string and setting PCRE_NOTBOL in the case of  a  pattern  that  begins         suffer, but ensuring that the callouts do occur.
2090    
2091             PCRE_NO_UTF8_CHECK
2092    
2093           When PCRE_UTF8 is set at compile time, the validity of the subject as a
2094           UTF-8  string is automatically checked when pcre_exec() is subsequently
2095           called.  The value of startoffset is also checked  to  ensure  that  it
2096           points  to  the start of a UTF-8 character. There is a discussion about
2097           the validity of UTF-8 strings in the section on UTF-8  support  in  the
2098           main  pcre  page.  If  an  invalid  UTF-8  sequence  of bytes is found,
2099           pcre_exec() returns the error PCRE_ERROR_BADUTF8. If  startoffset  con-
2100           tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
2101    
2102           If  you  already  know that your subject is valid, and you want to skip
2103           these   checks   for   performance   reasons,   you   can    set    the
2104           PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to
2105           do this for the second and subsequent calls to pcre_exec() if  you  are
2106           making  repeated  calls  to  find  all  the matches in a single subject
2107           string. However, you should be  sure  that  the  value  of  startoffset
2108           points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is
2109           set, the effect of passing an invalid UTF-8 string as a subject,  or  a
2110           value  of startoffset that does not point to the start of a UTF-8 char-
2111           acter, is undefined. Your program may crash.
2112    
2113             PCRE_PARTIAL_HARD
2114             PCRE_PARTIAL_SOFT
2115    
2116           These options turn on the partial matching feature. For backwards  com-
2117           patibility,  PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A partial
2118           match occurs if the end of the subject string is reached  successfully,
2119           but  there  are not enough subject characters to complete the match. If
2120           this happens when PCRE_PARTIAL_HARD  is  set,  pcre_exec()  immediately
2121           returns  PCRE_ERROR_PARTIAL.  Otherwise,  if  PCRE_PARTIAL_SOFT is set,
2122           matching continues by testing any other alternatives. Only if they  all
2123           fail  is  PCRE_ERROR_PARTIAL  returned (instead of PCRE_ERROR_NOMATCH).
2124           The portion of the string that was inspected when the partial match was
2125           found  is  set  as  the first matching string. There is a more detailed
2126           discussion in the pcrepartial documentation.
2127    
2128       The string to be matched by pcre_exec()
2129    
2130           The subject string is passed to pcre_exec() as a pointer in subject,  a
2131           length (in bytes) in length, and a starting byte offset in startoffset.
2132           In UTF-8 mode, the byte offset must point to the start of a UTF-8 char-
2133           acter.  Unlike  the pattern string, the subject may contain binary zero
2134           bytes. When the starting offset is zero, the search for a match  starts
2135           at  the  beginning  of  the subject, and this is by far the most common
2136           case.
2137    
2138           A non-zero starting offset is useful when searching for  another  match
2139           in  the same subject by calling pcre_exec() again after a previous suc-
2140           cess.  Setting startoffset differs from just passing over  a  shortened
2141           string  and  setting  PCRE_NOTBOL  in the case of a pattern that begins
2142         with any kind of lookbehind. For example, consider the pattern         with any kind of lookbehind. For example, consider the pattern
2143    
2144           \Biss\B           \Biss\B
2145    
2146         which  finds  occurrences  of "iss" in the middle of words. (\B matches         which finds occurrences of "iss" in the middle of  words.  (\B  matches
2147         only if the current position in the subject is not  a  word  boundary.)         only  if  the  current position in the subject is not a word boundary.)
2148         When  applied  to the string "Mississipi" the first call to pcre_exec()         When applied to the string "Mississipi" the first call  to  pcre_exec()
2149         finds the first occurrence. If pcre_exec() is called  again  with  just         finds  the  first  occurrence. If pcre_exec() is called again with just
2150         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the remainder of the subject,  namely  "issipi",  it  does  not  match,
2151         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2152         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to be a word boundary. However, if pcre_exec()  is  passed  the  entire
2153         string again, but with startoffset  set  to  4,  it  finds  the  second         string again, but with startoffset set to 4, it finds the second occur-
2154         occurrence  of  "iss"  because  it  is able to look behind the starting         rence of "iss" because it is able to look behind the starting point  to
2155         point to discover that it is preceded by a letter.         discover that it is preceded by a letter.
2156    
2157         If a non-zero starting offset is passed when the pattern  is  anchored,         If  a  non-zero starting offset is passed when the pattern is anchored,
2158         one  attempt  to match at the given offset is tried. This can only suc-         one attempt to match at the given offset is made. This can only succeed
2159         ceed if the pattern does not require the match to be at  the  start  of         if  the  pattern  does  not require the match to be at the start of the
2160         the subject.         subject.
2161    
2162       How pcre_exec() returns captured substrings
2163    
2164         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
2165         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
2166         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
2167         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
2168         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
2169         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
2170         that do not cause substrings to be captured.         that do not cause substrings to be captured.
2171    
2172         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of integers
2173         offsets whose address is passed in ovector. The number of  elements  in         whose  address is passed in ovector. The number of elements in the vec-
2174         the vector is passed in ovecsize. The first two-thirds of the vector is         tor is passed in ovecsize, which must be a non-negative  number.  Note:
2175         used to pass back captured substrings, each substring using a  pair  of         this argument is NOT the size of ovector in bytes.
2176         integers.  The  remaining  third  of the vector is used as workspace by  
2177         pcre_exec() while matching capturing subpatterns, and is not  available         The  first  two-thirds of the vector is used to pass back captured sub-
2178         for  passing  back  information.  The  length passed in ovecsize should         strings, each substring using a pair of integers. The  remaining  third
2179         always be a multiple of three. If it is not, it is rounded down.         of  the  vector is used as workspace by pcre_exec() while matching cap-
2180           turing subpatterns, and is not available for passing back  information.
2181           The  number passed in ovecsize should always be a multiple of three. If
2182           it is not, it is rounded down.
2183    
2184         When a match has been successful, information about captured substrings         When a match is successful, information about  captured  substrings  is
2185         is returned in pairs of integers, starting at the beginning of ovector,         returned  in  pairs  of integers, starting at the beginning of ovector,
2186         and continuing up to two-thirds of its length at the  most.  The  first         and continuing up to two-thirds of its length at the  most.  The  first
2187         element of a pair is set to the offset of the first character in a sub-         element  of  each pair is set to the byte offset of the first character
2188         string, and the second is set to the  offset  of  the  first  character         in a substring, and the second is set to the byte offset of  the  first
2189         after  the  end  of  a  substring. The first pair, ovector[0] and ovec-         character  after  the end of a substring. Note: these values are always
2190         tor[1], identify the portion of  the  subject  string  matched  by  the         byte offsets, even in UTF-8 mode. They are not character counts.
2191         entire  pattern.  The next pair is used for the first capturing subpat-  
2192         tern, and so on. The value returned by pcre_exec()  is  the  number  of         The first pair of integers, ovector[0]  and  ovector[1],  identify  the
2193         pairs  that  have  been set. If there are no capturing subpatterns, the         portion  of  the subject string matched by the entire pattern. The next
2194         return value from a successful match is 1,  indicating  that  just  the         pair is used for the first capturing subpattern, and so on.  The  value
2195         first pair of offsets has been set.         returned by pcre_exec() is one more than the highest numbered pair that
2196           has been set.  For example, if two substrings have been  captured,  the
2197           returned  value is 3. If there are no capturing subpatterns, the return
2198           value from a successful match is 1, indicating that just the first pair
2199           of offsets has been set.
2200    
2201         Some  convenience  functions  are  provided for extracting the captured         If a capturing subpattern is matched repeatedly, it is the last portion
2202         substrings as separate strings. These are described  in  the  following         of the string that it matched that is returned.
        section.  
2203    
2204         It  is  possible  for  an capturing subpattern number n+1 to match some         If the vector is too small to hold all the captured substring  offsets,
2205         part of the subject when subpattern n has not been  used  at  all.  For         it is used as far as possible (up to two-thirds of its length), and the
2206         example, if the string "abc" is matched against the pattern (a|(z))(bc)         function returns a value of zero. If the substring offsets are  not  of
2207         subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both         interest,  pcre_exec()  may  be  called with ovector passed as NULL and
2208         offset values corresponding to the unused subpattern are set to -1.         ovecsize as zero. However, if the pattern contains back references  and
2209           the  ovector is not big enough to remember the related substrings, PCRE
2210           has to get additional memory for use during matching. Thus it  is  usu-
2211           ally advisable to supply an ovector.
2212    
2213           The pcre_fullinfo() function can be used to find out how many capturing
2214           subpatterns there are in a compiled  pattern.  The  smallest  size  for
2215           ovector  that  will allow for n captured substrings, in addition to the
2216           offsets of the substring matched by the whole pattern, is (n+1)*3.
2217    
2218           It is possible for capturing subpattern number n+1 to match  some  part
2219           of the subject when subpattern n has not been used at all. For example,
2220           if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
2221           return from the function is 4, and subpatterns 1 and 3 are matched, but
2222           2 is not. When this happens, both values in  the  offset  pairs  corre-
2223           sponding to unused subpatterns are set to -1.
2224    
2225           Offset  values  that correspond to unused subpatterns at the end of the
2226           expression are also set to -1. For example,  if  the  string  "abc"  is
2227           matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2228           matched. The return from the function is 2, because  the  highest  used
2229           capturing subpattern number is 1. However, you can refer to the offsets
2230           for the second and third capturing subpatterns if  you  wish  (assuming
2231           the vector is large enough, of course).
2232    
2233         If a capturing subpattern is matched repeatedly, it is the last portion         Some  convenience  functions  are  provided for extracting the captured
2234         of the string that it matched that gets returned.         substrings as separate strings. These are described below.
2235    
2236         If the vector is too small to hold all the captured substrings,  it  is     Error return values from pcre_exec()
        used as far as possible (up to two-thirds of its length), and the func-  
        tion returns a value of zero. In particular, if the  substring  offsets  
        are  not  of interest, pcre_exec() may be called with ovector passed as  
        NULL and ovecsize as zero. However, if the pattern contains back refer-  
        ences  and  the  ovector  isn't big enough to remember the related sub-  
        strings, PCRE has to get additional memory  for  use  during  matching.  
        Thus it is usually advisable to supply an ovector.  
   
        Note  that  pcre_info() can be used to find out how many capturing sub-  
        patterns there are in a compiled pattern. The smallest size for ovector  
        that  will  allow for n captured substrings, in addition to the offsets  
        of the substring matched by the whole pattern, is (n+1)*3.  
2237    
2238         If pcre_exec() fails, it returns a negative number. The  following  are         If pcre_exec() fails, it returns a negative number. The  following  are
2239         defined in the header file:         defined in the header file:
# Line 1196  MATCHING A PATTERN Line 2254  MATCHING A PATTERN
2254           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
2255    
2256         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
2257         to  catch  the case when it is passed a junk pointer. This is the error         to catch the case when it is passed a junk pointer and to detect when a
2258         it gives when the magic number isn't present.         pattern that was compiled in an environment of one endianness is run in
2259           an  environment  with the other endianness. This is the error that PCRE
2260           gives when the magic number is not present.
2261    
2262           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
2263    
2264         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
2265         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
# Line 1211  MATCHING A PATTERN Line 2271  MATCHING A PATTERN
2271         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2272         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE gets a block of memory at the start of matching to  use  for  this
2273         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose.  If the call via pcre_malloc() fails, this error is given. The
2274         memory is freed at the end of matching.         memory is automatically freed at the end of matching.
2275    
2276           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2277    
# Line 1221  MATCHING A PATTERN Line 2281  MATCHING A PATTERN
2281    
2282           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2283    
2284         The recursion and backtracking limit, as specified by  the  match_limit         The backtracking limit, as specified by  the  match_limit  field  in  a
2285         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         pcre_extra  structure  (or  defaulted) was reached. See the description
2286         description above.         above.
2287    
2288           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2289    
# Line 1242  MATCHING A PATTERN Line 2302  MATCHING A PATTERN
2302         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2303         ter.         ter.
2304    
2305             PCRE_ERROR_PARTIAL        (-12)
2306    
2307           The  subject  string did not match, but it did match partially. See the
2308           pcrepartial documentation for details of partial matching.
2309    
2310             PCRE_ERROR_BADPARTIAL     (-13)
2311    
2312           This code is no longer in  use.  It  was  formerly  returned  when  the
2313           PCRE_PARTIAL  option  was used with a compiled pattern containing items
2314           that were  not  supported  for  partial  matching.  From  release  8.00
2315           onwards, there are no restrictions on partial matching.
2316    
2317             PCRE_ERROR_INTERNAL       (-14)
2318    
2319           An  unexpected  internal error has occurred. This error could be caused
2320           by a bug in PCRE or by overwriting of the compiled pattern.
2321    
2322             PCRE_ERROR_BADCOUNT       (-15)
2323    
2324           This error is given if the value of the ovecsize argument is negative.
2325    
2326             PCRE_ERROR_RECURSIONLIMIT (-21)
2327    
2328           The internal recursion limit, as specified by the match_limit_recursion
2329           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2330           description above.
2331    
2332             PCRE_ERROR_BADNEWLINE     (-23)
2333    
2334           An invalid combination of PCRE_NEWLINE_xxx options was given.
2335    
2336           Error numbers -16 to -20 and -22 are not used by pcre_exec().
2337    
2338    
2339  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2340    
# Line 1256  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2349  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2349         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2350              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2351    
2352         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2353         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2354         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2355         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2356         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2357         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2358         substrings.  A  substring  that  contains  a  binary  zero is correctly         substrings.
2359         extracted and has a further zero added on the end, but  the  result  is  
2360         not, of course, a C string.         A substring that contains a binary zero is correctly extracted and  has
2361           a  further zero added on the end, but the result is not, of course, a C
2362           string.  However, you can process such a string  by  referring  to  the
2363           length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2364           string().  Unfortunately, the interface to pcre_get_substring_list() is
2365           not  adequate for handling strings containing binary zeros, because the
2366           end of the final string is not independently indicated.
2367    
2368         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2369         tions: subject is the subject string which has just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2370         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2371         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2372         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2373         entire regular expression. This is the value returned by  pcre_exec  if         entire regular expression. This is the value returned by pcre_exec() if
2374         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2375         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2376         be the size of the vector divided by three.         be the number of elements in the vector divided by three.
2377    
2378         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2379         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2380         zero  extracts  the  substring  that  matched the entire pattern, while         zero extracts the substring that matched the  entire  pattern,  whereas
2381         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2382         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2383         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2384         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2385         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2386         the terminating zero, or one of         the terminating zero, or one of these error codes:
2387    
2388           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2389    
2390         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2391         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2392    
2393           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2394    
2395         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2396    
2397         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2398         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2399         single block of memory which is obtained via pcre_malloc.  The  address         single block of memory that is obtained via pcre_malloc. The address of
2400         of the memory block is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2401         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2402         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all  went  well,  or  the
2403           error code
2404    
2405           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2406    
# Line 1313  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2413  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2413         string  by inspecting the appropriate offset in ovector, which is nega-         string  by inspecting the appropriate offset in ovector, which is nega-
2414         tive for unset substrings.         tive for unset substrings.
2415    
2416         The    two    convenience    functions    pcre_free_substring()     and         The two convenience functions pcre_free_substring() and  pcre_free_sub-
2417         pcre_free_substring_list() can be used to free the memory returned by a         string_list()  can  be  used  to free the memory returned by a previous
2418         previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2419         respectively. They do nothing more than call the function pointed to by         tively.  They  do  nothing  more  than  call the function pointed to by
2420         pcre_free, which of course could be called directly from a  C  program.         pcre_free, which of course could be called directly from a  C  program.
2421         However,  PCRE is used in some situations where it is linked via a spe-         However,  PCRE is used in some situations where it is linked via a spe-
2422         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
2423         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free  directly;  it is for these cases that the functions are pro-
2424         vided.         vided.
2425    
2426    
2427  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2428    
2429           int pcre_get_stringnumber(const pcre *code,
2430                const char *name);
2431    
2432         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
2433              const char *subject, int *ovector,              const char *subject, int *ovector,
2434              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2435              char *buffer, int buffersize);              char *buffer, int buffersize);
2436    
        int pcre_get_stringnumber(const pcre *code,  
             const char *name);  
   
2437         int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
2438              const char *subject, int *ovector,              const char *subject, int *ovector,
2439              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2440              const char **stringptr);              const char **stringptr);
2441    
2442         To extract a substring by name, you first have to find associated  num-         To extract a substring by name, you first have to find associated  num-
2443         ber.  This  can  be  done by calling pcre_get_stringnumber(). The first         ber.  For example, for this pattern
2444         argument is the compiled pattern, and the second is the name. For exam-  
2445         ple, for this pattern           (a+)b(?<xxx>\d+)...
2446    
2447           ab(?<xxx>\d+)...         the number of the subpattern called "xxx" is 2. If the name is known to
2448           be unique (PCRE_DUPNAMES was not set), you can find the number from the
2449         the  number  of the subpattern called "xxx" is 1. Given the number, you         name by calling pcre_get_stringnumber(). The first argument is the com-
2450         can then extract the substring directly, or use one  of  the  functions         piled pattern, and the second is the name. The yield of the function is
2451         described  in the previous section. For convenience, there are also two         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2452         functions that do the whole job.         subpattern of that name.
2453    
2454           Given the number, you can extract the substring directly, or use one of
2455           the functions described in the previous section. For convenience, there
2456           are also two functions that do the whole job.
2457    
2458         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2459         pcre_get_named_substring() are the same as those for the functions that         pcre_get_named_substring()  are  the  same  as  those for the similarly
2460         extract by number, and so are not re-described here. There are just two         named functions that extract by number. As these are described  in  the
2461         differences.         previous  section,  they  are not re-described here. There are just two
2462           differences:
2463    
2464         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2465         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2466         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2467         name-to-number translation table.         name-to-number translation table.
2468    
2469         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2470         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2471         ate.         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2472           behaviour may not be what you want (see the next section).
2473  Last updated: 09 December 2003  
2474  Copyright (c) 1997-2003 University of Cambridge.         Warning: If the pattern uses the (?| feature to set up multiple subpat-
2475  -----------------------------------------------------------------------------         terns  with  the  same number, as described in the section on duplicate
2476           subpattern numbers in the pcrepattern page, you  cannot  use  names  to
2477           distinguish  the  different subpatterns, because names are not included
2478           in the compiled code. The matching process uses only numbers. For  this
2479           reason,  the  use of different names for subpatterns of the same number
2480           causes an error at compile time.
2481    
2482    
2483    DUPLICATE SUBPATTERN NAMES
2484    
2485           int pcre_get_stringtable_entries(const pcre *code,
2486                const char *name, char **first, char **last);
2487    
2488           When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2489           subpatterns  are not required to be unique. (Duplicate names are always
2490           allowed for subpatterns with the same number, created by using the  (?|
2491           feature.  Indeed,  if  such subpatterns are named, they are required to
2492           use the same names.)
2493    
2494           Normally, patterns with duplicate names are such that in any one match,
2495           only  one of the named subpatterns participates. An example is shown in
2496           the pcrepattern documentation.
2497    
2498           When   duplicates   are   present,   pcre_copy_named_substring()    and
2499           pcre_get_named_substring()  return the first substring corresponding to
2500           the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
2501           (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
2502           function returns one of the numbers that are associated with the  name,
2503           but it is not defined which it is.
2504    
2505           If  you want to get full details of all captured substrings for a given
2506           name, you must use  the  pcre_get_stringtable_entries()  function.  The
2507           first argument is the compiled pattern, and the second is the name. The
2508           third and fourth are pointers to variables which  are  updated  by  the
2509           function. After it has run, they point to the first and last entries in
2510           the name-to-number table  for  the  given  name.  The  function  itself
2511           returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2512           there are none. The format of the table is described above in the  sec-
2513           tion  entitled  Information  about  a  pattern.  Given all the relevant
2514           entries for the name, you can extract each of their numbers, and  hence
2515           the captured data, if any.
2516    
2517    
2518    FINDING ALL POSSIBLE MATCHES
2519    
2520           The  traditional  matching  function  uses a similar algorithm to Perl,
2521           which stops when it finds the first match, starting at a given point in
2522           the  subject.  If you want to find all possible matches, or the longest
2523           possible match, consider using the alternative matching  function  (see
2524           below)  instead.  If you cannot use the alternative function, but still
2525           need to find all possible matches, you can kludge it up by  making  use
2526           of the callout facility, which is described in the pcrecallout documen-
2527           tation.
2528    
2529  PCRE(3)                                                                PCRE(3)         What you have to do is to insert a callout right at the end of the pat-
2530           tern.   When your callout function is called, extract and save the cur-
2531           rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2532           backtrack  and  try other alternatives. Ultimately, when it runs out of
2533           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2534    
2535    
2536    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2537    
2538  NAME         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2539         PCRE - Perl-compatible regular expressions              const char *subject, int length, int startoffset,
2540                int options, int *ovector, int ovecsize,
2541                int *workspace, int wscount);
2542    
2543  PCRE CALLOUTS         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2544           against  a  compiled pattern, using a matching algorithm that scans the
2545           subject string just once, and does not backtrack.  This  has  different
2546           characteristics  to  the  normal  algorithm, and is not compatible with
2547           Perl. Some of the features of PCRE patterns are not  supported.  Never-
2548           theless,  there are times when this kind of matching can be useful. For
2549           a discussion of the two matching algorithms, and  a  list  of  features
2550           that  pcre_dfa_exec() does not support, see the pcrematching documenta-
2551           tion.
2552    
2553         int (*pcre_callout)(pcre_callout_block *);         The arguments for the pcre_dfa_exec() function  are  the  same  as  for
2554           pcre_exec(), plus two extras. The ovector argument is used in a differ-
2555           ent way, and this is described below. The other  common  arguments  are
2556           used  in  the  same way as for pcre_exec(), so their description is not
2557           repeated here.
2558    
2559           The two additional arguments provide workspace for  the  function.  The
2560           workspace  vector  should  contain at least 20 elements. It is used for
2561           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2562           workspace  will  be  needed for patterns and subjects where there are a
2563           lot of potential matches.
2564    
2565         PCRE provides a feature called "callout", which is a means of temporar-         Here is an example of a simple call to pcre_dfa_exec():
        ily passing control to the caller of PCRE  in  the  middle  of  pattern  
        matching.  The  caller of PCRE provides an external function by putting  
        its entry point in the global variable pcre_callout. By  default,  this  
        variable contains NULL, which disables all calling out.  
2566    
2567         Within  a  regular  expression,  (?C) indicates the points at which the           int rc;
2568         external function is to be called.  Different  callout  points  can  be           int ovector[10];
2569         identified  by  putting  a number less than 256 after the letter C. The           int wspace[20];
2570         default value is zero.  For  example,  this  pattern  has  two  callout           rc = pcre_dfa_exec(
2571         points:             re,             /* result of pcre_compile() */
2572               NULL,           /* we didn't study the pattern */
2573               "some string",  /* the subject string */
2574               11,             /* the length of the subject string */
2575               0,              /* start at offset 0 in the subject */
2576               0,              /* default options */
2577               ovector,        /* vector of integers for substring information */
2578               10,             /* number of elements (NOT size in bytes) */
2579               wspace,         /* working space vector */
2580               20);            /* number of elements (NOT size in bytes) */
2581    
2582       Option bits for pcre_dfa_exec()
2583    
2584           The unused bits of the options argument  for  pcre_dfa_exec()  must  be
2585           zero.  The  only  bits  that  may  be  set are PCRE_ANCHORED, PCRE_NEW-
2586           LINE_xxx,        PCRE_NOTBOL,        PCRE_NOTEOL,        PCRE_NOTEMPTY,
2587           PCRE_NOTEMPTY_ATSTART, PCRE_NO_UTF8_CHECK, PCRE_PARTIAL_HARD, PCRE_PAR-
2588           TIAL_SOFT, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All  but  the  last
2589           four  of  these  are  exactly  the  same  as  for pcre_exec(), so their
2590           description is not repeated here.
2591    
2592             PCRE_PARTIAL_HARD
2593             PCRE_PARTIAL_SOFT
2594    
2595           These have the same general effect as they do for pcre_exec(), but  the
2596           details  are  slightly  different.  When  PCRE_PARTIAL_HARD  is set for
2597           pcre_dfa_exec(), it returns PCRE_ERROR_PARTIAL if the end of  the  sub-
2598           ject  is  reached  and there is still at least one matching possibility
2599           that requires additional characters. This happens even if some complete
2600           matches have also been found. When PCRE_PARTIAL_SOFT is set, the return
2601           code PCRE_ERROR_NOMATCH is converted into PCRE_ERROR_PARTIAL if the end
2602           of  the  subject  is  reached, there have been no complete matches, but
2603           there is still at least one matching possibility. The  portion  of  the
2604           string  that  was inspected when the longest partial match was found is
2605           set as the first matching string in both cases.
2606    
2607             PCRE_DFA_SHORTEST
2608    
2609           Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2610           stop as soon as it has found one match. Because of the way the alterna-
2611           tive algorithm works, this is necessarily the shortest  possible  match
2612           at the first possible matching point in the subject string.
2613    
2614             PCRE_DFA_RESTART
2615    
2616           When pcre_dfa_exec() returns a partial match, it is possible to call it
2617           again, with additional subject characters, and have  it  continue  with
2618           the  same match. The PCRE_DFA_RESTART option requests this action; when
2619           it is set, the workspace and wscount options must  reference  the  same
2620           vector  as  before  because data about the match so far is left in them
2621           after a partial match. There is more discussion of this facility in the
2622           pcrepartial documentation.
2623    
2624       Successful returns from pcre_dfa_exec()
2625    
2626           When  pcre_dfa_exec()  succeeds, it may have matched more than one sub-
2627           string in the subject. Note, however, that all the matches from one run
2628           of  the  function  start  at the same point in the subject. The shorter
2629           matches are all initial substrings of the longer matches. For  example,
2630           if the pattern
2631    
2632             <.*>
2633    
2634           is matched against the string
2635    
2636             This is <something> <something else> <something further> no more
2637    
2638           the three matched strings are
2639    
2640             <something>
2641             <something> <something else>
2642             <something> <something else> <something further>
2643    
2644           On  success,  the  yield of the function is a number greater than zero,
2645           which is the number of matched substrings.  The  substrings  themselves
2646           are  returned  in  ovector. Each string uses two elements; the first is
2647           the offset to the start, and the second is the offset to  the  end.  In
2648           fact,  all  the  strings  have the same start offset. (Space could have
2649           been saved by giving this only once, but it was decided to retain  some
2650           compatibility  with  the  way pcre_exec() returns data, even though the
2651           meaning of the strings is different.)
2652    
2653           The strings are returned in reverse order of length; that is, the long-
2654           est  matching  string is given first. If there were too many matches to
2655           fit into ovector, the yield of the function is zero, and the vector  is
2656           filled with the longest matches.
2657    
2658       Error returns from pcre_dfa_exec()
2659    
2660           The  pcre_dfa_exec()  function returns a negative number when it fails.
2661           Many of the errors are the same  as  for  pcre_exec(),  and  these  are
2662           described  above.   There are in addition the following errors that are
2663           specific to pcre_dfa_exec():
2664    
2665             PCRE_ERROR_DFA_UITEM      (-16)
2666    
2667           This return is given if pcre_dfa_exec() encounters an item in the  pat-
2668           tern  that  it  does not support, for instance, the use of \C or a back
2669           reference.
2670    
2671             PCRE_ERROR_DFA_UCOND      (-17)
2672    
2673           This return is given if pcre_dfa_exec()  encounters  a  condition  item
2674           that  uses  a back reference for the condition, or a test for recursion
2675           in a specific group. These are not supported.
2676    
2677             PCRE_ERROR_DFA_UMLIMIT    (-18)
2678    
2679           This return is given if pcre_dfa_exec() is called with an  extra  block
2680           that contains a setting of the match_limit field. This is not supported
2681           (it is meaningless).
2682    
2683             PCRE_ERROR_DFA_WSSIZE     (-19)
2684    
2685           This return is given if  pcre_dfa_exec()  runs  out  of  space  in  the
2686           workspace vector.
2687    
2688             PCRE_ERROR_DFA_RECURSE    (-20)
2689    
2690           When  a  recursive subpattern is processed, the matching function calls
2691           itself recursively, using private vectors for  ovector  and  workspace.
2692           This  error  is  given  if  the output vector is not large enough. This
2693           should be extremely rare, as a vector of size 1000 is used.
2694    
2695    
2696    SEE ALSO
2697    
2698           pcrebuild(3), pcrecallout(3), pcrecpp(3)(3), pcrematching(3),  pcrepar-
2699           tial(3), pcreposix(3), pcreprecompile(3), pcresample(3), pcrestack(3).
2700    
2701    
2702    AUTHOR
2703    
2704           Philip Hazel
2705           University Computing Service
2706           Cambridge CB2 3QH, England.
2707    
2708    
2709    REVISION
2710    
2711           Last updated: 03 October 2009
2712           Copyright (c) 1997-2009 University of Cambridge.
2713    ------------------------------------------------------------------------------
2714    
2715    
2716    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2717    
2718    
2719    NAME
2720           PCRE - Perl-compatible regular expressions
2721    
2722    
2723    PCRE CALLOUTS
2724    
2725           int (*pcre_callout)(pcre_callout_block *);
2726    
2727           PCRE provides a feature called "callout", which is a means of temporar-
2728           ily passing control to the caller of PCRE  in  the  middle  of  pattern
2729           matching.  The  caller of PCRE provides an external function by putting
2730           its entry point in the global variable pcre_callout. By  default,  this
2731           variable contains NULL, which disables all calling out.
2732    
2733           Within  a  regular  expression,  (?C) indicates the points at which the
2734           external function is to be called.  Different  callout  points  can  be
2735           identified  by  putting  a number less than 256 after the letter C. The
2736           default value is zero.  For  example,  this  pattern  has  two  callout
2737           points:
2738    
2739           (?C1)abc(?C2)def           (?C1)abc(?C2)def
2740    
2741         During matching, when PCRE reaches a callout point (and pcre_callout is         If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() or
2742         set), the external function is called. Its only argument is  a  pointer         pcre_compile2() is called, PCRE  automatically  inserts  callouts,  all
2743         to a pcre_callout block. This contains the following variables:         with  number  255,  before  each  item  in the pattern. For example, if
2744           PCRE_AUTO_CALLOUT is used with the pattern
2745    
2746             A(\d{2}|--)
2747    
2748           it is processed as if it were
2749    
2750           (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2751    
2752           Notice that there is a callout before and after  each  parenthesis  and
2753           alternation  bar.  Automatic  callouts  can  be  used  for tracking the
2754           progress of pattern matching. The pcretest command has an  option  that
2755           sets  automatic callouts; when it is used, the output indicates how the
2756           pattern is matched. This is useful information when you are  trying  to
2757           optimize the performance of a particular pattern.
2758    
2759    
2760    MISSING CALLOUTS
2761    
2762           You  should  be  aware  that,  because of optimizations in the way PCRE
2763           matches patterns by default, callouts  sometimes  do  not  happen.  For
2764           example, if the pattern is
2765    
2766             ab(?C4)cd
2767    
2768           PCRE knows that any matching string must contain the letter "d". If the
2769           subject string is "abyz", the lack of "d" means that  matching  doesn't
2770           ever  start,  and  the  callout is never reached. However, with "abyd",
2771           though the result is still no match, the callout is obeyed.
2772    
2773           If the pattern is studied, PCRE knows the minimum length of a  matching
2774           string,  and will immediately give a "no match" return without actually
2775           running a match if the subject is not long enough, or,  for  unanchored
2776           patterns, if it has been scanned far enough.
2777    
2778           You  can disable these optimizations by passing the PCRE_NO_START_OPTI-
2779           MIZE option to pcre_exec() or  pcre_dfa_exec().  This  slows  down  the
2780           matching  process,  but  does  ensure that callouts such as the example
2781           above are obeyed.
2782    
2783    
2784    THE CALLOUT INTERFACE
2785    
2786           During matching, when PCRE reaches a callout point, the external  func-
2787           tion  defined by pcre_callout is called (if it is set). This applies to
2788           both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The
2789           only  argument  to  the callout function is a pointer to a pcre_callout
2790           block. This structure contains the following fields:
2791    
2792           int          version;           int          version;
2793           int          callout_number;           int          callout_number;
# Line 1408  PCRE CALLOUTS Line 2799  PCRE CALLOUTS
2799           int          capture_top;           int          capture_top;
2800           int          capture_last;           int          capture_last;
2801           void        *callout_data;           void        *callout_data;
2802             int          pattern_position;
2803             int          next_item_length;
2804    
2805         The  version  field  is an integer containing the version number of the         The version field is an integer containing the version  number  of  the
2806         block format. The current version  is  zero.  The  version  number  may         block  format. The initial version was 0; the current version is 1. The
2807         change  in  future if additional fields are added, but the intention is         version number will change again in future  if  additional  fields  are
2808         never to remove any of the existing fields.         added, but the intention is never to remove any of the existing fields.
2809    
2810         The callout_number field contains the number of the  callout,  as  com-         The  callout_number  field  contains the number of the callout, as com-
2811         piled into the pattern (that is, the number after ?C).         piled into the pattern (that is, the number after ?C for  manual  call-
2812           outs, and 255 for automatically generated callouts).
2813    
2814         The  offset_vector field is a pointer to the vector of offsets that was         The  offset_vector field is a pointer to the vector of offsets that was
2815         passed by the caller to pcre_exec(). The contents can be  inspected  in         passed  by  the  caller  to  pcre_exec()   or   pcre_dfa_exec().   When
2816         order  to extract substrings that have been matched so far, in the same         pcre_exec()  is used, the contents can be inspected in order to extract
2817         way as for extracting substrings after a match has completed.         substrings that have been matched so  far,  in  the  same  way  as  for
2818           extracting  substrings after a match has completed. For pcre_dfa_exec()
2819           this field is not useful.
2820    
2821         The subject and subject_length fields contain copies  the  values  that         The subject and subject_length fields contain copies of the values that
2822         were passed to pcre_exec().         were passed to pcre_exec().
2823    
2824         The  start_match  field contains the offset within the subject at which         The  start_match  field normally contains the offset within the subject
2825         the current match attempt started. If the pattern is not anchored,  the         at which the current match attempt  started.  However,  if  the  escape
2826         callout  function  may  be  called several times for different starting         sequence  \K has been encountered, this value is changed to reflect the
2827         points.         modified starting point. If the pattern is not  anchored,  the  callout
2828           function may be called several times from the same point in the pattern
2829           for different starting points in the subject.
2830    
2831         The current_position field contains the offset within  the  subject  of         The current_position field contains the offset within  the  subject  of
2832         the current match pointer.         the current match pointer.
2833    
2834         The  capture_top field contains one more than the number of the highest         When  the  pcre_exec() function is used, the capture_top field contains
2835         numbered  captured  substring  so  far.  If  no  substrings  have  been         one more than the number of the highest numbered captured substring  so
2836         captured, the value of capture_top is one.         far.  If  no substrings have been captured, the value of capture_top is
2837           one. This is always the case when pcre_dfa_exec() is used,  because  it
2838           does not support captured substrings.
2839    
2840         The  capture_last  field  contains the number of the most recently cap-         The  capture_last  field  contains the number of the most recently cap-
2841         tured substring.         tured substring. If no substrings have been captured, its value is  -1.
2842           This is always the case when pcre_dfa_exec() is used.
2843    
2844         The callout_data field contains a value that is passed  to  pcre_exec()         The  callout_data  field contains a value that is passed to pcre_exec()
2845         by  the  caller specifically so that it can be passed back in callouts.         or pcre_dfa_exec() specifically so that it can be passed back in  call-
2846         It is passed in the pcre_callout field of the  pcre_extra  data  struc-         outs.  It  is  passed  in the pcre_callout field of the pcre_extra data
2847         ture.  If  no  such  data  was  passed,  the value of callout_data in a         structure. If no such data was passed, the value of callout_data  in  a
2848         pcre_callout block is NULL. There is a description  of  the  pcre_extra         pcre_callout  block  is  NULL. There is a description of the pcre_extra
2849         structure in the pcreapi documentation.         structure in the pcreapi documentation.
2850    
2851           The pattern_position field is present from version 1 of the  pcre_call-
2852           out structure. It contains the offset to the next item to be matched in
2853           the pattern string.
2854    
2855           The next_item_length field is present from version 1 of the  pcre_call-
2856           out structure. It contains the length of the next item to be matched in
2857           the pattern string. When the callout immediately precedes  an  alterna-
2858           tion  bar, a closing parenthesis, or the end of the pattern, the length
2859           is zero. When the callout precedes an opening parenthesis,  the  length
2860           is that of the entire subpattern.
2861    
2862           The  pattern_position  and next_item_length fields are intended to help
2863           in distinguishing between different automatic callouts, which all  have
2864           the same callout number. However, they are set for all callouts.
2865    
2866    
2867  RETURN VALUES  RETURN VALUES
2868    
2869         The callout function returns an integer. If the value is zero, matching         The  external callout function returns an integer to PCRE. If the value
2870         proceeds as normal. If the value is greater than zero,  matching  fails         is zero, matching proceeds as normal. If  the  value  is  greater  than
2871         at the current point, but backtracking to test other possibilities goes         zero,  matching  fails  at  the current point, but the testing of other
2872         ahead, just as if a lookahead assertion had failed.  If  the  value  is         matching possibilities goes ahead, just as if a lookahead assertion had
2873         less  than  zero,  the  match is abandoned, and pcre_exec() returns the         failed.  If  the  value  is less than zero, the match is abandoned, and
2874         value.         pcre_exec() or pcre_dfa_exec() returns the negative value.
2875    
2876         Negative  values  should  normally  be   chosen   from   the   set   of         Negative  values  should  normally  be   chosen   from   the   set   of
2877         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
# Line 1464  RETURN VALUES Line 2879  RETURN VALUES
2879         reserved  for  use  by callout functions; it will never be used by PCRE         reserved  for  use  by callout functions; it will never be used by PCRE
2880         itself.         itself.
2881    
 Last updated: 21 January 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
2882    
2883  PCRE(3)                                                                PCRE(3)  AUTHOR
2884    
2885           Philip Hazel
2886           University Computing Service
2887           Cambridge CB2 3QH, England.
2888    
2889    
2890    REVISION
2891    
2892           Last updated: 29 September 2009
2893           Copyright (c) 1997-2009 University of Cambridge.
2894    ------------------------------------------------------------------------------
2895    
2896    
2897    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2898    
2899    
2900  NAME  NAME
2901         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2902    
2903  DIFFERENCES FROM PERL  
2904    DIFFERENCES BETWEEN PCRE AND PERL
2905    
2906         This  document describes the differences in the ways that PCRE and Perl         This  document describes the differences in the ways that PCRE and Perl
2907         handle regular expressions. The differences  described  here  are  with         handle regular expressions. The differences  described  here  are  with
2908         respect to Perl 5.8.         respect to Perl 5.10.
2909    
2910         1.  PCRE does not have full UTF-8 support. Details of what it does have         1.  PCRE has only a subset of Perl's UTF-8 and Unicode support. Details
2911         are given in the section on UTF-8 support in the main pcre page.         of what it does have are given in the section on UTF-8 support  in  the
2912           main pcre page.
2913    
2914         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2915         permits  them,  but they do not mean what you might think. For example,         permits them, but they do not mean what you might think.  For  example,
2916         (?!a){3} does not assert that the next three characters are not "a". It         (?!a){3} does not assert that the next three characters are not "a". It
2917         just asserts that the next character is not "a" three times.         just asserts that the next character is not "a" three times.
2918    
2919         3.  Capturing  subpatterns  that occur inside negative lookahead asser-         3. Capturing subpatterns that occur inside  negative  lookahead  asser-
2920         tions are counted, but their entries in the offsets  vector  are  never         tions  are  counted,  but their entries in the offsets vector are never
2921         set.  Perl sets its numerical variables from any such patterns that are         set. Perl sets its numerical variables from any such patterns that  are
2922         matched before the assertion fails to match something (thereby succeed-         matched before the assertion fails to match something (thereby succeed-
2923         ing),  but  only  if the negative lookahead assertion contains just one         ing), but only if the negative lookahead assertion  contains  just  one
2924         branch.         branch.
2925    
2926         4. Though binary zero characters are supported in the  subject  string,         4.  Though  binary zero characters are supported in the subject string,
2927         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2928         mal C string, terminated by zero. The escape sequence "\0" can be  used         mal C string, terminated by zero. The escape sequence \0 can be used in
2929         in the pattern to represent a binary zero.         the pattern to represent a binary zero.
2930    
2931         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5. The following Perl escape sequences are not supported: \l,  \u,  \L,
2932         \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general         \U, and \N. In fact these are implemented by Perl's general string-han-
2933         string-handling and are not part of its pattern matching engine. If any         dling and are not part of its pattern matching engine. If any of  these
2934         of these are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2935    
2936         6. PCRE does support the \Q...\E escape for quoting substrings. Charac-         6.  The Perl escape sequences \p, \P, and \X are supported only if PCRE
2937         ters  in  between  are  treated as literals. This is slightly different         is built with Unicode character property support. The  properties  that
2938         from Perl in that $ and @ are  also  handled  as  literals  inside  the         can  be tested with \p and \P are limited to the general category prop-
2939         quotes.  In Perl, they cause variable interpolation (but of course PCRE         erties such as Lu and Nd, script names such as Greek or  Han,  and  the
2940           derived  properties  Any  and  L&. PCRE does support the Cs (surrogate)
2941           property, which Perl does not; the  Perl  documentation  says  "Because
2942           Perl hides the need for the user to understand the internal representa-
2943           tion of Unicode characters, there is no need to implement the  somewhat
2944           messy concept of surrogates."
2945    
2946           7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2947           ters in between are treated as literals.  This  is  slightly  different
2948           from  Perl  in  that  $  and  @ are also handled as literals inside the
2949           quotes. In Perl, they cause variable interpolation (but of course  PCRE
2950         does not have variables). Note the following examples:         does not have variables). Note the following examples:
2951    
2952             Pattern            PCRE matches      Perl matches             Pattern            PCRE matches      Perl matches
# Line 1519  DIFFERENCES FROM PERL Line 2956  DIFFERENCES FROM PERL
2956             \Qabc\$xyz\E       abc\$xyz          abc\$xyz             \Qabc\$xyz\E       abc\$xyz          abc\$xyz
2957             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz             \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz
2958    
2959         The \Q...\E sequence is recognized both inside  and  outside  character         The  \Q...\E  sequence  is recognized both inside and outside character
2960         classes.         classes.
2961    
2962         7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2963         constructions. However, there is some experimental support  for  recur-         constructions.  However,  there is support for recursive patterns. This
2964         sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).         is not available in Perl 5.8, but it is in Perl 5.10.  Also,  the  PCRE
2965         Also, the PCRE "callout" feature allows  an  external  function  to  be         "callout"  feature allows an external function to be called during pat-
2966         called during pattern matching.         tern matching. See the pcrecallout documentation for details.
2967    
2968         8.  There  are some differences that are concerned with the settings of         9. Subpatterns that are called  recursively  or  as  "subroutines"  are
2969         captured strings when part of  a  pattern  is  repeated.  For  example,         always  treated  as  atomic  groups  in  PCRE. This is like Python, but
2970         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         unlike Perl. There is a discussion of an example that explains this  in
2971           more  detail  in  the section on recursion differences from Perl in the
2972           pcrepattern page.
2973    
2974           10. There are some differences that are concerned with the settings  of
2975           captured  strings  when  part  of  a  pattern is repeated. For example,
2976           matching "aba" against the  pattern  /^(a(b)?)+$/  in  Perl  leaves  $2
2977         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
2978    
2979         9. PCRE  provides  some  extensions  to  the  Perl  regular  expression         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
2980         facilities:         (*FAIL), (*F), (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but  only  in
2981           the forms without an argument. PCRE does not support (*MARK).
2982         (a)  Although  lookbehind  assertions  must match fixed length strings,  
2983         each alternative branch of a lookbehind assertion can match a different         12.  PCRE's handling of duplicate subpattern numbers and duplicate sub-
2984         length of string. Perl requires them all to have the same length.         pattern names is not as general as Perl's. This is a consequence of the
2985           fact the PCRE works internally just with numbers, using an external ta-
2986           ble to translate between numbers and names. In  particular,  a  pattern
2987           such  as  (?|(?<a>A)|(?<b)B),  where the two capturing parentheses have
2988           the same number but different names, is not supported,  and  causes  an
2989           error  at compile time. If it were allowed, it would not be possible to
2990           distinguish which parentheses matched, because both names map  to  cap-
2991           turing subpattern number 1. To avoid this confusing situation, an error
2992           is given at compile time.
2993    
2994           13. PCRE provides some extensions to the Perl regular expression facil-
2995           ities.   Perl  5.10  includes new features that are not in earlier ver-
2996           sions of Perl, some of which (such as named parentheses) have  been  in
2997           PCRE for some time. This list is with respect to Perl 5.10:
2998    
2999           (a)  Although  lookbehind  assertions  in  PCRE must match fixed length
3000           strings, each alternative branch of a lookbehind assertion can match  a
3001           different  length  of  string.  Perl requires them all to have the same
3002           length.
3003    
3004         (b)  If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the $         (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is not set, the  $
3005         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
3006    
3007         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
3008         cial meaning is faulted.         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
3009           ignored.  (Perl can be made to issue a warning.)
3010    
3011         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-
3012         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
3013         lowed by a question mark they are.         lowed by a question mark they are.
3014    
3015         (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
3016         the first matching position in the subject string.         tried only at the first matching position in the subject string.
3017    
3018         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NOTEMPTY_ATSTART,
3019         TURE options for pcre_exec() have no Perl equivalents.         and  PCRE_NO_AUTO_CAPTURE  options for pcre_exec() have no Perl equiva-
3020           lents.
3021    
3022         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive         (g) The \R escape sequence can be restricted to match only CR,  LF,  or
3023         pattern matching (Perl can do  this  using  the  (?p{code})  construct,         CRLF by the PCRE_BSR_ANYCRLF option.
        which PCRE cannot support.)  
3024    
3025         (h)  PCRE supports named capturing substrings, using the Python syntax.         (h) The callout facility is PCRE-specific.
3026    
3027         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from         (i) The partial matching facility is PCRE-specific.
        Sun's Java package.  
3028    
3029         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
3030           even on different hosts that have the other endianness.
3031    
3032         (k) The callout facility is PCRE-specific.         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
3033           different way and is not Perl-compatible.
3034    
3035  Last updated: 09 December 2003         (l)  PCRE  recognizes some special sequences such as (*CR) at the start
3036  Copyright (c) 1997-2003 University of Cambridge.         of a pattern that set overall options that cannot be changed within the
3037  -----------------------------------------------------------------------------         pattern.
3038    
 PCRE(3)                                                                PCRE(3)  
3039    
3040    AUTHOR
3041    
3042           Philip Hazel
3043           University Computing Service
3044           Cambridge CB2 3QH, England.
3045    
3046    
3047    REVISION
3048    
3049           Last updated: 04 October 2009
3050           Copyright (c) 1997-2009 University of Cambridge.
3051    ------------------------------------------------------------------------------
3052    
3053    
3054    PCREPATTERN(3)                                                  PCREPATTERN(3)
3055    
3056    
3057  NAME  NAME
3058         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
3059    
3060    
3061  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
3062    
3063         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax and semantics of the regular expressions that are supported
3064         are described below. Regular expressions are also described in the Perl         by PCRE are described in detail below. There is a quick-reference  syn-
3065         documentation  and in a number of other books, some of which have copi-         tax summary in the pcresyntax page. PCRE tries to match Perl syntax and
3066         ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-         semantics as closely as it can. PCRE  also  supports  some  alternative
3067         lished  by  O'Reilly, covers them in great detail. The description here         regular  expression  syntax (which does not conflict with the Perl syn-
3068         is intended as reference documentation.         tax) in order to provide some compatibility with regular expressions in
3069           Python, .NET, and Oniguruma.
3070         The basic operation of PCRE is on strings of bytes. However,  there  is  
3071         also  support for UTF-8 character strings. To use this support you must         Perl's  regular expressions are described in its own documentation, and
3072         build PCRE to include UTF-8 support, and then call pcre_compile()  with         regular expressions in general are covered in a number of  books,  some
3073         the  PCRE_UTF8  option.  How  this affects the pattern matching is men-         of  which  have  copious  examples. Jeffrey Friedl's "Mastering Regular
3074         tioned in several places below. There is also a summary of  UTF-8  fea-         Expressions", published by  O'Reilly,  covers  regular  expressions  in
3075         tures in the section on UTF-8 support in the main pcre page.         great  detail.  This  description  of  PCRE's  regular  expressions  is
3076           intended as reference material.
3077    
3078           The original operation of PCRE was on strings of  one-byte  characters.
3079           However,  there is now also support for UTF-8 character strings. To use
3080           this, PCRE must be built to include UTF-8 support, and  you  must  call
3081           pcre_compile()  or  pcre_compile2() with the PCRE_UTF8 option. There is
3082           also a special sequence that can be given at the start of a pattern:
3083    
3084             (*UTF8)
3085    
3086           Starting a pattern with this sequence  is  equivalent  to  setting  the
3087           PCRE_UTF8  option.  This  feature  is  not Perl-compatible. How setting
3088           UTF-8 mode affects pattern matching  is  mentioned  in  several  places
3089           below.  There  is  also  a  summary of UTF-8 features in the section on
3090           UTF-8 support in the main pcre page.
3091    
3092           The remainder of this document discusses the  patterns  that  are  sup-
3093           ported  by  PCRE when its main matching function, pcre_exec(), is used.
3094           From  release  6.0,   PCRE   offers   a   second   matching   function,
3095           pcre_dfa_exec(),  which matches using a different algorithm that is not
3096           Perl-compatible. Some of the features discussed below are not available
3097           when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
3098           alternative function, and how it differs from the normal function,  are
3099           discussed in the pcrematching page.
3100    
3101    
3102    NEWLINE CONVENTIONS
3103    
3104           PCRE  supports five different conventions for indicating line breaks in
3105           strings: a single CR (carriage return) character, a  single  LF  (line-
3106           feed) character, the two-character sequence CRLF, any of the three pre-
3107           ceding, or any Unicode newline sequence. The pcreapi page  has  further
3108           discussion  about newlines, and shows how to set the newline convention