/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 73 by nigel, Sat Feb 24 21:40:30 2007 UTC revision 313 by ph10, Wed Jan 23 18:05:06 2008 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
# Line 5  synopses of each function in the library Line 6  synopses of each function in the library
6  separate text files for the pcregrep and pcretest commands.  separate text files for the pcregrep and pcretest commands.
7  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
8    
 PCRE(3)                                                                PCRE(3)  
9    
10    PCRE(3)                                                                PCRE(3)
11    
12    
13  NAME  NAME
14         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
15    
16  DESCRIPTION  
17    INTRODUCTION
18    
19         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
20         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
21         just  a  few  differences.  The current implementation of PCRE (release         just  a  few differences. (Certain features that appeared in Python and
22         4.x) corresponds approximately with Perl  5.8,  including  support  for         PCRE before they appeared in Perl are also available using  the  Python
23         UTF-8  encoded  strings.   However,  this  support has to be explicitly         syntax.)
24         enabled; it is not the default.  
25           The  current  implementation of PCRE (release 7.x) corresponds approxi-
26         PCRE is written in C and released as a C library. However, a number  of         mately with Perl 5.10, including support for UTF-8 encoded strings  and
27         people  have  written  wrappers  and interfaces of various kinds. A C++         Unicode general category properties. However, UTF-8 and Unicode support
28         class is included in these contributions, which can  be  found  in  the         has to be explicitly enabled; it is not the default. The Unicode tables
29           correspond to Unicode release 5.0.0.
30    
31           In  addition to the Perl-compatible matching function, PCRE contains an
32           alternative matching function that matches the same  compiled  patterns
33           in  a different way. In certain circumstances, the alternative function
34           has some advantages. For a discussion of the two  matching  algorithms,
35           see the pcrematching page.
36    
37           PCRE  is  written  in C and released as a C library. A number of people
38           have written wrappers and interfaces of various kinds.  In  particular,
39           Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
40           included as part of the PCRE distribution. The pcrecpp page has details
41           of  this  interface.  Other  people's contributions can be found in the
42         Contrib directory at the primary FTP site, which is:         Contrib directory at the primary FTP site, which is:
43    
44         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
45    
46         Details  of  exactly which Perl regular expression features are and are         Details of exactly which Perl regular expression features are  and  are
47         not supported by PCRE are given in separate documents. See the pcrepat-         not supported by PCRE are given in separate documents. See the pcrepat-
48         tern and pcrecompat pages.         tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
49           page.
50    
51         Some  features  of  PCRE can be included, excluded, or changed when the         Some  features  of  PCRE can be included, excluded, or changed when the
52         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
53         client  to  discover  which features are available. Documentation about         client  to  discover  which  features are available. The features them-
54         building PCRE for various operating systems can be found in the  README         selves are described in the pcrebuild page. Documentation about  build-
55         file in the source distribution.         ing  PCRE for various operating systems can be found in the README file
56           in the source distribution.
57    
58           The library contains a number of undocumented  internal  functions  and
59           data  tables  that  are  used by more than one of the exported external
60           functions, but which are not intended  for  use  by  external  callers.
61           Their  names  all begin with "_pcre_", which hopefully will not provoke
62           any name clashes. In some environments, it is possible to control which
63           external  symbols  are  exported when a shared library is built, and in
64           these cases the undocumented symbols are not exported.
65    
66    
67  USER DOCUMENTATION  USER DOCUMENTATION
68    
69         The user documentation for PCRE has been split up into a number of dif-         The user documentation for PCRE comprises a number  of  different  sec-
70         ferent sections. In the "man" format, each of these is a separate  "man         tions.  In the "man" format, each of these is a separate "man page". In
71         page".  In  the  HTML  format, each is a separate page, linked from the         the HTML format, each is a separate page, linked from the  index  page.
72         index page. In the plain text format, all  the  sections  are  concate-         In  the  plain text format, all the sections are concatenated, for ease
73         nated, for ease of searching. The sections are as follows:         of searching. The sections are as follows:
74    
75           pcre              this document           pcre              this document
76           pcreapi           details of PCRE's native API           pcre-config       show PCRE installation configuration information
77             pcreapi           details of PCRE's native C API
78           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
79           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
80           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
81             pcrecpp           details of the C++ wrapper
82           pcregrep          description of the pcregrep command           pcregrep          description of the pcregrep command
83             pcrematching      discussion of the two matching algorithms
84             pcrepartial       details of the partial matching facility
85           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
86                               regular expressions                               regular expressions
87             pcresyntax        quick syntax reference
88           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
89           pcreposix         the POSIX-compatible API           pcreposix         the POSIX-compatible C API
90             pcreprecompile    details of saving and re-using precompiled patterns
91           pcresample        discussion of the sample program           pcresample        discussion of the sample program
92           pcretest          the pcretest testing command           pcrestack         discussion of stack usage
93             pcretest          description of the pcretest testing command
94    
95         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
96         each library function, listing its arguments and results.         each C library function, listing its arguments and results.
97    
98    
99  LIMITATIONS  LIMITATIONS
# Line 74  LIMITATIONS Line 106  LIMITATIONS
106         process  regular  expressions  that are truly enormous, you can compile         process  regular  expressions  that are truly enormous, you can compile
107         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in         PCRE with an internal linkage size of 3 or 4 (see the  README  file  in
108         the  source  distribution and the pcrebuild documentation for details).         the  source  distribution and the pcrebuild documentation for details).
109         If these cases the limit is substantially larger.  However,  the  speed         In these cases the limit is substantially larger.  However,  the  speed
110         of execution will be slower.         of execution is slower.
111    
112         All values in repeating quantifiers must be less than 65536.  The maxi-         All values in repeating quantifiers must be less than 65536.
        mum number of capturing subpatterns is 65535.  
113    
114         There is no limit to the number of non-capturing subpatterns,  but  the         There is no limit to the number of parenthesized subpatterns, but there
115         maximum  depth  of  nesting  of  all kinds of parenthesized subpattern,         can be no more than 65535 capturing subpatterns.
        including capturing subpatterns, assertions, and other types of subpat-  
        tern, is 200.  
116    
117         The  maximum  length of a subject string is the largest positive number         The maximum length of name for a named subpattern is 32 characters, and
118         that an integer variable can hold. However, PCRE uses recursion to han-         the maximum number of named subpatterns is 10000.
        dle  subpatterns  and indefinite repetition. This means that the avail-  
        able stack space may limit the size of a subject  string  that  can  be  
        processed by certain patterns.  
119    
120           The  maximum  length of a subject string is the largest positive number
121           that an integer variable can hold. However, when using the  traditional
122           matching function, PCRE uses recursion to handle subpatterns and indef-
123           inite repetition.  This means that the available stack space may  limit
124           the size of a subject string that can be processed by certain patterns.
125           For a discussion of stack issues, see the pcrestack documentation.
126    
 UTF-8 SUPPORT  
127    
128         Starting  at  release  3.3,  PCRE  has  had  some support for character  UTF-8 AND UNICODE PROPERTY SUPPORT
129         strings encoded in the UTF-8 format. For  release  4.0  this  has  been  
130         greatly extended to cover most common requirements.         From release 3.3, PCRE has  had  some  support  for  character  strings
131           encoded  in the UTF-8 format. For release 4.0 this was greatly extended
132           to cover most common requirements, and in release 5.0  additional  sup-
133           port for Unicode general category properties was added.
134    
135         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8         In  order  process  UTF-8 strings, you must build PCRE to include UTF-8
136         support in the code, and, in addition,  you  must  call  pcre_compile()         support in the code, and, in addition,  you  must  call  pcre_compile()
# Line 106  UTF-8 SUPPORT Line 140  UTF-8 SUPPORT
140    
141         If  you compile PCRE with UTF-8 support, but do not use it at run time,         If  you compile PCRE with UTF-8 support, but do not use it at run time,
142         the library will be a bit bigger, but the additional run time  overhead         the library will be a bit bigger, but the additional run time  overhead
143         is  limited  to testing the PCRE_UTF8 flag in several places, so should         is limited to testing the PCRE_UTF8 flag occasionally, so should not be
144         not be very large.         very big.
145    
146         The following comments apply when PCRE is running in UTF-8 mode:         If PCRE is built with Unicode character property support (which implies
147           UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-
148           ported.  The available properties that can be tested are limited to the
149           general  category  properties such as Lu for an upper case letter or Nd
150           for a decimal number, the Unicode script names such as Arabic  or  Han,
151           and  the  derived  properties  Any  and L&. A full list is given in the
152           pcrepattern documentation. Only the short names for properties are sup-
153           ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-
154           ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may
155           optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE
156           does not support this.
157    
158       Validity of UTF-8 strings
159    
160           When you set the PCRE_UTF8 flag, the strings  passed  as  patterns  and
161           subjects are (by default) checked for validity on entry to the relevant
162           functions. From release 7.3 of PCRE, the check is according  the  rules
163           of  RFC  3629, which are themselves derived from the Unicode specifica-
164           tion. Earlier releases of PCRE followed the rules of  RFC  2279,  which
165           allows  the  full range of 31-bit values (0 to 0x7FFFFFFF). The current
166           check allows only values in the range U+0 to U+10FFFF, excluding U+D800
167           to U+DFFF.
168    
169           The  excluded  code  points are the "Low Surrogate Area" of Unicode, of
170           which the Unicode Standard says this: "The Low Surrogate Area does  not
171           contain  any  character  assignments,  consequently  no  character code
172           charts or namelists are provided for this area. Surrogates are reserved
173           for  use  with  UTF-16 and then must be used in pairs." The code points
174           that are encoded by UTF-16 pairs  are  available  as  independent  code
175           points  in  the  UTF-8  encoding.  (In other words, the whole surrogate
176           thing is a fudge for UTF-16 which unfortunately messes up UTF-8.)
177    
178           If an  invalid  UTF-8  string  is  passed  to  PCRE,  an  error  return
179           (PCRE_ERROR_BADUTF8) is given. In some situations, you may already know
180           that your strings are valid, and therefore want to skip these checks in
181           order to improve performance. If you set the PCRE_NO_UTF8_CHECK flag at
182           compile time or at run time, PCRE assumes that the pattern  or  subject
183           it  is  given  (respectively)  contains only valid UTF-8 codes. In this
184           case, it does not diagnose an invalid UTF-8 string.
185    
186           If you pass an invalid UTF-8 string  when  PCRE_NO_UTF8_CHECK  is  set,
187           what  happens  depends on why the string is invalid. If the string con-
188           forms to the "old" definition of UTF-8 (RFC 2279), it is processed as a
189           string  of  characters  in  the  range 0 to 0x7FFFFFFF. In other words,
190           apart from the initial validity test, PCRE (when in UTF-8 mode) handles
191           strings  according  to  the more liberal rules of RFC 2279. However, if
192           the string does not even conform to RFC 2279, the result is  undefined.
193           Your program may crash.
194    
195           If  you  want  to  process  strings  of  values  in the full range 0 to
196           0x7FFFFFFF, encoded in a UTF-8-like manner as per the old RFC, you  can
197           set PCRE_NO_UTF8_CHECK to bypass the more restrictive test. However, in
198           this situation, you will have to apply your own validity check.
199    
200       General comments about UTF-8 mode
201    
202         1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and         1. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a
203         subjects  are  checked for validity on entry to the relevant functions.         two-byte UTF-8 character if the value is greater than 127.
        If an invalid UTF-8 string is passed, an error return is given. In some  
        situations,  you  may  already  know  that  your strings are valid, and  
        therefore want to skip these checks in order to improve performance. If  
        you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,  
        PCRE assumes that the pattern or subject  it  is  given  (respectively)  
        contains  only valid UTF-8 codes. In this case, it does not diagnose an  
        invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when  
        PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may  
        crash.  
   
        2. In a pattern, the escape sequence \x{...}, where the contents of the  
        braces  is  a  string  of hexadecimal digits, is interpreted as a UTF-8  
        character whose code number is the given hexadecimal number, for  exam-  
        ple:  \x{1234}.  If a non-hexadecimal digit appears between the braces,  
        the item is not recognized.  This escape sequence can be used either as  
        a literal, or within a character class.  
204    
205         3.  The  original hexadecimal escape sequence, \xhh, matches a two-byte         2.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8
206         UTF-8 character if the value is greater than 127.         characters for values greater than \177.
207    
208         4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-         3. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-
209         vidual bytes, for example: \x{100}{3}.         vidual bytes, for example: \x{100}{3}.
210    
211         5.  The  dot  metacharacter  matches  one  UTF-8 character instead of a         4.  The dot metacharacter matches one UTF-8 character instead of a sin-
212         single byte.         gle byte.
213    
214         6. The escape sequence \C can be used to match a single byte  in  UTF-8         5. The escape sequence \C can be used to match a single byte  in  UTF-8
215         mode, but its use can lead to some strange effects.         mode,  but  its  use can lead to some strange effects. This facility is
216           not available in the alternative matching function, pcre_dfa_exec().
217    
218           6. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly
219           test  characters of any code value, but the characters that PCRE recog-
220           nizes as digits, spaces, or word characters  remain  the  same  set  as
221           before, all with values less than 256. This remains true even when PCRE
222           includes Unicode property support, because to do otherwise  would  slow
223           down  PCRE in many common cases. If you really want to test for a wider
224           sense of, say, "digit", you must use Unicode  property  tests  such  as
225           \p{Nd}.
226    
227           7.  Similarly,  characters that match the POSIX named character classes
228           are all low-valued characters.
229    
230           8. However, the Perl 5.10 horizontal and vertical  whitespace  matching
231           escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
232           acters.
233    
234           9. Case-insensitive matching applies only to  characters  whose  values
235           are  less than 128, unless PCRE is built with Unicode property support.
236           Even when Unicode property support is available, PCRE  still  uses  its
237           own  character  tables when checking the case of low-valued characters,
238           so as not to degrade performance.  The Unicode property information  is
239           used only for characters with higher values. Even when Unicode property
240           support is available, PCRE supports case-insensitive matching only when
241           there  is  a  one-to-one  mapping between a letter's cases. There are a
242           small number of many-to-one mappings in Unicode;  these  are  not  sup-
243           ported by PCRE.
244    
        7.  The  character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly  
        test characters of any code value, but the characters that PCRE  recog-  
        nizes  as  digits,  spaces,  or  word characters remain the same set as  
        before, all with values less than 256.  
   
        8. Case-insensitive matching applies only to  characters  whose  values  
        are  less  than  256.  PCRE  does  not support the notion of "case" for  
        higher-valued characters.  
245    
246         9. PCRE does not support the use of Unicode tables  and  properties  or  AUTHOR
        the Perl escapes \p, \P, and \X.  
247    
248           Philip Hazel
249           University Computing Service
250           Cambridge CB2 3QH, England.
251    
252  AUTHOR         Putting  an actual email address here seems to have been a spam magnet,
253           so I've taken it away. If you want to email me, use  my  two  initials,
254           followed by the two digits 10, at the domain cam.ac.uk.
255    
        Philip Hazel <ph10@cam.ac.uk>  
        University Computing Service,  
        Cambridge CB2 3QG, England.  
        Phone: +44 1223 334714  
256    
257  Last updated: 20 August 2003  REVISION
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
258    
259  PCRE(3)                                                                PCRE(3)         Last updated: 09 August 2007
260           Copyright (c) 1997-2007 University of Cambridge.
261    ------------------------------------------------------------------------------
262    
263    
264    PCREBUILD(3)                                                      PCREBUILD(3)
265    
266    
267  NAME  NAME
268         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
269    
270    
271  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
272    
273         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
274         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. It assumes use of the  configure
275         lected,  by  providing  options  to  the  configure script which is run         script,  where the optional features are selected or deselected by pro-
276         before the make command. The complete list  of  options  for  configure         viding options to configure before running the make  command.  However,
277         (which  includes the standard ones such as the selection of the instal-         the  same  options  can be selected in both Unix-like and non-Unix-like
278         lation directory) can be obtained by running         environments using the GUI facility of  CMakeSetup  if  you  are  using
279           CMake instead of configure to build PCRE.
280    
281           The complete list of options for configure (which includes the standard
282           ones such as the  selection  of  the  installation  directory)  can  be
283           obtained by running
284    
285           ./configure --help           ./configure --help
286    
287         The following sections describe certain options whose names begin  with         The  following  sections  include  descriptions  of options whose names
288         --enable  or  --disable. These settings specify changes to the defaults         begin with --enable or --disable. These settings specify changes to the
289         for the configure command. Because of the  way  that  configure  works,         defaults  for  the configure command. Because of the way that configure
290         --enable  and  --disable  always  come  in  pairs, so the complementary         works, --enable and --disable always come in pairs, so  the  complemen-
291         option always exists as well, but as it specifies the  default,  it  is         tary  option always exists as well, but as it specifies the default, it
292         not described.         is not described.
293    
294    
295    C++ SUPPORT
296    
297           By default, the configure script will search for a C++ compiler and C++
298           header files. If it finds them, it automatically builds the C++ wrapper
299           library for PCRE. You can disable this by adding
300    
301             --disable-cpp
302    
303           to the configure command.
304    
305    
306  UTF-8 SUPPORT  UTF-8 SUPPORT
# Line 198  UTF-8 SUPPORT Line 309  UTF-8 SUPPORT
309    
310           --enable-utf8           --enable-utf8
311    
312         to  the  configure  command.  Of  itself, this does not make PCRE treat         to the configure command. Of itself, this  does  not  make  PCRE  treat
313         strings as UTF-8. As well as compiling PCRE with this option, you  also         strings  as UTF-8. As well as compiling PCRE with this option, you also
314         have  have to set the PCRE_UTF8 option when you call the pcre_compile()         have have to set the PCRE_UTF8 option when you call the  pcre_compile()
315         function.         function.
316    
317    
318    UNICODE CHARACTER PROPERTY SUPPORT
319    
320           UTF-8  support allows PCRE to process character values greater than 255
321           in the strings that it handles. On its own, however, it does  not  pro-
322           vide any facilities for accessing the properties of such characters. If
323           you want to be able to use the pattern escapes \P, \p,  and  \X,  which
324           refer to Unicode character properties, you must add
325    
326             --enable-unicode-properties
327    
328           to  the configure command. This implies UTF-8 support, even if you have
329           not explicitly requested it.
330    
331           Including Unicode property support adds around 30K  of  tables  to  the
332           PCRE  library.  Only  the general category properties such as Lu and Nd
333           are supported. Details are given in the pcrepattern documentation.
334    
335    
336  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
337    
338         By default, PCRE treats character 10 (linefeed) as the newline  charac-         By default, PCRE interprets character 10 (linefeed, LF)  as  indicating
339         ter. This is the normal newline character on Unix-like systems. You can         the  end  of  a line. This is the normal newline character on Unix-like
340         compile PCRE to use character 13 (carriage return) instead by adding         systems. You can compile PCRE to use character 13 (carriage return, CR)
341           instead, by adding
342    
343           --enable-newline-is-cr           --enable-newline-is-cr
344    
345         to the configure command. For completeness there is  also  a  --enable-         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
346         newline-is-lf  option,  which explicitly specifies linefeed as the new-         option, which explicitly specifies linefeed as the newline character.
347         line character.  
348           Alternatively, you can specify that line endings are to be indicated by
349           the two character sequence CRLF. If you want this, add
350    
351             --enable-newline-is-crlf
352    
353           to the configure command. There is a fourth option, specified by
354    
355             --enable-newline-is-anycrlf
356    
357           which  causes  PCRE  to recognize any of the three sequences CR, LF, or
358           CRLF as indicating a line ending. Finally, a fifth option, specified by
359    
360             --enable-newline-is-any
361    
362           causes PCRE to recognize any Unicode newline sequence.
363    
364           Whatever  line  ending convention is selected when PCRE is built can be
365           overridden when the library functions are called. At build time  it  is
366           conventional to use the standard for your operating system.
367    
368    
369    WHAT \R MATCHES
370    
371           By  default,  the  sequence \R in a pattern matches any Unicode newline
372           sequence, whatever has been selected as the line  ending  sequence.  If
373           you specify
374    
375             --enable-bsr-anycrlf
376    
377           the  default  is changed so that \R matches only CR, LF, or CRLF. What-
378           ever is selected when PCRE is built can be overridden when the  library
379           functions are called.
380    
381    
382  BUILDING SHARED AND STATIC LIBRARIES  BUILDING SHARED AND STATIC LIBRARIES
383    
384         The PCRE building process uses libtool to build both shared and  static         The  PCRE building process uses libtool to build both shared and static
385         Unix  libraries by default. You can suppress one of these by adding one         Unix libraries by default. You can suppress one of these by adding  one
386         of         of
387    
388           --disable-shared           --disable-shared
# Line 231  BUILDING SHARED AND STATIC LIBRARIES Line 393  BUILDING SHARED AND STATIC LIBRARIES
393    
394  POSIX MALLOC USAGE  POSIX MALLOC USAGE
395    
396         When PCRE is called through the  POSIX  interface  (see  the  pcreposix         When PCRE is called through the POSIX interface (see the pcreposix doc-
397         documentation),  additional working storage is required for holding the         umentation), additional working storage is  required  for  holding  the
398         pointers to capturing substrings because PCRE requires  three  integers         pointers  to capturing substrings, because PCRE requires three integers
399         per  substring,  whereas  the POSIX interface provides only two. If the         per substring, whereas the POSIX interface provides only  two.  If  the
400         number of expected substrings is small, the wrapper function uses space         number of expected substrings is small, the wrapper function uses space
401         on the stack, because this is faster than using malloc() for each call.         on the stack, because this is faster than using malloc() for each call.
402         The default threshold above which the stack is no longer used is 10; it         The default threshold above which the stack is no longer used is 10; it
# Line 245  POSIX MALLOC USAGE Line 407  POSIX MALLOC USAGE
407         to the configure command.         to the configure command.
408    
409    
410    HANDLING VERY LARGE PATTERNS
411    
412           Within a compiled pattern, offset values are used  to  point  from  one
413           part  to another (for example, from an opening parenthesis to an alter-
414           nation metacharacter). By default, two-byte values are used  for  these
415           offsets,  leading  to  a  maximum size for a compiled pattern of around
416           64K. This is sufficient to handle all but the most  gigantic  patterns.
417           Nevertheless,  some  people do want to process enormous patterns, so it
418           is possible to compile PCRE to use three-byte or four-byte  offsets  by
419           adding a setting such as
420    
421             --with-link-size=3
422    
423           to  the  configure  command.  The value given must be 2, 3, or 4. Using
424           longer offsets slows down the operation of PCRE because it has to  load
425           additional bytes when handling them.
426    
427    
428    AVOIDING EXCESSIVE STACK USAGE
429    
430           When matching with the pcre_exec() function, PCRE implements backtrack-
431           ing by making recursive calls to an internal function  called  match().
432           In  environments  where  the size of the stack is limited, this can se-
433           verely limit PCRE's operation. (The Unix environment does  not  usually
434           suffer from this problem, but it may sometimes be necessary to increase
435           the maximum stack size.  There is a discussion in the  pcrestack  docu-
436           mentation.)  An alternative approach to recursion that uses memory from
437           the heap to remember data, instead of using recursive  function  calls,
438           has  been  implemented to work round the problem of limited stack size.
439           If you want to build a version of PCRE that works this way, add
440    
441             --disable-stack-for-recursion
442    
443           to the configure command. With this configuration, PCRE  will  use  the
444           pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
445           ment functions. By default these point to malloc() and free(), but  you
446           can replace the pointers so that your own functions are used.
447    
448           Separate  functions  are  provided  rather  than  using pcre_malloc and
449           pcre_free because the  usage  is  very  predictable:  the  block  sizes
450           requested  are  always  the  same,  and  the blocks are always freed in
451           reverse order. A calling program might be able to  implement  optimized
452           functions  that  perform  better  than  malloc()  and free(). PCRE runs
453           noticeably more slowly when built in this way. This option affects only
454           the   pcre_exec()   function;   it   is   not   relevant  for  the  the
455           pcre_dfa_exec() function.
456    
457    
458  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
459    
460         Internally,  PCRE  has a function called match() which it calls repeat-         Internally, PCRE has a function called match(), which it calls  repeat-
461         edly (possibly recursively) when performing a  matching  operation.  By         edly   (sometimes   recursively)  when  matching  a  pattern  with  the
462         limiting  the  number of times this function may be called, a limit can         pcre_exec() function. By controlling the maximum number of  times  this
463           function  may be called during a single matching operation, a limit can
464         be placed on the resources used by a single call  to  pcre_exec().  The         be placed on the resources used by a single call  to  pcre_exec().  The
465         limit  can be changed at run time, as described in the pcreapi documen-         limit  can be changed at run time, as described in the pcreapi documen-
466         tation. The default is 10 million, but this can be changed by adding  a         tation. The default is 10 million, but this can be changed by adding  a
# Line 257  LIMITING PCRE RESOURCE USAGE Line 468  LIMITING PCRE RESOURCE USAGE
468    
469           --with-match-limit=500000           --with-match-limit=500000
470    
471         to the configure command.         to   the   configure  command.  This  setting  has  no  effect  on  the
472           pcre_dfa_exec() matching function.
473    
474           In some environments it is desirable to limit the  depth  of  recursive
475           calls of match() more strictly than the total number of calls, in order
476           to restrict the maximum amount of stack (or heap,  if  --disable-stack-
477           for-recursion is specified) that is used. A second limit controls this;
478           it defaults to the value that  is  set  for  --with-match-limit,  which
479           imposes  no  additional constraints. However, you can set a lower limit
480           by adding, for example,
481    
482             --with-match-limit-recursion=10000
483    
484           to the configure command. This value can  also  be  overridden  at  run
485           time.
486    
487    
488    CREATING CHARACTER TABLES AT BUILD TIME
489    
490           PCRE  uses fixed tables for processing characters whose code values are
491           less than 256. By default, PCRE is built with a set of tables that  are
492           distributed  in  the  file pcre_chartables.c.dist. These tables are for
493           ASCII codes only. If you add
494    
495             --enable-rebuild-chartables
496    
497           to the configure command, the distributed tables are  no  longer  used.
498           Instead,  a  program  called dftables is compiled and run. This outputs
499           the source for new set of tables, created in the default locale of your
500           C runtime system. (This method of replacing the tables does not work if
501           you are cross compiling, because dftables is run on the local host.  If
502           you  need  to  create alternative tables when cross compiling, you will
503           have to do so "by hand".)
504    
 HANDLING VERY LARGE PATTERNS  
505    
506         Within  a  compiled  pattern,  offset values are used to point from one  USING EBCDIC CODE
        part to another (for example, from an opening parenthesis to an  alter-  
        nation  metacharacter).  By  default two-byte values are used for these  
        offsets, leading to a maximum size for a  compiled  pattern  of  around  
        64K.  This  is sufficient to handle all but the most gigantic patterns.  
        Nevertheless, some people do want to process enormous patterns,  so  it  
        is  possible  to compile PCRE to use three-byte or four-byte offsets by  
        adding a setting such as  
507    
508           --with-link-size=3         PCRE assumes by default that it will run in an  environment  where  the
509           character  code  is  ASCII  (or Unicode, which is a superset of ASCII).
510           This is the case for most computer operating systems.  PCRE  can,  how-
511           ever, be compiled to run in an EBCDIC environment by adding
512    
513         to the configure command. The value given must be 2,  3,  or  4.  Using           --enable-ebcdic
        longer  offsets slows down the operation of PCRE because it has to load  
        additional bytes when handling them.  
514    
515         If you build PCRE with an increased link size, test 2 (and  test  5  if         to the configure command. This setting implies --enable-rebuild-charta-
516         you  are using UTF-8) will fail. Part of the output of these tests is a         bles. You should only use it if you know that  you  are  in  an  EBCDIC
517         representation of the compiled pattern, and this changes with the  link         environment (for example, an IBM mainframe operating system).
        size.  
518    
519    
520  AVOIDING EXCESSIVE STACK USAGE  PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
521    
522         PCRE  implements  backtracking while matching by making recursive calls         By default, pcregrep reads all files as plain text. You can build it so
523         to an internal function called match(). In environments where the  size         that it recognizes files whose names end in .gz or .bz2, and reads them
524         of the stack is limited, this can severely limit PCRE's operation. (The         with libz or libbz2, respectively, by adding one or both of
        Unix environment does not usually suffer from this problem.) An  alter-  
        native  approach  that  uses  memory  from  the  heap to remember data,  
        instead of using recursive function calls, has been implemented to work  
        round  this  problem. If you want to build a version of PCRE that works  
        this way, add  
525    
526           --disable-stack-for-recursion           --enable-pcregrep-libz
527             --enable-pcregrep-libbz2
528    
529         to the configure command. With this configuration, PCRE  will  use  the         to the configure command. These options naturally require that the rel-
530         pcre_stack_malloc   and   pcre_stack_free   variables  to  call  memory         evant libraries are installed on your system. Configuration  will  fail
531         management functions. Separate functions are provided because the usage         if they are not.
        is very predictable: the block sizes requested are always the same, and  
        the blocks are always freed in reverse order. A calling  program  might  
        be  able  to implement optimized functions that perform better than the  
        standard malloc() and  free()  functions.  PCRE  runs  noticeably  more  
        slowly when built in this way.  
532    
533    
534  USING EBCDIC CODE  PCRETEST OPTION FOR LIBREADLINE SUPPORT
535    
536         PCRE  assumes  by  default that it will run in an environment where the         If you add
        character code is ASCII (or UTF-8, which is a superset of ASCII).  PCRE  
        can, however, be compiled to run in an EBCDIC environment by adding  
537    
538           --enable-ebcdic           --enable-pcretest-libreadline
539    
540         to the configure command.         to  the  configure  command,  pcretest  is  linked with the libreadline
541           library, and when its input is from a terminal, it reads it  using  the
542           readline() function. This provides line-editing and history facilities.
543           Note that libreadline is GPL-licenced, so if you distribute a binary of
544           pcretest linked in this way, there may be licensing issues.
545    
 Last updated: 09 December 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
546    
547  PCRE(3)                                                                PCRE(3)  SEE ALSO
548    
549           pcreapi(3), pcre_config(3).
550    
551    
552    AUTHOR
553    
554           Philip Hazel
555           University Computing Service
556           Cambridge CB2 3QH, England.
557    
558    
559    REVISION
560    
561           Last updated: 18 December 2007
562           Copyright (c) 1997-2007 University of Cambridge.
563    ------------------------------------------------------------------------------
564    
565    
566    PCREMATCHING(3)                                                PCREMATCHING(3)
567    
568    
569    NAME
570           PCRE - Perl-compatible regular expressions
571    
572    
573    PCRE MATCHING ALGORITHMS
574    
575           This document describes the two different algorithms that are available
576           in PCRE for matching a compiled regular expression against a given sub-
577           ject  string.  The  "standard"  algorithm  is  the  one provided by the
578           pcre_exec() function.  This works in the same was  as  Perl's  matching
579           function, and provides a Perl-compatible matching operation.
580    
581           An  alternative  algorithm is provided by the pcre_dfa_exec() function;
582           this operates in a different way, and is not  Perl-compatible.  It  has
583           advantages  and disadvantages compared with the standard algorithm, and
584           these are described below.
585    
586           When there is only one possible way in which a given subject string can
587           match  a pattern, the two algorithms give the same answer. A difference
588           arises, however, when there are multiple possibilities. For example, if
589           the pattern
590    
591             ^<.*>
592    
593           is matched against the string
594    
595             <something> <something else> <something further>
596    
597           there are three possible answers. The standard algorithm finds only one
598           of them, whereas the alternative algorithm finds all three.
599    
600    
601    REGULAR EXPRESSIONS AS TREES
602    
603           The set of strings that are matched by a regular expression can be rep-
604           resented  as  a  tree structure. An unlimited repetition in the pattern
605           makes the tree of infinite size, but it is still a tree.  Matching  the
606           pattern  to a given subject string (from a given starting point) can be
607           thought of as a search of the tree.  There are two  ways  to  search  a
608           tree:  depth-first  and  breadth-first, and these correspond to the two
609           matching algorithms provided by PCRE.
610    
611    
612    THE STANDARD MATCHING ALGORITHM
613    
614           In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
615           sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
616           depth-first search of the pattern tree. That is, it  proceeds  along  a
617           single path through the tree, checking that the subject matches what is
618           required. When there is a mismatch, the algorithm  tries  any  alterna-
619           tives  at  the  current point, and if they all fail, it backs up to the
620           previous branch point in the  tree,  and  tries  the  next  alternative
621           branch  at  that  level.  This often involves backing up (moving to the
622           left) in the subject string as well.  The  order  in  which  repetition
623           branches  are  tried  is controlled by the greedy or ungreedy nature of
624           the quantifier.
625    
626           If a leaf node is reached, a matching string has  been  found,  and  at
627           that  point the algorithm stops. Thus, if there is more than one possi-
628           ble match, this algorithm returns the first one that it finds.  Whether
629           this  is the shortest, the longest, or some intermediate length depends
630           on the way the greedy and ungreedy repetition quantifiers are specified
631           in the pattern.
632    
633           Because  it  ends  up  with a single path through the tree, it is rela-
634           tively straightforward for this algorithm to keep  track  of  the  sub-
635           strings  that  are  matched  by portions of the pattern in parentheses.
636           This provides support for capturing parentheses and back references.
637    
638    
639    THE ALTERNATIVE MATCHING ALGORITHM
640    
641           This algorithm conducts a breadth-first search of  the  tree.  Starting
642           from  the  first  matching  point  in the subject, it scans the subject
643           string from left to right, once, character by character, and as it does
644           this,  it remembers all the paths through the tree that represent valid
645           matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
646           though  it is not implemented as a traditional finite state machine (it
647           keeps multiple states active simultaneously).
648    
649           The scan continues until either the end of the subject is  reached,  or
650           there  are  no more unterminated paths. At this point, terminated paths
651           represent the different matching possibilities (if there are none,  the
652           match  has  failed).   Thus,  if there is more than one possible match,
653           this algorithm finds all of them, and in particular, it finds the long-
654           est.  In PCRE, there is an option to stop the algorithm after the first
655           match (which is necessarily the shortest) has been found.
656    
657           Note that all the matches that are found start at the same point in the
658           subject. If the pattern
659    
660             cat(er(pillar)?)
661    
662           is  matched  against the string "the caterpillar catchment", the result
663           will be the three strings "cat", "cater", and "caterpillar" that  start
664           at the fourth character of the subject. The algorithm does not automat-
665           ically move on to find matches that start at later positions.
666    
667           There are a number of features of PCRE regular expressions that are not
668           supported by the alternative matching algorithm. They are as follows:
669    
670           1.  Because  the  algorithm  finds  all possible matches, the greedy or
671           ungreedy nature of repetition quantifiers is not relevant.  Greedy  and
672           ungreedy quantifiers are treated in exactly the same way. However, pos-
673           sessive quantifiers can make a difference when what follows could  also
674           match what is quantified, for example in a pattern like this:
675    
676             ^a++\w!
677    
678           This  pattern matches "aaab!" but not "aaa!", which would be matched by
679           a non-possessive quantifier. Similarly, if an atomic group is  present,
680           it  is matched as if it were a standalone pattern at the current point,
681           and the longest match is then "locked in" for the rest of  the  overall
682           pattern.
683    
684           2. When dealing with multiple paths through the tree simultaneously, it
685           is not straightforward to keep track of  captured  substrings  for  the
686           different  matching  possibilities,  and  PCRE's implementation of this
687           algorithm does not attempt to do this. This means that no captured sub-
688           strings are available.
689    
690           3.  Because no substrings are captured, back references within the pat-
691           tern are not supported, and cause errors if encountered.
692    
693           4. For the same reason, conditional expressions that use  a  backrefer-
694           ence  as  the  condition or test for a specific group recursion are not
695           supported.
696    
697           5. Because many paths through the tree may be  active,  the  \K  escape
698           sequence, which resets the start of the match when encountered (but may
699           be on some paths and not on others), is not  supported.  It  causes  an
700           error if encountered.
701    
702           6.  Callouts  are  supported, but the value of the capture_top field is
703           always 1, and the value of the capture_last field is always -1.
704    
705           7. The \C escape sequence, which (in the standard algorithm) matches  a
706           single  byte, even in UTF-8 mode, is not supported because the alterna-
707           tive algorithm moves through the subject  string  one  character  at  a
708           time, for all active paths through the tree.
709    
710           8.  None  of  the  backtracking control verbs such as (*PRUNE) are sup-
711           ported.
712    
713    
714    ADVANTAGES OF THE ALTERNATIVE ALGORITHM
715    
716           Using the alternative matching algorithm provides the following  advan-
717           tages:
718    
719           1. All possible matches (at a single point in the subject) are automat-
720           ically found, and in particular, the longest match is  found.  To  find
721           more than one match using the standard algorithm, you have to do kludgy
722           things with callouts.
723    
724           2. There is much better support for partial matching. The  restrictions
725           on  the content of the pattern that apply when using the standard algo-
726           rithm for partial matching do not apply to the  alternative  algorithm.
727           For  non-anchored patterns, the starting position of a partial match is
728           available.
729    
730           3. Because the alternative algorithm  scans  the  subject  string  just
731           once,  and  never  needs to backtrack, it is possible to pass very long
732           subject strings to the matching function in  several  pieces,  checking
733           for partial matching each time.
734    
735    
736    DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
737    
738           The alternative algorithm suffers from a number of disadvantages:
739    
740           1.  It  is  substantially  slower  than the standard algorithm. This is
741           partly because it has to search for all possible matches, but  is  also
742           because it is less susceptible to optimization.
743    
744           2. Capturing parentheses and back references are not supported.
745    
746           3. Although atomic groups are supported, their use does not provide the
747           performance advantage that it does for the standard algorithm.
748    
749    
750    AUTHOR
751    
752           Philip Hazel
753           University Computing Service
754           Cambridge CB2 3QH, England.
755    
756    
757    REVISION
758    
759           Last updated: 08 August 2007
760           Copyright (c) 1997-2007 University of Cambridge.
761    ------------------------------------------------------------------------------
762    
763    
764    PCREAPI(3)                                                          PCREAPI(3)
765    
766    
767  NAME  NAME
768         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
769    
770  SYNOPSIS OF PCRE API  
771    PCRE NATIVE API
772    
773         #include <pcre.h>         #include <pcre.h>
774    
# Line 335  SYNOPSIS OF PCRE API Line 776  SYNOPSIS OF PCRE API
776              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
777              const unsigned char *tableptr);              const unsigned char *tableptr);
778    
779           pcre *pcre_compile2(const char *pattern, int options,
780                int *errorcodeptr,
781                const char **errptr, int *erroffset,
782                const unsigned char *tableptr);
783    
784         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
785              const char **errptr);              const char **errptr);
786    
# Line 342  SYNOPSIS OF PCRE API Line 788  SYNOPSIS OF PCRE API
788              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
789              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
790    
791           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
792                const char *subject, int length, int startoffset,
793                int options, int *ovector, int ovecsize,
794                int *workspace, int wscount);
795    
796         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
797              const char *subject, int *ovector,              const char *subject, int *ovector,
798              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 359  SYNOPSIS OF PCRE API Line 810  SYNOPSIS OF PCRE API
810         int pcre_get_stringnumber(const pcre *code,         int pcre_get_stringnumber(const pcre *code,
811              const char *name);              const char *name);
812    
813           int pcre_get_stringtable_entries(const pcre *code,
814                const char *name, char **first, char **last);
815    
816         int pcre_get_substring(const char *subject, int *ovector,         int pcre_get_substring(const char *subject, int *ovector,
817              int stringcount, int stringnumber,              int stringcount, int stringnumber,
818              const char **stringptr);              const char **stringptr);
# Line 377  SYNOPSIS OF PCRE API Line 831  SYNOPSIS OF PCRE API
831    
832         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
833    
834           int pcre_refcount(pcre *code, int adjust);
835    
836         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
837    
838         char *pcre_version(void);         char *pcre_version(void);
# Line 392  SYNOPSIS OF PCRE API Line 848  SYNOPSIS OF PCRE API
848         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
849    
850    
851  PCRE API  PCRE API OVERVIEW
852    
853         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
854         is also a set of wrapper functions that correspond to the POSIX regular         are also some wrapper functions that correspond to  the  POSIX  regular
855         expression API.  These are described in the pcreposix documentation.         expression  API.  These  are  described in the pcreposix documentation.
856           Both of these APIs define a set of C function calls. A C++  wrapper  is
857         The  native  API  function  prototypes  are  defined in the header file         distributed with PCRE. It is documented in the pcrecpp page.
858         pcre.h, and on Unix systems the library itself is called libpcre.a,  so  
859         can be accessed by adding -lpcre to the command for linking an applica-         The  native  API  C  function prototypes are defined in the header file
860         tion which calls it. The header file defines the macros PCRE_MAJOR  and         pcre.h, and on Unix systems the library itself is called  libpcre.   It
861         PCRE_MINOR  to  contain  the  major  and  minor release numbers for the         can normally be accessed by adding -lpcre to the command for linking an
862         library. Applications can use these to include  support  for  different         application  that  uses  PCRE.  The  header  file  defines  the  macros
863         releases.         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-
864           bers for the library.  Applications can use these  to  include  support
865         The  functions  pcre_compile(),  pcre_study(), and pcre_exec() are used         for different releases of PCRE.
866         for compiling and matching regular expressions. A sample  program  that  
867         demonstrates  the simplest way of using them is given in the file pcre-         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and
868         demo.c. The pcresample documentation describes how to run it.         pcre_exec() are used for compiling and matching regular expressions  in
869           a  Perl-compatible  manner. A sample program that demonstrates the sim-
870         There are convenience functions for extracting captured substrings from         plest way of using them is provided in the file  called  pcredemo.c  in
871         a matched subject string. They are:         the  source distribution. The pcresample documentation describes how to
872           compile and run it.
873    
874           A second matching function, pcre_dfa_exec(), which is not Perl-compati-
875           ble,  is  also provided. This uses a different algorithm for the match-
876           ing. The alternative algorithm finds all possible matches (at  a  given
877           point  in  the subject), and scans the subject just once. However, this
878           algorithm does not return captured substrings. A description of the two
879           matching  algorithms and their advantages and disadvantages is given in
880           the pcrematching documentation.
881    
882           In addition to the main compiling and  matching  functions,  there  are
883           convenience functions for extracting captured substrings from a subject
884           string that is matched by pcre_exec(). They are:
885    
886           pcre_copy_substring()           pcre_copy_substring()
887           pcre_copy_named_substring()           pcre_copy_named_substring()
888           pcre_get_substring()           pcre_get_substring()
889           pcre_get_named_substring()           pcre_get_named_substring()
890           pcre_get_substring_list()           pcre_get_substring_list()
891             pcre_get_stringnumber()
892             pcre_get_stringtable_entries()
893    
894         pcre_free_substring() and pcre_free_substring_list() are also provided,         pcre_free_substring() and pcre_free_substring_list() are also provided,
895         to free the memory used for extracted strings.         to free the memory used for extracted strings.
896    
897         The function pcre_maketables() is used (optionally) to build a  set  of         The  function  pcre_maketables()  is  used  to build a set of character
898         character tables in the current locale for passing to pcre_compile().         tables  in  the  current  locale   for   passing   to   pcre_compile(),
899           pcre_exec(),  or  pcre_dfa_exec(). This is an optional facility that is
900         The  function  pcre_fullinfo()  is used to find out information about a         provided for specialist use.  Most  commonly,  no  special  tables  are
901         compiled pattern; pcre_info() is an obsolete version which returns only         passed,  in  which case internal tables that are generated when PCRE is
902         some  of  the available information, but is retained for backwards com-         built are used.
903         patibility.  The function pcre_version() returns a pointer to a  string  
904           The function pcre_fullinfo() is used to find out  information  about  a
905           compiled  pattern; pcre_info() is an obsolete version that returns only
906           some of the available information, but is retained for  backwards  com-
907           patibility.   The function pcre_version() returns a pointer to a string
908         containing the version of PCRE and its date of release.         containing the version of PCRE and its date of release.
909    
910         The  global  variables  pcre_malloc and pcre_free initially contain the         The function pcre_refcount() maintains a  reference  count  in  a  data
911         entry points of the standard  malloc()  and  free()  functions  respec-         block  containing  a compiled pattern. This is provided for the benefit
912           of object-oriented applications.
913    
914           The global variables pcre_malloc and pcre_free  initially  contain  the
915           entry  points  of  the  standard malloc() and free() functions, respec-
916         tively. PCRE calls the memory management functions via these variables,         tively. PCRE calls the memory management functions via these variables,
917         so a calling program can replace them if it  wishes  to  intercept  the         so  a  calling  program  can replace them if it wishes to intercept the
918         calls. This should be done before calling any PCRE functions.         calls. This should be done before calling any PCRE functions.
919    
920         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also         The global variables pcre_stack_malloc  and  pcre_stack_free  are  also
921         indirections to memory management functions.  These  special  functions         indirections  to  memory  management functions. These special functions
922         are  used  only  when  PCRE is compiled to use the heap for remembering         are used only when PCRE is compiled to use  the  heap  for  remembering
923         data, instead of recursive function calls. This is a  non-standard  way         data, instead of recursive function calls, when running the pcre_exec()
924         of  building  PCRE,  for  use in environments that have limited stacks.         function. See the pcrebuild documentation for  details  of  how  to  do
925         Because of the greater use of memory management, it runs  more  slowly.         this.  It  is  a non-standard way of building PCRE, for use in environ-
926         Separate  functions  are provided so that special-purpose external code         ments that have limited stacks. Because of the greater  use  of  memory
927         can be used for this case. When used, these functions are always called         management,  it  runs  more  slowly. Separate functions are provided so
928         in  a  stack-like  manner  (last obtained, first freed), and always for         that special-purpose external code can be  used  for  this  case.  When
929         memory blocks of the same size.         used,  these  functions  are always called in a stack-like manner (last
930           obtained, first freed), and always for memory blocks of the same  size.
931           There  is  a discussion about PCRE's stack usage in the pcrestack docu-
932           mentation.
933    
934         The global variable pcre_callout initially contains NULL. It can be set         The global variable pcre_callout initially contains NULL. It can be set
935         by  the  caller  to  a "callout" function, which PCRE will then call at         by  the  caller  to  a "callout" function, which PCRE will then call at
# Line 455  PCRE API Line 937  PCRE API
937         pcrecallout documentation.         pcrecallout documentation.
938    
939    
940    NEWLINES
941    
942           PCRE  supports five different conventions for indicating line breaks in
943           strings: a single CR (carriage return) character, a  single  LF  (line-
944           feed) character, the two-character sequence CRLF, any of the three pre-
945           ceding, or any Unicode newline sequence. The Unicode newline  sequences
946           are  the  three just mentioned, plus the single characters VT (vertical
947           tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line
948           separator, U+2028), and PS (paragraph separator, U+2029).
949    
950           Each  of  the first three conventions is used by at least one operating
951           system as its standard newline sequence. When PCRE is built, a  default
952           can  be  specified.  The default default is LF, which is the Unix stan-
953           dard. When PCRE is run, the default can be overridden,  either  when  a
954           pattern is compiled, or when it is matched.
955    
956           At compile time, the newline convention can be specified by the options
957           argument of pcre_compile(), or it can be specified by special  text  at
958           the start of the pattern itself; this overrides any other settings. See
959           the pcrepattern page for details of the special character sequences.
960    
961           In the PCRE documentation the word "newline" is used to mean "the char-
962           acter  or pair of characters that indicate a line break". The choice of
963           newline convention affects the handling of  the  dot,  circumflex,  and
964           dollar metacharacters, the handling of #-comments in /x mode, and, when
965           CRLF is a recognized line ending sequence, the match position  advance-
966           ment for a non-anchored pattern. There is more detail about this in the
967           section on pcre_exec() options below.
968    
969           The choice of newline convention does not affect the interpretation  of
970           the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
971           which is controlled in a similar way, but by separate options.
972    
973    
974  MULTITHREADING  MULTITHREADING
975    
976         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
977         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
978         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
979         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
980    
981         The  compiled form of a regular expression is not altered during match-         The compiled form of a regular expression is not altered during  match-
982         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
983         at once.         at once.
984    
985    
986    SAVING PRECOMPILED PATTERNS FOR LATER USE
987    
988           The compiled form of a regular expression can be saved and re-used at a
989           later  time,  possibly by a different program, and even on a host other
990           than the one on which  it  was  compiled.  Details  are  given  in  the
991           pcreprecompile  documentation.  However, compiling a regular expression
992           with one version of PCRE for use with a different version is not  guar-
993           anteed to work and may cause crashes.
994    
995    
996  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
997    
998         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
# Line 486  CHECKING BUILD-TIME OPTIONS Line 1012  CHECKING BUILD-TIME OPTIONS
1012         The  output is an integer that is set to one if UTF-8 support is avail-         The  output is an integer that is set to one if UTF-8 support is avail-
1013         able; otherwise it is set to zero.         able; otherwise it is set to zero.
1014    
1015             PCRE_CONFIG_UNICODE_PROPERTIES
1016    
1017           The output is an integer that is set to  one  if  support  for  Unicode
1018           character properties is available; otherwise it is set to zero.
1019    
1020           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1021    
1022         The output is an integer that is set to the value of the code  that  is         The  output  is  an integer whose value specifies the default character
1023         used  for the newline character. It is either linefeed (10) or carriage         sequence that is recognized as meaning "newline". The four values  that
1024         return (13), and should normally be the  standard  character  for  your         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
1025         operating system.         and -1 for ANY. The default should normally be  the  standard  sequence
1026           for your operating system.
1027    
1028             PCRE_CONFIG_BSR
1029    
1030           The output is an integer whose value indicates what character sequences
1031           the \R escape sequence matches by default. A value of 0 means  that  \R
1032           matches  any  Unicode  line ending sequence; a value of 1 means that \R
1033           matches only CR, LF, or CRLF. The default can be overridden when a pat-
1034           tern is compiled or matched.
1035    
1036           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1037    
# Line 514  CHECKING BUILD-TIME OPTIONS Line 1054  CHECKING BUILD-TIME OPTIONS
1054         internal  matching  function  calls in a pcre_exec() execution. Further         internal  matching  function  calls in a pcre_exec() execution. Further
1055         details are given with pcre_exec() below.         details are given with pcre_exec() below.
1056    
1057             PCRE_CONFIG_MATCH_LIMIT_RECURSION
1058    
1059           The output is an integer that gives the default limit for the depth  of
1060           recursion  when calling the internal matching function in a pcre_exec()
1061           execution. Further details are given with pcre_exec() below.
1062    
1063           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1064    
1065         The output is an integer that is set to one if  internal  recursion  is         The output is an integer that is set to one if internal recursion  when
1066         implemented  by recursive function calls that use the stack to remember         running pcre_exec() is implemented by recursive function calls that use
1067         their state. This is the usual way that PCRE is compiled. The output is         the stack to remember their state. This is the usual way that  PCRE  is
1068         zero  if PCRE was compiled to use blocks of data on the heap instead of         compiled. The output is zero if PCRE was compiled to use blocks of data
1069         recursive  function  calls.  In  this   case,   pcre_stack_malloc   and         on the  heap  instead  of  recursive  function  calls.  In  this  case,
1070         pcre_stack_free  are  called  to manage memory blocks on the heap, thus         pcre_stack_malloc  and  pcre_stack_free  are  called  to  manage memory
1071         avoiding the use of the stack.         blocks on the heap, thus avoiding the use of the stack.
1072    
1073    
1074  COMPILING A PATTERN  COMPILING A PATTERN
# Line 531  COMPILING A PATTERN Line 1077  COMPILING A PATTERN
1077              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
1078              const unsigned char *tableptr);              const unsigned char *tableptr);
1079    
1080           pcre *pcre_compile2(const char *pattern, int options,
1081                int *errorcodeptr,
1082                const char **errptr, int *erroffset,
1083                const unsigned char *tableptr);
1084    
1085         The function pcre_compile() is called to  compile  a  pattern  into  an         Either of the functions pcre_compile() or pcre_compile2() can be called
1086         internal  form.  The pattern is a C string terminated by a binary zero,         to compile a pattern into an internal form. The only difference between
1087         and is passed in the argument pattern. A pointer to a single  block  of         the two interfaces is that pcre_compile2() has an additional  argument,
1088         memory  that is obtained via pcre_malloc is returned. This contains the         errorcodeptr, via which a numerical error code can be returned.
1089         compiled code and related data.  The  pcre  type  is  defined  for  the  
1090         returned  block;  this  is a typedef for a structure whose contents are         The pattern is a C string terminated by a binary zero, and is passed in
1091         not externally defined. It is up to the caller to free the memory  when         the pattern argument. A pointer to a single block  of  memory  that  is
1092         it is no longer required.         obtained  via  pcre_malloc is returned. This contains the compiled code
1093           and related data. The pcre type is defined for the returned block; this
1094           is a typedef for a structure whose contents are not externally defined.
1095           It is up to the caller to free the memory (via pcre_free) when it is no
1096           longer required.
1097    
1098         Although  the compiled code of a PCRE regex is relocatable, that is, it         Although  the compiled code of a PCRE regex is relocatable, that is, it
1099         does not depend on memory location, the complete pcre data block is not         does not depend on memory location, the complete pcre data block is not
1100         fully relocatable, because it contains a copy of the tableptr argument,         fully  relocatable, because it may contain a copy of the tableptr argu-
1101         which is an address (see below).         ment, which is an address (see below).
1102    
1103         The options argument contains independent bits that affect the compila-         The options argument contains various bit settings that affect the com-
1104         tion.  It  should  be  zero  if  no  options  are required. Some of the         pilation.  It  should be zero if no options are required. The available
1105         options, in particular, those that are compatible with Perl,  can  also         options are described below. Some of them, in  particular,  those  that
1106         be  set and unset from within the pattern (see the detailed description         are  compatible  with  Perl,  can also be set and unset from within the
1107         of regular expressions in the  pcrepattern  documentation).  For  these         pattern (see the detailed description  in  the  pcrepattern  documenta-
1108         options,  the  contents of the options argument specifies their initial         tion).  For  these options, the contents of the options argument speci-
1109         settings at the start of compilation and execution.  The  PCRE_ANCHORED         fies their initial settings at the start of compilation and  execution.
1110         option can be set at the time of matching as well as at compile time.         The  PCRE_ANCHORED  and PCRE_NEWLINE_xxx options can be set at the time
1111           of matching as well as at compile time.
1112    
1113         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1114         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if  compilation  of  a  pattern fails, pcre_compile() returns NULL, and
1115         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
1116         sage. The offset from the start of the pattern to the  character  where         sage. This is a static string that is part of the library. You must not
1117         the  error  was  discovered  is  placed  in  the variable pointed to by         try to free it. The offset from the start of the pattern to the charac-
1118         erroffset, which must not be NULL. If it  is,  an  immediate  error  is         ter where the error was discovered is placed in the variable pointed to
1119           by erroffset, which must not be NULL. If it is, an immediate  error  is
1120         given.         given.
1121    
1122         If  the  final  argument, tableptr, is NULL, PCRE uses a default set of         If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1123         character tables which are built when it is compiled, using the default         codeptr argument is not NULL, a non-zero error code number is  returned
1124         C  locale.  Otherwise,  tableptr  must  be  the  result  of  a  call to         via  this argument in the event of an error. This is in addition to the
1125         pcre_maketables(). See the section on locale support below.         textual error message. Error codes and messages are listed below.
1126    
1127           If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1128           character  tables  that  are  built  when  PCRE  is compiled, using the
1129           default C locale. Otherwise, tableptr must be an address  that  is  the
1130           result  of  a  call to pcre_maketables(). This value is stored with the
1131           compiled pattern, and used again by pcre_exec(), unless  another  table
1132           pointer is passed to it. For more discussion, see the section on locale
1133           support below.
1134    
1135         This code fragment shows a typical straightforward  call  to  pcre_com-         This code fragment shows a typical straightforward  call  to  pcre_com-
1136         pile():         pile():
# Line 581  COMPILING A PATTERN Line 1145  COMPILING A PATTERN
1145             &erroffset,       /* for error offset */             &erroffset,       /* for error offset */
1146             NULL);            /* use default character tables */             NULL);            /* use default character tables */
1147    
1148         The following option bits are defined:         The  following  names  for option bits are defined in the pcre.h header
1149           file:
1150    
1151           PCRE_ANCHORED           PCRE_ANCHORED
1152    
1153         If this bit is set, the pattern is forced to be "anchored", that is, it         If this bit is set, the pattern is forced to be "anchored", that is, it
1154         is constrained to match only at the first matching point in the  string         is  constrained to match only at the first matching point in the string
1155         which is being searched (the "subject string"). This effect can also be         that is being searched (the "subject string"). This effect can also  be
1156         achieved by appropriate constructs in the pattern itself, which is  the         achieved  by appropriate constructs in the pattern itself, which is the
1157         only way to do it in Perl.         only way to do it in Perl.
1158    
1159             PCRE_AUTO_CALLOUT
1160    
1161           If this bit is set, pcre_compile() automatically inserts callout items,
1162           all  with  number  255, before each pattern item. For discussion of the
1163           callout facility, see the pcrecallout documentation.
1164    
1165             PCRE_BSR_ANYCRLF
1166             PCRE_BSR_UNICODE
1167    
1168           These options (which are mutually exclusive) control what the \R escape
1169           sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1170           or to match any Unicode newline sequence. The default is specified when
1171           PCRE is built. It can be overridden from within the pattern, or by set-
1172           ting an option when a compiled pattern is matched.
1173    
1174           PCRE_CASELESS           PCRE_CASELESS
1175    
1176         If  this  bit is set, letters in the pattern match both upper and lower         If this bit is set, letters in the pattern match both upper  and  lower
1177         case letters. It is equivalent to Perl's  /i  option,  and  it  can  be         case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1178         changed within a pattern by a (?i) option setting.         changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1179           always  understands the concept of case for characters whose values are
1180           less than 128, so caseless matching is always possible. For  characters
1181           with  higher  values,  the concept of case is supported if PCRE is com-
1182           piled with Unicode property support, but not otherwise. If you want  to
1183           use  caseless  matching  for  characters 128 and above, you must ensure
1184           that PCRE is compiled with Unicode property support  as  well  as  with
1185           UTF-8 support.
1186    
1187           PCRE_DOLLAR_ENDONLY           PCRE_DOLLAR_ENDONLY
1188    
1189         If  this bit is set, a dollar metacharacter in the pattern matches only         If  this bit is set, a dollar metacharacter in the pattern matches only
1190         at the end of the subject string. Without this option,  a  dollar  also         at the end of the subject string. Without this option,  a  dollar  also
1191         matches  immediately before the final character if it is a newline (but         matches  immediately before a newline at the end of the string (but not
1192         not before any  other  newlines).  The  PCRE_DOLLAR_ENDONLY  option  is         before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1193         ignored if PCRE_MULTILINE is set. There is no equivalent to this option         if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1194         in Perl, and no way to set it within a pattern.         Perl, and no way to set it within a pattern.
1195    
1196           PCRE_DOTALL           PCRE_DOTALL
1197    
1198         If this bit is set, a dot metacharater in the pattern matches all char-         If this bit is set, a dot metacharater in the pattern matches all char-
1199         acters,  including  newlines.  Without  it, newlines are excluded. This         acters,  including  those that indicate newline. Without it, a dot does
1200         option is equivalent to Perl's /s option, and it can be changed  within         not match when the current position is at a  newline.  This  option  is
1201         a  pattern  by  a  (?s)  option  setting. A negative class such as [^a]         equivalent  to Perl's /s option, and it can be changed within a pattern
1202         always matches a newline character, independent of the setting of  this         by a (?s) option setting. A negative class such as [^a] always  matches
1203         option.         newline characters, independent of the setting of this option.
1204    
1205             PCRE_DUPNAMES
1206    
1207           If  this  bit is set, names used to identify capturing subpatterns need
1208           not be unique. This can be helpful for certain types of pattern when it
1209           is  known  that  only  one instance of the named subpattern can ever be
1210           matched. There are more details of named subpatterns  below;  see  also
1211           the pcrepattern documentation.
1212    
1213           PCRE_EXTENDED           PCRE_EXTENDED
1214    
1215         If  this  bit  is  set,  whitespace  data characters in the pattern are         If  this  bit  is  set,  whitespace  data characters in the pattern are
1216         totally ignored except  when  escaped  or  inside  a  character  class.         totally ignored except when escaped or inside a character class. White-
1217         Whitespace  does  not  include the VT character (code 11). In addition,         space does not include the VT character (code 11). In addition, charac-
1218         characters between an unescaped # outside a  character  class  and  the         ters between an unescaped # outside a character class and the next new-
1219         next newline character, inclusive, are also ignored. This is equivalent         line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1220         to Perl's /x option, and it can be changed within a pattern by  a  (?x)         option, and it can be changed within a pattern by a  (?x)  option  set-
1221         option setting.         ting.
1222    
1223         This  option  makes  it possible to include comments inside complicated         This  option  makes  it possible to include comments inside complicated
1224         patterns.  Note, however, that this applies only  to  data  characters.         patterns.  Note, however, that this applies only  to  data  characters.
# Line 639  COMPILING A PATTERN Line 1234  COMPILING A PATTERN
1234         letter that has no special meaning  causes  an  error,  thus  reserving         letter that has no special meaning  causes  an  error,  thus  reserving
1235         these  combinations  for  future  expansion.  By default, as in Perl, a         these  combinations  for  future  expansion.  By default, as in Perl, a
1236         backslash followed by a letter with no special meaning is treated as  a         backslash followed by a letter with no special meaning is treated as  a
1237         literal.  There  are  at  present  no other features controlled by this         literal.  (Perl can, however, be persuaded to give a warning for this.)
1238         option. It can also be set by a (?X) option setting within a pattern.         There are at present no other features controlled by  this  option.  It
1239           can also be set by a (?X) option setting within a pattern.
1240    
1241             PCRE_FIRSTLINE
1242    
1243           If  this  option  is  set,  an  unanchored pattern is required to match
1244           before or at the first  newline  in  the  subject  string,  though  the
1245           matched text may continue over the newline.
1246    
1247           PCRE_MULTILINE           PCRE_MULTILINE
1248    
1249         By default, PCRE treats the subject string as consisting  of  a  single         By  default,  PCRE  treats the subject string as consisting of a single
1250         "line"  of  characters (even if it actually contains several newlines).         line of characters (even if it actually contains newlines). The  "start
1251         The "start of line" metacharacter (^) matches only at the start of  the         of  line"  metacharacter  (^)  matches only at the start of the string,
1252         string,  while  the "end of line" metacharacter ($) matches only at the         while the "end of line" metacharacter ($) matches only at  the  end  of
1253         end of the string, or before a terminating  newline  (unless  PCRE_DOL-         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1254         LAR_ENDONLY is set). This is the same as Perl.         is set). This is the same as Perl.
1255    
1256         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
1257         constructs match immediately following or immediately before  any  new-         constructs  match  immediately following or immediately before internal
1258         line  in the subject string, respectively, as well as at the very start         newlines in the subject string, respectively, as well as  at  the  very
1259         and end. This is equivalent to Perl's /m option, and it can be  changed         start  and  end.  This is equivalent to Perl's /m option, and it can be
1260         within a pattern by a (?m) option setting. If there are no "\n" charac-         changed within a pattern by a (?m) option setting. If there are no new-
1261         ters in a subject string, or no occurrences of ^ or  $  in  a  pattern,         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,
1262         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
1263    
1264             PCRE_NEWLINE_CR
1265             PCRE_NEWLINE_LF
1266             PCRE_NEWLINE_CRLF
1267             PCRE_NEWLINE_ANYCRLF
1268             PCRE_NEWLINE_ANY
1269    
1270           These options override the default newline definition that  was  chosen
1271           when  PCRE  was built. Setting the first or the second specifies that a
1272           newline is indicated by a single character (CR  or  LF,  respectively).
1273           Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
1274           two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
1275           that any of the three preceding sequences should be recognized. Setting
1276           PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
1277           recognized. The Unicode newline sequences are the three just mentioned,
1278           plus the single characters VT (vertical  tab,  U+000B),  FF  (formfeed,
1279           U+000C),  NEL  (next line, U+0085), LS (line separator, U+2028), and PS
1280           (paragraph separator, U+2029). The last  two  are  recognized  only  in
1281           UTF-8 mode.
1282    
1283           The  newline  setting  in  the  options  word  uses three bits that are
1284           treated as a number, giving eight possibilities. Currently only six are
1285           used  (default  plus the five values above). This means that if you set
1286           more than one newline option, the combination may or may not be  sensi-
1287           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1288           PCRE_NEWLINE_CRLF, but other combinations may yield unused numbers  and
1289           cause an error.
1290    
1291           The  only time that a line break is specially recognized when compiling
1292           a pattern is if PCRE_EXTENDED is set, and  an  unescaped  #  outside  a
1293           character  class  is  encountered.  This indicates a comment that lasts
1294           until after the next line break sequence. In other circumstances,  line
1295           break   sequences   are   treated  as  literal  data,  except  that  in
1296           PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters
1297           and are therefore ignored.
1298    
1299           The newline option that is set at compile time becomes the default that
1300           is used for pcre_exec() and pcre_dfa_exec(), but it can be  overridden.
1301    
1302           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
1303    
1304         If this option is set, it disables the use of numbered capturing paren-         If this option is set, it disables the use of numbered capturing paren-
# Line 678  COMPILING A PATTERN Line 1318  COMPILING A PATTERN
1318    
1319         This option causes PCRE to regard both the pattern and the  subject  as         This option causes PCRE to regard both the pattern and the  subject  as
1320         strings  of  UTF-8 characters instead of single-byte character strings.         strings  of  UTF-8 characters instead of single-byte character strings.
1321         However, it is available only if PCRE has been built to  include  UTF-8         However, it is available only when PCRE is built to include UTF-8  sup-
1322         support.  If  not, the use of this option provokes an error. Details of         port.  If not, the use of this option provokes an error. Details of how
1323         how this option changes the behaviour of PCRE are given in the  section         this option changes the behaviour of PCRE are given in the  section  on
1324         on UTF-8 support in the main pcre page.         UTF-8 support in the main pcre page.
1325    
1326           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
1327    
1328         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1329         automatically checked. If an invalid UTF-8 sequence of bytes is  found,         automatically checked. There is a  discussion  about  the  validity  of
1330         pcre_compile()  returns an error. If you already know that your pattern         UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1331         is valid, and you want to skip this check for performance reasons,  you         bytes is found, pcre_compile() returns an error. If  you  already  know
1332         can  set  the  PCRE_NO_UTF8_CHECK option. When it is set, the effect of         that your pattern is valid, and you want to skip this check for perfor-
1333         passing an invalid UTF-8 string as a pattern is undefined. It may cause         mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1334         your  program  to  crash.  Note that there is a similar option for sup-         set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1335         pressing the checking of subject strings passed to pcre_exec().         undefined. It may cause your program to crash. Note  that  this  option
1336           can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1337           UTF-8 validity checking of subject strings.
1338    
1339    
1340    COMPILATION ERROR CODES
1341    
1342           The following table lists the error  codes  than  may  be  returned  by
1343           pcre_compile2(),  along with the error messages that may be returned by
1344           both compiling functions. As PCRE has developed, some error codes  have
1345           fallen out of use. To avoid confusion, they have not been re-used.
1346    
1347              0  no error
1348              1  \ at end of pattern
1349              2  \c at end of pattern
1350              3  unrecognized character follows \
1351              4  numbers out of order in {} quantifier
1352              5  number too big in {} quantifier
1353              6  missing terminating ] for character class
1354              7  invalid escape sequence in character class
1355              8  range out of order in character class
1356              9  nothing to repeat
1357             10  [this code is not in use]
1358             11  internal error: unexpected repeat
1359             12  unrecognized character after (? or (?-
1360             13  POSIX named classes are supported only within a class
1361             14  missing )
1362             15  reference to non-existent subpattern
1363             16  erroffset passed as NULL
1364             17  unknown option bit(s) set
1365             18  missing ) after comment
1366             19  [this code is not in use]
1367             20  regular expression is too large
1368             21  failed to get memory
1369             22  unmatched parentheses
1370             23  internal error: code overflow
1371             24  unrecognized character after (?<
1372             25  lookbehind assertion is not fixed length
1373             26  malformed number or name after (?(
1374             27  conditional group contains more than two branches
1375             28  assertion expected after (?(
1376             29  (?R or (?[+-]digits must be followed by )
1377             30  unknown POSIX class name
1378             31  POSIX collating elements are not supported
1379             32  this version of PCRE is not compiled with PCRE_UTF8 support
1380             33  [this code is not in use]
1381             34  character value in \x{...} sequence is too large
1382             35  invalid condition (?(0)
1383             36  \C not allowed in lookbehind assertion
1384             37  PCRE does not support \L, \l, \N, \U, or \u
1385             38  number after (?C is > 255
1386             39  closing ) for (?C expected
1387             40  recursive call could loop indefinitely
1388             41  unrecognized character after (?P
1389             42  syntax error in subpattern name (missing terminator)
1390             43  two named subpatterns have the same name
1391             44  invalid UTF-8 string
1392             45  support for \P, \p, and \X has not been compiled
1393             46  malformed \P or \p sequence
1394             47  unknown property name after \P or \p
1395             48  subpattern name is too long (maximum 32 characters)
1396             49  too many named subpatterns (maximum 10000)
1397             50  [this code is not in use]
1398             51  octal value is greater than \377 (not in UTF-8 mode)
1399             52  internal error: overran compiling workspace
1400             53   internal  error:  previously-checked  referenced  subpattern not
1401           found
1402             54  DEFINE group contains more than one branch
1403             55  repeating a DEFINE group is not allowed
1404             56  inconsistent NEWLINE options
1405             57  \g is not followed by a braced name or an optionally braced
1406                   non-zero number
1407             58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number
1408             59  (*VERB) with an argument is not supported
1409             60  (*VERB) not recognized
1410             61  number is too big
1411             62  subpattern name expected
1412             63  digit expected after (?+
1413    
1414           The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
1415           values may be used if the limits were changed when PCRE was built.
1416    
1417    
1418  STUDYING A PATTERN  STUDYING A PATTERN
1419    
1420         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options
1421              const char **errptr);              const char **errptr);
1422    
1423         When a pattern is going to be used several times, it is worth  spending         If  a  compiled  pattern is going to be used several times, it is worth
1424         more  time  analyzing it in order to speed up the time taken for match-         spending more time analyzing it in order to speed up the time taken for
1425         ing. The function pcre_study() takes a pointer to a compiled pattern as         matching.  The function pcre_study() takes a pointer to a compiled pat-
1426         its first argument. If studing the pattern produces additional informa-         tern as its first argument. If studying the pattern produces additional
1427         tion that will help speed up matching, pcre_study() returns  a  pointer         information  that  will  help speed up matching, pcre_study() returns a
1428         to  a  pcre_extra  block,  in  which the study_data field points to the         pointer to a pcre_extra block, in which the study_data field points  to
1429         results of the study.         the results of the study.
1430    
1431         The returned value from  a  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
1432         pcre_exec().  However,  the pcre_extra block also contains other fields         pcre_exec(). However, a pcre_extra block  also  contains  other  fields
1433         that can be set by the caller before the block  is  passed;  these  are         that  can  be  set  by the caller before the block is passed; these are
1434         described  below.  If  studying  the pattern does not produce any addi-         described below in the section on matching a pattern.
1435         tional information, pcre_study() returns NULL. In that circumstance, if  
1436         the  calling  program  wants  to  pass  some  of  the  other  fields to         If studying the pattern does not  produce  any  additional  information
1437         pcre_exec(), it must set up its own pcre_extra block.         pcre_study() returns NULL. In that circumstance, if the calling program
1438           wants to pass any of the other fields to pcre_exec(), it  must  set  up
1439         The second argument contains option bits. At present,  no  options  are         its own pcre_extra block.
1440         defined for pcre_study(), and this argument should always be zero.  
1441           The  second  argument of pcre_study() contains option bits. At present,
1442         The  third argument for pcre_study() is a pointer for an error message.         no options are defined, and this argument should always be zero.
1443         If studying succeeds (even if no data is  returned),  the  variable  it  
1444         points  to  is set to NULL. Otherwise it points to a textual error mes-         The third argument for pcre_study() is a pointer for an error  message.
1445         sage. You should therefore test the error pointer for NULL after  call-         If  studying  succeeds  (even  if no data is returned), the variable it
1446         ing pcre_study(), to be sure that it has run successfully.         points to is set to NULL. Otherwise it is set to  point  to  a  textual
1447           error message. This is a static string that is part of the library. You
1448           must not try to free it. You should test the  error  pointer  for  NULL
1449           after calling pcre_study(), to be sure that it has run successfully.
1450    
1451         This is a typical call to pcre_study():         This is a typical call to pcre_study():
1452    
# Line 736  STUDYING A PATTERN Line 1458  STUDYING A PATTERN
1458    
1459         At present, studying a pattern is useful only for non-anchored patterns         At present, studying a pattern is useful only for non-anchored patterns
1460         that do not have a single fixed starting character. A bitmap of  possi-         that do not have a single fixed starting character. A bitmap of  possi-
1461         ble starting characters is created.         ble starting bytes is created.
1462    
1463    
1464  LOCALE SUPPORT  LOCALE SUPPORT
1465    
1466         PCRE  handles  caseless matching, and determines whether characters are         PCRE  handles  caseless matching, and determines whether characters are
1467         letters, digits, or whatever, by reference to a  set  of  tables.  When         letters, digits, or whatever, by reference to a set of tables,  indexed
1468         running  in UTF-8 mode, this applies only to characters with codes less         by  character  value.  When running in UTF-8 mode, this applies only to
1469         than 256. The library contains a default set of tables that is  created         characters with codes less than 128. Higher-valued  codes  never  match
1470         in  the  default  C locale when PCRE is compiled. This is used when the         escapes  such  as  \w or \d, but can be tested with \p if PCRE is built
1471         final argument of pcre_compile() is NULL, and is  sufficient  for  many         with Unicode character property support. The use of locales  with  Uni-
1472         applications.         code  is discouraged. If you are handling characters with codes greater
1473           than 128, you should either use UTF-8 and Unicode, or use locales,  but
1474         An alternative set of tables can, however, be supplied. Such tables are         not try to mix the two.
1475         built by calling the pcre_maketables() function,  which  has  no  argu-  
1476         ments,  in  the  relevant  locale.  The  result  can  then be passed to         PCRE  contains  an  internal set of tables that are used when the final
1477         pcre_compile() as often as necessary. For example,  to  build  and  use         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1478         tables that are appropriate for the French locale (where accented char-         applications.  Normally, the internal tables recognize only ASCII char-
1479         acters with codes greater than 128 are treated as letters), the follow-         acters. However, when PCRE is built, it is possible to cause the inter-
1480         ing code could be used:         nal tables to be rebuilt in the default "C" locale of the local system,
1481           which may cause them to be different.
1482    
1483           The internal tables can always be overridden by tables supplied by  the
1484           application that calls PCRE. These may be created in a different locale
1485           from the default. As more and more applications change  to  using  Uni-
1486           code, the need for this locale support is expected to die away.
1487    
1488           External  tables  are  built by calling the pcre_maketables() function,
1489           which has no arguments, in the relevant locale. The result can then  be
1490           passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1491           example, to build and use tables that are appropriate  for  the  French
1492           locale  (where  accented  characters  with  values greater than 128 are
1493           treated as letters), the following code could be used:
1494    
1495           setlocale(LC_CTYPE, "fr");           setlocale(LC_CTYPE, "fr_FR");
1496           tables = pcre_maketables();           tables = pcre_maketables();
1497           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
1498    
1499         The  tables  are  built in memory that is obtained via pcre_malloc. The         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1500         pointer that is passed to pcre_compile is saved with the compiled  pat-         if you are using Windows, the name for the French locale is "french".
1501         tern, and the same tables are used via this pointer by pcre_study() and  
1502         pcre_exec(). Thus, for any single pattern,  compilation,  studying  and         When  pcre_maketables()  runs,  the  tables are built in memory that is
1503         matching  all  happen in the same locale, but different patterns can be         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1504         compiled in different locales. It is  the  caller's  responsibility  to         that  the memory containing the tables remains available for as long as
1505         ensure  that  the memory containing the tables remains available for as         it is needed.
1506         long as it is needed.  
1507           The pointer that is passed to pcre_compile() is saved with the compiled
1508           pattern,  and the same tables are used via this pointer by pcre_study()
1509           and normally also by pcre_exec(). Thus, by default, for any single pat-
1510           tern, compilation, studying and matching all happen in the same locale,
1511           but different patterns can be compiled in different locales.
1512    
1513           It is possible to pass a table pointer or NULL (indicating the  use  of
1514           the  internal  tables)  to  pcre_exec(). Although not intended for this
1515           purpose, this facility could be used to match a pattern in a  different
1516           locale from the one in which it was compiled. Passing table pointers at
1517           run time is discussed below in the section on matching a pattern.
1518    
1519    
1520  INFORMATION ABOUT A PATTERN  INFORMATION ABOUT A PATTERN
# Line 792  INFORMATION ABOUT A PATTERN Line 1538  INFORMATION ABOUT A PATTERN
1538           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1539           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADOPTION  the value of what was invalid
1540    
1541         Here  is a typical call of pcre_fullinfo(), to obtain the length of the         The  "magic  number" is placed at the start of each compiled pattern as
1542         compiled pattern:         an simple check against passing an arbitrary memory pointer. Here is  a
1543           typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1544           pattern:
1545    
1546           int rc;           int rc;
1547           unsigned long int length;           size_t length;
1548           rc = pcre_fullinfo(           rc = pcre_fullinfo(
1549             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
1550             pe,               /* result of pcre_study(), or NULL */             pe,               /* result of pcre_study(), or NULL */
# Line 817  INFORMATION ABOUT A PATTERN Line 1565  INFORMATION ABOUT A PATTERN
1565         Return  the  number of capturing subpatterns in the pattern. The fourth         Return  the  number of capturing subpatterns in the pattern. The fourth
1566         argument should point to an int variable.         argument should point to an int variable.
1567    
1568             PCRE_INFO_DEFAULT_TABLES
1569    
1570           Return a pointer to the internal default character tables within  PCRE.
1571           The  fourth  argument should point to an unsigned char * variable. This
1572           information call is provided for internal use by the pcre_study() func-
1573           tion.  External  callers  can  cause PCRE to use its internal tables by
1574           passing a NULL table pointer.
1575    
1576           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
1577    
1578         Return information about the first byte of any matched  string,  for  a         Return information about the first byte of any matched  string,  for  a
1579         non-anchored    pattern.    (This    option    used    to   be   called         non-anchored  pattern. The fourth argument should point to an int vari-
1580         PCRE_INFO_FIRSTCHAR; the old name is  still  recognized  for  backwards         able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1581         compatibility.)         is still recognized for backwards compatibility.)
   
        If  there  is  a  fixed  first  byte,  e.g.  from  a  pattern  such  as  
        (cat|cow|coyote), it is returned in the integer pointed  to  by  where.  
        Otherwise, if either  
1582    
1583         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         If  there  is  a  fixed first byte, for example, from a pattern such as
1584           (cat|cow|coyote), its value is returned. Otherwise, if either
1585    
1586           (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1587         branch starts with "^", or         branch starts with "^", or
1588    
1589         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1590         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
1591    
1592         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
1593         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
1594         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
1595    
1596           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
1597    
1598         If  the pattern was studied, and this resulted in the construction of a         If the pattern was studied, and this resulted in the construction of  a
1599         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of bytes for the first byte in any
1600         matching  string, a pointer to the table is returned. Otherwise NULL is         matching string, a pointer to the table is returned. Otherwise NULL  is
1601         returned. The fourth argument should point to an unsigned char *  vari-         returned.  The fourth argument should point to an unsigned char * vari-
1602         able.         able.
1603    
1604             PCRE_INFO_HASCRORLF
1605    
1606           Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1607           characters,  otherwise  0.  The  fourth argument should point to an int
1608           variable. An explicit match is either a literal CR or LF character,  or
1609           \r or \n.
1610    
1611             PCRE_INFO_JCHANGED
1612    
1613           Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1614           otherwise 0. The fourth argument should point to an int variable.  (?J)
1615           and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1616    
1617           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
1618    
1619         Return  the  value of the rightmost literal byte that must exist in any         Return  the  value of the rightmost literal byte that must exist in any
# Line 862  INFORMATION ABOUT A PATTERN Line 1630  INFORMATION ABOUT A PATTERN
1630    
1631         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE  supports the use of named as well as numbered capturing parenthe-
1632         ses. The names are just an additional way of identifying the  parenthe-         ses. The names are just an additional way of identifying the  parenthe-
1633         ses,  which still acquire a number. A caller that wants to extract data         ses, which still acquire numbers. Several convenience functions such as
1634         from a named subpattern must convert the name to a number in  order  to         pcre_get_named_substring() are provided for  extracting  captured  sub-
1635         access  the  correct  pointers  in  the  output  vector (described with         strings  by  name. It is also possible to extract the data directly, by
1636         pcre_exec() below). In order to do this, it must first use these  three         first converting the name to a number in order to  access  the  correct
1637         values to obtain the name-to-number mapping table for the pattern.         pointers in the output vector (described with pcre_exec() below). To do
1638           the conversion, you need  to  use  the  name-to-number  map,  which  is
1639           described by these three values.
1640    
1641         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1642         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
# Line 876  INFORMATION ABOUT A PATTERN Line 1646  INFORMATION ABOUT A PATTERN
1646         first two bytes of each entry are the number of the capturing parenthe-         first two bytes of each entry are the number of the capturing parenthe-
1647         sis,  most  significant byte first. The rest of the entry is the corre-         sis,  most  significant byte first. The rest of the entry is the corre-
1648         sponding name, zero terminated. The names are  in  alphabetical  order.         sponding name, zero terminated. The names are  in  alphabetical  order.
1649         For  example,  consider  the following pattern (assume PCRE_EXTENDED is         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-
1650         set, so white space - including newlines - is ignored):         theses numbers. For example, consider  the  following  pattern  (assume
1651           PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is
1652           ignored):
1653    
1654           (?P<date> (?P<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
1655           (?P<month>\d\d) - (?P<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
1656    
1657         There are four named subpatterns, so the table has  four  entries,  and         There are four named subpatterns, so the table has  four  entries,  and
1658         each  entry  in the table is eight bytes long. The table is as follows,         each  entry  in the table is eight bytes long. The table is as follows,
1659         with non-printing bytes shows in hex, and undefined bytes shown as ??:         with non-printing bytes shows in hexadecimal, and undefined bytes shown
1660           as ??:
1661    
1662           00 01 d  a  t  e  00 ??           00 01 d  a  t  e  00 ??
1663           00 05 d  a  y  00 ?? ??           00 05 d  a  y  00 ?? ??
1664           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
1665           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
1666    
1667         When writing code to extract data from named subpatterns, remember that         When  writing  code  to  extract  data from named subpatterns using the
1668         the length of each entry may be different for each compiled pattern.         name-to-number map, remember that the length of the entries  is  likely
1669           to be different for each compiled pattern.
1670    
1671             PCRE_INFO_OKPARTIAL
1672    
1673           Return  1 if the pattern can be used for partial matching, otherwise 0.
1674           The fourth argument should point to an int  variable.  The  pcrepartial
1675           documentation  lists  the restrictions that apply to patterns when par-
1676           tial matching is used.
1677    
1678           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
1679    
1680         Return  a  copy of the options with which the pattern was compiled. The         Return a copy of the options with which the pattern was  compiled.  The
1681         fourth argument should point to an unsigned long  int  variable.  These         fourth  argument  should  point to an unsigned long int variable. These
1682         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
1683         by any top-level option settings within the pattern itself.         by any top-level option settings at the start of the pattern itself. In
1684           other words, they are the options that will be in force  when  matching
1685           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1686           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1687           and PCRE_EXTENDED.
1688    
1689         A pattern is automatically anchored by PCRE if  all  of  its  top-level         A  pattern  is  automatically  anchored by PCRE if all of its top-level
1690         alternatives begin with one of the following:         alternatives begin with one of the following:
1691    
1692           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 915  INFORMATION ABOUT A PATTERN Line 1700  INFORMATION ABOUT A PATTERN
1700    
1701           PCRE_INFO_SIZE           PCRE_INFO_SIZE
1702    
1703         Return the size of the compiled pattern, that is, the  value  that  was         Return  the  size  of the compiled pattern, that is, the value that was
1704         passed as the argument to pcre_malloc() when PCRE was getting memory in         passed as the argument to pcre_malloc() when PCRE was getting memory in
1705         which to place the compiled data. The fourth argument should point to a         which to place the compiled data. The fourth argument should point to a
1706         size_t variable.         size_t variable.
1707    
1708           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
1709    
1710         Returns  the  size of the data block pointed to by the study_data field         Return the size of the data block pointed to by the study_data field in
1711         in a pcre_extra block. That is, it is the  value  that  was  passed  to         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to
1712         pcre_malloc() when PCRE was getting memory into which to place the data         pcre_malloc() when PCRE was getting memory into which to place the data
1713         created by pcre_study(). The fourth argument should point to  a  size_t         created  by  pcre_study(). The fourth argument should point to a size_t
1714         variable.         variable.
1715    
1716    
# Line 933  OBSOLETE INFO FUNCTION Line 1718  OBSOLETE INFO FUNCTION
1718    
1719         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1720    
1721         The  pcre_info()  function is now obsolete because its interface is too         The pcre_info() function is now obsolete because its interface  is  too
1722         restrictive to return all the available data about a compiled  pattern.         restrictive  to return all the available data about a compiled pattern.
1723         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of
1724         pcre_info() is the number of capturing subpatterns, or one of the  fol-         pcre_info()  is the number of capturing subpatterns, or one of the fol-
1725         lowing negative numbers:         lowing negative numbers:
1726    
1727           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL       the argument code was NULL
1728           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1729    
1730         If  the  optptr  argument is not NULL, a copy of the options with which         If the optptr argument is not NULL, a copy of the  options  with  which
1731         the pattern was compiled is placed in the integer  it  points  to  (see         the  pattern  was  compiled  is placed in the integer it points to (see
1732         PCRE_INFO_OPTIONS above).         PCRE_INFO_OPTIONS above).
1733    
1734         If  the  pattern  is  not anchored and the firstcharptr argument is not         If the pattern is not anchored and the  firstcharptr  argument  is  not
1735         NULL, it is used to pass back information about the first character  of         NULL,  it is used to pass back information about the first character of
1736         any matched string (see PCRE_INFO_FIRSTBYTE above).         any matched string (see PCRE_INFO_FIRSTBYTE above).
1737    
1738    
1739  MATCHING A PATTERN  REFERENCE COUNTS
1740    
1741           int pcre_refcount(pcre *code, int adjust);
1742    
1743           The pcre_refcount() function is used to maintain a reference  count  in
1744           the data block that contains a compiled pattern. It is provided for the
1745           benefit of applications that  operate  in  an  object-oriented  manner,
1746           where different parts of the application may be using the same compiled
1747           pattern, but you want to free the block when they are all done.
1748    
1749           When a pattern is compiled, the reference count field is initialized to
1750           zero.   It is changed only by calling this function, whose action is to
1751           add the adjust value (which may be positive or  negative)  to  it.  The
1752           yield of the function is the new value. However, the value of the count
1753           is constrained to lie between 0 and 65535, inclusive. If the new  value
1754           is outside these limits, it is forced to the appropriate limit value.
1755    
1756           Except  when it is zero, the reference count is not correctly preserved
1757           if a pattern is compiled on one host and then  transferred  to  a  host
1758           whose byte-order is different. (This seems a highly unlikely scenario.)
1759    
1760    
1761    MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1762    
1763         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1764              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1765              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
1766    
1767         The  function pcre_exec() is called to match a subject string against a         The function pcre_exec() is called to match a subject string against  a
1768         pre-compiled pattern, which is passed in the code argument. If the pat-         compiled  pattern, which is passed in the code argument. If the pattern
1769         tern  has been studied, the result of the study should be passed in the         has been studied, the result of the study should be passed in the extra
1770         extra argument.         argument.  This  function is the main matching facility of the library,
1771           and it operates in a Perl-like manner. For specialist use there is also
1772           an  alternative matching function, which is described below in the sec-
1773           tion about the pcre_dfa_exec() function.
1774    
1775           In most applications, the pattern will have been compiled (and  option-
1776           ally  studied)  in the same process that calls pcre_exec(). However, it
1777           is possible to save compiled patterns and study data, and then use them
1778           later  in  different processes, possibly even on different hosts. For a
1779           discussion about this, see the pcreprecompile documentation.
1780    
1781         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
1782    
# Line 973  MATCHING A PATTERN Line 1789  MATCHING A PATTERN
1789             11,             /* the length of the subject string */             11,             /* the length of the subject string */
1790             0,              /* start at offset 0 in the subject */             0,              /* start at offset 0 in the subject */
1791             0,              /* default options */             0,              /* default options */
1792             ovector,        /* vector for substring information */             ovector,        /* vector of integers for substring information */
1793             30);            /* number of elements in the vector */             30);            /* number of elements (NOT size in bytes) */
1794    
1795       Extra data for pcre_exec()
1796    
1797         If the extra argument is not NULL, it must point to a  pcre_extra  data         If the extra argument is not NULL, it must point to a  pcre_extra  data
1798         block.  The pcre_study() function returns such a block (when it doesn't         block.  The pcre_study() function returns such a block (when it doesn't
1799         return NULL), but you can also create one for yourself, and pass  addi-         return NULL), but you can also create one for yourself, and pass  addi-
1800         tional information in it. The fields in the block are as follows:         tional  information  in it. The pcre_extra block contains the following
1801           fields (not necessarily in this order):
1802    
1803           unsigned long int flags;           unsigned long int flags;
1804           void *study_data;           void *study_data;
1805           unsigned long int match_limit;           unsigned long int match_limit;
1806             unsigned long int match_limit_recursion;
1807           void *callout_data;           void *callout_data;
1808             const unsigned char *tables;
1809    
1810         The  flags  field  is a bitmap that specifies which of the other fields         The flags field is a bitmap that specifies which of  the  other  fields
1811         are set. The flag bits are:         are set. The flag bits are:
1812    
1813           PCRE_EXTRA_STUDY_DATA           PCRE_EXTRA_STUDY_DATA
1814           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
1815             PCRE_EXTRA_MATCH_LIMIT_RECURSION
1816           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_CALLOUT_DATA
1817             PCRE_EXTRA_TABLES
1818    
1819         Other flag bits should be set to zero. The study_data field is  set  in         Other  flag  bits should be set to zero. The study_data field is set in
1820         the  pcre_extra  block  that is returned by pcre_study(), together with         the pcre_extra block that is returned by  pcre_study(),  together  with
1821         the appropriate flag bit. You should not set this yourself, but you can         the appropriate flag bit. You should not set this yourself, but you may
1822         add to the block by setting the other fields.         add to the block by setting the other fields  and  their  corresponding
1823           flag bits.
1824    
1825         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
1826         a vast amount of resources when running patterns that are not going  to         a vast amount of resources when running patterns that are not going  to
1827         match,  but  which  have  a very large number of possibilities in their         match,  but  which  have  a very large number of possibilities in their
1828         search trees. The classic  example  is  the  use  of  nested  unlimited         search trees. The classic  example  is  the  use  of  nested  unlimited
1829         repeats. Internally, PCRE uses a function called match() which it calls         repeats.
1830         repeatedly (sometimes recursively). The limit is imposed on the  number  
1831         of  times  this function is called during a match, which has the effect         Internally,  PCRE uses a function called match() which it calls repeat-
1832         of limiting the amount of recursion  and  backtracking  that  can  take         edly (sometimes recursively). The limit set by match_limit  is  imposed
1833         place.  For  patterns that are not anchored, the count starts from zero         on  the  number  of times this function is called during a match, which
1834           has the effect of limiting the amount of  backtracking  that  can  take
1835           place. For patterns that are not anchored, the count restarts from zero
1836         for each position in the subject string.         for each position in the subject string.
1837    
1838         The default limit for the library can be set when PCRE  is  built;  the         The default value for the limit can be set  when  PCRE  is  built;  the
1839         default  default  is 10 million, which handles all but the most extreme         default  default  is 10 million, which handles all but the most extreme
1840         cases. You can reduce  the  default  by  suppling  pcre_exec()  with  a         cases. You can override the default  by  suppling  pcre_exec()  with  a
1841         pcre_extra  block  in  which match_limit is set to a smaller value, and         pcre_extra     block    in    which    match_limit    is    set,    and
1842         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is         PCRE_EXTRA_MATCH_LIMIT is set in the  flags  field.  If  the  limit  is
1843         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.         exceeded, pcre_exec() returns PCRE_ERROR_MATCHLIMIT.
1844    
1845           The  match_limit_recursion field is similar to match_limit, but instead
1846           of limiting the total number of times that match() is called, it limits
1847           the  depth  of  recursion. The recursion depth is a smaller number than
1848           the total number of calls, because not all calls to match() are  recur-
1849           sive.  This limit is of use only if it is set smaller than match_limit.
1850    
1851           Limiting the recursion depth limits the amount of  stack  that  can  be
1852           used, or, when PCRE has been compiled to use memory on the heap instead
1853           of the stack, the amount of heap memory that can be used.
1854    
1855           The default value for match_limit_recursion can be  set  when  PCRE  is
1856           built;  the  default  default  is  the  same  value  as the default for
1857           match_limit. You can override the default by suppling pcre_exec()  with
1858           a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and
1859           PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the
1860           limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
1861    
1862         The  pcre_callout  field is used in conjunction with the "callout" fea-         The  pcre_callout  field is used in conjunction with the "callout" fea-
1863         ture, which is described in the pcrecallout documentation.         ture, which is described in the pcrecallout documentation.
1864    
1865         The PCRE_ANCHORED option can be passed in the options  argument,  whose         The tables field  is  used  to  pass  a  character  tables  pointer  to
1866         unused  bits  must  be zero. This limits pcre_exec() to matching at the         pcre_exec();  this overrides the value that is stored with the compiled
1867         first matching position.  However,  if  a  pattern  was  compiled  with         pattern. A non-NULL value is stored with the compiled pattern  only  if
1868         PCRE_ANCHORED,  or turned out to be anchored by virtue of its contents,         custom  tables  were  supplied to pcre_compile() via its tableptr argu-
1869         it cannot be made unachored at matching time.         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
1870           PCRE's  internal  tables  to be used. This facility is helpful when re-
1871         When PCRE_UTF8 was set at compile time, the validity of the subject  as         using patterns that have been saved after compiling  with  an  external
1872         a  UTF-8  string is automatically checked, and the value of startoffset         set  of  tables,  because  the  external tables might be at a different
1873         is also checked to ensure that it points to the start of a UTF-8  char-         address when pcre_exec() is called. See the  pcreprecompile  documenta-
1874         acter.  If  an  invalid  UTF-8  sequence of bytes is found, pcre_exec()         tion for a discussion of saving compiled patterns for later use.
1875         returns  the  error  PCRE_ERROR_BADUTF8.  If  startoffset  contains  an  
1876         invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.     Option bits for pcre_exec()
1877    
1878         If  you  already  know that your subject is valid, and you want to skip         The  unused  bits of the options argument for pcre_exec() must be zero.
1879         these   checks   for   performance   reasons,   you   can    set    the         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,
1880         PCRE_NO_UTF8_CHECK  option  when calling pcre_exec(). You might want to         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and
1881         do this for the second and subsequent calls to pcre_exec() if  you  are         PCRE_PARTIAL.
1882         making  repeated  calls  to  find  all  the matches in a single subject  
1883         string. However, you should be  sure  that  the  value  of  startoffset           PCRE_ANCHORED
1884         points  to  the  start of a UTF-8 character. When PCRE_NO_UTF8_CHECK is  
1885         set, the effect of passing an invalid UTF-8 string as a subject,  or  a         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first
1886         value  of startoffset that does not point to the start of a UTF-8 char-         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or
1887         acter, is undefined. Your program may crash.         turned out to be anchored by virtue of its contents, it cannot be  made
1888           unachored at matching time.
1889    
1890             PCRE_BSR_ANYCRLF
1891             PCRE_BSR_UNICODE
1892    
1893           These options (which are mutually exclusive) control what the \R escape
1894           sequence matches. The choice is either to match only CR, LF,  or  CRLF,
1895           or  to  match  any Unicode newline sequence. These options override the
1896           choice that was made or defaulted when the pattern was compiled.
1897    
1898             PCRE_NEWLINE_CR
1899             PCRE_NEWLINE_LF
1900             PCRE_NEWLINE_CRLF
1901             PCRE_NEWLINE_ANYCRLF
1902             PCRE_NEWLINE_ANY
1903    
1904           These options override  the  newline  definition  that  was  chosen  or
1905           defaulted  when the pattern was compiled. For details, see the descrip-
1906           tion of pcre_compile()  above.  During  matching,  the  newline  choice
1907           affects  the  behaviour  of the dot, circumflex, and dollar metacharac-
1908           ters. It may also alter the way the match position is advanced after  a
1909           match failure for an unanchored pattern.
1910    
1911           When  PCRE_NEWLINE_CRLF,  PCRE_NEWLINE_ANYCRLF,  or PCRE_NEWLINE_ANY is
1912           set, and a match attempt for an unanchored pattern fails when the  cur-
1913           rent  position  is  at  a  CRLF  sequence,  and the pattern contains no
1914           explicit matches for  CR  or  LF  characters,  the  match  position  is
1915           advanced by two characters instead of one, in other words, to after the
1916           CRLF.
1917    
1918           The above rule is a compromise that makes the most common cases work as
1919           expected.  For  example,  if  the  pattern  is .+A (and the PCRE_DOTALL
1920           option is not set), it does not match the string "\r\nA" because, after
1921           failing  at the start, it skips both the CR and the LF before retrying.
1922           However, the pattern [\r\n]A does match that string,  because  it  con-
1923           tains an explicit CR or LF reference, and so advances only by one char-
1924           acter after the first failure.
1925    
1926           An explicit match for CR of LF is either a literal appearance of one of
1927           those  characters,  or  one  of the \r or \n escape sequences. Implicit
1928           matches such as [^X] do not count, nor does \s (which includes  CR  and
1929           LF in the characters that it matches).
1930    
1931         There are also three further options that can be set only  at  matching         Notwithstanding  the above, anomalous effects may still occur when CRLF
1932         time:         is a valid newline sequence and explicit \r or \n escapes appear in the
1933           pattern.
1934    
1935           PCRE_NOTBOL           PCRE_NOTBOL
1936    
1937         The  first  character  of the string is not the beginning of a line, so         This option specifies that first character of the subject string is not
1938         the circumflex metacharacter should not match before it.  Setting  this         the beginning of a line, so the  circumflex  metacharacter  should  not
1939         without  PCRE_MULTILINE  (at  compile  time) causes circumflex never to         match  before it. Setting this without PCRE_MULTILINE (at compile time)
1940         match.         causes circumflex never to match. This option affects only  the  behav-
1941           iour of the circumflex metacharacter. It does not affect \A.
1942    
1943           PCRE_NOTEOL           PCRE_NOTEOL
1944    
1945         The end of the string is not the end of a line, so the dollar metachar-         This option specifies that the end of the subject string is not the end
1946         acter  should  not  match  it  nor (except in multiline mode) a newline         of a line, so the dollar metacharacter should not match it nor  (except
1947         immediately before it. Setting this without PCRE_MULTILINE (at  compile         in  multiline mode) a newline immediately before it. Setting this with-
1948         time) causes dollar never to match.         out PCRE_MULTILINE (at compile time) causes dollar never to match. This
1949           option  affects only the behaviour of the dollar metacharacter. It does
1950           not affect \Z or \z.
1951    
1952           PCRE_NOTEMPTY           PCRE_NOTEMPTY
1953    
1954         An empty string is not considered to be a valid match if this option is         An empty string is not considered to be a valid match if this option is
1955         set. If there are alternatives in the pattern, they are tried.  If  all         set.  If  there are alternatives in the pattern, they are tried. If all
1956         the  alternatives  match  the empty string, the entire match fails. For         the alternatives match the empty string, the entire  match  fails.  For
1957         example, if the pattern         example, if the pattern
1958    
1959           a?b?           a?b?
1960    
1961         is applied to a string not beginning with "a" or "b",  it  matches  the         is  applied  to  a string not beginning with "a" or "b", it matches the
1962         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty string at the start of the subject. With PCRE_NOTEMPTY set,  this
1963         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
1964         rences of "a" or "b".         rences of "a" or "b".
1965    
1966         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-
1967         cial case of a pattern match of the empty  string  within  its  split()         cial  case  of  a  pattern match of the empty string within its split()
1968         function,  and  when  using  the /g modifier. It is possible to emulate         function, and when using the /g modifier. It  is  possible  to  emulate
1969         Perl's behaviour after matching a null string by first trying the match         Perl's behaviour after matching a null string by first trying the match
1970         again at the same offset with PCRE_NOTEMPTY set, and then if that fails         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then
1971         by advancing the starting offset (see below)  and  trying  an  ordinary         if  that  fails by advancing the starting offset (see below) and trying
1972         match again.         an ordinary match again. There is some code that demonstrates how to do
1973           this in the pcredemo.c sample program.
1974         The  subject string is passed to pcre_exec() as a pointer in subject, a  
1975         length in length, and a starting byte offset in startoffset. Unlike the           PCRE_NO_UTF8_CHECK
1976         pattern  string,  the  subject  may contain binary zero bytes. When the  
1977         starting offset is zero, the search for a match starts at the beginning         When PCRE_UTF8 is set at compile time, the validity of the subject as a
1978         of the subject, and this is by far the most common case.         UTF-8 string is automatically checked when pcre_exec() is  subsequently
1979           called.   The  value  of  startoffset is also checked to ensure that it
1980         If the pattern was compiled with the PCRE_UTF8 option, the subject must         points to the start of a UTF-8 character. There is a  discussion  about
1981         be a sequence of bytes that is a valid UTF-8 string, and  the  starting         the  validity  of  UTF-8 strings in the section on UTF-8 support in the
1982         offset  must point to the beginning of a UTF-8 character. If an invalid         main pcre page. If  an  invalid  UTF-8  sequence  of  bytes  is  found,
1983         UTF-8 string or offset is passed, an error  (either  PCRE_ERROR_BADUTF8         pcre_exec()  returns  the error PCRE_ERROR_BADUTF8. If startoffset con-
1984         or   PCRE_ERROR_BADUTF8_OFFSET)   is   returned,   unless   the  option         tains an invalid value, PCRE_ERROR_BADUTF8_OFFSET is returned.
1985         PCRE_NO_UTF8_CHECK is set,  in  which  case  PCRE's  behaviour  is  not  
1986         defined.         If you already know that your subject is valid, and you  want  to  skip
1987           these    checks    for   performance   reasons,   you   can   set   the
1988           PCRE_NO_UTF8_CHECK option when calling pcre_exec(). You might  want  to
1989           do  this  for the second and subsequent calls to pcre_exec() if you are
1990           making repeated calls to find all  the  matches  in  a  single  subject
1991           string.  However,  you  should  be  sure  that the value of startoffset
1992           points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is
1993           set,  the  effect of passing an invalid UTF-8 string as a subject, or a
1994           value of startoffset that does not point to the start of a UTF-8  char-
1995           acter, is undefined. Your program may crash.
1996    
1997             PCRE_PARTIAL
1998    
1999           This  option  turns  on  the  partial  matching feature. If the subject
2000           string fails to match the pattern, but at some point during the  match-
2001           ing  process  the  end of the subject was reached (that is, the subject
2002           partially matches the pattern and the failure to  match  occurred  only
2003           because  there were not enough subject characters), pcre_exec() returns
2004           PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is
2005           used,  there  are restrictions on what may appear in the pattern. These
2006           are discussed in the pcrepartial documentation.
2007    
2008       The string to be matched by pcre_exec()
2009    
2010           The subject string is passed to pcre_exec() as a pointer in subject,  a
2011           length  in  length, and a starting byte offset in startoffset. In UTF-8
2012           mode, the byte offset must point to the start  of  a  UTF-8  character.
2013           Unlike  the  pattern string, the subject may contain binary zero bytes.
2014           When the starting offset is zero, the search for a match starts at  the
2015           beginning of the subject, and this is by far the most common case.
2016    
2017         A  non-zero  starting offset is useful when searching for another match         A  non-zero  starting offset is useful when searching for another match
2018         in the same subject by calling pcre_exec() again after a previous  suc-         in the same subject by calling pcre_exec() again after a previous  suc-
# Line 1111  MATCHING A PATTERN Line 2029  MATCHING A PATTERN
2029         the  remainder  of  the  subject,  namely  "issipi", it does not match,         the  remainder  of  the  subject,  namely  "issipi", it does not match,
2030         because \B is always false at the start of the subject, which is deemed         because \B is always false at the start of the subject, which is deemed
2031         to  be  a  word  boundary. However, if pcre_exec() is passed the entire         to  be  a  word  boundary. However, if pcre_exec() is passed the entire
2032         string again, but with startoffset  set  to  4,  it  finds  the  second         string again, but with startoffset set to 4, it finds the second occur-
2033         occurrence  of  "iss"  because  it  is able to look behind the starting         rence  of "iss" because it is able to look behind the starting point to
2034         point to discover that it is preceded by a letter.         discover that it is preceded by a letter.
2035    
2036         If a non-zero starting offset is passed when the pattern  is  anchored,         If a non-zero starting offset is passed when the pattern  is  anchored,
2037         one  attempt  to match at the given offset is tried. This can only suc-         one attempt to match at the given offset is made. This can only succeed
2038         ceed if the pattern does not require the match to be at  the  start  of         if the pattern does not require the match to be at  the  start  of  the
2039         the subject.         subject.
2040    
2041       How pcre_exec() returns captured substrings
2042    
2043         In  general, a pattern matches a certain portion of the subject, and in         In  general, a pattern matches a certain portion of the subject, and in
2044         addition, further substrings from the subject  may  be  picked  out  by         addition, further substrings from the subject  may  be  picked  out  by
# Line 1130  MATCHING A PATTERN Line 2050  MATCHING A PATTERN
2050    
2051         Captured  substrings are returned to the caller via a vector of integer         Captured  substrings are returned to the caller via a vector of integer
2052         offsets whose address is passed in ovector. The number of  elements  in         offsets whose address is passed in ovector. The number of  elements  in
2053         the vector is passed in ovecsize. The first two-thirds of the vector is         the  vector is passed in ovecsize, which must be a non-negative number.
2054         used to pass back captured substrings, each substring using a  pair  of         Note: this argument is NOT the size of ovector in bytes.
        integers.  The  remaining  third  of the vector is used as workspace by  
        pcre_exec() while matching capturing subpatterns, and is not  available  
        for  passing  back  information.  The  length passed in ovecsize should  
        always be a multiple of three. If it is not, it is rounded down.  
   
        When a match has been successful, information about captured substrings  
        is returned in pairs of integers, starting at the beginning of ovector,  
        and continuing up to two-thirds of its length at the  most.  The  first  
        element of a pair is set to the offset of the first character in a sub-  
        string, and the second is set to the  offset  of  the  first  character  
        after  the  end  of  a  substring. The first pair, ovector[0] and ovec-  
        tor[1], identify the portion of  the  subject  string  matched  by  the  
        entire  pattern.  The next pair is used for the first capturing subpat-  
        tern, and so on. The value returned by pcre_exec()  is  the  number  of  
        pairs  that  have  been set. If there are no capturing subpatterns, the  
        return value from a successful match is 1,  indicating  that  just  the  
        first pair of offsets has been set.  
   
        Some  convenience  functions  are  provided for extracting the captured  
        substrings as separate strings. These are described  in  the  following  
        section.  
2055    
2056         It  is  possible  for  an capturing subpattern number n+1 to match some         The first two-thirds of the vector is used to pass back  captured  sub-
2057         part of the subject when subpattern n has not been  used  at  all.  For         strings,  each  substring using a pair of integers. The remaining third
2058         example, if the string "abc" is matched against the pattern (a|(z))(bc)         of the vector is used as workspace by pcre_exec() while  matching  cap-
2059         subpatterns 1 and 3 are matched, but 2 is not. When this happens,  both         turing  subpatterns, and is not available for passing back information.
2060         offset values corresponding to the unused subpattern are set to -1.         The length passed in ovecsize should always be a multiple of three.  If
2061           it is not, it is rounded down.
2062    
2063           When  a  match  is successful, information about captured substrings is
2064           returned in pairs of integers, starting at the  beginning  of  ovector,
2065           and  continuing  up  to two-thirds of its length at the most. The first
2066           element of a pair is set to the offset of the first character in a sub-
2067           string,  and  the  second  is  set to the offset of the first character
2068           after the end of a substring. The  first  pair,  ovector[0]  and  ovec-
2069           tor[1],  identify  the  portion  of  the  subject string matched by the
2070           entire pattern. The next pair is used for the first  capturing  subpat-
2071           tern, and so on. The value returned by pcre_exec() is one more than the
2072           highest numbered pair that has been set. For example, if two substrings
2073           have  been captured, the returned value is 3. If there are no capturing
2074           subpatterns, the return value from a successful match is 1,  indicating
2075           that just the first pair of offsets has been set.
2076    
2077         If a capturing subpattern is matched repeatedly, it is the last portion         If a capturing subpattern is matched repeatedly, it is the last portion
2078         of the string that it matched that gets returned.         of the string that it matched that is returned.
2079    
2080         If the vector is too small to hold all the captured substrings,  it  is         If the vector is too small to hold all the captured substring  offsets,
2081         used as far as possible (up to two-thirds of its length), and the func-         it is used as far as possible (up to two-thirds of its length), and the
2082         tion returns a value of zero. In particular, if the  substring  offsets         function returns a value of zero. In particular, if the substring  off-
2083         are  not  of interest, pcre_exec() may be called with ovector passed as         sets are not of interest, pcre_exec() may be called with ovector passed
2084         NULL and ovecsize as zero. However, if the pattern contains back refer-         as NULL and ovecsize as zero. However, if  the  pattern  contains  back
2085         ences  and  the  ovector  isn't big enough to remember the related sub-         references  and  the  ovector is not big enough to remember the related
2086         strings, PCRE has to get additional memory  for  use  during  matching.         substrings, PCRE has to get additional memory for use during  matching.
2087         Thus it is usually advisable to supply an ovector.         Thus it is usually advisable to supply an ovector.
2088    
2089         Note  that  pcre_info() can be used to find out how many capturing sub-         The  pcre_info()  function  can  be used to find out how many capturing
2090         patterns there are in a compiled pattern. The smallest size for ovector         subpatterns there are in a compiled  pattern.  The  smallest  size  for
2091         that  will  allow for n captured substrings, in addition to the offsets         ovector  that  will allow for n captured substrings, in addition to the
2092         of the substring matched by the whole pattern, is (n+1)*3.         offsets of the substring matched by the whole pattern, is (n+1)*3.
2093    
2094           It is possible for capturing subpattern number n+1 to match  some  part
2095           of the subject when subpattern n has not been used at all. For example,
2096           if the string "abc" is matched  against  the  pattern  (a|(z))(bc)  the
2097           return from the function is 4, and subpatterns 1 and 3 are matched, but
2098           2 is not. When this happens, both values in  the  offset  pairs  corre-
2099           sponding to unused subpatterns are set to -1.
2100    
2101           Offset  values  that correspond to unused subpatterns at the end of the
2102           expression are also set to -1. For example,  if  the  string  "abc"  is
2103           matched  against the pattern (abc)(x(yz)?)? subpatterns 2 and 3 are not
2104           matched. The return from the function is 2, because  the  highest  used
2105           capturing subpattern number is 1. However, you can refer to the offsets
2106           for the second and third capturing subpatterns if  you  wish  (assuming
2107           the vector is large enough, of course).
2108    
2109           Some  convenience  functions  are  provided for extracting the captured
2110           substrings as separate strings. These are described below.
2111    
2112       Error return values from pcre_exec()
2113    
2114         If pcre_exec() fails, it returns a negative number. The  following  are         If pcre_exec() fails, it returns a negative number. The  following  are
2115         defined in the header file:         defined in the header file:
# Line 1196  MATCHING A PATTERN Line 2130  MATCHING A PATTERN
2130           PCRE_ERROR_BADMAGIC       (-4)           PCRE_ERROR_BADMAGIC       (-4)
2131    
2132         PCRE stores a 4-byte "magic number" at the start of the compiled  code,         PCRE stores a 4-byte "magic number" at the start of the compiled  code,
2133         to  catch  the case when it is passed a junk pointer. This is the error         to catch the case when it is passed a junk pointer and to detect when a
2134         it gives when the magic number isn't present.         pattern that was compiled in an environment of one endianness is run in
2135           an  environment  with the other endianness. This is the error that PCRE
2136           gives when the magic number is not present.
2137    
2138           PCRE_ERROR_UNKNOWN_NODE   (-5)           PCRE_ERROR_UNKNOWN_OPCODE (-5)
2139    
2140         While running the pattern match, an unknown item was encountered in the         While running the pattern match, an unknown item was encountered in the
2141         compiled  pattern.  This  error  could be caused by a bug in PCRE or by         compiled  pattern.  This  error  could be caused by a bug in PCRE or by
# Line 1211  MATCHING A PATTERN Line 2147  MATCHING A PATTERN
2147         to pcre_exec() is not big enough to remember the referenced substrings,         to pcre_exec() is not big enough to remember the referenced substrings,
2148         PCRE gets a block of memory at the start of matching to  use  for  this         PCRE gets a block of memory at the start of matching to  use  for  this
2149         purpose.  If the call via pcre_malloc() fails, this error is given. The         purpose.  If the call via pcre_malloc() fails, this error is given. The
2150         memory is freed at the end of matching.         memory is automatically freed at the end of matching.
2151    
2152           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2153    
# Line 1221  MATCHING A PATTERN Line 2157  MATCHING A PATTERN
2157    
2158           PCRE_ERROR_MATCHLIMIT     (-8)           PCRE_ERROR_MATCHLIMIT     (-8)
2159    
2160         The recursion and backtracking limit, as specified by  the  match_limit         The backtracking limit, as specified by  the  match_limit  field  in  a
2161         field  in  a  pcre_extra  structure (or defaulted) was reached. See the         pcre_extra  structure  (or  defaulted) was reached. See the description
2162         description above.         above.
2163    
2164           PCRE_ERROR_CALLOUT        (-9)           PCRE_ERROR_CALLOUT        (-9)
2165    
# Line 1242  MATCHING A PATTERN Line 2178  MATCHING A PATTERN
2178         value of startoffset did not point to the beginning of a UTF-8  charac-         value of startoffset did not point to the beginning of a UTF-8  charac-
2179         ter.         ter.
2180    
2181             PCRE_ERROR_PARTIAL        (-12)
2182    
2183           The  subject  string did not match, but it did match partially. See the
2184           pcrepartial documentation for details of partial matching.
2185    
2186             PCRE_ERROR_BADPARTIAL     (-13)
2187    
2188           The PCRE_PARTIAL option was used with  a  compiled  pattern  containing
2189           items  that are not supported for partial matching. See the pcrepartial
2190           documentation for details of partial matching.
2191    
2192             PCRE_ERROR_INTERNAL       (-14)
2193    
2194           An unexpected internal error has occurred. This error could  be  caused
2195           by a bug in PCRE or by overwriting of the compiled pattern.
2196    
2197             PCRE_ERROR_BADCOUNT       (-15)
2198    
2199           This  error is given if the value of the ovecsize argument is negative.
2200    
2201             PCRE_ERROR_RECURSIONLIMIT (-21)
2202    
2203           The internal recursion limit, as specified by the match_limit_recursion
2204           field  in  a  pcre_extra  structure (or defaulted) was reached. See the
2205           description above.
2206    
2207             PCRE_ERROR_BADNEWLINE     (-23)
2208    
2209           An invalid combination of PCRE_NEWLINE_xxx options was given.
2210    
2211           Error numbers -16 to -20 and -22 are not used by pcre_exec().
2212    
2213    
2214  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2215    
# Line 1256  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2224  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2224         int pcre_get_substring_list(const char *subject,         int pcre_get_substring_list(const char *subject,
2225              int *ovector, int stringcount, const char ***listptr);              int *ovector, int stringcount, const char ***listptr);
2226    
2227         Captured  substrings  can  be  accessed  directly  by using the offsets         Captured substrings can be  accessed  directly  by  using  the  offsets
2228         returned by pcre_exec() in  ovector.  For  convenience,  the  functions         returned  by  pcre_exec()  in  ovector.  For convenience, the functions
2229         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-         pcre_copy_substring(),    pcre_get_substring(),    and    pcre_get_sub-
2230         string_list() are provided for extracting captured substrings  as  new,         string_list()  are  provided for extracting captured substrings as new,
2231         separate,  zero-terminated strings. These functions identify substrings         separate, zero-terminated strings. These functions identify  substrings
2232         by number. The next section describes functions  for  extracting  named         by  number.  The  next section describes functions for extracting named
2233         substrings.  A  substring  that  contains  a  binary  zero is correctly         substrings.
2234         extracted and has a further zero added on the end, but  the  result  is  
2235         not, of course, a C string.         A substring that contains a binary zero is correctly extracted and  has
2236           a  further zero added on the end, but the result is not, of course, a C
2237           string.  However, you can process such a string  by  referring  to  the
2238           length  that  is  returned  by  pcre_copy_substring() and pcre_get_sub-
2239           string().  Unfortunately, the interface to pcre_get_substring_list() is
2240           not  adequate for handling strings containing binary zeros, because the
2241           end of the final string is not independently indicated.
2242    
2243         The  first  three  arguments  are the same for all three of these func-         The first three arguments are the same for all  three  of  these  func-
2244         tions: subject is the subject string which has just  been  successfully         tions:  subject  is  the subject string that has just been successfully
2245         matched, ovector is a pointer to the vector of integer offsets that was         matched, ovector is a pointer to the vector of integer offsets that was
2246         passed to pcre_exec(), and stringcount is the number of substrings that         passed to pcre_exec(), and stringcount is the number of substrings that
2247         were  captured  by  the match, including the substring that matched the         were captured by the match, including the substring  that  matched  the
2248         entire regular expression. This is the value returned by  pcre_exec  if         entire regular expression. This is the value returned by pcre_exec() if
2249         it  is greater than zero. If pcre_exec() returned zero, indicating that         it is greater than zero. If pcre_exec() returned zero, indicating  that
2250         it ran out of space in ovector, the value passed as stringcount  should         it  ran out of space in ovector, the value passed as stringcount should
2251         be the size of the vector divided by three.         be the number of elements in the vector divided by three.
2252    
2253         The  functions pcre_copy_substring() and pcre_get_substring() extract a         The functions pcre_copy_substring() and pcre_get_substring() extract  a
2254         single substring, whose number is given as  stringnumber.  A  value  of         single  substring,  whose  number  is given as stringnumber. A value of
2255         zero  extracts  the  substring  that  matched the entire pattern, while         zero extracts the substring that matched the  entire  pattern,  whereas
2256         higher values  extract  the  captured  substrings.  For  pcre_copy_sub-         higher  values  extract  the  captured  substrings.  For pcre_copy_sub-
2257         string(),  the  string  is  placed  in buffer, whose length is given by         string(), the string is placed in buffer,  whose  length  is  given  by
2258         buffersize, while for pcre_get_substring() a new  block  of  memory  is         buffersize,  while  for  pcre_get_substring()  a new block of memory is
2259         obtained  via  pcre_malloc,  and its address is returned via stringptr.         obtained via pcre_malloc, and its address is  returned  via  stringptr.
2260         The yield of the function is the length of the  string,  not  including         The  yield  of  the function is the length of the string, not including
2261         the terminating zero, or one of         the terminating zero, or one of these error codes:
2262    
2263           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2264    
2265         The  buffer  was too small for pcre_copy_substring(), or the attempt to         The buffer was too small for pcre_copy_substring(), or the  attempt  to
2266         get memory failed for pcre_get_substring().         get memory failed for pcre_get_substring().
2267    
2268           PCRE_ERROR_NOSUBSTRING    (-7)           PCRE_ERROR_NOSUBSTRING    (-7)
2269    
2270         There is no substring whose number is stringnumber.         There is no substring whose number is stringnumber.
2271    
2272         The pcre_get_substring_list()  function  extracts  all  available  sub-         The  pcre_get_substring_list()  function  extracts  all  available sub-
2273         strings  and  builds  a list of pointers to them. All this is done in a         strings and builds a list of pointers to them. All this is  done  in  a
2274         single block of memory which is obtained via pcre_malloc.  The  address         single block of memory that is obtained via pcre_malloc. The address of
2275         of the memory block is returned via listptr, which is also the start of         the memory block is returned via listptr, which is also  the  start  of
2276         the list of string pointers. The end of the list is marked  by  a  NULL         the  list  of  string pointers. The end of the list is marked by a NULL
2277         pointer. The yield of the function is zero if all went well, or         pointer. The yield of the function is zero if all  went  well,  or  the
2278           error code
2279    
2280           PCRE_ERROR_NOMEMORY       (-6)           PCRE_ERROR_NOMEMORY       (-6)
2281    
# Line 1313  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER Line 2288  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2288         string  by inspecting the appropriate offset in ovector, which is nega-         string  by inspecting the appropriate offset in ovector, which is nega-
2289         tive for unset substrings.         tive for unset substrings.
2290    
2291         The    two    convenience    functions    pcre_free_substring()     and         The two convenience functions pcre_free_substring() and  pcre_free_sub-
2292         pcre_free_substring_list() can be used to free the memory returned by a         string_list()  can  be  used  to free the memory returned by a previous
2293         previous call  of  pcre_get_substring()  or  pcre_get_substring_list(),         call  of  pcre_get_substring()  or  pcre_get_substring_list(),  respec-
2294         respectively. They do nothing more than call the function pointed to by         tively.  They  do  nothing  more  than  call the function pointed to by
2295         pcre_free, which of course could be called directly from a  C  program.         pcre_free, which of course could be called directly from a  C  program.
2296         However,  PCRE is used in some situations where it is linked via a spe-         However,  PCRE is used in some situations where it is linked via a spe-
2297         cial  interface  to  another  programming  language  which  cannot  use         cial  interface  to  another  programming  language  that  cannot   use
2298         pcre_free  directly;  it is for these cases that the functions are pro-         pcre_free  directly;  it is for these cases that the functions are pro-
2299         vided.         vided.
2300    
2301    
2302  EXTRACTING CAPTURED SUBSTRINGS BY NAME  EXTRACTING CAPTURED SUBSTRINGS BY NAME
2303    
2304           int pcre_get_stringnumber(const pcre *code,
2305                const char *name);
2306    
2307         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
2308              const char *subject, int *ovector,              const char *subject, int *ovector,
2309              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2310              char *buffer, int buffersize);              char *buffer, int buffersize);
2311    
        int pcre_get_stringnumber(const pcre *code,  
             const char *name);  
   
2312         int pcre_get_named_substring(const pcre *code,         int pcre_get_named_substring(const pcre *code,
2313              const char *subject, int *ovector,              const char *subject, int *ovector,
2314              int stringcount, const char *stringname,              int stringcount, const char *stringname,
2315              const char **stringptr);              const char **stringptr);
2316    
2317         To extract a substring by name, you first have to find associated  num-         To extract a substring by name, you first have to find associated  num-
2318         ber.  This  can  be  done by calling pcre_get_stringnumber(). The first         ber.  For example, for this pattern
2319         argument is the compiled pattern, and the second is the name. For exam-  
2320         ple, for this pattern           (a+)b(?<xxx>\d+)...
2321    
2322           ab(?<xxx>\d+)...         the number of the subpattern called "xxx" is 2. If the name is known to
2323           be unique (PCRE_DUPNAMES was not set), you can find the number from the
2324         the  number  of the subpattern called "xxx" is 1. Given the number, you         name by calling pcre_get_stringnumber(). The first argument is the com-
2325         can then extract the substring directly, or use one  of  the  functions         piled pattern, and the second is the name. The yield of the function is
2326         described  in the previous section. For convenience, there are also two         the  subpattern  number,  or PCRE_ERROR_NOSUBSTRING (-7) if there is no
2327         functions that do the whole job.         subpattern of that name.
2328    
2329           Given the number, you can extract the substring directly, or use one of
2330           the functions described in the previous section. For convenience, there
2331           are also two functions that do the whole job.
2332    
2333         Most   of   the   arguments    of    pcre_copy_named_substring()    and         Most   of   the   arguments    of    pcre_copy_named_substring()    and
2334         pcre_get_named_substring() are the same as those for the functions that         pcre_get_named_substring()  are  the  same  as  those for the similarly
2335         extract by number, and so are not re-described here. There are just two         named functions that extract by number. As these are described  in  the
2336         differences.         previous  section,  they  are not re-described here. There are just two
2337           differences:
2338    
2339         First,  instead  of a substring number, a substring name is given. Sec-         First, instead of a substring number, a substring name is  given.  Sec-
2340         ond, there is an extra argument, given at the start, which is a pointer         ond, there is an extra argument, given at the start, which is a pointer
2341         to  the compiled pattern. This is needed in order to gain access to the         to the compiled pattern. This is needed in order to gain access to  the
2342         name-to-number translation table.         name-to-number translation table.
2343    
2344         These functions call pcre_get_stringnumber(), and if it succeeds,  they         These  functions call pcre_get_stringnumber(), and if it succeeds, they
2345         then  call  pcre_copy_substring() or pcre_get_substring(), as appropri-         then call pcre_copy_substring() or pcre_get_substring(),  as  appropri-
2346         ate.         ate.  NOTE:  If PCRE_DUPNAMES is set and there are duplicate names, the
2347           behaviour may not be what you want (see the next section).
2348    
2349    
2350    DUPLICATE SUBPATTERN NAMES
2351    
2352           int pcre_get_stringtable_entries(const pcre *code,
2353                const char *name, char **first, char **last);
2354    
2355           When a pattern is compiled with the  PCRE_DUPNAMES  option,  names  for
2356           subpatterns  are  not  required  to  be unique. Normally, patterns with
2357           duplicate names are such that in any one match, only one of  the  named
2358           subpatterns  participates. An example is shown in the pcrepattern docu-
2359           mentation.
2360    
2361           When   duplicates   are   present,   pcre_copy_named_substring()    and
2362           pcre_get_named_substring()  return the first substring corresponding to
2363           the given name that is set. If  none  are  set,  PCRE_ERROR_NOSUBSTRING
2364           (-7)  is  returned;  no  data  is returned. The pcre_get_stringnumber()
2365           function returns one of the numbers that are associated with the  name,
2366           but it is not defined which it is.
2367    
2368           If  you want to get full details of all captured substrings for a given
2369           name, you must use  the  pcre_get_stringtable_entries()  function.  The
2370           first argument is the compiled pattern, and the second is the name. The
2371           third and fourth are pointers to variables which  are  updated  by  the
2372           function. After it has run, they point to the first and last entries in
2373           the name-to-number table  for  the  given  name.  The  function  itself
2374           returns  the  length  of  each entry, or PCRE_ERROR_NOSUBSTRING (-7) if
2375           there are none. The format of the table is described above in the  sec-
2376           tion  entitled  Information  about  a  pattern.  Given all the relevant
2377           entries for the name, you can extract each of their numbers, and  hence
2378           the captured data, if any.
2379    
2380    
2381    FINDING ALL POSSIBLE MATCHES
2382    
2383           The  traditional  matching  function  uses a similar algorithm to Perl,
2384           which stops when it finds the first match, starting at a given point in
2385           the  subject.  If you want to find all possible matches, or the longest
2386           possible match, consider using the alternative matching  function  (see
2387           below)  instead.  If you cannot use the alternative function, but still
2388           need to find all possible matches, you can kludge it up by  making  use
2389           of the callout facility, which is described in the pcrecallout documen-
2390           tation.
2391    
2392  Last updated: 09 December 2003         What you have to do is to insert a callout right at the end of the pat-
2393  Copyright (c) 1997-2003 University of Cambridge.         tern.   When your callout function is called, extract and save the cur-
2394  -----------------------------------------------------------------------------         rent matched substring. Then return  1,  which  forces  pcre_exec()  to
2395           backtrack  and  try other alternatives. Ultimately, when it runs out of
2396           matches, pcre_exec() will yield PCRE_ERROR_NOMATCH.
2397    
 PCRE(3)                                                                PCRE(3)  
2398    
2399    MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
2400    
2401           int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
2402                const char *subject, int length, int startoffset,
2403                int options, int *ovector, int ovecsize,
2404                int *workspace, int wscount);
2405    
2406  NAME         The function pcre_dfa_exec()  is  called  to  match  a  subject  string
2407         PCRE - Perl-compatible regular expressions         against  a  compiled pattern, using a matching algorithm that scans the
2408           subject string just once, and does not backtrack.  This  has  different
2409           characteristics  to  the  normal  algorithm, and is not compatible with
2410           Perl. Some of the features of PCRE patterns are not  supported.  Never-
2411           theless,  there are times when this kind of matching can be useful. For
2412           a discussion of the two matching algorithms, see the pcrematching docu-
2413           mentation.
2414    
2415           The  arguments  for  the  pcre_dfa_exec()  function are the same as for
2416           pcre_exec(), plus two extras. The ovector argument is used in a differ-
2417           ent  way,  and  this is described below. The other common arguments are
2418           used in the same way as for pcre_exec(), so their  description  is  not
2419           repeated here.
2420    
2421           The  two  additional  arguments provide workspace for the function. The
2422           workspace vector should contain at least 20 elements. It  is  used  for
2423           keeping  track  of  multiple  paths  through  the  pattern  tree.  More
2424           workspace will be needed for patterns and subjects where  there  are  a
2425           lot of potential matches.
2426    
2427  PCRE CALLOUTS         Here is an example of a simple call to pcre_dfa_exec():
2428    
2429         int (*pcre_callout)(pcre_callout_block *);           int rc;
2430             int ovector[10];
2431             int wspace[20];
2432             rc = pcre_dfa_exec(
2433               re,             /* result of pcre_compile() */
2434               NULL,           /* we didn't study the pattern */
2435               "some string",  /* the subject string */
2436               11,             /* the length of the subject string */
2437               0,              /* start at offset 0 in the subject */
2438               0,              /* default options */
2439               ovector,        /* vector of integers for substring information */
2440               10,             /* number of elements (NOT size in bytes) */
2441               wspace,         /* working space vector */
2442               20);            /* number of elements (NOT size in bytes) */
2443    
2444         PCRE provides a feature called "callout", which is a means of temporar-     Option bits for pcre_dfa_exec()
        ily passing control to the caller of PCRE  in  the  middle  of  pattern  
        matching.  The  caller of PCRE provides an external function by putting  
        its entry point in the global variable pcre_callout. By  default,  this  
        variable contains NULL, which disables all calling out.  
2445    
2446         Within  a  regular  expression,  (?C) indicates the points at which the         The  unused  bits  of  the options argument for pcre_dfa_exec() must be
2447         external function is to be called.  Different  callout  points  can  be         zero. The only bits  that  may  be  set  are  PCRE_ANCHORED,  PCRE_NEW-
2448         identified  by  putting  a number less than 256 after the letter C. The         LINE_xxx,  PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, PCRE_NO_UTF8_CHECK,
2449         default value is zero.  For  example,  this  pattern  has  two  callout         PCRE_PARTIAL, PCRE_DFA_SHORTEST, and PCRE_DFA_RESTART. All but the last
2450         points:         three of these are the same as for pcre_exec(), so their description is
2451           not repeated here.
2452    
2453           (?C1)abc(?C2)def           PCRE_PARTIAL
2454    
2455         During matching, when PCRE reaches a callout point (and pcre_callout is         This has the same general effect as it does for  pcre_exec(),  but  the
2456         set), the external function is called. Its only argument is  a  pointer         details   are   slightly   different.  When  PCRE_PARTIAL  is  set  for
2457         to a pcre_callout block. This contains the following variables:         pcre_dfa_exec(), the return code PCRE_ERROR_NOMATCH is  converted  into
2458           PCRE_ERROR_PARTIAL  if  the  end  of the subject is reached, there have
2459           been no complete matches, but there is still at least one matching pos-
2460           sibility.  The portion of the string that provided the partial match is
2461           set as the first matching string.
2462    
2463           int          version;           PCRE_DFA_SHORTEST
          int          callout_number;  
          int         *offset_vector;  
          const char  *subject;  
          int          subject_length;  
          int          start_match;  
          int          current_position;  
          int          capture_top;  
          int          capture_last;  
          void        *callout_data;  
2464    
2465         The  version  field  is an integer containing the version number of the         Setting the PCRE_DFA_SHORTEST option causes the matching  algorithm  to
2466         block format. The current version  is  zero.  The  version  number  may         stop as soon as it has found one match. Because of the way the alterna-
2467         change  in  future if additional fields are added, but the intention is         tive algorithm works, this is necessarily the shortest  possible  match
2468         never to remove any of the existing fields.         at the first possible matching point in the subject string.
2469    
2470         The callout_number field contains the number of the  callout,  as  com-           PCRE_DFA_RESTART
        piled into the pattern (that is, the number after ?C).  
2471    
2472         The  offset_vector field is a pointer to the vector of offsets that was         When  pcre_dfa_exec()  is  called  with  the  PCRE_PARTIAL  option, and
2473         passed by the caller to pcre_exec(). The contents can be  inspected  in         returns a partial match, it is possible to call it  again,  with  addi-
2474         order  to extract substrings that have been matched so far, in the same         tional  subject  characters,  and have it continue with the same match.
2475         way as for extracting substrings after a match has completed.         The PCRE_DFA_RESTART option requests this action; when it is  set,  the
2476           workspace  and wscount options must reference the same vector as before
2477           because data about the match so far is left in  them  after  a  partial
2478           match.  There  is  more  discussion of this facility in the pcrepartial
2479           documentation.
2480    
2481         The subject and subject_length fields contain copies  the  values  that     Successful returns from pcre_dfa_exec()
        were passed to pcre_exec().  
2482    
2483         The  start_match  field contains the offset within the subject at which         When pcre_dfa_exec() succeeds, it may have matched more than  one  sub-
2484         the current match attempt started. If the pattern is not anchored,  the         string in the subject. Note, however, that all the matches from one run
2485         callout  function  may  be  called several times for different starting         of the function start at the same point in  the  subject.  The  shorter
2486         points.         matches  are all initial substrings of the longer matches. For example,
2487           if the pattern
2488    
2489         The current_position field contains the offset within  the  subject  of           <.*>
        the current match pointer.  
2490    
2491         The  capture_top field contains one more than the number of the highest         is matched against the string
        numbered  captured  substring  so  far.  If  no  substrings  have  been  
        captured, the value of capture_top is one.  
2492    
2493         The  capture_last  field  contains the number of the most recently cap-           This is <something> <something else> <something further> no more
        tured substring.  
2494    
2495         The callout_data field contains a value that is passed  to  pcre_exec()         the three matched strings are
2496         by  the  caller specifically so that it can be passed back in callouts.  
2497         It is passed in the pcre_callout field of the  pcre_extra  data  struc-           <something>
2498         ture.  If  no  such  data  was  passed,  the value of callout_data in a           <something> <something else>
2499         pcre_callout block is NULL. There is a description  of  the  pcre_extra           <something> <something else> <something further>
2500    
2501           On success, the yield of the function is a number  greater  than  zero,
2502           which  is  the  number of matched substrings. The substrings themselves
2503           are returned in ovector. Each string uses two elements;  the  first  is
2504           the  offset  to  the start, and the second is the offset to the end. In
2505           fact, all the strings have the same start  offset.  (Space  could  have
2506           been  saved by giving this only once, but it was decided to retain some
2507           compatibility with the way pcre_exec() returns data,  even  though  the
2508           meaning of the strings is different.)
2509    
2510           The strings are returned in reverse order of length; that is, the long-
2511           est matching string is given first. If there were too many  matches  to
2512           fit  into ovector, the yield of the function is zero, and the vector is
2513           filled with the longest matches.
2514    
2515       Error returns from pcre_dfa_exec()
2516    
2517           The pcre_dfa_exec() function returns a negative number when  it  fails.
2518           Many  of  the  errors  are  the  same as for pcre_exec(), and these are
2519           described above.  There are in addition the following errors  that  are
2520           specific to pcre_dfa_exec():
2521    
2522             PCRE_ERROR_DFA_UITEM      (-16)
2523    
2524           This  return is given if pcre_dfa_exec() encounters an item in the pat-
2525           tern that it does not support, for instance, the use of \C  or  a  back
2526           reference.
2527    
2528             PCRE_ERROR_DFA_UCOND      (-17)
2529    
2530           This  return  is  given  if pcre_dfa_exec() encounters a condition item
2531           that uses a back reference for the condition, or a test  for  recursion
2532           in a specific group. These are not supported.
2533    
2534             PCRE_ERROR_DFA_UMLIMIT    (-18)
2535    
2536           This  return  is given if pcre_dfa_exec() is called with an extra block
2537           that contains a setting of the match_limit field. This is not supported
2538           (it is meaningless).
2539    
2540             PCRE_ERROR_DFA_WSSIZE     (-19)
2541    
2542           This  return  is  given  if  pcre_dfa_exec()  runs  out of space in the
2543           workspace vector.
2544    
2545             PCRE_ERROR_DFA_RECURSE    (-20)
2546    
2547           When a recursive subpattern is processed, the matching  function  calls
2548           itself  recursively,  using  private vectors for ovector and workspace.
2549           This error is given if the output vector  is  not  large  enough.  This
2550           should be extremely rare, as a vector of size 1000 is used.
2551    
2552    
2553    SEE ALSO
2554    
2555           pcrebuild(3),  pcrecallout(3), pcrecpp(3)(3), pcrematching(3), pcrepar-
2556           tial(3), pcreposix(3), pcreprecompile(3), pcresample(3),  pcrestack(3).
2557    
2558    
2559    AUTHOR
2560    
2561           Philip Hazel
2562           University Computing Service
2563           Cambridge CB2 3QH, England.
2564    
2565    
2566    REVISION
2567    
2568           Last updated: 23 January 2008
2569           Copyright (c) 1997-2008 University of Cambridge.
2570    ------------------------------------------------------------------------------
2571    
2572    
2573    PCRECALLOUT(3)                                                  PCRECALLOUT(3)
2574    
2575    
2576    NAME
2577           PCRE - Perl-compatible regular expressions
2578    
2579    
2580    PCRE CALLOUTS
2581    
2582           int (*pcre_callout)(pcre_callout_block *);
2583    
2584           PCRE provides a feature called "callout", which is a means of temporar-
2585           ily passing control to the caller of PCRE  in  the  middle  of  pattern
2586           matching.  The  caller of PCRE provides an external function by putting
2587           its entry point in the global variable pcre_callout. By  default,  this
2588           variable contains NULL, which disables all calling out.
2589    
2590           Within  a  regular  expression,  (?C) indicates the points at which the
2591           external function is to be called.  Different  callout  points  can  be
2592           identified  by  putting  a number less than 256 after the letter C. The
2593           default value is zero.  For  example,  this  pattern  has  two  callout
2594           points:
2595    
2596             (?C1)abc(?C2)def
2597    
2598           If  the  PCRE_AUTO_CALLOUT  option  bit  is  set when pcre_compile() is
2599           called, PCRE automatically  inserts  callouts,  all  with  number  255,
2600           before  each  item in the pattern. For example, if PCRE_AUTO_CALLOUT is
2601           used with the pattern
2602    
2603             A(\d{2}|--)
2604    
2605           it is processed as if it were
2606    
2607           (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
2608    
2609           Notice that there is a callout before and after  each  parenthesis  and
2610           alternation  bar.  Automatic  callouts  can  be  used  for tracking the
2611           progress of pattern matching. The pcretest command has an  option  that
2612           sets  automatic callouts; when it is used, the output indicates how the
2613           pattern is matched. This is useful information when you are  trying  to
2614           optimize the performance of a particular pattern.
2615    
2616    
2617    MISSING CALLOUTS
2618    
2619           You  should  be  aware  that,  because of optimizations in the way PCRE
2620           matches patterns, callouts sometimes do not happen. For example, if the
2621           pattern is
2622    
2623             ab(?C4)cd
2624    
2625           PCRE knows that any matching string must contain the letter "d". If the
2626           subject string is "abyz", the lack of "d" means that  matching  doesn't
2627           ever  start,  and  the  callout is never reached. However, with "abyd",
2628           though the result is still no match, the callout is obeyed.
2629    
2630    
2631    THE CALLOUT INTERFACE
2632    
2633           During matching, when PCRE reaches a callout point, the external  func-
2634           tion  defined by pcre_callout is called (if it is set). This applies to
2635           both the pcre_exec() and the pcre_dfa_exec()  matching  functions.  The
2636           only  argument  to  the callout function is a pointer to a pcre_callout
2637           block. This structure contains the following fields:
2638    
2639             int          version;
2640             int          callout_number;
2641             int         *offset_vector;
2642             const char  *subject;
2643             int          subject_length;
2644             int          start_match;
2645             int          current_position;
2646             int          capture_top;
2647             int          capture_last;
2648             void        *callout_data;
2649             int          pattern_position;
2650             int          next_item_length;
2651    
2652           The version field is an integer containing the version  number  of  the
2653           block  format. The initial version was 0; the current version is 1. The
2654           version number will change again in future  if  additional  fields  are
2655           added, but the intention is never to remove any of the existing fields.
2656    
2657           The callout_number field contains the number of the  callout,  as  com-
2658           piled  into  the pattern (that is, the number after ?C for manual call-
2659           outs, and 255 for automatically generated callouts).
2660    
2661           The offset_vector field is a pointer to the vector of offsets that  was
2662           passed   by   the   caller  to  pcre_exec()  or  pcre_dfa_exec().  When
2663           pcre_exec() is used, the contents can be inspected in order to  extract
2664           substrings  that  have  been  matched  so  far,  in the same way as for
2665           extracting substrings after a match has completed. For  pcre_dfa_exec()
2666           this field is not useful.
2667    
2668           The subject and subject_length fields contain copies of the values that
2669           were passed to pcre_exec().
2670    
2671           The start_match field normally contains the offset within  the  subject
2672           at  which  the  current  match  attempt started. However, if the escape
2673           sequence \K has been encountered, this value is changed to reflect  the
2674           modified  starting  point.  If the pattern is not anchored, the callout
2675           function may be called several times from the same point in the pattern
2676           for different starting points in the subject.
2677    
2678           The  current_position  field  contains the offset within the subject of
2679           the current match pointer.
2680    
2681           When the pcre_exec() function is used, the capture_top  field  contains
2682           one  more than the number of the highest numbered captured substring so
2683           far. If no substrings have been captured, the value of  capture_top  is
2684           one.  This  is always the case when pcre_dfa_exec() is used, because it
2685           does not support captured substrings.
2686    
2687           The capture_last field contains the number of the  most  recently  cap-
2688           tured  substring. If no substrings have been captured, its value is -1.
2689           This is always the case when pcre_dfa_exec() is used.
2690    
2691           The callout_data field contains a value that is passed  to  pcre_exec()
2692           or  pcre_dfa_exec() specifically so that it can be passed back in call-
2693           outs. It is passed in the pcre_callout field  of  the  pcre_extra  data
2694           structure.  If  no such data was passed, the value of callout_data in a
2695           pcre_callout block is NULL. There is a description  of  the  pcre_extra
2696         structure in the pcreapi documentation.         structure in the pcreapi documentation.
2697    
2698           The  pattern_position field is present from version 1 of the pcre_call-
2699           out structure. It contains the offset to the next item to be matched in
2700           the pattern string.
2701    
2702           The  next_item_length field is present from version 1 of the pcre_call-
2703           out structure. It contains the length of the next item to be matched in
2704           the  pattern  string. When the callout immediately precedes an alterna-
2705           tion bar, a closing parenthesis, or the end of the pattern, the  length
2706           is  zero.  When the callout precedes an opening parenthesis, the length
2707           is that of the entire subpattern.
2708    
2709           The pattern_position and next_item_length fields are intended  to  help
2710           in  distinguishing between different automatic callouts, which all have
2711           the same callout number. However, they are set for all callouts.
2712    
2713    
2714  RETURN VALUES  RETURN VALUES
2715    
2716         The callout function returns an integer. If the value is zero, matching         The external callout function returns an integer to PCRE. If the  value
2717         proceeds as normal. If the value is greater than zero,  matching  fails         is  zero,  matching  proceeds  as  normal. If the value is greater than
2718         at the current point, but backtracking to test other possibilities goes         zero, matching fails at the current point, but  the  testing  of  other
2719         ahead, just as if a lookahead assertion had failed.  If  the  value  is         matching possibilities goes ahead, just as if a lookahead assertion had
2720         less  than  zero,  the  match is abandoned, and pcre_exec() returns the         failed. If the value is less than zero, the  match  is  abandoned,  and
2721         value.         pcre_exec() (or pcre_dfa_exec()) returns the negative value.
2722    
2723         Negative  values  should  normally  be   chosen   from   the   set   of         Negative   values   should   normally   be   chosen  from  the  set  of
2724         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-         PCRE_ERROR_xxx values. In particular, PCRE_ERROR_NOMATCH forces a stan-
2725         dard "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT  is         dard  "no  match"  failure.   The  error  number  PCRE_ERROR_CALLOUT is
2726         reserved  for  use  by callout functions; it will never be used by PCRE         reserved for use by callout functions; it will never be  used  by  PCRE
2727         itself.         itself.
2728    
 Last updated: 21 January 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
2729    
2730  PCRE(3)                                                                PCRE(3)  AUTHOR
2731    
2732           Philip Hazel
2733           University Computing Service
2734           Cambridge CB2 3QH, England.
2735    
2736    
2737    REVISION
2738    
2739           Last updated: 29 May 2007
2740           Copyright (c) 1997-2007 University of Cambridge.
2741    ------------------------------------------------------------------------------
2742    
2743    
2744    PCRECOMPAT(3)                                                    PCRECOMPAT(3)
2745    
2746    
2747  NAME  NAME
2748         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2749    
 DIFFERENCES FROM PERL  
2750    
2751         This  document describes the differences in the ways that PCRE and Perl  DIFFERENCES BETWEEN PCRE AND PERL
        handle regular expressions. The differences  described  here  are  with  
        respect to Perl 5.8.  
2752    
2753         1.  PCRE does not have full UTF-8 support. Details of what it does have         This  document describes the differences in the ways that PCRE and Perl
2754         are given in the section on UTF-8 support in the main pcre page.         handle regular expressions. The differences described here  are  mainly
2755           with  respect  to  Perl 5.8, though PCRE versions 7.0 and later contain
2756           some features that are expected to be in the forthcoming Perl 5.10.
2757    
2758           1. PCRE has only a subset of Perl's UTF-8 and Unicode support.  Details
2759           of  what  it does have are given in the section on UTF-8 support in the
2760           main pcre page.
2761    
2762         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl         2. PCRE does not allow repeat quantifiers on lookahead assertions. Perl
2763         permits  them,  but they do not mean what you might think. For example,         permits  them,  but they do not mean what you might think. For example,
# Line 1498  DIFFERENCES FROM PERL Line 2773  DIFFERENCES FROM PERL
2773    
2774         4. Though binary zero characters are supported in the  subject  string,         4. Though binary zero characters are supported in the  subject  string,
2775         they are not allowed in a pattern string because it is passed as a nor-         they are not allowed in a pattern string because it is passed as a nor-
2776         mal C string, terminated by zero. The escape sequence "\0" can be  used         mal C string, terminated by zero. The escape sequence \0 can be used in
2777         in the pattern to represent a binary zero.         the pattern to represent a binary zero.
2778    
2779         5.  The  following Perl escape sequences are not supported: \l, \u, \L,         5.  The  following Perl escape sequences are not supported: \l, \u, \L,
2780         \U, \P, \p, \N, and \X. In fact these are implemented by Perl's general         \U, and \N. In fact these are implemented by Perl's general string-han-
2781         string-handling and are not part of its pattern matching engine. If any         dling  and are not part of its pattern matching engine. If any of these
2782         of these are encountered by PCRE, an error is generated.         are encountered by PCRE, an error is generated.
2783    
2784           6. The Perl escape sequences \p, \P, and \X are supported only if  PCRE
2785           is  built  with Unicode character property support. The properties that
2786           can be tested with \p and \P are limited to the general category  prop-
2787           erties  such  as  Lu and Nd, script names such as Greek or Han, and the
2788           derived properties Any and L&.
2789    
2790         6. PCRE does support the \Q...\E escape for quoting substrings. Charac-         7. PCRE does support the \Q...\E escape for quoting substrings. Charac-
2791         ters  in  between  are  treated as literals. This is slightly different         ters  in  between  are  treated as literals. This is slightly different
2792         from Perl in that $ and @ are  also  handled  as  literals  inside  the         from Perl in that $ and @ are  also  handled  as  literals  inside  the
2793         quotes.  In Perl, they cause variable interpolation (but of course PCRE         quotes.  In Perl, they cause variable interpolation (but of course PCRE
# Line 1522  DIFFERENCES FROM PERL Line 2803  DIFFERENCES FROM PERL
2803         The \Q...\E sequence is recognized both inside  and  outside  character         The \Q...\E sequence is recognized both inside  and  outside  character
2804         classes.         classes.
2805    
2806         7. Fairly obviously, PCRE does not support the (?{code}) and (?p{code})         8. Fairly obviously, PCRE does not support the (?{code}) and (??{code})
2807         constructions. However, there is some experimental support  for  recur-         constructions. However, there is support for recursive  patterns.  This
2808         sive  patterns  using the non-Perl items (?R), (?number) and (?P>name).         is  not available in Perl 5.8, but will be in Perl 5.10. Also, the PCRE
2809         Also, the PCRE "callout" feature allows  an  external  function  to  be         "callout" feature allows an external function to be called during  pat-
2810         called during pattern matching.         tern matching. See the pcrecallout documentation for details.
2811    
2812           9.  Subpatterns  that  are  called  recursively or as "subroutines" are
2813           always treated as atomic groups in  PCRE.  This  is  like  Python,  but
2814           unlike Perl.
2815    
2816         8.  There  are some differences that are concerned with the settings of         10.  There are some differences that are concerned with the settings of
2817         captured strings when part of  a  pattern  is  repeated.  For  example,         captured strings when part of  a  pattern  is  repeated.  For  example,
2818         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2         matching  "aba"  against  the  pattern  /^(a(b)?)+$/  in Perl leaves $2
2819         unset, but in PCRE it is set to "b".         unset, but in PCRE it is set to "b".
2820    
2821         9. PCRE  provides  some  extensions  to  the  Perl  regular  expression         11.  PCRE  does  support  Perl  5.10's  backtracking  verbs  (*ACCEPT),
2822         facilities:         (*FAIL),  (*F),  (*COMMIT), (*PRUNE), (*SKIP), and (*THEN), but only in
2823           the forms without an  argument.  PCRE  does  not  support  (*MARK).  If
2824           (*ACCEPT)  is within capturing parentheses, PCRE does not set that cap-
2825           ture group; this is different to Perl.
2826    
2827           12. PCRE provides some extensions to the Perl regular expression facil-
2828           ities.   Perl  5.10  will  include new features that are not in earlier
2829           versions, some of which (such as named parentheses) have been  in  PCRE
2830           for some time. This list is with respect to Perl 5.10:
2831    
2832         (a)  Although  lookbehind  assertions  must match fixed length strings,         (a)  Although  lookbehind  assertions  must match fixed length strings,
2833         each alternative branch of a lookbehind assertion can match a different         each alternative branch of a lookbehind assertion can match a different
# Line 1544  DIFFERENCES FROM PERL Line 2837  DIFFERENCES FROM PERL
2837         meta-character matches only at the very end of the string.         meta-character matches only at the very end of the string.
2838    
2839         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-         (c) If PCRE_EXTRA is set, a backslash followed by a letter with no spe-
2840         cial meaning is faulted.         cial meaning is faulted. Otherwise, like Perl, the backslash is quietly
2841           ignored.  (Perl can be made to issue a warning.)
2842    
2843         (d)  If  PCRE_UNGREEDY is set, the greediness of the repetition quanti-         (d) If PCRE_UNGREEDY is set, the greediness of the  repetition  quanti-
2844         fiers is inverted, that is, by default they are not greedy, but if fol-         fiers is inverted, that is, by default they are not greedy, but if fol-
2845         lowed by a question mark they are.         lowed by a question mark they are.
2846    
2847         (e)  PCRE_ANCHORED  can  be used to force a pattern to be tried only at         (e) PCRE_ANCHORED can be used at matching time to force a pattern to be
2848         the first matching position in the subject string.         tried only at the first matching position in the subject string.
2849    
2850         (f) The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and  PCRE_NO_AUTO_CAP-         (f)  The PCRE_NOTBOL, PCRE_NOTEOL, PCRE_NOTEMPTY, and PCRE_NO_AUTO_CAP-
2851         TURE options for pcre_exec() have no Perl equivalents.         TURE options for pcre_exec() have no Perl equivalents.
2852    
2853         (g)  The (?R), (?number), and (?P>name) constructs allows for recursive         (g) The \R escape sequence can be restricted to match only CR,  LF,  or
2854         pattern matching (Perl can do  this  using  the  (?p{code})  construct,         CRLF by the PCRE_BSR_ANYCRLF option.
        which PCRE cannot support.)  
2855    
2856         (h)  PCRE supports named capturing substrings, using the Python syntax.         (h) The callout facility is PCRE-specific.
2857    
2858         (i) PCRE supports the possessive quantifier  "++"  syntax,  taken  from         (i) The partial matching facility is PCRE-specific.
        Sun's Java package.  
2859    
2860         (j) The (R) condition, for testing recursion, is a PCRE extension.         (j) Patterns compiled by PCRE can be saved and re-used at a later time,
2861           even on different hosts that have the other endianness.
2862    
2863         (k) The callout facility is PCRE-specific.         (k) The alternative matching function (pcre_dfa_exec())  matches  in  a
2864           different way and is not Perl-compatible.
2865    
2866  Last updated: 09 December 2003         (l)  PCRE  recognizes some special sequences such as (*CR) at the start
2867  Copyright (c) 1997-2003 University of Cambridge.         of a pattern that set overall options that cannot be changed within the
2868  -----------------------------------------------------------------------------         pattern.
2869    
2870    
2871    AUTHOR
2872    
2873           Philip Hazel
2874           University Computing Service
2875           Cambridge CB2 3QH, England.
2876    
 PCRE(3)                                                                PCRE(3)  
2877    
2878    REVISION
2879    
2880           Last updated: 11 September 2007
2881           Copyright (c) 1997-2007 University of Cambridge.
2882    ------------------------------------------------------------------------------
2883    
2884    
2885    PCREPATTERN(3)                                                  PCREPATTERN(3)
2886    
2887    
2888  NAME  NAME
2889         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
2890    
2891    
2892  PCRE REGULAR EXPRESSION DETAILS  PCRE REGULAR EXPRESSION DETAILS
2893    
2894         The  syntax  and semantics of the regular expressions supported by PCRE         The  syntax and semantics of the regular expressions that are supported
2895         are described below. Regular expressions are also described in the Perl         by PCRE are described in detail below. There is a quick-reference  syn-
2896         documentation  and in a number of other books, some of which have copi-         tax  summary  in  the  pcresyntax  page. Perl's regular expressions are
2897         ous examples. Jeffrey Friedl's "Mastering  Regular  Expressions",  pub-         described in its own documentation, and regular expressions in  general
2898         lished  by  O'Reilly, covers them in great detail. The description here         are  covered in a number of books, some of which have copious examples.
2899         is intended as reference documentation.         Jeffrey  Friedl's  "Mastering  Regular   Expressions",   published   by
2900           O'Reilly,  covers regular expressions in great detail. This description
2901         The basic operation of PCRE is on strings of bytes. However,  there  is         of PCRE's regular expressions is intended as reference material.
2902         also  support for UTF-8 character strings. To use this support you must  
2903         build PCRE to include UTF-8 support, and then call pcre_compile()  with         The original operation of PCRE was on strings of  one-byte  characters.
2904         the  PCRE_UTF8  option.  How  this affects the pattern matching is men-         However,  there is now also support for UTF-8 character strings. To use
2905         tioned in several places below. There is also a summary of  UTF-8  fea-         this, you must build PCRE to  include  UTF-8  support,  and  then  call
2906         tures in the section on UTF-8 support in the main pcre page.         pcre_compile()  with  the  PCRE_UTF8  option.  How this affects pattern
2907           matching is mentioned in several places below. There is also a  summary
2908           of  UTF-8  features  in  the  section on UTF-8 support in the main pcre
2909           page.
2910    
2911           The remainder of this document discusses the  patterns  that  are  sup-
2912           ported  by  PCRE when its main matching function, pcre_exec(), is used.
2913           From  release  6.0,   PCRE   offers   a   second   matching   function,
2914           pcre_dfa_exec(),  which matches using a different algorithm that is not
2915           Perl-compatible. Some of the features discussed below are not available
2916           when  pcre_dfa_exec()  is used. The advantages and disadvantages of the
2917           alternative function, and how it differs from the normal function,  are
2918           discussed in the pcrematching page.
2919    
2920    
2921    NEWLINE CONVENTIONS
2922    
2923           PCRE  supports five different conventions for indicating line breaks in
2924           strings: a single CR (carriage return) character, a  single  LF  (line-
2925           feed) character, the two-character sequence CRLF, any of the three pre-
2926           ceding, or any Unicode newline sequence. The pcreapi page  has  further
2927           discussion  about newlines, and shows how to set the newline convention
2928           in the options arguments for the compiling and matching functions.
2929    
2930           It is also possible to specify a newline convention by starting a  pat-
2931           tern string with one of the following five sequences:
2932    
2933             (*CR)        carriage return
2934             (*LF)        linefeed
2935             (*CRLF)      carriage return, followed by linefeed
2936             (*ANYCRLF)   any of the three above
2937             (*ANY)       all Unicode newline sequences
2938    
2939           These override the default and the options given to pcre_compile(). For
2940           example, on a Unix system where LF is the default newline sequence, the
2941           pattern
2942    
2943             (*CR)a.b
2944    
2945           changes the convention to CR. That pattern matches "a\nb" because LF is
2946           no longer a newline. Note that these special settings,  which  are  not
2947           Perl-compatible,  are  recognized  only at the very start of a pattern,
2948           and that they must be in upper case.  If  more  than  one  of  them  is
2949           present, the last one is used.
2950    
2951           The  newline  convention  does  not  affect what the \R escape sequence
2952           matches. By default, this is any Unicode  newline  sequence,  for  Perl
2953           compatibility.  However, this can be changed; see the description of \R
2954           in the section entitled "Newline sequences" below. A change of \R  set-
2955           ting can be combined with a change of newline convention.
2956    
2957    
2958    CHARACTERS AND METACHARACTERS
2959    
2960         A  regular  expression  is  a pattern that is matched against a subject         A  regular  expression  is  a pattern that is matched against a subject
2961         string from left to right. Most characters stand for  themselves  in  a         string from left to right. Most characters stand for  themselves  in  a
# Line 1603  PCRE REGULAR EXPRESSION DETAILS Line 2964  PCRE REGULAR EXPRESSION DETAILS
2964    
2965           The quick brown fox           The quick brown fox
2966    
2967         matches a portion of a subject string that is identical to itself.  The         matches a portion of a subject string that is identical to itself. When
2968         power of regular expressions comes from the ability to include alterna-         caseless  matching is specified (the PCRE_CASELESS option), letters are
2969         tives and repetitions in the pattern. These are encoded in the  pattern         matched independently of case. In UTF-8 mode, PCRE  always  understands
2970         by  the  use  of meta-characters, which do not stand for themselves but         the  concept  of case for characters whose values are less than 128, so
2971         instead are interpreted in some special way.         caseless matching is always possible. For characters with  higher  val-
2972           ues,  the concept of case is supported if PCRE is compiled with Unicode
2973         There are two different sets of meta-characters: those that are  recog-         property support, but not otherwise.   If  you  want  to  use  caseless
2974         nized  anywhere in the pattern except within square brackets, and those         matching  for  characters  128  and above, you must ensure that PCRE is
2975         that are recognized in square brackets. Outside  square  brackets,  the         compiled with Unicode property support as well as with UTF-8 support.
2976         meta-characters are as follows:  
2977           The power of regular expressions comes  from  the  ability  to  include
2978           alternatives  and  repetitions in the pattern. These are encoded in the
2979           pattern by the use of metacharacters, which do not stand for themselves
2980           but instead are interpreted in some special way.
2981    
2982           There  are  two different sets of metacharacters: those that are recog-
2983           nized anywhere in the pattern except within square brackets, and  those
2984           that  are  recognized  within square brackets. Outside square brackets,
2985           the metacharacters are as follows:
2986    
2987           \      general escape character with several uses           \      general escape character with several uses
2988           ^      assert start of string (or line, in multiline mode)           ^      assert start of string (or line, in multiline mode)
# Line 1630  PCRE REGULAR EXPRESSION DETAILS Line 3000  PCRE REGULAR EXPRESSION DETAILS
3000                  also "possessive quantifier"                  also "possessive quantifier"
3001           {      start min/max quantifier           {      start min/max quantifier
3002    
3003         Part  of  a  pattern  that is in square brackets is called a "character         Part of a pattern that is in square brackets  is  called  a  "character
3004         class". In a character class the only meta-characters are:         class". In a character class the only metacharacters are:
3005    
3006           \      general escape character           \      general escape character
3007           ^      negate the class, but only if the first character           ^      negate the class, but only if the first character
# Line 1640  PCRE REGULAR EXPRESSION DETAILS Line 3010  PCRE REGULAR EXPRESSION DETAILS
3010                    syntax)                    syntax)
3011           ]      terminates the character class           ]      terminates the character class
3012    
3013         The following sections describe the use of each of the meta-characters.         The  following sections describe the use of each of the metacharacters.
3014    
3015    
3016  BACKSLASH  BACKSLASH
3017    
3018         The backslash character has several uses. Firstly, if it is followed by         The backslash character has several uses. Firstly, if it is followed by
3019         a non-alphameric character, it takes  away  any  special  meaning  that         a  non-alphanumeric  character,  it takes away any special meaning that
3020         character  may  have.  This  use  of  backslash  as an escape character         character may have. This  use  of  backslash  as  an  escape  character
3021         applies both inside and outside character classes.         applies both inside and outside character classes.
3022    
3023         For example, if you want to match a * character, you write  \*  in  the         For  example,  if  you want to match a * character, you write \* in the
3024         pattern.   This  escaping  action  applies whether or not the following         pattern.  This escaping action applies whether  or  not  the  following
3025         character would otherwise be interpreted as a meta-character, so it  is         character  would  otherwise be interpreted as a metacharacter, so it is
3026         always  safe to precede a non-alphameric with backslash to specify that         always safe to precede a non-alphanumeric  with  backslash  to  specify
3027         it stands for itself. In particular, if you want to match a  backslash,         that  it stands for itself. In particular, if you want to match a back-
3028         you write \\.         slash, you write \\.
3029    
3030         If  a  pattern is compiled with the PCRE_EXTENDED option, whitespace in         If a pattern is compiled with the PCRE_EXTENDED option,  whitespace  in
3031         the pattern (other than in a character class) and characters between  a         the  pattern (other than in a character class) and characters between a
3032         # outside a character class and the next newline character are ignored.         # outside a character class and the next newline are ignored. An escap-
3033         An escaping backslash can be used to include a whitespace or #  charac-         ing  backslash  can  be  used to include a whitespace or # character as
3034         ter as part of the pattern.         part of the pattern.
3035    
3036         If  you  want  to remove the special meaning from a sequence of charac-         If you want to remove the special meaning from a  sequence  of  charac-
3037         ters, you can do so by putting them between \Q and \E. This is  differ-         ters,  you can do so by putting them between \Q and \E. This is differ-
3038         ent  from  Perl  in  that  $  and  @ are handled as literals in \Q...\E         ent from Perl in that $ and  @  are  handled  as  literals  in  \Q...\E
3039         sequences in PCRE, whereas in Perl, $ and @ cause  variable  interpola-         sequences  in  PCRE, whereas in Perl, $ and @ cause variable interpola-
3040         tion. Note the following examples:         tion. Note the following examples:
3041    
3042           Pattern            PCRE matches   Perl matches           Pattern            PCRE matches   Perl matches
# Line 1676  BACKSLASH Line 3046  BACKSLASH
3046           \Qabc\$xyz\E       abc\$xyz       abc\$xyz           \Qabc\$xyz\E       abc\$xyz       abc\$xyz
3047           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz           \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz
3048    
3049         The  \Q...\E  sequence  is recognized both inside and outside character         The \Q...\E sequence is recognized both inside  and  outside  character
3050         classes.         classes.
3051    
3052       Non-printing characters
3053    
3054         A second use of backslash provides a way of encoding non-printing char-         A second use of backslash provides a way of encoding non-printing char-
3055         acters  in patterns in a visible manner. There is no restriction on the         acters in patterns in a visible manner. There is no restriction on  the
3056         appearance of non-printing characters, apart from the binary zero  that         appearance  of non-printing characters, apart from the binary zero that
3057         terminates  a  pattern,  but  when  a pattern is being prepared by text         terminates a pattern, but when a pattern  is  being  prepared  by  text
3058         editing, it is usually easier  to  use  one  of  the  following  escape         editing,  it  is  usually  easier  to  use  one of the following escape
3059         sequences than the binary character it represents:         sequences than the binary character it represents:
3060    
3061           \a        alarm, that is, the BEL character (hex 07)           \a        alarm, that is, the BEL character (hex 07)
3062           \cx       "control-x", where x is any character           \cx       "control-x", where x is any character
3063           \e        escape (hex 1B)           \e        escape (hex 1B)
3064           \f        formfeed (hex 0C)           \f        formfeed (hex 0C)
3065           \n        newline (hex 0A)           \n        linefeed (hex 0A)
3066           \r        carriage return (hex 0D)           \r        carriage return (hex 0D)
3067           \t        tab (hex 09)           \t        tab (hex 09)
3068           \ddd      character with octal code ddd, or backreference           \ddd      character with octal code ddd, or backreference
3069           \xhh      character with hex code hh           \xhh      character with hex code hh
3070           \x{hhh..} character with hex code hhh... (UTF-8 mode only)           \x{hhh..} character with hex code hhh..
3071    
3072         The  precise  effect of \cx is as follows: if x is a lower case letter,         The precise effect of \cx is as follows: if x is a lower  case  letter,
3073         it is converted to upper case. Then bit 6 of the character (hex 40)  is         it  is converted to upper case. Then bit 6 of the character (hex 40) is
3074         inverted.   Thus  \cz becomes hex 1A, but \c{ becomes hex 3B, while \c;         inverted.  Thus \cz becomes hex 1A, but \c{ becomes hex 3B,  while  \c;
3075         becomes hex 7B.         becomes hex 7B.
3076    
3077         After \x, from zero to two hexadecimal digits are read (letters can  be         After  \x, from zero to two hexadecimal digits are read (letters can be
3078         in  upper or lower case). In UTF-8 mode, any number of hexadecimal dig-         in upper or lower case). Any number of hexadecimal  digits  may  appear
3079         its may appear between \x{ and }, but the value of the  character  code         between  \x{  and  },  but the value of the character code must be less
3080         must  be  less  than  2**31  (that is, the maximum hexadecimal value is         than 256 in non-UTF-8 mode, and less than 2**31 in UTF-8 mode. That is,
3081         7FFFFFFF). If characters other than hexadecimal digits  appear  between         the  maximum value in hexadecimal is 7FFFFFFF. Note that this is bigger
3082         \x{  and }, or if there is no terminating }, this form of escape is not         than the largest Unicode code point, which is 10FFFF.
3083         recognized. Instead, the initial \x will be interpreted as a basic hex-  
3084         adecimal escape, with no following digits, giving a byte whose value is         If characters other than hexadecimal digits appear between \x{  and  },
3085           or if there is no terminating }, this form of escape is not recognized.
3086           Instead, the initial \x will be  interpreted  as  a  basic  hexadecimal
3087           escape,  with  no  following  digits, giving a character whose value is
3088         zero.         zero.
3089    
3090         Characters whose value is less than 256 can be defined by either of the         Characters whose value is less than 256 can be defined by either of the
3091         two  syntaxes for \x when PCRE is in UTF-8 mode. There is no difference         two  syntaxes  for  \x. There is no difference in the way they are han-
3092         in the way they are handled. For example, \xdc is exactly the  same  as         dled. For example, \xdc is exactly the same as \x{dc}.
3093         \x{dc}.  
3094           After \0 up to two further octal digits are read. If  there  are  fewer
3095         After  \0  up  to  two further octal digits are read. In both cases, if         than  two  digits,  just  those  that  are  present  are used. Thus the
3096         there are fewer than two digits, just those that are present are  used.         sequence \0\x\07 specifies two binary zeros followed by a BEL character
3097         Thus  the sequence \0\x\07 specifies two binary zeros followed by a BEL         (code  value 7). Make sure you supply two digits after the initial zero
3098         character (code value 7). Make sure you supply  two  digits  after  the         if the pattern character that follows is itself an octal digit.
        initial zero if the character that follows is itself an octal digit.  
3099    
3100         The handling of a backslash followed by a digit other than 0 is compli-         The handling of a backslash followed by a digit other than 0 is compli-
3101         cated.  Outside a character class, PCRE reads it and any following dig-         cated.  Outside a character class, PCRE reads it and any following dig-
3102         its  as  a  decimal  number. If the number is less than 10, or if there         its as a decimal number. If the number is less than  10,  or  if  there
3103         have been at least that many previous capturing left parentheses in the         have been at least that many previous capturing left parentheses in the
3104         expression,  the  entire  sequence  is  taken  as  a  back reference. A         expression, the entire  sequence  is  taken  as  a  back  reference.  A
3105         description of how this works is given later, following the  discussion         description  of how this works is given later, following the discussion
3106         of parenthesized subpatterns.         of parenthesized subpatterns.
3107    
3108         Inside  a  character  class, or if the decimal number is greater than 9         Inside a character class, or if the decimal number is  greater  than  9
3109         and there have not been that many capturing subpatterns, PCRE  re-reads         and  there have not been that many capturing subpatterns, PCRE re-reads
3110         up  to three octal digits following the backslash, and generates a sin-         up to three octal digits following the backslash, and uses them to gen-
3111         gle byte from the least significant 8 bits of the value. Any subsequent         erate  a data character. Any subsequent digits stand for themselves. In
3112         digits stand for themselves.  For example:         non-UTF-8 mode, the value of a character specified  in  octal  must  be
3113           less  than  \400.  In  UTF-8 mode, values up to \777 are permitted. For
3114           example:
3115    
3116           \040   is another way of writing a space           \040   is another way of writing a space
3117           \40    is the same, provided there are fewer than 40           \40    is the same, provided there are fewer than 40
# Line 1752  BACKSLASH Line 3128  BACKSLASH
3128           \81    is either a back reference, or a binary zero           \81    is either a back reference, or a binary zero
3129                     followed by the two characters "8" and "1"                     followed by the two characters "8" and "1"
3130    
3131         Note  that  octal  values of 100 or greater must not be introduced by a         Note that octal values of 100 or greater must not be  introduced  by  a
3132         leading zero, because no more than three octal digits are ever read.         leading zero, because no more than three octal digits are ever read.
3133    
3134         All the sequences that define a single byte value  or  a  single  UTF-8         All the sequences that define a single character value can be used both
3135         character (in UTF-8 mode) can be used both inside and outside character         inside and outside character classes. In addition, inside  a  character
3136         classes. In addition, inside a character  class,  the  sequence  \b  is         class,  the  sequence \b is interpreted as the backspace character (hex
3137         interpreted  as  the  backspace character (hex 08). Outside a character         08), and the sequences \R and \X are interpreted as the characters  "R"
3138         class it has a different meaning (see below).         and  "X", respectively. Outside a character class, these sequences have
3139           different meanings (see below).
3140    
3141       Absolute and relative back references
3142    
3143           The sequence \g followed by an unsigned or a negative  number,  option-
3144           ally  enclosed  in braces, is an absolute or relative back reference. A
3145           named back reference can be coded as \g{name}. Back references are dis-
3146           cussed later, following the discussion of parenthesized subpatterns.
3147    
3148         The third use of backslash is for specifying generic character types:     Generic character types
3149    
3150           Another use of backslash is for specifying generic character types. The
3151           following are always recognized:
3152    
3153           \d     any decimal digit           \d     any decimal digit
3154           \D     any character that is not a decimal digit           \D     any character that is not a decimal digit
3155             \h     any horizontal whitespace character
3156             \H     any character that is not a horizontal whitespace character
3157           \s     any whitespace character           \s     any whitespace character
3158           \S     any character that is not a whitespace character           \S     any character that is not a whitespace character
3159             \v     any vertical whitespace character
3160             \V     any character that is not a vertical whitespace character
3161           \w     any "word" character           \w     any "word" character
3162           \W     any "non-word" character           \W     any "non-word" character
3163    
# Line 1774  BACKSLASH Line 3165  BACKSLASH
3165         into  two disjoint sets. Any given character matches one, and only one,         into  two disjoint sets. Any given character matches one, and only one,
3166         of each pair.         of each pair.
3167    
        In UTF-8 mode, characters with values greater than 255 never match  \d,  
        \s, or \w, and always match \D, \S, and \W.  
   
        For  compatibility  with Perl, \s does not match the VT character (code  
        11).  This makes it different from the the POSIX "space" class. The  \s  
        characters are HT (9), LF (10), FF (12), CR (13), and space (32).  
   
        A  "word" character is any letter or digit or the underscore character,