/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 69 by nigel, Sat Feb 24 21:40:18 2007 UTC revision 708 by ph10, Fri Sep 23 11:03:03 2011 UTC
# Line 1  Line 1 
1    -----------------------------------------------------------------------------
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. There are  synopses of each function in the library have not been included. Neither has
6  separate text files for the pcregrep and pcretest commands.  the pcredemo program. There are separate text files for the pcregrep and
7    pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
9    
10    
11    PCRE(3)                                                                PCRE(3)
12    
13    
14  NAME  NAME
15       PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
16    
17    
18  DESCRIPTION  INTRODUCTION
19    
20       The PCRE library is a set of functions that implement  regu-         The  PCRE  library is a set of functions that implement regular expres-
21       lar  expression  pattern  matching using the same syntax and         sion pattern matching using the same syntax and semantics as Perl, with
22       semantics as Perl, with just a few differences. The  current         just  a few differences. Some features that appeared in Python and PCRE
23       implementation  of  PCRE  (release 4.x) corresponds approxi-         before they appeared in Perl are also available using the  Python  syn-
24       mately with Perl 5.8, including support  for  UTF-8  encoded         tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25       strings.    However,  this  support  has  to  be  explicitly         items, and there is an option for requesting some  minor  changes  that
26       enabled; it is not the default.         give better JavaScript compatibility.
27    
28       PCRE is written in C and released as a C library. However, a         The  current implementation of PCRE corresponds approximately with Perl
29       number  of  people  have  written wrappers and interfaces of         5.12, including support for UTF-8 encoded strings and  Unicode  general
30       various kinds. A C++ class is included  in  these  contribu-         category  properties.  However,  UTF-8  and  Unicode  support has to be
31       tions,  which  can  be found in the Contrib directory at the         explicitly enabled; it is not the default. The  Unicode  tables  corre-
32       primary FTP site, which is:         spond to Unicode release 6.0.0.
33    
34       ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         In  addition to the Perl-compatible matching function, PCRE contains an
35           alternative function that matches the same compiled patterns in a  dif-
36       Details of exactly which Perl  regular  expression  features         ferent way. In certain circumstances, the alternative function has some
37       are  and  are  not  supported  by PCRE are given in separate         advantages.  For a discussion of the two matching algorithms,  see  the
38       documents. See the pcrepattern and pcrecompat pages.         pcrematching page.
39    
40       Some features of PCRE can be included, excluded, or  changed         PCRE  is  written  in C and released as a C library. A number of people
41       when  the library is built. The pcre_config() function makes         have written wrappers and interfaces of various kinds.  In  particular,
42       it possible for a client  to  discover  which  features  are         Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now
43       available.  Documentation  about  building  PCRE for various         included as part of the PCRE distribution. The pcrecpp page has details
44       operating systems can be found in the  README  file  in  the         of  this  interface.  Other  people's contributions can be found in the
45       source distribution.         Contrib directory at the primary FTP site, which is:
46    
47           ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
48    
49           Details of exactly which Perl regular expression features are  and  are
50           not supported by PCRE are given in separate documents. See the pcrepat-
51           tern and pcrecompat pages. There is a syntax summary in the  pcresyntax
52           page.
53    
54           Some  features  of  PCRE can be included, excluded, or changed when the
55           library is built. The pcre_config() function makes it  possible  for  a
56           client  to  discover  which  features are available. The features them-
57           selves are described in the pcrebuild page. Documentation about  build-
58           ing  PCRE  for various operating systems can be found in the README and
59           NON-UNIX-USE files in the source distribution.
60    
61           The library contains a number of undocumented  internal  functions  and
62           data  tables  that  are  used by more than one of the exported external
63           functions, but which are not intended  for  use  by  external  callers.
64           Their  names  all begin with "_pcre_", which hopefully will not provoke
65           any name clashes. In some environments, it is possible to control which
66           external  symbols  are  exported when a shared library is built, and in
67           these cases the undocumented symbols are not exported.
68    
69    
70  USER DOCUMENTATION  USER DOCUMENTATION
71    
72       The user documentation for PCRE has been  split  up  into  a         The user documentation for PCRE comprises a number  of  different  sec-
73       number  of  different sections. In the "man" format, each of         tions.  In the "man" format, each of these is a separate "man page". In
74       these is a separate "man page". In the HTML format, each  is         the HTML format, each is a separate page, linked from the  index  page.
75       a  separate  page,  linked from the index page. In the plain         In  the  plain  text format, all the sections, except the pcredemo sec-
76       text format, all the sections are concatenated, for ease  of         tion, are concatenated, for ease of searching. The sections are as fol-
77       searching. The sections are as follows:         lows:
78    
79         pcre              this document           pcre              this document
80         pcreapi           details of PCRE's native API           pcre-config       show PCRE installation configuration information
81         pcrebuild         options for building PCRE           pcreapi           details of PCRE's native C API
82         pcrecallout       details of the callout feature           pcrebuild         options for building PCRE
83         pcrecompat        discussion of Perl compatibility           pcrecallout       details of the callout feature
84         pcregrep          description of the pcregrep command           pcrecompat        discussion of Perl compatibility
85         pcrepattern       syntax and semantics of supported           pcrecpp           details of the C++ wrapper
86                             regular expressions           pcredemo          a demonstration C program that uses PCRE
87         pcreperform       discussion of performance issues           pcregrep          description of the pcregrep command
88         pcreposix         the POSIX-compatible API           pcrejit           discussion of the just-in-time optimization support
89         pcresample        discussion of the sample program           pcrelimits        details of size and other limits
90         pcretest          the pcretest testing command           pcrematching      discussion of the two matching algorithms
91             pcrepartial       details of the partial matching facility
92       In addition, in the "man" and HTML formats, there is a short           pcrepattern       syntax and semantics of supported
93       page  for  each  library function, listing its arguments and                               regular expressions
94       results.           pcreperform       discussion of performance issues
95             pcreposix         the POSIX-compatible C API
96             pcreprecompile    details of saving and re-using precompiled patterns
97  LIMITATIONS           pcresample        discussion of the pcredemo program
98             pcrestack         discussion of stack usage
99       There are some size limitations in PCRE but it is hoped that           pcresyntax        quick syntax reference
100       they will never in practice be relevant.           pcretest          description of the pcretest testing command
101             pcreunicode       discussion of Unicode and UTF-8 support
      The maximum length of a  compiled  pattern  is  65539  (sic)  
      bytes  if PCRE is compiled with the default internal linkage  
      size of 2. If you want to process regular  expressions  that  
      are  truly  enormous,  you can compile PCRE with an internal  
      linkage size of 3 or 4 (see the README file  in  the  source  
      distribution  and  the pcrebuild documentation for details).  
      If these cases the limit is substantially larger.   However,  
      the speed of execution will be slower.  
   
      All values in repeating quantifiers must be less than 65536.  
      The maximum number of capturing subpatterns is 65535.  
   
      There is no limit to the  number  of  non-capturing  subpat-  
      terns,  but  the  maximum  depth  of nesting of all kinds of  
      parenthesized subpattern, including  capturing  subpatterns,  
      assertions, and other types of subpattern, is 200.  
   
      The maximum length of a subject string is the largest  posi-  
      tive number that an integer variable can hold. However, PCRE  
      uses recursion to handle subpatterns and indefinite  repeti-  
      tion.  This  means  that the available stack space may limit  
      the size of a subject string that can be processed  by  cer-  
      tain patterns.  
102    
103           In  addition,  in the "man" and HTML formats, there is a short page for
104           each C library function, listing its arguments and results.
105    
 UTF-8 SUPPORT  
106    
107       Starting at release 3.3, PCRE has had some support for char-  AUTHOR
      acter  strings  encoded in the UTF-8 format. For release 4.0  
      this has been greatly extended to cover most common require-  
      ments.  
   
      In order process UTF-8  strings,  you  must  build  PCRE  to  
      include  UTF-8  support  in  the code, and, in addition, you  
      must call pcre_compile() with  the  PCRE_UTF8  option  flag.  
      When  you  do this, both the pattern and any subject strings  
      that are matched against it are  treated  as  UTF-8  strings  
      instead of just strings of bytes.  
   
      If you compile PCRE with UTF-8 support, but do not use it at  
      run  time,  the  library will be a bit bigger, but the addi-  
      tional run time overhead is limited to testing the PCRE_UTF8  
      flag in several places, so should not be very large.  
   
      The following comments apply when PCRE is running  in  UTF-8  
      mode:  
   
      1. PCRE assumes that the strings it is given  contain  valid  
      UTF-8  codes. It does not diagnose invalid UTF-8 strings. If  
      you pass invalid UTF-8 strings  to  PCRE,  the  results  are  
      undefined.  
   
      2. In a pattern, the escape sequence \x{...}, where the con-  
      tents  of  the  braces is a string of hexadecimal digits, is  
      interpreted as a UTF-8 character whose code  number  is  the  
      given  hexadecimal  number, for example: \x{1234}. If a non-  
      hexadecimal digit appears between the braces,  the  item  is  
      not  recognized.  This escape sequence can be used either as  
      a literal, or within a character class.  
   
      3. The original hexadecimal escape sequence, \xhh, matches a  
      two-byte UTF-8 character if the value is greater than 127.  
   
      4. Repeat quantifiers apply to  complete  UTF-8  characters,  
      not to individual bytes, for example: \x{100}{3}.  
   
      5. The dot metacharacter matches one UTF-8 character instead  
      of a single byte.  
   
      6. The escape sequence \C can be used to match a single byte  
      in UTF-8 mode, but its use can lead to some strange effects.  
   
      7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  
      correctly test characters of any code value, but the charac-  
      ters that PCRE recognizes as digits, spaces, or word charac-  
      ters  remain  the  same  set as before, all with values less  
      than 256.  
   
      8. Case-insensitive  matching  applies  only  to  characters  
      whose  values  are  less than 256. PCRE does not support the  
      notion of "case" for higher-valued characters.  
108    
109       9. PCRE does not support the use of Unicode tables and  pro-         Philip Hazel
110       perties or the Perl escapes \p, \P, and \X.         University Computing Service
111           Cambridge CB2 3QH, England.
112    
113           Putting an actual email address here seems to have been a spam  magnet,
114           so  I've  taken  it away. If you want to email me, use my two initials,
115           followed by the two digits 10, at the domain cam.ac.uk.
116    
 AUTHOR  
117    
118       Philip Hazel <ph10@cam.ac.uk>  REVISION
119       University Computing Service,  
120       Cambridge CB2 3QG, England.         Last updated: 24 August 2011
121       Phone: +44 1223 334714         Copyright (c) 1997-2011 University of Cambridge.
122    ------------------------------------------------------------------------------
123    
124    
125    PCREBUILD(3)                                                      PCREBUILD(3)
126    
 Last updated: 04 February 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
127    
128  NAME  NAME
129       PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
130    
131    
132  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
133    
134       This document describes the optional features of  PCRE  that         This  document  describes  the  optional  features  of PCRE that can be
135       can  be  selected when the library is compiled. They are all         selected when the library is compiled. It assumes use of the  configure
136       selected, or deselected, by providing options to the config-         script,  where the optional features are selected or deselected by pro-
137       ure  script  which  is run before the make command. The com-         viding options to configure before running the make  command.  However,
138       plete list of options  for  configure  (which  includes  the         the  same  options  can be selected in both Unix-like and non-Unix-like
139       standard  ones  such  as  the  selection of the installation         environments using the GUI facility of cmake-gui if you are using CMake
140       directory) can be obtained by running         instead of configure to build PCRE.
141    
142         ./configure --help         There  is  a  lot more information about building PCRE in non-Unix-like
143           environments in the file called NON_UNIX_USE, which is part of the PCRE
144       The following sections describe certain options whose  names         distribution.  You  should consult this file as well as the README file
145       begin  with  --enable  or  --disable. These settings specify         if you are building in a non-Unix-like environment.
146       changes to the defaults for the configure  command.  Because  
147       of  the  way  that  configure  works, --enable and --disable         The complete list of options for configure (which includes the standard
148       always come in pairs, so  the  complementary  option  always         ones  such  as  the  selection  of  the  installation directory) can be
149       exists  as  well, but as it specifies the default, it is not         obtained by running
150       described.  
151             ./configure --help
152    
153           The following sections include  descriptions  of  options  whose  names
154           begin with --enable or --disable. These settings specify changes to the
155           defaults for the configure command. Because of the way  that  configure
156           works,  --enable  and --disable always come in pairs, so the complemen-
157           tary option always exists as well, but as it specifies the default,  it
158           is not described.
159    
160    
161    BUILDING SHARED AND STATIC LIBRARIES
162    
163           The  PCRE building process uses libtool to build both shared and static
164           Unix libraries by default. You can suppress one of these by adding  one
165           of
166    
167             --disable-shared
168             --disable-static
169    
170           to the configure command, as required.
171    
172    
173    C++ SUPPORT
174    
175           By default, the configure script will search for a C++ compiler and C++
176           header files. If it finds them, it automatically builds the C++ wrapper
177           library for PCRE. You can disable this by adding
178    
179             --disable-cpp
180    
181           to the configure command.
182    
183    
184  UTF-8 SUPPORT  UTF-8 SUPPORT
185    
186       To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF-8 Unicode character strings, add
187    
188             --enable-utf8
189    
190           to  the  configure  command.  Of  itself, this does not make PCRE treat
191           strings as UTF-8. As well as compiling PCRE with this option, you  also
192           have  have to set the PCRE_UTF8 option when you call the pcre_compile()
193           or pcre_compile2() functions.
194    
195           If you set --enable-utf8 when compiling in an EBCDIC environment,  PCRE
196           expects its input to be either ASCII or UTF-8 (depending on the runtime
197           option). It is not possible to support both EBCDIC and UTF-8  codes  in
198           the  same  version  of  the  library.  Consequently,  --enable-utf8 and
199           --enable-ebcdic are mutually exclusive.
200    
201    
202    UNICODE CHARACTER PROPERTY SUPPORT
203    
204           UTF-8 support allows PCRE to process character values greater than  255
205           in  the  strings that it handles. On its own, however, it does not pro-
206           vide any facilities for accessing the properties of such characters. If
207           you  want  to  be able to use the pattern escapes \P, \p, and \X, which
208           refer to Unicode character properties, you must add
209    
210             --enable-unicode-properties
211    
212           to the configure command. This implies UTF-8 support, even if you  have
213           not explicitly requested it.
214    
215           Including  Unicode  property  support  adds around 30K of tables to the
216           PCRE library. Only the general category properties such as  Lu  and  Nd
217           are supported. Details are given in the pcrepattern documentation.
218    
        --enable-utf8  
219    
220       to the configure command. Of itself, this does not make PCRE  JUST-IN-TIME COMPILER SUPPORT
221       treat  strings as UTF-8. As well as compiling PCRE with this  
222       option, you also have have to set the PCRE_UTF8 option  when         Just-in-time compiler support is included in the build by specifying
223       you call the pcre_compile() function.  
224             --enable-jit
225    
226           This  support  is available only for certain hardware architectures. If
227           this option is set for an  unsupported  architecture,  a  compile  time
228           error  occurs.   See  the pcrejit documentation for a discussion of JIT
229           usage. When JIT support is enabled, pcregrep automatically makes use of
230           it, unless you add
231    
232             --disable-pcregrep-jit
233    
234           to the "configure" command.
235    
236    
237  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
238    
239       By default, PCRE treats character 10 (linefeed) as the  new-         By  default,  PCRE interprets the linefeed (LF) character as indicating
240       line  character.  This  is  the  normal newline character on         the end of a line. This is the normal newline  character  on  Unix-like
241       Unix-like systems. You can compile PCRE to use character  13         systems.  You  can compile PCRE to use carriage return (CR) instead, by
242       (carriage return) instead by adding         adding
   
        --enable-newline-is-cr  
   
      to the configure command. For completeness there is  also  a  
      --enable-newline-is-lf  option,  which  explicitly specifies  
      linefeed as the newline character.  
243    
244             --enable-newline-is-cr
245    
246  BUILDING SHARED AND STATIC LIBRARIES         to the  configure  command.  There  is  also  a  --enable-newline-is-lf
247           option, which explicitly specifies linefeed as the newline character.
248    
249           Alternatively, you can specify that line endings are to be indicated by
250           the two character sequence CRLF. If you want this, add
251    
252             --enable-newline-is-crlf
253    
254           to the configure command. There is a fourth option, specified by
255    
256             --enable-newline-is-anycrlf
257    
258           which causes PCRE to recognize any of the three sequences  CR,  LF,  or
259           CRLF as indicating a line ending. Finally, a fifth option, specified by
260    
261             --enable-newline-is-any
262    
263           causes PCRE to recognize any Unicode newline sequence.
264    
265           Whatever  line  ending convention is selected when PCRE is built can be
266           overridden when the library functions are called. At build time  it  is
267           conventional to use the standard for your operating system.
268    
269    
270    WHAT \R MATCHES
271    
272       The PCRE building process uses libtool to build both  shared         By  default,  the  sequence \R in a pattern matches any Unicode newline
273       and  static  Unix libraries by default. You can suppress one         sequence, whatever has been selected as the line  ending  sequence.  If
274       of these by adding one of         you specify
275    
276         --disable-shared           --enable-bsr-anycrlf
        --disable-static  
277    
278       to the configure command, as required.         the  default  is changed so that \R matches only CR, LF, or CRLF. What-
279           ever is selected when PCRE is built can be overridden when the  library
280           functions are called.
281    
282    
283  POSIX MALLOC USAGE  POSIX MALLOC USAGE
284    
285       When PCRE is called through the  POSIX  interface  (see  the         When PCRE is called through the POSIX interface (see the pcreposix doc-
286       pcreposix  documentation),  additional  working  storage  is         umentation), additional working storage is  required  for  holding  the
287       required for holding the pointers  to  capturing  substrings         pointers  to capturing substrings, because PCRE requires three integers
288       because  PCRE requires three integers per substring, whereas         per substring, whereas the POSIX interface provides only  two.  If  the
289       the POSIX interface provides only  two.  If  the  number  of         number of expected substrings is small, the wrapper function uses space
290       expected  substrings  is  small,  the  wrapper function uses         on the stack, because this is faster than using malloc() for each call.
291       space on the stack, because this is faster than  using  mal-         The default threshold above which the stack is no longer used is 10; it
292       loc()  for  each call. The default threshold above which the         can be changed by adding a setting such as
      stack is no longer used is 10; it can be changed by adding a  
      setting such as  
293    
294         --with-posix-malloc-threshold=20           --with-posix-malloc-threshold=20
295    
296       to the configure command.         to the configure command.
297    
298    
299    HANDLING VERY LARGE PATTERNS
300    
301           Within a compiled pattern, offset values are used  to  point  from  one
302           part  to another (for example, from an opening parenthesis to an alter-
303           nation metacharacter). By default, two-byte values are used  for  these
304           offsets,  leading  to  a  maximum size for a compiled pattern of around
305           64K. This is sufficient to handle all but the most  gigantic  patterns.
306           Nevertheless,  some  people do want to process truyl enormous patterns,
307           so it is possible to compile PCRE to use three-byte or  four-byte  off-
308           sets by adding a setting such as
309    
310             --with-link-size=3
311    
312           to  the  configure  command.  The value given must be 2, 3, or 4. Using
313           longer offsets slows down the operation of PCRE because it has to  load
314           additional bytes when handling them.
315    
316    
317    AVOIDING EXCESSIVE STACK USAGE
318    
319           When matching with the pcre_exec() function, PCRE implements backtrack-
320           ing by making recursive calls to an internal function  called  match().
321           In  environments  where  the size of the stack is limited, this can se-
322           verely limit PCRE's operation. (The Unix environment does  not  usually
323           suffer from this problem, but it may sometimes be necessary to increase
324           the maximum stack size.  There is a discussion in the  pcrestack  docu-
325           mentation.)  An alternative approach to recursion that uses memory from
326           the heap to remember data, instead of using recursive  function  calls,
327           has  been  implemented to work round the problem of limited stack size.
328           If you want to build a version of PCRE that works this way, add
329    
330             --disable-stack-for-recursion
331    
332           to the configure command. With this configuration, PCRE  will  use  the
333           pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
334           ment functions. By default these point to malloc() and free(), but  you
335           can replace the pointers so that your own functions are used instead.
336    
337           Separate  functions  are  provided  rather  than  using pcre_malloc and
338           pcre_free because the  usage  is  very  predictable:  the  block  sizes
339           requested  are  always  the  same,  and  the blocks are always freed in
340           reverse order. A calling program might be able to  implement  optimized
341           functions  that  perform  better  than  malloc()  and free(). PCRE runs
342           noticeably more slowly when built in this way. This option affects only
343           the pcre_exec() function; it is not relevant for pcre_dfa_exec().
344    
345    
346  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
347    
348       Internally, PCRE has a  function  called  match()  which  it         Internally,  PCRE has a function called match(), which it calls repeat-
349       calls  repeatedly  (possibly  recursively) when performing a         edly  (sometimes  recursively)  when  matching  a  pattern   with   the
350       matching operation. By limiting the  number  of  times  this         pcre_exec()  function.  By controlling the maximum number of times this
351       function  may  be  called,  a  limit  can  be  placed on the         function may be called during a single matching operation, a limit  can
352       resources used by a single call to  pcre_exec().  The  limit         be  placed  on  the resources used by a single call to pcre_exec(). The
353       can  be  changed  at  run  time, as described in the pcreapi         limit can be changed at run time, as described in the pcreapi  documen-
354       documentation. The default is 10 million, but  this  can  be         tation.  The default is 10 million, but this can be changed by adding a
355       changed by adding a setting such as         setting such as
356    
357         --with-match-limit=500000           --with-match-limit=500000
358    
359       to the configure command.         to  the  configure  command.  This  setting  has  no  effect   on   the
360           pcre_dfa_exec() matching function.
361    
362           In  some  environments  it is desirable to limit the depth of recursive
363           calls of match() more strictly than the total number of calls, in order
364           to  restrict  the maximum amount of stack (or heap, if --disable-stack-
365           for-recursion is specified) that is used. A second limit controls this;
366           it  defaults  to  the  value  that is set for --with-match-limit, which
367           imposes no additional constraints. However, you can set a  lower  limit
368           by adding, for example,
369    
370  HANDLING VERY LARGE PATTERNS           --with-match-limit-recursion=10000
371    
372       Within a compiled pattern, offset values are used  to  point         to  the  configure  command.  This  value can also be overridden at run
373       from  one  part  to  another  (for  example, from an opening         time.
      parenthesis to an  alternation  metacharacter).  By  default  
      two-byte  values  are  used  for these offsets, leading to a  
      maximum size for a compiled pattern of around 64K.  This  is  
      sufficient  to  handle  all  but the most gigantic patterns.  
      Nevertheless, some people do want to process  enormous  pat-  
      terns,  so  it is possible to compile PCRE to use three-byte  
      or four-byte offsets by adding a setting such as  
   
        --with-link-size=3  
   
      to the configure command. The value given must be 2,  3,  or  
      4.  Using  longer  offsets  slows down the operation of PCRE  
      because it has to load additional bytes when handling them.  
   
      If you build PCRE with an increased link size, test  2  (and  
      test 5 if you are using UTF-8) will fail. Part of the output  
      of these tests is a representation of the compiled  pattern,  
      and this changes with the link size.  
374    
 Last updated: 21 January 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
375    
376  NAME  CREATING CHARACTER TABLES AT BUILD TIME
      PCRE - Perl-compatible regular expressions  
377    
378           PCRE uses fixed tables for processing characters whose code values  are
379           less  than 256. By default, PCRE is built with a set of tables that are
380           distributed in the file pcre_chartables.c.dist. These  tables  are  for
381           ASCII codes only. If you add
382    
383  SYNOPSIS OF PCRE API           --enable-rebuild-chartables
384    
385       #include <pcre.h>         to  the  configure  command, the distributed tables are no longer used.
386           Instead, a program called dftables is compiled and  run.  This  outputs
387           the source for new set of tables, created in the default locale of your
388           C runtime system. (This method of replacing the tables does not work if
389           you  are cross compiling, because dftables is run on the local host. If
390           you need to create alternative tables when cross  compiling,  you  will
391           have to do so "by hand".)
392    
      pcre *pcre_compile(const char *pattern, int options,  
           const char **errptr, int *erroffset,  
           const unsigned char *tableptr);  
393    
394       pcre_extra *pcre_study(const pcre *code, int options,  USING EBCDIC CODE
           const char **errptr);  
395    
396       int pcre_exec(const pcre *code, const pcre_extra *extra,         PCRE  assumes  by  default that it will run in an environment where the
397            const char *subject, int length, int startoffset,         character code is ASCII (or Unicode, which is  a  superset  of  ASCII).
398            int options, int *ovector, int ovecsize);         This  is  the  case for most computer operating systems. PCRE can, how-
399           ever, be compiled to run in an EBCDIC environment by adding
400    
401       int pcre_copy_named_substring(const pcre *code,           --enable-ebcdic
           const char *subject, int *ovector,  
           int stringcount, const char *stringname,  
           char *buffer, int buffersize);  
402    
403       int pcre_copy_substring(const char *subject, int *ovector,         to the configure command. This setting implies --enable-rebuild-charta-
404            int stringcount, int stringnumber, char *buffer,         bles.  You  should  only  use  it if you know that you are in an EBCDIC
405            int buffersize);         environment (for example,  an  IBM  mainframe  operating  system).  The
406           --enable-ebcdic option is incompatible with --enable-utf8.
407    
      int pcre_get_named_substring(const pcre *code,  
           const char *subject, int *ovector,  
           int stringcount, const char *stringname,  
           const char **stringptr);  
408    
409       int pcre_get_stringnumber(const pcre *code,  PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
           const char *name);  
410    
411       int pcre_get_substring(const char *subject, int *ovector,         By default, pcregrep reads all files as plain text. You can build it so
412            int stringcount, int stringnumber,         that it recognizes files whose names end in .gz or .bz2, and reads them
413            const char **stringptr);         with libz or libbz2, respectively, by adding one or both of
414    
415       int pcre_get_substring_list(const char *subject,           --enable-pcregrep-libz
416            int *ovector, int stringcount, const char ***listptr);           --enable-pcregrep-libbz2
417    
418       void pcre_free_substring(const char *stringptr);         to the configure command. These options naturally require that the rel-
419           evant libraries are installed on your system. Configuration  will  fail
420           if they are not.
421    
      void pcre_free_substring_list(const char **stringptr);  
422    
423       const unsigned char *pcre_maketables(void);  PCREGREP BUFFER SIZE
424    
425       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         pcregrep  uses  an internal buffer to hold a "window" on the file it is
426            int what, void *where);         scanning, in order to be able to output "before" and "after" lines when
427           it  finds  a match. The size of the buffer is controlled by a parameter
428           whose default value is 20K. The buffer itself is three times this size,
429           but because of the way it is used for holding "before" lines, the long-
430           est line that is guaranteed to be processable is  the  parameter  size.
431           You can change the default parameter value by adding, for example,
432    
433             --with-pcregrep-bufsize=50K
434    
435       int pcre_info(const pcre *code, int *optptr, *firstcharptr);         to the configure command. The caller of pcregrep can, however, override
436           this value by specifying a run-time option.
437    
      int pcre_config(int what, void *where);  
438    
439       char *pcre_version(void);  PCRETEST OPTION FOR LIBREADLINE SUPPORT
440    
441       void *(*pcre_malloc)(size_t);         If you add
442    
443       void (*pcre_free)(void *);           --enable-pcretest-libreadline
444    
445       int (*pcre_callout)(pcre_callout_block *);         to the configure command,  pcretest  is  linked  with  the  libreadline
446           library,  and  when its input is from a terminal, it reads it using the
447           readline() function. This provides line-editing and history facilities.
448           Note that libreadline is GPL-licensed, so if you distribute a binary of
449           pcretest linked in this way, there may be licensing issues.
450    
451           Setting this option causes the -lreadline option to  be  added  to  the
452           pcretest  build.  In many operating environments with a sytem-installed
453           libreadline this is sufficient. However, in some environments (e.g.  if
454           an  unmodified  distribution version of readline is in use), some extra
455           configuration may be necessary. The INSTALL file for  libreadline  says
456           this:
457    
458  PCRE API           "Readline uses the termcap functions, but does not link with the
459             termcap or curses library itself, allowing applications which link
460             with readline the to choose an appropriate library."
461    
462       PCRE has its own native API,  which  is  described  in  this         If  your environment has not been set up so that an appropriate library
463       document.  There  is  also  a  set of wrapper functions that         is automatically included, you may need to add something like
      correspond to the POSIX regular expression API.   These  are  
      described in the pcreposix documentation.  
464    
465       The native API function prototypes are defined in the header           LIBS="-ncurses"
      file  pcre.h,  and  on  Unix  systems  the library itself is  
      called libpcre.a, so can be accessed by adding -lpcre to the  
      command  for  linking  an  application  which  calls it. The  
      header file defines the macros PCRE_MAJOR and PCRE_MINOR  to  
      contain the major and minor release numbers for the library.  
      Applications can use these to include support for  different  
      releases.  
466    
467       The functions pcre_compile(), pcre_study(), and  pcre_exec()         immediately before the configure command.
      are  used  for compiling and matching regular expressions. A  
      sample program that demonstrates the simplest way  of  using  
      them  is  given in the file pcredemo.c. The pcresample docu-  
      mentation describes how to run it.  
468    
      There are convenience functions for extracting captured sub-  
      strings from a matched subject string. They are:  
469    
470         pcre_copy_substring()  SEE ALSO
        pcre_copy_named_substring()  
        pcre_get_substring()  
        pcre_get_named_substring()  
        pcre_get_substring_list()  
471    
472       pcre_free_substring()  and  pcre_free_substring_list()   are         pcreapi(3), pcre_config(3).
      also  provided,  to  free  the  memory  used  for  extracted  
      strings.  
473    
      The function pcre_maketables() is used (optionally) to build  
      a  set of character tables in the current locale for passing  
      to pcre_compile().  
474    
475       The function pcre_fullinfo() is used to find out information  AUTHOR
      about a compiled pattern; pcre_info() is an obsolete version  
      which returns only some of the available information, but is  
      retained   for   backwards   compatibility.    The  function  
      pcre_version() returns a pointer to a string containing  the  
      version of PCRE and its date of release.  
476    
477       The global variables  pcre_malloc  and  pcre_free  initially         Philip Hazel
478       contain the entry points of the standard malloc() and free()         University Computing Service
479       functions respectively. PCRE  calls  the  memory  management         Cambridge CB2 3QH, England.
      functions  via  these  variables,  so  a calling program can  
      replace them if it  wishes  to  intercept  the  calls.  This  
      should be done before calling any PCRE functions.  
480    
      The global variable pcre_callout initially contains NULL. It  
      can be set by the caller to a "callout" function, which PCRE  
      will then call at specified points during a matching  opera-  
      tion. Details are given in the pcrecallout documentation.  
481    
482    REVISION
483    
484  MULTITHREADING         Last updated: 06 September 2011
485           Copyright (c) 1997-2011 University of Cambridge.
486    ------------------------------------------------------------------------------
487    
      The PCRE functions can be used in  multi-threading  applica-  
      tions, with the proviso that the memory management functions  
      pointed to by pcre_malloc and  pcre_free,  and  the  callout  
      function  pointed  to  by  pcre_callout,  are  shared by all  
      threads.  
   
      The compiled form of a regular  expression  is  not  altered  
      during  matching, so the same compiled pattern can safely be  
      used by several threads at once.  
488    
489    PCREMATCHING(3)                                                PCREMATCHING(3)
490    
 CHECKING BUILD-TIME OPTIONS  
491    
492       int pcre_config(int what, void *where);  NAME
493           PCRE - Perl-compatible regular expressions
494    
      The function pcre_config() makes  it  possible  for  a  PCRE  
      client  to  discover  which optional features have been com-  
      piled into the PCRE library. The pcrebuild documentation has  
      more details about these optional features.  
495    
496       The first argument for pcre_config() is an integer, specify-  PCRE MATCHING ALGORITHMS
      ing  which information is required; the second argument is a  
      pointer to a variable into which the information is  placed.  
      The following information is available:  
497    
498         PCRE_CONFIG_UTF8         This document describes the two different algorithms that are available
499           in PCRE for matching a compiled regular expression against a given sub-
500           ject  string.  The  "standard"  algorithm  is  the  one provided by the
501           pcre_exec() function.  This works in the same was  as  Perl's  matching
502           function, and provides a Perl-compatible matching operation.
503    
504       The output is an integer that is set to one if UTF-8 support         An  alternative  algorithm is provided by the pcre_dfa_exec() function;
505       is available; otherwise it is set to zero.         this operates in a different way, and is not  Perl-compatible.  It  has
506           advantages  and disadvantages compared with the standard algorithm, and
507           these are described below.
508    
509         PCRE_CONFIG_NEWLINE         When there is only one possible way in which a given subject string can
510           match  a pattern, the two algorithms give the same answer. A difference
511           arises, however, when there are multiple possibilities. For example, if
512           the pattern
513    
514       The output is an integer that is set to  the  value  of  the           ^<.*>
      code  that  is  used for the newline character. It is either  
      linefeed (10) or carriage return (13), and  should  normally  
      be the standard character for your operating system.  
515    
516         PCRE_CONFIG_LINK_SIZE         is matched against the string
517    
518       The output is an integer that contains the number  of  bytes           <something> <something else> <something further>
      used  for  internal linkage in compiled regular expressions.  
      The value is 2, 3, or 4. Larger values allow larger  regular  
      expressions  to be compiled, at the expense of slower match-  
      ing. The default value of 2 is sufficient for  all  but  the  
      most  massive patterns, since it allows the compiled pattern  
      to be up to 64K in size.  
519    
520         PCRE_CONFIG_POSIX_MALLOC_THRESHOLD         there are three possible answers. The standard algorithm finds only one
521           of them, whereas the alternative algorithm finds all three.
522    
      The output is an integer that contains the  threshold  above  
      which  the POSIX interface uses malloc() for output vectors.  
      Further details are given in the pcreposix documentation.  
523    
524         PCRE_CONFIG_MATCH_LIMIT  REGULAR EXPRESSIONS AS TREES
525    
526       The output is an integer that gives the  default  limit  for         The set of strings that are matched by a regular expression can be rep-
527       the   number  of  internal  matching  function  calls  in  a         resented  as  a  tree structure. An unlimited repetition in the pattern
528       pcre_exec()  execution.  Further  details  are  given   with         makes the tree of infinite size, but it is still a tree.  Matching  the
529       pcre_exec() below.         pattern  to a given subject string (from a given starting point) can be
530           thought of as a search of the tree.  There are two  ways  to  search  a
531           tree:  depth-first  and  breadth-first, and these correspond to the two
532           matching algorithms provided by PCRE.
533    
534    
535  COMPILING A PATTERN  THE STANDARD MATCHING ALGORITHM
536    
537       pcre *pcre_compile(const char *pattern, int options,         In the terminology of Jeffrey Friedl's book "Mastering Regular  Expres-
538            const char **errptr, int *erroffset,         sions",  the  standard  algorithm  is an "NFA algorithm". It conducts a
539            const unsigned char *tableptr);         depth-first search of the pattern tree. That is, it  proceeds  along  a
540           single path through the tree, checking that the subject matches what is
541       The function pcre_compile() is called to compile  a  pattern         required. When there is a mismatch, the algorithm  tries  any  alterna-
542       into  an internal form. The pattern is a C string terminated         tives  at  the  current point, and if they all fail, it backs up to the
543       by a binary zero, and is passed in the argument  pattern.  A         previous branch point in the  tree,  and  tries  the  next  alternative
544       pointer  to  a  single  block of memory that is obtained via         branch  at  that  level.  This often involves backing up (moving to the
545       pcre_malloc is returned. This contains the compiled code and         left) in the subject string as well.  The  order  in  which  repetition
546       related  data.  The  pcre  type  is defined for the returned         branches  are  tried  is controlled by the greedy or ungreedy nature of
547       block; this is a typedef for a structure whose contents  are         the quantifier.
      not  externally  defined. It is up to the caller to free the  
      memory when it is no longer required.  
   
      Although the compiled code of a PCRE regex  is  relocatable,  
      that is, it does not depend on memory location, the complete  
      pcre data block is not fully relocatable,  because  it  con-  
      tains  a  copy of the tableptr argument, which is an address  
      (see below).  
      The options argument contains independent bits  that  affect  
      the  compilation.  It  should  be  zero  if  no  options are  
      required. Some of the options, in particular, those that are  
      compatible  with Perl, can also be set and unset from within  
      the pattern (see the detailed description of regular expres-  
      sions  in the pcrepattern documentation). For these options,  
      the contents of the options argument specifies their initial  
      settings  at  the  start  of  compilation and execution. The  
      PCRE_ANCHORED option can be set at the time of  matching  as  
      well as at compile time.  
   
      If errptr is NULL, pcre_compile() returns NULL  immediately.  
      Otherwise, if compilation of a pattern fails, pcre_compile()  
      returns NULL, and sets the variable pointed to by errptr  to  
      point  to a textual error message. The offset from the start  
      of  the  pattern  to  the  character  where  the  error  was  
      discovered   is   placed  in  the  variable  pointed  to  by  
      erroffset, which must not be NULL. If it  is,  an  immediate  
      error is given.  
   
      If the final  argument,  tableptr,  is  NULL,  PCRE  uses  a  
      default  set  of character tables which are built when it is  
      compiled, using the default C  locale.  Otherwise,  tableptr  
      must  be  the result of a call to pcre_maketables(). See the  
      section on locale support below.  
   
      This code fragment shows a typical straightforward  call  to  
      pcre_compile():  
   
        pcre *re;  
        const char *error;  
        int erroffset;  
        re = pcre_compile(  
          "^A.*Z",          /* the pattern */  
          0,                /* default options */  
          &error,           /* for error message */  
          &erroffset,       /* for error offset */  
          NULL);            /* use default character tables */  
   
      The following option bits are defined:  
   
        PCRE_ANCHORED  
   
      If this bit is set, the pattern is forced to be  "anchored",  
      that is, it is constrained to match only at the first match-  
      ing point in the string which is being searched  (the  "sub-  
      ject string"). This effect can also be achieved by appropri-  
      ate constructs in the pattern itself, which is the only  way  
      to do it in Perl.  
   
        PCRE_CASELESS  
   
      If this bit is set, letters in the pattern match both  upper  
      and  lower  case  letters.  It  is  equivalent  to Perl's /i  
      option, and it can be changed within a  pattern  by  a  (?i)  
      option setting.  
   
        PCRE_DOLLAR_ENDONLY  
   
      If this bit is set, a dollar metacharacter  in  the  pattern  
      matches  only at the end of the subject string. Without this  
      option, a dollar also matches immediately before  the  final  
      character  if it is a newline (but not before any other new-  
      lines).  The  PCRE_DOLLAR_ENDONLY  option  is   ignored   if  
      PCRE_MULTILINE is set. There is no equivalent to this option  
      in Perl, and no way to set it within a pattern.  
   
        PCRE_DOTALL  
   
      If this bit is  set,  a  dot  metacharater  in  the  pattern  
      matches all characters, including newlines. Without it, new-  
      lines are excluded. This option is equivalent to  Perl's  /s  
      option,  and  it  can  be changed within a pattern by a (?s)  
      option setting. A negative class such as [^a] always matches  
      a  newline  character,  independent  of  the setting of this  
      option.  
   
        PCRE_EXTENDED  
   
      If this bit is set, whitespace data characters in  the  pat-  
      tern  are  totally  ignored  except when escaped or inside a  
      character class. Whitespace does not include the VT  charac-  
      ter  (code 11). In addition, characters between an unescaped  
      # outside a character class and the next newline  character,  
      inclusive, are also ignored. This is equivalent to Perl's /x  
      option, and it can be changed within a  pattern  by  a  (?x)  
      option setting.  
   
      This option makes it possible  to  include  comments  inside  
      complicated patterns.  Note, however, that this applies only  
      to data characters. Whitespace characters may  never  appear  
      within special character sequences in a pattern, for example  
      within the sequence (?( which introduces a conditional  sub-  
      pattern.  
   
        PCRE_EXTRA  
   
      This option was invented in  order  to  turn  on  additional  
      functionality of PCRE that is incompatible with Perl, but it  
      is currently of very little use. When set, any backslash  in  
      a  pattern  that is followed by a letter that has no special  
      meaning causes an error, thus reserving  these  combinations  
      for  future  expansion.  By default, as in Perl, a backslash  
      followed by a letter with no special meaning is treated as a  
      literal.  There  are at present no other features controlled  
      by this option. It can also be set by a (?X) option  setting  
      within a pattern.  
   
        PCRE_MULTILINE  
   
      By default, PCRE treats the subject string as consisting  of  
      a  single "line" of characters (even if it actually contains  
      several newlines). The "start  of  line"  metacharacter  (^)  
      matches  only  at the start of the string, while the "end of  
      line" metacharacter ($) matches  only  at  the  end  of  the  
      string,    or   before   a   terminating   newline   (unless  
      PCRE_DOLLAR_ENDONLY is set). This is the same as Perl.  
   
      When PCRE_MULTILINE it is set, the "start of line" and  "end  
      of  line"  constructs match immediately following or immedi-  
      ately before any newline  in  the  subject  string,  respec-  
      tively,  as  well  as  at  the  very  start and end. This is  
      equivalent to Perl's /m option, and it can be changed within  
      a  pattern  by  a  (?m) option setting. If there are no "\n"  
      characters in a subject string, or no occurrences of ^ or  $  
      in a pattern, setting PCRE_MULTILINE has no effect.  
   
        PCRE_NO_AUTO_CAPTURE  
   
      If this option is set, it disables the use of numbered  cap-  
      turing  parentheses  in the pattern. Any opening parenthesis  
      that is not followed by ? behaves as if it were followed  by  
      ?:  but  named  parentheses  can still be used for capturing  
      (and they acquire numbers in the usual  way).  There  is  no  
      equivalent of this option in Perl.  
   
        PCRE_UNGREEDY  
   
      This option inverts the "greediness" of the  quantifiers  so  
      that  they  are  not greedy by default, but become greedy if  
      followed by "?". It is not compatible with Perl. It can also  
      be set by a (?U) option setting within the pattern.  
   
        PCRE_UTF8  
   
      This option causes PCRE to regard both the pattern  and  the  
      subject  as  strings  of UTF-8 characters instead of single-  
      byte character strings. However, it  is  available  only  if  
      PCRE  has  been  built to include UTF-8 support. If not, the  
      use of this option provokes an error. Details  of  how  this  
      option  changes  the behaviour of PCRE are given in the sec-  
      tion on UTF-8 support in the main pcre page.  
548    
549           If a leaf node is reached, a matching string has  been  found,  and  at
550           that  point the algorithm stops. Thus, if there is more than one possi-
551           ble match, this algorithm returns the first one that it finds.  Whether
552           this  is the shortest, the longest, or some intermediate length depends
553           on the way the greedy and ungreedy repetition quantifiers are specified
554           in the pattern.
555    
556  STUDYING A PATTERN         Because  it  ends  up  with a single path through the tree, it is rela-
557           tively straightforward for this algorithm to keep  track  of  the  sub-
558           strings  that  are  matched  by portions of the pattern in parentheses.
559           This provides support for capturing parentheses and back references.
560    
      pcre_extra *pcre_study(const pcre *code, int options,  
           const char **errptr);  
561    
562       When a pattern is going to be  used  several  times,  it  is  THE ALTERNATIVE MATCHING ALGORITHM
      worth  spending  more time analyzing it in order to speed up  
      the time taken for matching. The function pcre_study() takes  
      a  pointer  to  a compiled pattern as its first argument. If  
      studing the pattern  produces  additional  information  that  
      will  help speed up matching, pcre_study() returns a pointer  
      to a pcre_extra block, in which the study_data field  points  
      to the results of the study.  
   
      The  returned  value  from  a  pcre_study()  can  be  passed  
      directly  to pcre_exec(). However, the pcre_extra block also  
      contains other fields that can be set by the  caller  before  
      the  block is passed; these are described below. If studying  
      the pattern does not  produce  any  additional  information,  
      pcre_study() returns NULL. In that circumstance, if the cal-  
      ling program wants to pass  some  of  the  other  fields  to  
      pcre_exec(), it must set up its own pcre_extra block.  
   
      The second argument contains option  bits.  At  present,  no  
      options  are  defined  for  pcre_study(),  and this argument  
      should always be zero.  
   
      The third argument for pcre_study()  is  a  pointer  for  an  
      error  message.  If  studying  succeeds  (even if no data is  
      returned), the variable it points to is set to NULL.  Other-  
      wise it points to a textual error message. You should there-  
      fore  test  the  error  pointer  for  NULL   after   calling  
      pcre_study(), to be sure that it has run successfully.  
   
      This is a typical call to pcre_study():  
   
        pcre_extra *pe;  
        pe = pcre_study(  
          re,             /* result of pcre_compile() */  
          0,              /* no options exist */  
          &error);        /* set to NULL or points to a message */  
   
      At present, studying a  pattern  is  useful  only  for  non-  
      anchored  patterns  that do not have a single fixed starting  
      character. A  bitmap  of  possible  starting  characters  is  
      created.  
563    
564           This algorithm conducts a breadth-first search of  the  tree.  Starting
565           from  the  first  matching  point  in the subject, it scans the subject
566           string from left to right, once, character by character, and as it does
567           this,  it remembers all the paths through the tree that represent valid
568           matches. In Friedl's terminology, this is a kind  of  "DFA  algorithm",
569           though  it is not implemented as a traditional finite state machine (it
570           keeps multiple states active simultaneously).
571    
572  LOCALE SUPPORT         Although the general principle of this matching algorithm  is  that  it
573           scans  the subject string only once, without backtracking, there is one
574           exception: when a lookaround assertion is encountered,  the  characters
575           following  or  preceding  the  current  point  have to be independently
576           inspected.
577    
578       PCRE handles caseless matching, and determines whether char-         The scan continues until either the end of the subject is  reached,  or
579       acters  are  letters, digits, or whatever, by reference to a         there  are  no more unterminated paths. At this point, terminated paths
580       set of tables. When running in UTF-8 mode, this applies only         represent the different matching possibilities (if there are none,  the
581       to characters with codes less than 256. The library contains         match  has  failed).   Thus,  if there is more than one possible match,
582       a default set of tables that is created  in  the  default  C         this algorithm finds all of them, and in particular, it finds the long-
583       locale  when  PCRE  is compiled. This is used when the final         est.  The  matches are returned in decreasing order of length. There is
584       argument of pcre_compile() is NULL, and  is  sufficient  for         an option to stop the algorithm after the first match (which is  neces-
585       many applications.         sarily the shortest) is found.
   
      An alternative set of tables can, however, be supplied. Such  
      tables  are built by calling the pcre_maketables() function,  
      which has no arguments, in the relevant locale.  The  result  
      can  then be passed to pcre_compile() as often as necessary.  
      For example, to build and use tables  that  are  appropriate  
      for  the French locale (where accented characters with codes  
      greater than 128 are treated as letters), the following code  
      could be used:  
   
        setlocale(LC_CTYPE, "fr");  
        tables = pcre_maketables();  
        re = pcre_compile(..., tables);  
   
      The  tables  are  built  in  memory  that  is  obtained  via  
      pcre_malloc.  The  pointer that is passed to pcre_compile is  
      saved with the compiled pattern, and  the  same  tables  are  
      used via this pointer by pcre_study() and pcre_exec(). Thus,  
      for any single pattern, compilation, studying  and  matching  
      all happen in the same locale, but different patterns can be  
      compiled in different locales. It is the caller's  responsi-  
      bility  to  ensure  that  the  memory  containing the tables  
      remains available for as long as it is needed.  
586    
587           Note that all the matches that are found start at the same point in the
588           subject. If the pattern
589    
590  INFORMATION ABOUT A PATTERN           cat(er(pillar)?)?
591    
592       int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         is matched against the string "the caterpillar catchment",  the  result
593            int what, void *where);         will  be the three strings "caterpillar", "cater", and "cat" that start
594           at the fifth character of the subject. The algorithm does not automati-
595           cally move on to find matches that start at later positions.
596    
597       The pcre_fullinfo() function  returns  information  about  a         There are a number of features of PCRE regular expressions that are not
598       compiled pattern. It replaces the obsolete pcre_info() func-         supported by the alternative matching algorithm. They are as follows:
      tion, which is nevertheless retained for backwards compabil-  
      ity (and is documented below).  
   
      The first argument for pcre_fullinfo() is a pointer  to  the  
      compiled  pattern.  The  second  argument  is  the result of  
      pcre_study(), or NULL if the pattern was  not  studied.  The  
      third  argument  specifies  which  piece  of  information is  
      required, and the fourth argument is a pointer to a variable  
      to  receive  the data. The yield of the function is zero for  
      success, or one of the following negative numbers:  
   
        PCRE_ERROR_NULL       the argument code was NULL  
                              the argument where was NULL  
        PCRE_ERROR_BADMAGIC   the "magic number" was not found  
        PCRE_ERROR_BADOPTION  the value of what was invalid  
   
      Here is a typical call of  pcre_fullinfo(),  to  obtain  the  
      length of the compiled pattern:  
   
        int rc;  
        unsigned long int length;  
        rc = pcre_fullinfo(  
          re,               /* result of pcre_compile() */  
          pe,               /* result of pcre_study(), or NULL */  
          PCRE_INFO_SIZE,   /* what is required */  
          &length);         /* where to put the data */  
   
      The possible values for the third argument  are  defined  in  
      pcre.h, and are as follows:  
   
        PCRE_INFO_BACKREFMAX  
   
      Return the number of the highest back reference in the  pat-  
      tern.  The  fourth argument should point to an int variable.  
      Zero is returned if there are no back references.  
   
        PCRE_INFO_CAPTURECOUNT  
   
      Return the number of capturing subpatterns in  the  pattern.  
      The fourth argument should point to an int variable.  
   
        PCRE_INFO_FIRSTBYTE  
   
      Return information about  the  first  byte  of  any  matched  
      string,  for a non-anchored pattern. (This option used to be  
      called PCRE_INFO_FIRSTCHAR; the old name is still recognized  
      for backwards compatibility.)  
   
      If there is a fixed first byte, e.g. from a pattern such  as  
      (cat|cow|coyote),  it  is returned in the integer pointed to  
      by where. Otherwise, if either  
   
      (a) the pattern was compiled with the PCRE_MULTILINE option,  
      and every branch starts with "^", or  
   
      (b) every  branch  of  the  pattern  starts  with  ".*"  and  
      PCRE_DOTALL is not set (if it were set, the pattern would be  
      anchored),  
   
      -1 is returned, indicating that the pattern matches only  at  
      the  start  of  a subject string or after any newline within  
      the string. Otherwise -2 is returned. For anchored patterns,  
      -2 is returned.  
   
        PCRE_INFO_FIRSTTABLE  
   
      If the pattern was studied, and this resulted  in  the  con-  
      struction of a 256-bit table indicating a fixed set of bytes  
      for the first byte in any matching string, a pointer to  the  
      table  is  returned.  Otherwise NULL is returned. The fourth  
      argument should point to an unsigned char * variable.  
   
        PCRE_INFO_LASTLITERAL  
   
      Return the value of the rightmost  literal  byte  that  must  
      exist  in  any  matched  string, other than at its start, if  
      such a byte has been recorded. The  fourth  argument  should  
      point  to  an  int variable. If there is no such byte, -1 is  
      returned. For anchored patterns,  a  last  literal  byte  is  
      recorded  only  if  it follows something of variable length.  
      For example, for the pattern /^a\d+z\d+/ the returned  value  
      is "z", but for /^a\dz\d/ the returned value is -1.  
   
        PCRE_INFO_NAMECOUNT  
        PCRE_INFO_NAMEENTRYSIZE  
        PCRE_INFO_NAMETABLE  
   
      PCRE supports the use of named as well as numbered capturing  
      parentheses. The names are just an additional way of identi-  
      fying the parentheses,  which  still  acquire  a  number.  A  
      caller  that  wants  to extract data from a named subpattern  
      must convert the name to a number in  order  to  access  the  
      correct  pointers  in  the  output  vector  (described  with  
      pcre_exec() below). In order to do this, it must  first  use  
      these  three  values  to  obtain  the name-to-number mapping  
      table for the pattern.  
   
      The  map  consists  of  a  number  of  fixed-size   entries.  
      PCRE_INFO_NAMECOUNT   gives   the  number  of  entries,  and  
      PCRE_INFO_NAMEENTRYSIZE gives the size of each  entry;  both  
      of  these return an int value. The entry size depends on the  
      length of the longest name.  PCRE_INFO_NAMETABLE  returns  a  
      pointer to the first entry of the table (a pointer to char).  
      The first two bytes of each entry are the number of the cap-  
      turing parenthesis, most significant byte first. The rest of  
      the entry is the corresponding name,  zero  terminated.  The  
      names  are  in alphabetical order. For example, consider the  
      following pattern (assume PCRE_EXTENDED  is  set,  so  white  
      space - including newlines - is ignored):  
   
        (?P<date> (?P<year>(\d\d)?\d\d) -  
        (?P<month>\d\d) - (?P<day>\d\d) )  
   
      There are four named subpatterns,  so  the  table  has  four  
      entries,  and  each  entry in the table is eight bytes long.  
      The table is as follows, with non-printing  bytes  shows  in  
      hex, and undefined bytes shown as ??:  
   
        00 01 d  a  t  e  00 ??  
        00 05 d  a  y  00 ?? ??  
        00 04 m  o  n  t  h  00  
        00 02 y  e  a  r  00 ??  
   
      When writing code to extract data  from  named  subpatterns,  
      remember  that the length of each entry may be different for  
      each compiled pattern.  
   
        PCRE_INFO_OPTIONS  
   
      Return a copy of the options with which the pattern was com-  
      piled.  The fourth argument should point to an unsigned long  
      int variable. These option bits are those specified  in  the  
      call  to  pcre_compile(),  modified  by any top-level option  
      settings within the pattern itself.  
   
      A pattern is automatically anchored by PCRE if  all  of  its  
      top-level alternatives begin with one of the following:  
   
        ^     unless PCRE_MULTILINE is set  
        \A    always  
        \G    always  
        .*    if PCRE_DOTALL is set and there are no back  
                references to the subpattern in which .* appears  
   
      For such patterns, the  PCRE_ANCHORED  bit  is  set  in  the  
      options returned by pcre_fullinfo().  
   
        PCRE_INFO_SIZE  
   
      Return the size of the compiled pattern, that is, the  value  
      that  was  passed as the argument to pcre_malloc() when PCRE  
      was getting memory in which to place the compiled data.  The  
      fourth argument should point to a size_t variable.  
   
        PCRE_INFO_STUDYSIZE  
   
      Returns the size  of  the  data  block  pointed  to  by  the  
      study_data  field  in a pcre_extra block. That is, it is the  
      value that was passed to pcre_malloc() when PCRE was getting  
      memory into which to place the data created by pcre_study().  
      The fourth argument should point to a size_t variable.  
599    
600           1. Because the algorithm finds all  possible  matches,  the  greedy  or
601           ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
602           ungreedy quantifiers are treated in exactly the same way. However, pos-
603           sessive  quantifiers can make a difference when what follows could also
604           match what is quantified, for example in a pattern like this:
605    
606  OBSOLETE INFO FUNCTION           ^a++\w!
607    
608       int pcre_info(const pcre *code, int *optptr, *firstcharptr);         This pattern matches "aaab!" but not "aaa!", which would be matched  by
609           a  non-possessive quantifier. Similarly, if an atomic group is present,
610           it is matched as if it were a standalone pattern at the current  point,
611           and  the  longest match is then "locked in" for the rest of the overall
612           pattern.
613    
614       The pcre_info() function is now obsolete because its  inter-         2. When dealing with multiple paths through the tree simultaneously, it
615       face  is  too  restrictive  to return all the available data         is  not  straightforward  to  keep track of captured substrings for the
616       about  a  compiled  pattern.   New   programs   should   use         different matching possibilities, and  PCRE's  implementation  of  this
617       pcre_fullinfo()  instead.  The  yield  of pcre_info() is the         algorithm does not attempt to do this. This means that no captured sub-
618       number of capturing subpatterns, or  one  of  the  following         strings are available.
      negative numbers:  
   
        PCRE_ERROR_NULL       the argument code was NULL  
        PCRE_ERROR_BADMAGIC   the "magic number" was not found  
   
      If the optptr argument is not NULL, a copy  of  the  options  
      with which the pattern was compiled is placed in the integer  
      it points to (see PCRE_INFO_OPTIONS above).  
   
      If the pattern is not anchored and the firstcharptr argument  
      is  not  NULL, it is used to pass back information about the  
      first    character    of    any    matched    string    (see  
      PCRE_INFO_FIRSTBYTE above).  
619    
620           3. Because no substrings are captured, back references within the  pat-
621           tern are not supported, and cause errors if encountered.
622    
623  MATCHING A PATTERN         4.  For  the same reason, conditional expressions that use a backrefer-
624           ence as the condition or test for a specific group  recursion  are  not
625           supported.
626    
627       int pcre_exec(const pcre *code, const pcre_extra *extra,         5.  Because  many  paths  through the tree may be active, the \K escape
628            const char *subject, int length, int startoffset,         sequence, which resets the start of the match when encountered (but may
629            int options, int *ovector, int ovecsize);         be  on  some  paths  and not on others), is not supported. It causes an
630           error if encountered.
      The function pcre_exec() is called to match a subject string  
      against  a pre-compiled pattern, which is passed in the code  
      argument. If the pattern has been studied, the result of the  
      study should be passed in the extra argument.  
   
      Here is an example of a simple call to pcre_exec():  
   
        int rc;  
        int ovector[30];  
        rc = pcre_exec(  
          re,             /* result of pcre_compile() */  
          NULL,           /* we didn't study the pattern */  
          "some string",  /* the subject string */  
          11,             /* the length of the subject string */  
          0,              /* start at offset 0 in the subject */  
          0,              /* default options */  
          ovector,        /* vector for substring information */  
          30);            /* number of elements in the vector */  
   
      If the extra argument is  not  NULL,  it  must  point  to  a  
      pcre_extra  data  block.  The  pcre_study() function returns  
      such a block (when it doesn't return NULL), but you can also  
      create  one for yourself, and pass additional information in  
      it. The fields in the block are as follows:  
   
        unsigned long int flags;  
        void *study_data;  
        unsigned long int match_limit;  
        void *callout_data;  
   
      The flags field is a bitmap  that  specifies  which  of  the  
      other fields are set. The flag bits are:  
   
        PCRE_EXTRA_STUDY_DATA  
        PCRE_EXTRA_MATCH_LIMIT  
        PCRE_EXTRA_CALLOUT_DATA  
   
      Other flag bits should be set to zero. The study_data  field  
      is   set  in  the  pcre_extra  block  that  is  returned  by  
      pcre_study(), together with the appropriate  flag  bit.  You  
      should  not  set this yourself, but you can add to the block  
      by setting the other fields.  
   
      The match_limit field provides a means  of  preventing  PCRE  
      from  using  up a vast amount of resources when running pat-  
      terns that are not going to match, but  which  have  a  very  
      large  number  of  possibilities  in their search trees. The  
      classic example is the  use  of  nested  unlimited  repeats.  
      Internally,  PCRE  uses  a  function called match() which it  
      calls  repeatedly  (sometimes  recursively).  The  limit  is  
      imposed  on the number of times this function is called dur-  
      ing a match, which has the effect of limiting the amount  of  
      recursion and backtracking that can take place. For patterns  
      that are not anchored, the count starts from zero  for  each  
      position in the subject string.  
   
      The default limit for the library can be set  when  PCRE  is  
      built;  the default default is 10 million, which handles all  
      but the most extreme cases. You can reduce  the  default  by  
      suppling  pcre_exec()  with  a  pcre_extra  block  in  which  
      match_limit   is   set   to    a    smaller    value,    and  
      PCRE_EXTRA_MATCH_LIMIT  is  set  in  the flags field. If the  
      limit      is      exceeded,       pcre_exec()       returns  
      PCRE_ERROR_MATCHLIMIT.  
   
      The pcre_callout field is used in conjunction with the "cal-  
      lout"  feature,  which is described in the pcrecallout docu-  
      mentation.  
   
      The PCRE_ANCHORED option can be passed in the options  argu-  
      ment,   whose   unused   bits  must  be  zero.  This  limits  
      pcre_exec() to matching at the first matching position. How-  
      ever,  if  a  pattern  was  compiled  with PCRE_ANCHORED, or  
      turned out to be anchored by virtue of its contents, it can-  
      not be made unachored at matching time.  
   
      There are also three further options that can be set only at  
      matching time:  
   
        PCRE_NOTBOL  
   
      The first character of the string is not the beginning of  a  
      line,  so  the  circumflex  metacharacter  should  not match  
      before it. Setting this without PCRE_MULTILINE  (at  compile  
      time) causes circumflex never to match.  
   
        PCRE_NOTEOL  
   
      The end of the string is not the end of a line, so the  dol-  
      lar  metacharacter should not match it nor (except in multi-  
      line mode) a newline immediately  before  it.  Setting  this  
      without PCRE_MULTILINE (at compile time) causes dollar never  
      to match.  
   
        PCRE_NOTEMPTY  
   
      An empty string is not considered to be  a  valid  match  if  
      this  option  is  set. If there are alternatives in the pat-  
      tern, they are tried. If  all  the  alternatives  match  the  
      empty  string,  the  entire match fails. For example, if the  
      pattern  
   
        a?b?  
   
      is applied to a string not beginning with  "a"  or  "b",  it  
      matches  the  empty string at the start of the subject. With  
      PCRE_NOTEMPTY set, this match is not valid, so PCRE searches  
      further into the string for occurrences of "a" or "b".  
   
      Perl has no direct equivalent of PCRE_NOTEMPTY, but it  does  
      make  a  special case of a pattern match of the empty string  
      within its split() function, and when using the /g modifier.  
      It  is possible to emulate Perl's behaviour after matching a  
      null string by first trying the  match  again  at  the  same  
      offset  with  PCRE_NOTEMPTY  set,  and then if that fails by  
      advancing the starting offset  (see  below)  and  trying  an  
      ordinary match again.  
   
      The subject string is passed to pcre_exec() as a pointer  in  
      subject,  a length in length, and a starting offset in star-  
      toffset. Unlike the pattern string, the subject may  contain  
      binary  zero  bytes.  When  the starting offset is zero, the  
      search for a match starts at the beginning of  the  subject,  
      and this is by far the most common case.  
   
      If the pattern was compiled with the PCRE_UTF8  option,  the  
      subject  must  be  a sequence of bytes that is a valid UTF-8  
      string.  If  an  invalid  UTF-8  string  is  passed,  PCRE's  
      behaviour is not defined.  
   
      A non-zero starting offset  is  useful  when  searching  for  
      another  match  in  the  same subject by calling pcre_exec()  
      again after a previous success.  Setting startoffset differs  
      from  just  passing  over  a  shortened  string  and setting  
      PCRE_NOTBOL in the case of a pattern that  begins  with  any  
      kind of lookbehind. For example, consider the pattern  
   
        \Biss\B  
   
      which finds occurrences of "iss" in the middle of words. (\B  
      matches only if the current position in the subject is not a  
      word boundary.) When applied to the string "Mississipi"  the  
      first  call  to  pcre_exec()  finds the first occurrence. If  
      pcre_exec() is called again with just the remainder  of  the  
      subject,  namely  "issipi", it does not match, because \B is  
      always false at the start of the subject, which is deemed to  
      be  a  word  boundary. However, if pcre_exec() is passed the  
      entire string again, but with startoffset set to 4, it finds  
      the  second  occurrence  of "iss" because it is able to look  
      behind the starting point to discover that it is preceded by  
      a letter.  
   
      If a non-zero starting offset is passed when the pattern  is  
      anchored, one attempt to match at the given offset is tried.  
      This can only succeed if the pattern does  not  require  the  
      match to be at the start of the subject.  
   
      In general, a pattern matches a certain portion of the  sub-  
      ject,  and  in addition, further substrings from the subject  
      may be picked out by parts of  the  pattern.  Following  the  
      usage  in  Jeffrey Friedl's book, this is called "capturing"  
      in what follows, and the phrase  "capturing  subpattern"  is  
      used for a fragment of a pattern that picks out a substring.  
      PCRE supports several other kinds of  parenthesized  subpat-  
      tern that do not cause substrings to be captured.  
   
      Captured substrings are returned to the caller via a  vector  
      of  integer  offsets whose address is passed in ovector. The  
      number of elements in the vector is passed in ovecsize.  The  
      first two-thirds of the vector is used to pass back captured  
      substrings, each substring using a  pair  of  integers.  The  
      remaining  third  of  the  vector  is  used  as workspace by  
      pcre_exec() while matching capturing subpatterns, and is not  
      available for passing back information. The length passed in  
      ovecsize should always be a multiple of three. If it is not,  
      it is rounded down.  
   
      When a match has been successful, information about captured  
      substrings is returned in pairs of integers, starting at the  
      beginning of ovector, and continuing up to two-thirds of its  
      length  at  the  most. The first element of a pair is set to  
      the offset of the first character in a  substring,  and  the  
      second is set to the offset of the first character after the  
      end of a substring. The first  pair,  ovector[0]  and  ovec-  
      tor[1],  identify  the portion of the subject string matched  
      by the entire pattern. The next pair is used for  the  first  
      capturing  subpattern,  and  so  on.  The  value returned by  
      pcre_exec() is the number of pairs that have  been  set.  If  
      there  are no capturing subpatterns, the return value from a  
      successful match is 1, indicating that just the  first  pair  
      of offsets has been set.  
   
      Some convenience functions are provided for  extracting  the  
      captured substrings as separate strings. These are described  
      in the following section.  
   
      It is possible for an capturing  subpattern  number  n+1  to  
      match  some  part  of  the subject when subpattern n has not  
      been used at all.  For  example,  if  the  string  "abc"  is  
      matched  against the pattern (a|(z))(bc) subpatterns 1 and 3  
      are matched, but 2 is not. When this  happens,  both  offset  
      values corresponding to the unused subpattern are set to -1.  
   
      If a capturing subpattern is matched repeatedly, it  is  the  
      last  portion  of  the  string  that  it  matched  that gets  
      returned.  
   
      If the vector is too small to hold  all  the  captured  sub-  
      strings,  it is used as far as possible (up to two-thirds of  
      its length), and the function returns a value  of  zero.  In  
      particular,  if  the  substring offsets are not of interest,  
      pcre_exec() may be called with ovector passed  as  NULL  and  
      ovecsize  as  zero.  However,  if  the pattern contains back  
      references and the ovector isn't big enough to remember  the  
      related  substrings,  PCRE  has to get additional memory for  
      use during matching. Thus it is usually advisable to  supply  
      an ovector.  
   
      Note that pcre_info() can be used to find out how many  cap-  
      turing  subpatterns  there  are  in  a compiled pattern. The  
      smallest size for ovector that will  allow  for  n  captured  
      substrings,  in  addition  to  the  offsets of the substring  
      matched by the whole pattern, is (n+1)*3.  
   
      If pcre_exec() fails, it returns a negative number. The fol-  
      lowing are defined in the header file:  
   
        PCRE_ERROR_NOMATCH        (-1)  
   
      The subject string did not match the pattern.  
   
        PCRE_ERROR_NULL           (-2)  
   
      Either code or subject was passed as NULL,  or  ovector  was  
      NULL and ovecsize was not zero.  
   
        PCRE_ERROR_BADOPTION      (-3)  
   
      An unrecognized bit was set in the options argument.  
   
        PCRE_ERROR_BADMAGIC       (-4)  
   
      PCRE stores a 4-byte "magic number" at the start of the com-  
      piled  code,  to  catch  the  case  when it is passed a junk  
      pointer. This is the error it gives when  the  magic  number  
      isn't present.  
   
        PCRE_ERROR_UNKNOWN_NODE   (-5)  
   
      While running the pattern match, an unknown item was encoun-  
      tered in the compiled pattern. This error could be caused by  
      a bug in PCRE or by overwriting of the compiled pattern.  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      If a pattern contains back references, but the ovector  that  
      is  passed  to pcre_exec() is not big enough to remember the  
      referenced substrings, PCRE gets a block of  memory  at  the  
      start  of  matching to use for this purpose. If the call via  
      pcre_malloc() fails, this error  is  given.  The  memory  is  
      freed at the end of matching.  
   
        PCRE_ERROR_NOSUBSTRING    (-7)  
   
      This   error   is   used   by   the   pcre_copy_substring(),  
      pcre_get_substring(),  and  pcre_get_substring_list()  func-  
      tions (see below). It is never returned by pcre_exec().  
   
        PCRE_ERROR_MATCHLIMIT     (-8)  
   
      The recursion and backtracking limit, as  specified  by  the  
      match_limit  field  in a pcre_extra structure (or defaulted)  
      was reached. See the description above.  
   
        PCRE_ERROR_CALLOUT        (-9)  
   
      This error is never generated by pcre_exec() itself.  It  is  
      provided  for  use by callout functions that want to yield a  
      distinctive error code. See  the  pcrecallout  documentation  
      for details.  
631    
632           6. Callouts are supported, but the value of the  capture_top  field  is
633           always 1, and the value of the capture_last field is always -1.
634    
635  EXTRACTING CAPTURED SUBSTRINGS BY NUMBER         7.  The \C escape sequence, which (in the standard algorithm) matches a
636           single byte, even in UTF-8 mode, is not supported because the  alterna-
637           tive  algorithm  moves  through  the  subject string one character at a
638           time, for all active paths through the tree.
639    
640       int pcre_copy_substring(const char *subject, int *ovector,         8. Except for (*FAIL), the backtracking control verbs such as  (*PRUNE)
641            int stringcount, int stringnumber, char *buffer,         are  not  supported.  (*FAIL)  is supported, and behaves like a failing
642            int buffersize);         negative assertion.
   
      int pcre_get_substring(const char *subject, int *ovector,  
           int stringcount, int stringnumber,  
           const char **stringptr);  
   
      int pcre_get_substring_list(const char *subject,  
           int *ovector, int stringcount, const char ***listptr);  
   
      Captured substrings can be accessed directly  by  using  the  
      offsets returned by pcre_exec() in ovector. For convenience,  
      the functions  pcre_copy_substring(),  pcre_get_substring(),  
      and  pcre_get_substring_list()  are  provided for extracting  
      captured  substrings  as  new,   separate,   zero-terminated  
      strings.  These functions identify substrings by number. The  
      next section describes functions for extracting  named  sub-  
      strings.   A  substring  that  contains  a  binary  zero  is  
      correctly extracted and has a further zero added on the end,  
      but the result is not, of course, a C string.  
   
      The first three arguments are the  same  for  all  three  of  
      these  functions:   subject  is the subject string which has  
      just been successfully matched, ovector is a pointer to  the  
      vector  of  integer  offsets that was passed to pcre_exec(),  
      and stringcount is the number of substrings that  were  cap-  
      tured by the match, including the substring that matched the  
      entire regular expression. This is  the  value  returned  by  
      pcre_exec  if  it  is  greater  than  zero.  If  pcre_exec()  
      returned zero, indicating that it ran out of space in  ovec-  
      tor,  the  value passed as stringcount should be the size of  
      the vector divided by three.  
   
      The functions pcre_copy_substring() and pcre_get_substring()  
      extract a single substring, whose number is given as string-  
      number. A value of zero extracts the substring that  matched  
      the entire pattern, while higher values extract the captured  
      substrings. For pcre_copy_substring(), the string is  placed  
      in  buffer,  whose  length is given by buffersize, while for  
      pcre_get_substring() a new block of memory is  obtained  via  
      pcre_malloc,  and its address is returned via stringptr. The  
      yield of the function is  the  length  of  the  string,  not  
      including the terminating zero, or one of  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      The buffer was too small for pcre_copy_substring(),  or  the  
      attempt to get memory failed for pcre_get_substring().  
   
        PCRE_ERROR_NOSUBSTRING    (-7)  
   
      There is no substring whose number is stringnumber.  
   
      The pcre_get_substring_list() function extracts  all  avail-  
      able  substrings  and builds a list of pointers to them. All  
      this is done in a single block of memory which  is  obtained  
      via pcre_malloc. The address of the memory block is returned  
      via listptr, which is also the start of the list  of  string  
      pointers.  The  end of the list is marked by a NULL pointer.  
      The yield of the function is zero if all went well, or  
   
        PCRE_ERROR_NOMEMORY       (-6)  
   
      if the attempt to get the memory block failed.  
   
      When any of these functions encounter a  substring  that  is  
      unset, which can happen when capturing subpattern number n+1  
      matches some part of the subject, but subpattern n  has  not  
      been  used  at all, they return an empty string. This can be  
      distinguished  from  a  genuine  zero-length  substring   by  
      inspecting the appropriate offset in ovector, which is nega-  
      tive for unset substrings.  
   
      The  two  convenience  functions  pcre_free_substring()  and  
      pcre_free_substring_list()  can  be  used to free the memory  
      returned by  a  previous  call  of  pcre_get_substring()  or  
      pcre_get_substring_list(),  respectively.  They  do  nothing  
      more than call the function pointed to by  pcre_free,  which  
      of  course  could  be called directly from a C program. How-  
      ever, PCRE is used in some situations where it is linked via  
      a  special  interface  to another programming language which  
      cannot use pcre_free directly; it is for  these  cases  that  
      the functions are provided.  
643    
644    
645  EXTRACTING CAPTURED SUBSTRINGS BY NAME  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
646    
647       int pcre_copy_named_substring(const pcre *code,         Using the alternative matching algorithm provides the following  advan-
648            const char *subject, int *ovector,         tages:
           int stringcount, const char *stringname,  
           char *buffer, int buffersize);  
   
      int pcre_get_stringnumber(const pcre *code,  
           const char *name);  
   
      int pcre_get_named_substring(const pcre *code,  
           const char *subject, int *ovector,  
           int stringcount, const char *stringname,  
           const char **stringptr);  
   
      To extract a substring by name, you first have to find asso-  
      ciated    number.    This    can    be   done   by   calling  
      pcre_get_stringnumber(). The first argument is the  compiled  
      pattern,  and  the second is the name. For example, for this  
      pattern  
   
        ab(?<xxx>\d+)...  
   
      the number of the subpattern called "xxx" is  1.  Given  the  
      number,  you can then extract the substring directly, or use  
      one of the functions described in the previous section.  For  
      convenience,  there are also two functions that do the whole  
      job.  
   
      Most of the  arguments  of  pcre_copy_named_substring()  and  
      pcre_get_named_substring()  are  the  same  as those for the  
      functions that  extract  by  number,  and  so  are  not  re-  
      described here. There are just two differences.  
   
      First, instead of a substring number, a  substring  name  is  
      given.  Second,  there  is  an  extra argument, given at the  
      start, which is a pointer to the compiled pattern.  This  is  
      needed  in order to gain access to the name-to-number trans-  
      lation table.  
   
      These functions  call  pcre_get_stringnumber(),  and  if  it  
      succeeds,    they   then   call   pcre_copy_substring()   or  
      pcre_get_substring(), as appropriate.  
649    
650  Last updated: 03 February 2003         1. All possible matches (at a single point in the subject) are automat-
651  Copyright (c) 1997-2003 University of Cambridge.         ically found, and in particular, the longest match is  found.  To  find
652  -----------------------------------------------------------------------------         more than one match using the standard algorithm, you have to do kludgy
653           things with callouts.
654    
655  NAME         2. Because the alternative algorithm  scans  the  subject  string  just
656       PCRE - Perl-compatible regular expressions         once,  and  never  needs to backtrack, it is possible to pass very long
657           subject strings to the matching function in  several  pieces,  checking
658           for  partial  matching  each time. Although it is possible to do multi-
659           segment matching using the standard algorithm (pcre_exec()), by retain-
660           ing  partially matched substrings, it is more complicated. The pcrepar-
661           tial documentation gives details  of  partial  matching  and  discusses
662           multi-segment matching.
663    
664    
665  PCRE CALLOUTS  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
666    
667       int (*pcre_callout)(pcre_callout_block *);         The alternative algorithm suffers from a number of disadvantages:
668    
669       PCRE provides a feature called "callout", which is  a  means         1.  It  is  substantially  slower  than the standard algorithm. This is
670       of  temporarily passing control to the caller of PCRE in the         partly because it has to search for all possible matches, but  is  also
671       middle of pattern matching. The caller of PCRE  provides  an         because it is less susceptible to optimization.
      external  function  by putting its entry point in the global  
      variable pcre_callout. By default,  this  variable  contains  
      NULL, which disables all calling out.  
   
      Within a regular expression, (?C) indicates  the  points  at  
      which  the external function is to be called. Different cal-  
      lout points can be identified by putting a number less  than  
      256  after  the  letter  C.  The default value is zero.  For  
      example, this pattern has two callout points:  
   
        (?C1)9abc(?C2)def  
   
      During matching, when PCRE  reaches  a  callout  point  (and  
      pcre_callout  is  set), the external function is called. Its  
      only argument is a pointer to  a  pcre_callout  block.  This  
      contains the following variables:  
   
        int          version;  
        int          callout_number;  
        int         *offset_vector;  
        const char  *subject;  
        int          subject_length;  
        int          start_match;  
        int          current_position;  
        int          capture_top;  
        int          capture_last;  
        void        *callout_data;  
   
      The version field  is  an  integer  containing  the  version  
      number of the block format. The current version is zero. The  
      version number may change in future if additional fields are  
      added,  but  the  intention  is  never  to remove any of the  
      existing fields.  
   
      The callout_number field contains the number of the callout,  
      as compiled into the pattern (that is, the number after ?C).  
   
      The offset_vector field  is  a  pointer  to  the  vector  of  
      offsets  that  was  passed by the caller to pcre_exec(). The  
      contents can be inspected in  order  to  extract  substrings  
      that  have  been  matched  so  far,  in  the same way as for  
      extracting substrings after a match has completed.  
      The subject and subject_length  fields  contain  copies  the  
      values that were passed to pcre_exec().  
   
      The start_match field contains the offset within the subject  
      at  which  the current match attempt started. If the pattern  
      is not anchored, the callout function may be called  several  
      times for different starting points.  
   
      The current_position field contains the  offset  within  the  
      subject of the current match pointer.  
   
      The capture_top field contains the  number  of  the  highest  
      captured substring so far.  
   
      The capture_last field  contains  the  number  of  the  most  
      recently captured substring.  
   
      The callout_data field contains a value that  is  passed  to  
      pcre_exec()  by  the  caller  specifically so that it can be  
      passed back in callouts. It is passed  in  the  pcre_callout  
      field  of the pcre_extra data structure. If no such data was  
      passed, the value of callout_data in a pcre_callout block is  
      NULL.  There is a description of the pcre_extra structure in  
      the pcreapi documentation.  
672    
673           2. Capturing parentheses and back references are not supported.
674    
675           3. Although atomic groups are supported, their use does not provide the
676           performance advantage that it does for the standard algorithm.
677    
 RETURN VALUES  
678    
679       The callout function returns an integer.  If  the  value  is  AUTHOR
680       zero,  matching  proceeds as normal. If the value is greater  
681       than zero, matching fails at the current  point,  but  back-         Philip Hazel
682       tracking  to test other possibilities goes ahead, just as if         University Computing Service
683       a lookahead assertion had failed. If the value is less  than         Cambridge CB2 3QH, England.
684       zero,  the  match  is abandoned, and pcre_exec() returns the  
685       value.  
686    REVISION
687       Negative values should normally be chosen from  the  set  of  
688       PCRE_ERROR_xxx  values.  In  particular,  PCRE_ERROR_NOMATCH         Last updated: 17 November 2010
689       forces a standard "no  match"  failure.   The  error  number         Copyright (c) 1997-2010 University of Cambridge.
690       PCRE_ERROR_CALLOUT is reserved for use by callout functions;  ------------------------------------------------------------------------------
691       it will never be used by PCRE itself.  
692    
693    PCREAPI(3)                                                          PCREAPI(3)
694    
 Last updated: 21 January 2003  
 Copyright (c) 1997-2003 University of Cambridge.  
 -----------------------------------------------------------------------------  
695    
696  NAME  NAME
697       PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
698    
699    
700  DIFFERENCES FROM PERL  PCRE NATIVE API BASIC FUNCTIONS
701    
702       This document describes the differences  in  the  ways  that         #include <pcre.h>
      PCRE  and  Perl  handle regular expressions. The differences  
      described here are with respect to Perl 5.8.  
   
      1. PCRE does  not  allow  repeat  quantifiers  on  lookahead  
      assertions. Perl permits them, but they do not mean what you  
      might think. For example, (?!a){3} does not assert that  the  
      next  three characters are not "a". It just asserts that the  
      next character is not "a" three times.  
   
      2. Capturing subpatterns that occur inside  negative  looka-  
      head  assertions  are  counted,  but  their  entries  in the  
      offsets vector are never set. Perl sets its numerical  vari-  
      ables  from  any  such  patterns that are matched before the  
      assertion fails to match something (thereby succeeding), but  
      only  if  the negative lookahead assertion contains just one  
      branch.  
   
      3. Though binary zero characters are supported in  the  sub-  
      ject  string,  they  are  not  allowed  in  a pattern string  
      because it is passed as a normal  C  string,  terminated  by  
      zero. The escape sequence "\0" can be used in the pattern to  
      represent a binary zero.  
   
      4. The following Perl escape sequences  are  not  supported:  
      \l,  \u,  \L,  \U,  \P, \p, and \X. In fact these are imple-  
      mented by Perl's general string-handling and are not part of  
      its pattern matching engine. If any of these are encountered  
      by PCRE, an error is generated.  
   
      5. PCRE does support the \Q...\E  escape  for  quoting  sub-  
      strings. Characters in between are treated as literals. This  
      is slightly different from Perl in that $  and  @  are  also  
      handled  as  literals inside the quotes. In Perl, they cause  
      variable interpolation (but of course  PCRE  does  not  have  
      variables). Note the following examples:  
   
          Pattern            PCRE matches      Perl matches  
   
          \Qabc$xyz\E        abc$xyz           abc followed by the  
                                                 contents of $xyz  
          \Qabc\$xyz\E       abc\$xyz          abc\$xyz  
          \Qabc\E\$\Qxyz\E   abc$xyz           abc$xyz  
   
      In PCRE, the \Q...\E mechanism is not  recognized  inside  a  
      character class.  
   
      8. Fairly obviously, PCRE does not support the (?{code}) and  
      (?p{code})  constructions. However, there is some experimen-  
      tal support for recursive patterns using the non-Perl  items  
      (?R),  (?number)  and  (?P>name).  Also,  the PCRE "callout"  
      feature allows an external function to be called during pat-  
      tern matching.  
   
      9. There are some differences that are  concerned  with  the  
      settings  of  captured  strings  when  part  of a pattern is  
      repeated. For example, matching "aba"  against  the  pattern  
      /^(a(b)?)+$/  in Perl leaves $2 unset, but in PCRE it is set  
      to "b".  
   
      10. PCRE  provides  some  extensions  to  the  Perl  regular  
      expression facilities:  
   
      (a) Although lookbehind assertions must match  fixed  length  
      strings,  each  alternative branch of a lookbehind assertion  
      can match a different length of string. Perl  requires  them  
      all to have the same length.  
   
      (b) If PCRE_DOLLAR_ENDONLY is set and PCRE_MULTILINE is  not  
      set,  the  $  meta-character matches only at the very end of  
      the string.  
   
      (c) If PCRE_EXTRA is set, a backslash followed by  a  letter  
      with no special meaning is faulted.  
   
      (d) If PCRE_UNGREEDY is set, the greediness of  the  repeti-  
      tion  quantifiers  is inverted, that is, by default they are  
      not greedy, but if followed by a question mark they are.  
   
      (e) PCRE_ANCHORED can be used to force a pattern to be tried  
      only at the first matching position in the subject string.  
   
      (f)  The  PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   and  
      PCRE_NO_AUTO_CAPTURE  options  for  pcre_exec() have no Perl  
      equivalents.  
   
      (g) The (?R), (?number), and (?P>name) constructs allows for  
      recursive  pattern  matching  (Perl  can  do  this using the  
      (?p{code}) construct, which PCRE cannot support.)  
   
      (h) PCRE supports  named  capturing  substrings,  using  the  
      Python syntax.  
   
      (i) PCRE supports the  possessive  quantifier  "++"  syntax,  
      taken from Sun's Java package.  
703    
704       (j) The (R) condition, for  testing  recursion,  is  a  PCRE         pcre *pcre_compile(const char *pattern, int options,
705       extension.              const char **errptr, int *erroffset,
706                const unsigned char *tableptr);
707    
708       (k) The callout facility is PCRE-specific.         pcre *pcre_compile2(const char *pattern, int options,
709                int *errorcodeptr,
710                const char **errptr, int *erroffset,
711                const unsigned char *tableptr);
712    
713  Last updated: 03 February 2003         pcre_extra *pcre_study(const pcre *code, int options,
714  Copyright (c) 1997-2003 University of Cambridge.              const char **errptr);
 -----------------------------------------------------------------------------  
715    
716  NAME         void pcre_free_study(pcre_extra *extra);
      PCRE - Perl-compatible regular expressions  
717    
718           int pcre_exec(const pcre *code, const pcre_extra *extra,
719                const char *subject, int length, int startoffset,
720                int options, int *ovector, int ovecsize);
721    
 PCRE REGULAR EXPRESSION DETAILS  
722    
723       The syntax and semantics of  the  regular  expressions  sup-  PCRE NATIVE API AUXILIARY FUNCTIONS
      ported  by PCRE are described below. Regular expressions are  
      also described in the Perl documentation and in a number  of  
      other  books,  some  of which have copious examples. Jeffrey  
      Friedl's  "Mastering  Regular  Expressions",  published   by  
      O'Reilly,  covers them in great detail. The description here  
      is intended as reference documentation.  
   
      The basic operation of PCRE is on strings of bytes. However,  
      there  is  also  support for UTF-8 character strings. To use  
      this support you must build PCRE to include  UTF-8  support,  
      and  then call pcre_compile() with the PCRE_UTF8 option. How  
      this affects the pattern matching is  mentioned  in  several  
      places  below.  There is also a summary of UTF-8 features in  
      the section on UTF-8 support in the main pcre page.  
   
      A regular expression is a pattern that is matched against  a  
      subject string from left to right. Most characters stand for  
      themselves in a pattern, and match the corresponding charac-  
      ters in the subject. As a trivial example, the pattern  
   
        The quick brown fox  
   
      matches a portion of a subject string that is  identical  to  
      itself.  The  power  of  regular  expressions comes from the  
      ability to include alternatives and repetitions in the  pat-  
      tern.  These  are encoded in the pattern by the use of meta-  
      characters, which do not stand for  themselves  but  instead  
      are interpreted in some special way.  
   
      There are two different sets of meta-characters: those  that  
      are  recognized anywhere in the pattern except within square  
      brackets, and those that are recognized in square  brackets.  
      Outside square brackets, the meta-characters are as follows:  
   
        \      general escape character with several uses  
        ^      assert start of string (or line, in multiline mode)  
        $      assert end of string (or line, in multiline mode)  
        .      match any character except newline (by default)  
        [      start character class definition  
        |      start of alternative branch  
        (      start subpattern  
        )      end subpattern  
        ?      extends the meaning of (  
               also 0 or 1 quantifier  
               also quantifier minimizer  
        *      0 or more quantifier  
        +      1 or more quantifier  
               also "possessive quantifier"  
        {      start min/max quantifier  
   
      Part of a pattern that is in square  brackets  is  called  a  
      "character  class".  In  a  character  class  the only meta-  
      characters are:  
   
        \      general escape character  
        ^      negate the class, but only if the first character  
        -      indicates character range  
        [      POSIX character class (only if followed by POSIX  
                 syntax)  
        ]      terminates the character class  
724    
725       The following sections describe  the  use  of  each  of  the         pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
      meta-characters.  
726    
727           void pcre_jit_stack_free(pcre_jit_stack *stack);
728    
729  BACKSLASH         void pcre_assign_jit_stack(pcre_extra *extra,
730                pcre_jit_callback callback, void *data);
731    
732       The backslash character has several uses. Firstly, if it  is         int pcre_dfa_exec(const pcre *code, const pcre_extra *extra,
733       followed  by  a  non-alphameric character, it takes away any              const char *subject, int length, int startoffset,
734       special  meaning  that  character  may  have.  This  use  of              int options, int *ovector, int ovecsize,
735       backslash  as  an  escape  character applies both inside and              int *workspace, int wscount);
      outside character classes.  
   
      For example, if you want to match a * character,  you  write  
      \*  in the pattern.  This escaping action applies whether or  
      not the following character would otherwise  be  interpreted  
      as  a meta-character, so it is always safe to precede a non-  
      alphameric with backslash to  specify  that  it  stands  for  
      itself. In particular, if you want to match a backslash, you  
      write \\.  
   
      If a pattern is compiled with the PCRE_EXTENDED option, whi-  
      tespace in the pattern (other than in a character class) and  
      characters between a # outside a  character  class  and  the  
      next  newline  character  are ignored. An escaping backslash  
      can be used to include a whitespace or # character  as  part  
      of the pattern.  
   
      If you want to remove the special meaning from a sequence of  
      characters, you can do so by putting them between \Q and \E.  
      This is different from Perl in that $ and @ are  handled  as  
      literals  in  \Q...\E  sequences in PCRE, whereas in Perl, $  
      and @ cause variable interpolation. Note the following exam-  
      ples:  
   
        Pattern            PCRE matches   Perl matches  
   
        \Qabc$xyz\E        abc$xyz        abc followed by the  
   
                                            contents of $xyz  
        \Qabc\$xyz\E       abc\$xyz       abc\$xyz  
        \Qabc\E\$\Qxyz\E   abc$xyz        abc$xyz  
   
      The \Q...\E sequence is recognized both inside  and  outside  
      character classes.  
   
      A second use of backslash provides a way  of  encoding  non-  
      printing  characters  in patterns in a visible manner. There  
      is no restriction on the appearance of non-printing  charac-  
      ters,  apart from the binary zero that terminates a pattern,  
      but when a pattern is being prepared by text editing, it  is  
      usually  easier to use one of the following escape sequences  
      than the binary character it represents:  
   
        \a        alarm, that is, the BEL character (hex 07)  
        \cx       "control-x", where x is any character  
        \e        escape (hex 1B)  
        \f        formfeed (hex 0C)  
        \n        newline (hex 0A)  
        \r        carriage return (hex 0D)  
        \t        tab (hex 09)  
        \ddd      character with octal code ddd, or backreference  
        \xhh      character with hex code hh  
        \x{hhh..} character with hex code hhh... (UTF-8 mode only)  
   
      The precise effect of \cx is as follows: if  x  is  a  lower  
      case  letter,  it  is converted to upper case. Then bit 6 of  
      the character (hex 40) is inverted.  Thus  \cz  becomes  hex  
      1A, but \c{ becomes hex 3B, while \c; becomes hex 7B.  
   
      After \x, from zero  to  two  hexadecimal  digits  are  read  
      (letters  can be in upper or lower case). In UTF-8 mode, any  
      number of hexadecimal digits may appear between \x{  and  },  
      but  the value of the character code must be less than 2**31  
      (that is, the maximum hexadecimal  value  is  7FFFFFFF).  If  
      characters  other than hexadecimal digits appear between \x{  
      and }, or if there is no terminating }, this form of  escape  
      is  not  recognized.  Instead, the initial \x will be inter-  
      preted as a basic  hexadecimal  escape,  with  no  following  
      digits, giving a byte whose value is zero.  
   
      Characters whose value is less than 256 can  be  defined  by  
      either  of  the  two  syntaxes  for \x when PCRE is in UTF-8  
      mode. There is no difference in the way  they  are  handled.  
      For example, \xdc is exactly the same as \x{dc}.  
   
      After \0 up to two further octal digits are  read.  In  both  
      cases,  if  there are fewer than two digits, just those that  
      are present are used. Thus the  sequence  \0\x\07  specifies  
      two binary zeros followed by a BEL character (code value 7).  
      Make sure you supply two digits after the  initial  zero  if  
      the character that follows is itself an octal digit.  
   
      The handling of a backslash followed by a digit other than 0  
      is  complicated.   Outside  a character class, PCRE reads it  
      and any following digits as a decimal number. If the  number  
      is  less  than  10, or if there have been at least that many  
      previous capturing left parentheses in the  expression,  the  
      entire  sequence is taken as a back reference. A description  
      of how this works is given later, following  the  discussion  
      of parenthesized subpatterns.  
   
      Inside a character  class,  or  if  the  decimal  number  is  
      greater  than  9 and there have not been that many capturing  
      subpatterns, PCRE re-reads up to three octal digits  follow-  
      ing  the  backslash,  and  generates  a single byte from the  
      least significant 8 bits of the value. Any subsequent digits  
      stand for themselves.  For example:  
   
        \040   is another way of writing a space  
        \40    is the same, provided there are fewer than 40  
                  previous capturing subpatterns  
        \7     is always a back reference  
        \11    might be a back reference, or another way of  
                  writing a tab  
        \011   is always a tab  
        \0113  is a tab followed by the character "3"  
        \113   might be a back reference, otherwise the  
                  character with octal code 113  
        \377   might be a back reference, otherwise  
                  the byte consisting entirely of 1 bits  
        \81    is either a back reference, or a binary zero  
                  followed by the two characters "8" and "1"  
   
      Note that octal values of 100 or greater must not be  intro-  
      duced  by  a  leading zero, because no more than three octal  
      digits are ever read.  
   
      All the sequences that define a single byte value or a  sin-  
      gle  UTF-8 character (in UTF-8 mode) can be used both inside  
      and outside character classes. In addition, inside a charac-  
      ter  class,  the sequence \b is interpreted as the backspace  
      character (hex 08). Outside a character class it has a  dif-  
      ferent meaning (see below).  
   
      The third use of backslash is for specifying generic charac-  
      ter types:  
   
        \d     any decimal digit  
        \D     any character that is not a decimal digit  
        \s     any whitespace character  
        \S     any character that is not a whitespace character  
        \w     any "word" character  
        W     any "non-word" character  
   
      Each pair of escape sequences partitions the complete set of  
      characters  into  two  disjoint  sets.  Any  given character  
      matches one, and only one, of each pair.  
   
      In UTF-8 mode, characters with values greater than 255 never  
      match \d, \s, or \w, and always match \D, \S, and \W.  
   
      For compatibility with Perl, \s does not match the VT  char-  
      acter (code 11).  This makes it different from the the POSIX  
      "space" class. The \s characters are HT  (9),  LF  (10),  FF  
      (12), CR (13), and space (32).  
   
      A "word" character is any letter or digit or the  underscore  
      character,  that  is,  any  character which can be part of a  
      Perl "word". The definition of letters and  digits  is  con-  
      trolled  by PCRE's character tables, and may vary if locale-  
      specific matching is taking place (see "Locale  support"  in  
      the pcreapi page). For example, in the "fr" (French) locale,  
      some character codes greater than 128 are used for  accented  
      letters, and these are matched by \w.  
   
      These character type sequences can appear  both  inside  and  
      outside  character classes. They each match one character of  
      the appropriate type. If the current matching  point  is  at  
      the end of the subject string, all of them fail, since there  
      is no character to match.  
   
      The fourth use of backslash is  for  certain  simple  asser-  
      tions. An assertion specifies a condition that has to be met  
      at a particular point in  a  match,  without  consuming  any  
      characters  from  the subject string. The use of subpatterns  
      for more complicated  assertions  is  described  below.  The  
      backslashed assertions are  
   
        \b     matches at a word boundary  
        \B     matches when not at a word boundary  
        \A     matches at start of subject  
        \Z     matches at end of subject or before newline at end  
        \z     matches at end of subject  
        \G     matches at first matching position in subject  
   
      These assertions may not appear in  character  classes  (but  
      note  that  \b has a different meaning, namely the backspace  
      character, inside a character class).  
   
      A word boundary is a position in the  subject  string  where  
      the current character and the previous character do not both  
      match \w or \W (i.e. one matches \w and  the  other  matches  
      \W),  or the start or end of the string if the first or last  
      character matches \w, respectively.  
      The \A, \Z, and \z assertions differ  from  the  traditional  
      circumflex  and  dollar  (described below) in that they only  
      ever match at the very start and end of the subject  string,  
      whatever options are set. Thus, they are independent of mul-  
      tiline mode.  
   
      They are not affected  by  the  PCRE_NOTBOL  or  PCRE_NOTEOL  
      options.  If the startoffset argument of pcre_exec() is non-  
      zero, indicating that matching is to start at a point  other  
      than  the  beginning of the subject, \A can never match. The  
      difference between \Z and \z is that  \Z  matches  before  a  
      newline  that is the last character of the string as well as  
      at the end of the string, whereas \z  matches  only  at  the  
      end.  
   
      The \G assertion is true  only  when  the  current  matching  
      position is at the start point of the match, as specified by  
      the startoffset argument of pcre_exec(). It differs from  \A  
      when  the  value  of  startoffset  is  non-zero.  By calling  
      pcre_exec() multiple times with appropriate  arguments,  you  
      can mimic Perl's /g option, and it is in this kind of imple-  
      mentation where \G can be useful.  
   
      Note, however, that PCRE's  interpretation  of  \G,  as  the  
      start of the current match, is subtly different from Perl's,  
      which defines it as the end of the previous match. In  Perl,  
      these  can  be  different when the previously matched string  
      was empty. Because PCRE does just one match at  a  time,  it  
      cannot reproduce this behaviour.  
   
      If all the alternatives of a  pattern  begin  with  \G,  the  
      expression  is  anchored to the starting match position, and  
      the "anchored" flag is set in the compiled  regular  expres-  
      sion.  
736    
737           int pcre_copy_named_substring(const pcre *code,
738                const char *subject, int *ovector,
739                int stringcount, const char *stringname,
740                char *buffer, int buffersize);
741    
742  CIRCUMFLEX AND DOLLAR         int pcre_copy_substring(const char *subject, int *ovector,
743                int stringcount, int stringnumber, char *buffer,
744                int buffersize);
745    
746       Outside a character class, in the default matching mode, the         int pcre_get_named_substring(const pcre *code,
747       circumflex  character  is an assertion which is true only if              const char *subject, int *ovector,
748       the current matching point is at the start  of  the  subject              int stringcount, const char *stringname,
749       string.  If  the startoffset argument of pcre_exec() is non-              const char **stringptr);
      zero, circumflex  can  never  match  if  the  PCRE_MULTILINE  
      option is unset. Inside a character class, circumflex has an  
      entirely different meaning (see below).  
   
      Circumflex need not be the first character of the pattern if  
      a  number of alternatives are involved, but it should be the  
      first thing in each alternative in which it appears  if  the  
      pattern is ever to match that branch. If all possible alter-  
      natives start with a circumflex, that is, if the pattern  is  
      constrained to match only at the start of the subject, it is  
      said to be an "anchored" pattern. (There are also other con-  
      structs that can cause a pattern to be anchored.)  
   
      A dollar character is an assertion which is true only if the  
      current  matching point is at the end of the subject string,  
      or immediately before a newline character that is  the  last  
      character in the string (by default). Dollar need not be the  
      last character of the pattern if a  number  of  alternatives  
      are  involved,  but it should be the last item in any branch  
      in which it appears.  Dollar has no  special  meaning  in  a  
      character class.  
   
      The meaning of dollar can be changed so that it matches only  
      at   the   very   end   of   the   string,  by  setting  the  
      PCRE_DOLLAR_ENDONLY option at compile time.  This  does  not  
      affect the \Z assertion.  
   
      The meanings of the circumflex  and  dollar  characters  are  
      changed  if  the  PCRE_MULTILINE option is set. When this is  
      the case,  they  match  immediately  after  and  immediately  
      before an internal newline character, respectively, in addi-  
      tion to matching at the start and end of the subject string.  
      For  example, the pattern /^abc$/ matches the subject string  
      "def\nabc" in multiline  mode,  but  not  otherwise.  Conse-  
      quently,  patterns  that  are  anchored  in single line mode  
      because all branches start with ^ are not anchored in multi-  
      line  mode,  and a match for circumflex is possible when the  
      startoffset  argument  of  pcre_exec()  is   non-zero.   The  
      PCRE_DOLLAR_ENDONLY  option  is ignored if PCRE_MULTILINE is  
      set.  
   
      Note that the sequences \A, \Z, and \z can be used to  match  
      the  start  and end of the subject in both modes, and if all  
      branches of a pattern start with \A it is  always  anchored,  
      whether PCRE_MULTILINE is set or not.  
   
   
 FULL STOP (PERIOD, DOT)  
   
      Outside a character class, a dot in the pattern matches  any  
      one character in the subject, including a non-printing char-  
      acter, but not (by default) newline.  In UTF-8 mode,  a  dot  
      matches  any  UTF-8  character, which might be more than one  
      byte  long,  except  (by  default)  for  newline.   If   the  
      PCRE_DOTALL  option is set, dots match newlines as well. The  
      handling of dot is entirely independent of the  handling  of  
      circumflex and dollar, the only relationship being that they  
      both involve newline characters. Dot has no special  meaning  
      in a character class.  
750    
751           int pcre_get_stringnumber(const pcre *code,
752                const char *name);
753    
754           int pcre_get_stringtable_entries(const pcre *code,
755                const char *name, char **first, char **last);
756    
757  MATCHING A SINGLE BYTE         int pcre_get_substring(const char *subject, int *ovector,
758                int stringcount, int stringnumber,
759                const char **stringptr);
760    
761       Outside a character class, the escape  sequence  \C  matches         int pcre_get_substring_list(const char *subject,
762       any  one  byte, both in and out of UTF-8 mode. Unlike a dot,              int *ovector, int stringcount, const char ***listptr);
      it always matches a newline. The feature is provided in Perl  
      in  order  to match individual bytes in UTF-8 mode.  Because  
      it breaks up UTF-8 characters into  individual  bytes,  what  
      remains  in  the string may be a malformed UTF-8 string. For  
      this reason it is best avoided.  
   
      PCRE does not allow \C to appear  in  lookbehind  assertions  
      (see below), because in UTF-8 mode it makes it impossible to  
      calculate the length of the lookbehind.  
   
   
 SQUARE BRACKETS  
   
      An opening square bracket introduces a character class, ter-  
      minated  by  a  closing  square  bracket.  A  closing square  
      bracket on its own is  not  special.  If  a  closing  square  
      bracket  is  required as a member of the class, it should be  
      the first data character in the class (after an initial cir-  
      cumflex, if present) or escaped with a backslash.  
   
      A character class matches a single character in the subject.  
      In  UTF-8 mode, the character may occupy more than one byte.  
      A matched character must be in the set of characters defined  
      by the class, unless the first character in the class defin-  
      ition is a circumflex, in which case the  subject  character  
      must not be in the set defined by the class. If a circumflex  
      is actually required as a member of the class, ensure it  is  
      not the first character, or escape it with a backslash.  
   
      For example, the character class [aeiou] matches  any  lower  
      case vowel, while [^aeiou] matches any character that is not  
      a lower case vowel. Note that a circumflex is  just  a  con-  
      venient  notation for specifying the characters which are in  
      the class by enumerating those that are not. It  is  not  an  
      assertion:  it  still  consumes a character from the subject  
      string, and fails if the current pointer is at  the  end  of  
      the string.  
   
      In UTF-8 mode, characters with values greater than  255  can  
      be  included  in a class as a literal string of bytes, or by  
      using the \x{ escaping mechanism.  
   
      When caseless matching  is  set,  any  letters  in  a  class  
      represent  both their upper case and lower case versions, so  
      for example, a caseless [aeiou] matches "A" as well as  "a",  
      and  a caseless [^aeiou] does not match "A", whereas a case-  
      ful version would. PCRE does not support the concept of case  
      for characters with values greater than 255.  
      The newline character is never treated in any special way in  
      character  classes,  whatever the setting of the PCRE_DOTALL  
      or PCRE_MULTILINE options is. A  class  such  as  [^a]  will  
      always match a newline.  
   
      The minus (hyphen) character can be used to specify a  range  
      of  characters  in  a  character  class.  For example, [d-m]  
      matches any letter between d and m, inclusive.  If  a  minus  
      character  is required in a class, it must be escaped with a  
      backslash or appear in a position where it cannot be  inter-  
      preted as indicating a range, typically as the first or last  
      character in the class.  
   
      It is not possible to have the literal character "]" as  the  
      end  character  of  a  range.  A  pattern such as [W-]46] is  
      interpreted as a class of two characters ("W" and "-")  fol-  
      lowed by a literal string "46]", so it would match "W46]" or  
      "-46]". However, if the "]" is escaped with a  backslash  it  
      is  interpreted  as  the end of range, so [W-\]46] is inter-  
      preted as a single class containing a range followed by  two  
      separate characters. The octal or hexadecimal representation  
      of "]" can also be used to end a range.  
   
      Ranges  operate  in  the  collating  sequence  of  character  
      values.  They  can  also  be  used  for characters specified  
      numerically, for example [\000-\037]. In UTF-8 mode,  ranges  
      can  include  characters  whose values are greater than 255,  
      for example [\x{100}-\x{2ff}].  
   
      If a range that  includes  letters  is  used  when  caseless  
      matching  is set, it matches the letters in either case. For  
      example, [W-c] is  equivalent  to  [][\^_`wxyzabc],  matched  
      caselessly,  and if character tables for the "fr" locale are  
      in use, [\xc8-\xcb] matches accented E  characters  in  both  
      cases.  
   
      The character types \d, \D, \s, \S,  \w,  and  \W  may  also  
      appear  in  a  character  class, and add the characters that  
      they match to the class. For example, [\dABCDEF] matches any  
      hexadecimal  digit.  A  circumflex  can conveniently be used  
      with the upper case character types to specify a  more  res-  
      tricted set of characters than the matching lower case type.  
      For example, the class [^\W_] matches any letter  or  digit,  
      but not underscore.  
   
      All non-alphameric characters other than \,  -,  ^  (at  the  
      start)  and  the  terminating ] are non-special in character  
      classes, but it does no harm if they are escaped.  
763    
764           void pcre_free_substring(const char *stringptr);
765    
766  POSIX CHARACTER CLASSES         void pcre_free_substring_list(const char **stringptr);
767    
768       Perl supports the  POSIX  notation  for  character  classes,         const unsigned char *pcre_maketables(void);
      which  uses names enclosed by [: and :] within the enclosing  
      square brackets. PCRE also supports this notation. For exam-  
      ple,  
   
        [01[:alpha:]%]  
   
      matches "0", "1", any alphabetic character, or "%". The sup-  
      ported class names are  
   
        alnum    letters and digits  
        alpha    letters  
        ascii    character codes 0 - 127  
        blank    space or tab only  
        cntrl    control characters  
        digit    decimal digits (same as \d)  
        graph    printing characters, excluding space  
        lower    lower case letters  
        print    printing characters, including space  
        punct    printing characters, excluding letters and digits  
        space    white space (not quite the same as \s)  
        upper    upper case letters  
        word     "word" characters (same as \w)  
        xdigit   hexadecimal digits  
   
      The "space" characters are HT (9),  LF  (10),  VT  (11),  FF  
      (12),  CR  (13),  and  space  (32).  Notice  that  this list  
      includes the VT character (code 11). This makes "space" dif-  
      ferent  to  \s, which does not include VT (for Perl compati-  
      bility).  
   
      The name "word" is a Perl extension, and "blank"  is  a  GNU  
      extension from Perl 5.8. Another Perl extension is negation,  
      which is indicated by a ^ character  after  the  colon.  For  
      example,  
   
        [12[:^digit:]]  
   
      matches "1", "2", or any non-digit.  PCRE  (and  Perl)  also  
      recognize the POSIX syntax [.ch.] and [=ch=] where "ch" is a  
      "collating element", but these are  not  supported,  and  an  
      error is given if they are encountered.  
769    
770       In UTF-8 mode, characters with values greater  than  255  do         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
771       not match any of the POSIX character classes.              int what, void *where);
772    
773           int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
774    
775  VERTICAL BAR         int pcre_refcount(pcre *code, int adjust);
776    
777       Vertical bar characters are  used  to  separate  alternative         int pcre_config(int what, void *where);
      patterns. For example, the pattern  
778    
779         gilbert|sullivan         char *pcre_version(void);
780    
      matches either "gilbert" or "sullivan". Any number of alter-  
      natives  may  appear,  and an empty alternative is permitted  
      (matching the empty string).   The  matching  process  tries  
      each  alternative in turn, from left to right, and the first  
      one that succeeds is used. If the alternatives are within  a  
      subpattern  (defined  below),  "succeeds" means matching the  
      rest of the main pattern as well as the alternative  in  the  
      subpattern.  
781    
782    PCRE NATIVE API INDIRECTED FUNCTIONS
783    
784  INTERNAL OPTION SETTING         void *(*pcre_malloc)(size_t);
785    
786       The   settings   of   the   PCRE_CASELESS,   PCRE_MULTILINE,         void (*pcre_free)(void *);
      PCRE_DOTALL,  and  PCRE_EXTENDED options can be changed from  
      within the pattern by a  sequence  of  Perl  option  letters  
      enclosed between "(?" and ")". The option letters are  
   
        i  for PCRE_CASELESS  
        m  for PCRE_MULTILINE  
        s  for PCRE_DOTALL  
        x  for PCRE_EXTENDED  
   
      For example, (?im) sets caseless, multiline matching. It  is  
      also possible to unset these options by preceding the letter  
      with a hyphen, and a combined setting and unsetting such  as  
      (?im-sx),  which sets PCRE_CASELESS and PCRE_MULTILINE while  
      unsetting PCRE_DOTALL and PCRE_EXTENDED, is also  permitted.  
      If  a  letter  appears both before and after the hyphen, the  
      option is unset.  
   
      When an option change occurs at  top  level  (that  is,  not  
      inside  subpattern  parentheses),  the change applies to the  
      remainder of the pattern that follows.   If  the  change  is  
      placed  right  at  the  start of a pattern, PCRE extracts it  
      into the global options (and it will therefore  show  up  in  
      data extracted by the pcre_fullinfo() function).  
   
      An option change within a subpattern affects only that  part  
      of the current pattern that follows it, so  
   
        (a(?i)b)c  
   
      matches  abc  and  aBc  and  no  other   strings   (assuming  
      PCRE_CASELESS  is  not used).  By this means, options can be  
      made to have different settings in different  parts  of  the  
      pattern.  Any  changes  made  in one alternative do carry on  
      into subsequent branches within  the  same  subpattern.  For  
      example,  
   
        (a(?i)b|c)  
   
      matches "ab", "aB", "c", and "C", even though when  matching  
      "C" the first branch is abandoned before the option setting.  
      This is because the effects of  option  settings  happen  at  
      compile  time. There would be some very weird behaviour oth-  
      erwise.  
   
      The PCRE-specific options PCRE_UNGREEDY and  PCRE_EXTRA  can  
      be changed in the same way as the Perl-compatible options by  
      using the characters U and X  respectively.  The  (?X)  flag  
      setting  is  special in that it must always occur earlier in  
      the pattern than any of the additional features it turns on,  
      even when it is at top level. It is best put at the start.  
787    
788           void *(*pcre_stack_malloc)(size_t);
789    
790  SUBPATTERNS         void (*pcre_stack_free)(void *);
791    
792       Subpatterns are delimited by parentheses  (round  brackets),         int (*pcre_callout)(pcre_callout_block *);
      which can be nested.  Marking part of a pattern as a subpat-  
      tern does two things:  
   
      1. It localizes a set of alternatives. For example, the pat-  
      tern  
   
        cat(aract|erpillar|)  
   
      matches one of the words "cat",  "cataract",  or  "caterpil-  
      lar".  Without  the  parentheses, it would match "cataract",  
      "erpillar" or the empty string.  
   
      2. It sets up the subpattern as a capturing  subpattern  (as  
      defined  above).   When the whole pattern matches, that por-  
      tion of the subject string that matched  the  subpattern  is  
      passed  back  to  the  caller  via  the  ovector argument of  
      pcre_exec(). Opening parentheses are counted  from  left  to  
      right (starting from 1) to obtain the numbers of the captur-  
      ing subpatterns.  
   
      For example, if the string "the red king" is matched against  
      the pattern  
   
        the ((red|white) (king|queen))  
   
      the captured substrings are "red king", "red",  and  "king",  
      and are numbered 1, 2, and 3, respectively.  
   
      The fact that plain parentheses fulfil two functions is  not  
      always  helpful.  There are often times when a grouping sub-  
      pattern is required without a capturing requirement.  If  an  
      opening  parenthesis  is  followed  by a question mark and a  
      colon, the subpattern does not do any capturing, and is  not  
      counted  when computing the number of any subsequent captur-  
      ing subpatterns. For  example,  if  the  string  "the  white  
      queen" is matched against the pattern  
   
        the ((?:red|white) (king|queen))  
   
      the captured substrings are "white queen" and  "queen",  and  
      are  numbered  1 and 2. The maximum number of capturing sub-  
      patterns is 65535, and the maximum depth of nesting  of  all  
      subpatterns, both capturing and non-capturing, is 200.  
   
      As a  convenient  shorthand,  if  any  option  settings  are  
      required  at  the  start  of a non-capturing subpattern, the  
      option letters may appear between the "?" and the ":".  Thus  
      the two patterns  
   
        (?i:saturday|sunday)  
        (?:(?i)saturday|sunday)  
   
      match exactly the same set of strings.  Because  alternative  
      branches  are  tried from left to right, and options are not  
      reset until the end of the subpattern is reached, an  option  
      setting  in  one  branch does affect subsequent branches, so  
      the above patterns match "SUNDAY" as well as "Saturday".  
793    
794    
795  NAMED SUBPATTERNS  PCRE API OVERVIEW
796    
797       Identifying capturing parentheses by number is  simple,  but         PCRE has its own native API, which is described in this document. There
798       it  can be very hard to keep track of the numbers in compli-         are also some wrapper functions that correspond to  the  POSIX  regular
799       cated regular expressions. Furthermore, if an expression  is         expression  API,  but they do not give access to all the functionality.
800       modified,  the  numbers  may change. To help with the diffi-         They are described in the pcreposix documentation. Both of  these  APIs
801       culty, PCRE supports the naming  of  subpatterns,  something         define  a  set  of  C function calls. A C++ wrapper is also distributed
802       that  Perl does not provide. The Python syntax (?P<name>...)         with PCRE. It is documented in the pcrecpp page.
      is used. Names consist of alphanumeric characters and under-  
      scores, and must be unique within a pattern.  
   
      Named capturing parentheses are still allocated  numbers  as  
      well  as  names.  The  PCRE  API provides function calls for  
      extracting the name-to-number translation table from a  com-  
      piled  pattern. For further details see the pcreapi documen-  
      tation.  
803    
804           The native API C function prototypes are defined  in  the  header  file
805           pcre.h,  and  on Unix systems the library itself is called libpcre.  It
806           can normally be accessed by adding -lpcre to the command for linking an
807           application  that  uses  PCRE.  The  header  file  defines  the  macros
808           PCRE_MAJOR and PCRE_MINOR to contain the major and minor  release  num-
809           bers  for  the  library.  Applications can use these to include support
810           for different releases of PCRE.
811    
812  REPETITION         In a Windows environment, if you want to statically link an application
813           program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
814           before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
815           loc()   and   pcre_free()   exported   functions   will   be   declared
816           __declspec(dllimport), with unwanted results.
817    
818       Repetition is specified by quantifiers, which can follow any         The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
819       of the following items:         pcre_exec()  are used for compiling and matching regular expressions in
820           a Perl-compatible manner. A sample program that demonstrates  the  sim-
821           plest  way  of  using them is provided in the file called pcredemo.c in
822           the PCRE source distribution. A listing of this program is given in the
823           pcredemo  documentation, and the pcresample documentation describes how
824           to compile and run it.
825    
826         a literal data character         Just-in-time compiler support is an optional feature of PCRE  that  can
827         the . metacharacter         be built in appropriate hardware environments. It greatly speeds up the
828         the \C escape sequence         matching performance of  many  patterns.  Simple  programs  can  easily
829         escapes such as \d that match single characters         request  that  it  be  used  if available, by setting an option that is
830         a character class         ignored when it is not relevant. More complicated programs  might  need
831         a back reference (see next section)         to     make    use    of    the    functions    pcre_jit_stack_alloc(),
832         a parenthesized subpattern (unless it is an assertion)         pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
833           the  JIT  code's  memory  usage.   These functions are discussed in the
834       The general repetition quantifier specifies  a  minimum  and         pcrejit documentation.
      maximum  number  of  permitted  matches,  by  giving the two  
      numbers in curly brackets (braces), separated  by  a  comma.  
      The  numbers  must be less than 65536, and the first must be  
      less than or equal to the second. For example:  
   
        z{2,4}  
   
      matches "zz", "zzz", or "zzzz". A closing brace on  its  own  
      is not a special character. If the second number is omitted,  
      but the comma is present, there is no upper  limit;  if  the  
      second number and the comma are both omitted, the quantifier  
      specifies an exact number of required matches. Thus  
   
        [aeiou]{3,}  
   
      matches at least 3 successive vowels,  but  may  match  many  
      more, while  
   
        \d{8}  
   
      matches exactly 8 digits.  An  opening  curly  bracket  that  
      appears  in a position where a quantifier is not allowed, or  
      one that does not match the syntax of a quantifier, is taken  
      as  a literal character. For example, {,6} is not a quantif-  
      ier, but a literal string of four characters.  
   
      In UTF-8 mode, quantifiers apply to UTF-8 characters  rather  
      than  to  individual  bytes.  Thus,  for example, \x{100}{2}  
      matches two UTF-8 characters, each of which  is  represented  
      by a two-byte sequence.  
   
      The quantifier {0} is permitted, causing the  expression  to  
      behave  as  if the previous item and the quantifier were not  
      present.  
   
      For convenience (and  historical  compatibility)  the  three  
      most common quantifiers have single-character abbreviations:  
   
        *    is equivalent to {0,}  
        +    is equivalent to {1,}  
        ?    is equivalent to {0,1}  
   
      It is possible to construct infinite loops  by  following  a  
      subpattern  that  can  match no characters with a quantifier  
      that has no upper limit, for example:  
   
        (a?)*  
   
      Earlier versions of Perl and PCRE used to give an  error  at  
      compile  time  for such patterns. However, because there are  
      cases where this  can  be  useful,  such  patterns  are  now  
      accepted,  but  if  any repetition of the subpattern does in  
      fact match no characters, the loop is forcibly broken.  
   
      By default, the quantifiers  are  "greedy",  that  is,  they  
      match  as much as possible (up to the maximum number of per-  
      mitted times), without causing the rest of  the  pattern  to  
      fail. The classic example of where this gives problems is in  
      trying to match comments in C programs. These appear between  
      the  sequences /* and */ and within the sequence, individual  
      * and / characters may appear. An attempt to  match  C  com-  
      ments by applying the pattern  
   
        /\*.*\*/  
   
      to the string  
   
        /* first command */  not comment  /* second comment */  
   
      fails, because it matches the entire  string  owing  to  the  
      greediness of the .*  item.  
   
      However, if a quantifier is followed by a question mark,  it  
      ceases  to be greedy, and instead matches the minimum number  
      of times possible, so the pattern  
   
        /\*.*?\*/  
   
      does the right thing with the C comments. The meaning of the  
      various  quantifiers is not otherwise changed, just the pre-  
      ferred number of matches.  Do not confuse this use of  ques-  
      tion  mark  with  its  use as a quantifier in its own right.  
      Because it has two uses, it can sometimes appear doubled, as  
      in  
   
        \d??\d  
   
      which matches one digit by preference, but can match two  if  
      that is the only way the rest of the pattern matches.  
   
      If the PCRE_UNGREEDY option is set (an option which  is  not  
      available  in  Perl),  the  quantifiers  are  not  greedy by  
      default, but individual ones can be made greedy by following  
      them  with  a  question mark. In other words, it inverts the  
      default behaviour.  
   
      When a parenthesized subpattern is quantified with a minimum  
      repeat  count  that is greater than 1 or with a limited max-  
      imum, more store is required for the  compiled  pattern,  in  
      proportion to the size of the minimum or maximum.  
      If a pattern starts with .* or  .{0,}  and  the  PCRE_DOTALL  
      option (equivalent to Perl's /s) is set, thus allowing the .  
      to match  newlines,  the  pattern  is  implicitly  anchored,  
      because whatever follows will be tried against every charac-  
      ter position in the subject string, so there is no point  in  
      retrying  the overall match at any position after the first.  
      PCRE normally treats such a pattern as though it  were  pre-  
      ceded by \A.  
   
      In cases where it is known that the subject string  contains  
      no  newlines,  it  is  worth setting PCRE_DOTALL in order to  
      obtain this optimization, or alternatively using ^ to  indi-  
      cate anchoring explicitly.  
   
      However, there is one situation where the optimization  can-  
      not  be  used. When .*  is inside capturing parentheses that  
      are the subject of a backreference elsewhere in the pattern,  
      a match at the start may fail, and a later one succeed. Con-  
      sider, for example:  
   
        (.*)abc\1  
   
      If the subject is "xyz123abc123"  the  match  point  is  the  
      fourth  character.  For  this  reason, such a pattern is not  
      implicitly anchored.  
   
      When a capturing subpattern is repeated, the value  captured  
      is the substring that matched the final iteration. For exam-  
      ple, after  
   
        (tweedle[dume]{3}\s*)+  
   
      has matched "tweedledum tweedledee" the value  of  the  cap-  
      tured  substring  is  "tweedledee".  However,  if  there are  
      nested capturing  subpatterns,  the  corresponding  captured  
      values  may  have been set in previous iterations. For exam-  
      ple, after  
835    
836         /(a|(b))+/         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
837           ble,  is  also provided. This uses a different algorithm for the match-
838           ing. The alternative algorithm finds all possible matches (at  a  given
839           point  in  the  subject), and scans the subject just once (unless there
840           are lookbehind assertions). However, this  algorithm  does  not  return
841           captured  substrings.  A description of the two matching algorithms and
842           their advantages and disadvantages is given in the  pcrematching  docu-
843           mentation.
844    
845       matches "aba" the value of the second captured substring  is         In  addition  to  the  main compiling and matching functions, there are
846       "b".         convenience functions for extracting captured substrings from a subject
847           string that is matched by pcre_exec(). They are:
848    
849             pcre_copy_substring()
850             pcre_copy_named_substring()
851             pcre_get_substring()
852             pcre_get_named_substring()
853             pcre_get_substring_list()
854             pcre_get_stringnumber()
855             pcre_get_stringtable_entries()
856    
857  ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS         pcre_free_substring() and pcre_free_substring_list() are also provided,
858           to free the memory used for extracted strings.
859    
860       With both maximizing and minimizing repetition,  failure  of         The function pcre_maketables() is used to  build  a  set  of  character
861       what  follows  normally  causes  the repeated item to be re-         tables   in   the   current   locale  for  passing  to  pcre_compile(),
862       evaluated to see if a different number of repeats allows the         pcre_exec(), or pcre_dfa_exec(). This is an optional facility  that  is
863       rest  of  the  pattern  to  match. Sometimes it is useful to         provided  for  specialist  use.  Most  commonly,  no special tables are
864       prevent this, either to change the nature of the  match,  or         passed, in which case internal tables that are generated when  PCRE  is
865       to  cause  it fail earlier than it otherwise might, when the         built are used.
      author of the pattern knows there is no  point  in  carrying  
      on.  
   
      Consider, for example, the pattern \d+foo  when  applied  to  
      the subject line  
   
        123456bar  
   
      After matching all 6 digits and then failing to match "foo",  
      the normal action of the matcher is to try again with only 5  
      digits matching the \d+ item, and then with 4,  and  so  on,  
      before  ultimately  failing. "Atomic grouping" (a term taken  
      from Jeffrey Friedl's book) provides the means for  specify-  
      ing  that once a subpattern has matched, it is not to be re-  
      evaluated in this way.  
   
      If we use atomic grouping  for  the  previous  example,  the  
      matcher  would give up immediately on failing to match "foo"  
      the  first  time.  The  notation  is  a  kind   of   special  
      parenthesis, starting with (?> as in this example:  
   
        (?>\d+)bar  
   
      This kind of parenthesis "locks up" the  part of the pattern  
      it  contains once it has matched, and a failure further into  
      the pattern is prevented from backtracking  into  it.  Back-  
      tracking  past  it to previous items, however, works as nor-  
      mal.  
   
      An alternative description is that a subpattern of this type  
      matches  the  string  of  characters that an identical stan-  
      dalone pattern would match, if anchored at the current point  
      in the subject string.  
   
      Atomic grouping subpatterns are not  capturing  subpatterns.  
      Simple  cases such as the above example can be thought of as  
      a maximizing repeat that must swallow everything it can. So,  
      while both \d+ and \d+? are prepared to adjust the number of  
      digits they match in order to make the rest of  the  pattern  
      match, (?>\d+) can only match an entire sequence of digits.  
   
      Atomic groups in general can of course  contain  arbitrarily  
      complicated  subpatterns,  and  can be nested. However, when  
      the subpattern for an atomic group is just a single repeated  
      item,  as in the example above, a simpler notation, called a  
      "possessive quantifier" can be used.  This  consists  of  an  
      additional  +  character  following a quantifier. Using this  
      notation, the previous example can be rewritten as  
   
        \d++bar  
   
      Possessive quantifiers are always greedy; the setting of the  
      PCRE_UNGREEDY option is ignored. They are a convenient nota-  
      tion for the simpler forms of atomic group.  However,  there  
      is  no  difference in the meaning or processing of a posses-  
      sive quantifier and the equivalent atomic group.  
   
      The possessive quantifier syntax is an extension to the Perl  
      syntax. It originates in Sun's Java package.  
   
      When a pattern contains an unlimited repeat inside a subpat-  
      tern  that  can  itself  be  repeated an unlimited number of  
      times, the use of an atomic group is the only way  to  avoid  
      some  failing  matches  taking  a very long time indeed. The  
      pattern  
   
        (\D+|<\d+>)*[!?]  
   
      matches an unlimited number of substrings that  either  con-  
      sist  of  non-digits,  or digits enclosed in <>, followed by  
      either ! or ?. When it matches, it runs quickly. However, if  
      it is applied to  
   
        aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa  
   
      it takes a long  time  before  reporting  failure.  This  is  
      because the string can be divided between the two repeats in  
      a large number of ways, and all have to be tried. (The exam-  
      ple  used  [!?]  rather  than a single character at the end,  
      because both PCRE and Perl have an optimization that  allows  
      for  fast  failure  when  a  single  character is used. They  
      remember the last single character that is  required  for  a  
      match,  and  fail early if it is not present in the string.)  
      If the pattern is changed to  
866    
867         ((?>\D+)|<\d+>)*[!?]         The  function  pcre_fullinfo()  is used to find out information about a
868           compiled pattern; pcre_info() is an obsolete version that returns  only
869           some  of  the available information, but is retained for backwards com-
870           patibility.  The function pcre_version() returns a pointer to a  string
871           containing the version of PCRE and its date of release.
872    
873       sequences of non-digits cannot be broken, and  failure  hap-         The  function  pcre_refcount()  maintains  a  reference count in a data
874       pens quickly.         block containing a compiled pattern. This is provided for  the  benefit
875           of object-oriented applications.
876    
877           The  global  variables  pcre_malloc and pcre_free initially contain the
878           entry points of the standard malloc()  and  free()  functions,  respec-
879           tively. PCRE calls the memory management functions via these variables,
880           so a calling program can replace them if it  wishes  to  intercept  the
881           calls. This should be done before calling any PCRE functions.
882    
883  BACK REFERENCES         The  global  variables  pcre_stack_malloc  and pcre_stack_free are also
884           indirections to memory management functions.  These  special  functions
885           are  used  only  when  PCRE is compiled to use the heap for remembering
886           data, instead of recursive function calls, when running the pcre_exec()
887           function.  See  the  pcrebuild  documentation  for details of how to do
888           this. It is a non-standard way of building PCRE, for  use  in  environ-
889           ments  that  have  limited stacks. Because of the greater use of memory
890           management, it runs more slowly. Separate  functions  are  provided  so
891           that  special-purpose  external  code  can  be used for this case. When
892           used, these functions are always called in a  stack-like  manner  (last
893           obtained,  first freed), and always for memory blocks of the same size.
894           There is a discussion about PCRE's stack usage in the  pcrestack  docu-
895           mentation.
896    
897       Outside a character class, a backslash followed by  a  digit         The global variable pcre_callout initially contains NULL. It can be set
898       greater  than  0  (and  possibly  further  digits) is a back         by the caller to a "callout" function, which PCRE  will  then  call  at
899       reference to a capturing subpattern earlier (that is, to its         specified  points during a matching operation. Details are given in the
900       left)  in  the  pattern,  provided there have been that many         pcrecallout documentation.
      previous capturing left parentheses.  
   
      However, if the decimal number following  the  backslash  is  
      less  than  10,  it is always taken as a back reference, and  
      causes an error only if there are not  that  many  capturing  
      left  parentheses in the entire pattern. In other words, the  
      parentheses that are referenced need not be to the  left  of  
      the  reference  for  numbers  less  than 10. See the section  
      entitled "Backslash" above for further details of  the  han-  
      dling of digits following a backslash.  
   
      A back reference matches whatever actually matched the  cap-  
      turing subpattern in the current subject string, rather than  
      anything matching the subpattern itself (see "Subpatterns as  
      subroutines" below for a way of doing that). So the pattern  
   
        (sens|respons)e and \1ibility  
   
      matches "sense and sensibility" and "response and  responsi-  
      bility",  but  not  "sense  and  responsibility". If caseful  
      matching is in force at the time of the back reference,  the  
      case of letters is relevant. For example,  
   
        ((?i)rah)\s+\1  
   
      matches "rah rah" and "RAH RAH", but  not  "RAH  rah",  even  
      though  the  original  capturing subpattern is matched case-  
      lessly.  
   
      Back references to named subpatterns use the  Python  syntax  
      (?P=name). We could rewrite the above example as follows:  
   
        (?<p1>(?i)rah)\s+(?P=p1)  
   
      There may be more than one back reference to the  same  sub-  
      pattern.  If  a  subpattern  has not actually been used in a  
      particular match, any back references to it always fail. For  
      example, the pattern  
   
        (a|(bc))\2  
   
      always fails if it starts to match  "a"  rather  than  "bc".  
      Because  there  may  be many capturing parentheses in a pat-  
      tern, all digits following the backslash are taken  as  part  
      of a potential back reference number. If the pattern contin-  
      ues with a digit character, some delimiter must be  used  to  
      terminate the back reference. If the PCRE_EXTENDED option is  
      set, this can be whitespace.  Otherwise an empty comment can  
      be used.  
   
      A back reference that occurs inside the parentheses to which  
      it  refers  fails when the subpattern is first used, so, for  
      example, (a\1) never matches.  However, such references  can  
      be useful inside repeated subpatterns. For example, the pat-  
      tern  
   
        (a|b\1)+  
   
      matches any number of "a"s and also "aba", "ababbaa" etc. At  
      each iteration of the subpattern, the back reference matches  
      the character string corresponding to  the  previous  itera-  
      tion.  In  order  for this to work, the pattern must be such  
      that the first iteration does not need  to  match  the  back  
      reference.  This  can  be  done using alternation, as in the  
      example above, or by a quantifier with a minimum of zero.  
901    
902    
903  ASSERTIONS  NEWLINES
904    
905           PCRE supports five different conventions for indicating line breaks  in
906           strings:  a  single  CR (carriage return) character, a single LF (line-
907           feed) character, the two-character sequence CRLF, any of the three pre-
908           ceding,  or any Unicode newline sequence. The Unicode newline sequences
909           are the three just mentioned, plus the single characters  VT  (vertical
910           tab,  U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS (line
911           separator, U+2028), and PS (paragraph separator, U+2029).
912    
913           Each of the first three conventions is used by at least  one  operating
914           system  as its standard newline sequence. When PCRE is built, a default
915           can be specified.  The default default is LF, which is the  Unix  stan-
916           dard.  When  PCRE  is run, the default can be overridden, either when a
917           pattern is compiled, or when it is matched.
918    
919       An assertion is  a  test  on  the  characters  following  or         At compile time, the newline convention can be specified by the options
920       preceding  the current matching point that does not actually         argument  of  pcre_compile(), or it can be specified by special text at
921       consume any characters. The simple assertions coded  as  \b,         the start of the pattern itself; this overrides any other settings. See
922       \B,  \A, \G, \Z, \z, ^ and $ are described above.  More com-         the pcrepattern page for details of the special character sequences.
      plicated assertions are coded as subpatterns. There are  two  
      kinds:  those that look ahead of the current position in the  
      subject string, and those that look behind it.  
923    
924       An assertion subpattern is matched in the normal way, except         In the PCRE documentation the word "newline" is used to mean "the char-
925       that  it  does not cause the current matching position to be         acter or pair of characters that indicate a line break". The choice  of
926       changed. Lookahead assertions start with  (?=  for  positive         newline  convention  affects  the  handling of the dot, circumflex, and
927       assertions and (?! for negative assertions. For example,         dollar metacharacters, the handling of #-comments in /x mode, and, when
928           CRLF  is a recognized line ending sequence, the match position advance-
929           ment for a non-anchored pattern. There is more detail about this in the
930           section on pcre_exec() options below.
931    
932         \w+(?=;)         The  choice of newline convention does not affect the interpretation of
933           the \n or \r escape sequences, nor does  it  affect  what  \R  matches,
934           which is controlled in a similar way, but by separate options.
935    
      matches a word followed by a semicolon, but does not include  
      the semicolon in the match, and  
936    
937         foo(?!bar)  MULTITHREADING
938    
939       matches any occurrence of "foo"  that  is  not  followed  by         The  PCRE  functions  can be used in multi-threading applications, with
940       "bar". Note that the apparently similar pattern         the  proviso  that  the  memory  management  functions  pointed  to  by
941           pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
942           callout function pointed to by pcre_callout, are shared by all threads.
943    
944           The compiled form of a regular expression is not altered during  match-
945           ing, so the same compiled pattern can safely be used by several threads
946           at once.
947    
948           If the just-in-time optimization feature is being used, it needs  sepa-
949           rate  memory stack areas for each thread. See the pcrejit documentation
950           for more details.
951    
952    
953    SAVING PRECOMPILED PATTERNS FOR LATER USE
954    
955           The compiled form of a regular expression can be saved and re-used at a
956           later  time,  possibly by a different program, and even on a host other
957           than the one on which  it  was  compiled.  Details  are  given  in  the
958           pcreprecompile  documentation.  However, compiling a regular expression
959           with one version of PCRE for use with a different version is not  guar-
960           anteed to work and may cause crashes.
961    
        (?!foo)bar  
962    
963       does not find an occurrence of "bar"  that  is  preceded  by  CHECKING BUILD-TIME OPTIONS
      something other than "foo"; it finds any occurrence of "bar"  
      whatsoever, because the assertion  (?!foo)  is  always  true  
      when  the  next  three  characters  are  "bar". A lookbehind  
      assertion is needed to achieve this effect.  
964    
965       If you want to force a matching failure at some point  in  a         int pcre_config(int what, void *where);
      pattern,  the  most  convenient  way  to  do it is with (?!)  
      because an empty string always matches, so an assertion that  
      requires there not to be an empty string must always fail.  
966    
967       Lookbehind assertions start with (?<=  for  positive  asser-         The  function pcre_config() makes it possible for a PCRE client to dis-
968       tions and (?<! for negative assertions. For example,         cover which optional features have been compiled into the PCRE library.
969           The  pcrebuild documentation has more details about these optional fea-
970           tures.
971    
972         (?<!foo)bar         The first argument for pcre_config() is an  integer,  specifying  which
973           information is required; the second argument is a pointer to a variable
974           into which the information is  placed.  The  following  information  is
975           available:
976    
977       does find an occurrence of "bar" that  is  not  preceded  by           PCRE_CONFIG_UTF8
      "foo". The contents of a lookbehind assertion are restricted  
      such that all the strings  it  matches  must  have  a  fixed  
      length.  However, if there are several alternatives, they do  
      not all have to have the same fixed length. Thus  
978    
979         (?<=bullock|donkey)         The  output is an integer that is set to one if UTF-8 support is avail-
980           able; otherwise it is set to zero.
981    
982       is permitted, but           PCRE_CONFIG_UNICODE_PROPERTIES
983    
984         (?<!dogs?|cats?)         The output is an integer that is set to  one  if  support  for  Unicode
985           character properties is available; otherwise it is set to zero.
986    
987       causes an error at compile time. Branches  that  match  dif-           PCRE_CONFIG_JIT
      ferent length strings are permitted only at the top level of  
      a lookbehind assertion. This is an extension  compared  with  
      Perl  (at  least  for  5.8),  which requires all branches to  
      match the same length of string. An assertion such as  
988    
989         (?<=ab(c|de))         The output is an integer that is set to one if support for just-in-time
990           compiling is available; otherwise it is set to zero.
991    
992       is not permitted, because its single  top-level  branch  can           PCRE_CONFIG_NEWLINE
      match two different lengths, but it is acceptable if rewrit-  
      ten to use two top-level branches:  
993    
994         (?<=abc|abde)         The output is an integer whose value specifies  the  default  character
995           sequence  that is recognized as meaning "newline". The four values that
996           are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,
997           and  -1  for  ANY.  Though they are derived from ASCII, the same values
998           are returned in EBCDIC environments. The default should normally corre-
999           spond to the standard sequence for your operating system.
1000    
1001       The implementation of lookbehind  assertions  is,  for  each           PCRE_CONFIG_BSR
      alternative,  to  temporarily move the current position back  
      by the fixed width and then  try  to  match.  If  there  are  
      insufficient  characters  before  the  current position, the  
      match is deemed to fail.  
1002    
1003       PCRE does not allow the \C escape (which  matches  a  single         The output is an integer whose value indicates what character sequences
1004       byte  in  UTF-8  mode)  to  appear in lookbehind assertions,         the \R escape sequence matches by default. A value of 0 means  that  \R
1005       because it makes it impossible to calculate  the  length  of         matches  any  Unicode  line ending sequence; a value of 1 means that \R
1006       the lookbehind.         matches only CR, LF, or CRLF. The default can be overridden when a pat-
1007           tern is compiled or matched.
1008    
1009       Atomic groups can be used  in  conjunction  with  lookbehind           PCRE_CONFIG_LINK_SIZE
      assertions  to  specify efficient matching at the end of the  
      subject string. Consider a simple pattern such as  
1010    
1011         abcd$         The  output  is  an  integer that contains the number of bytes used for
1012           internal linkage in compiled regular expressions. The value is 2, 3, or
1013           4.  Larger  values  allow larger regular expressions to be compiled, at
1014           the expense of slower matching. The default value of  2  is  sufficient
1015           for  all  but  the  most massive patterns, since it allows the compiled
1016           pattern to be up to 64K in size.
1017    
1018       when applied to a long string that does not  match.  Because           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
      matching  proceeds  from  left  to right, PCRE will look for  
      each "a" in the subject and then see if what follows matches  
      the rest of the pattern. If the pattern is specified as  
1019    
1020         ^.*abcd$         The output is an integer that contains the threshold  above  which  the
1021           POSIX  interface  uses malloc() for output vectors. Further details are
1022           given in the pcreposix documentation.
1023    
1024       the initial .* matches the entire string at first, but  when           PCRE_CONFIG_MATCH_LIMIT
      this  fails  (because  there  is no following "a"), it back-  
      tracks to match all but the last character, then all but the  
      last  two  characters,  and so on. Once again the search for  
      "a" covers the entire string, from right to left, so we  are  
      no better off. However, if the pattern is written as  
1025    
1026         ^(?>.*)(?<=abcd)         The output is a long integer that gives the default limit for the  num-
1027           ber  of  internal  matching  function calls in a pcre_exec() execution.
1028           Further details are given with pcre_exec() below.
1029    
1030       or, equivalently,           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1031    
1032         ^.*+(?<=abcd)         The output is a long integer that gives the default limit for the depth
1033           of   recursion  when  calling  the  internal  matching  function  in  a
1034           pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1035           below.
1036    
1037       there can be no backtracking for the .* item; it  can  match           PCRE_CONFIG_STACKRECURSE
      only  the entire string. The subsequent lookbehind assertion  
      does a single test on the last four characters. If it fails,  
      the match fails immediately. For long strings, this approach  
      makes a significant difference to the processing time.  
1038    
1039       Several assertions (of any sort) may  occur  in  succession.         The  output is an integer that is set to one if internal recursion when
1040       For example,         running pcre_exec() is implemented by recursive function calls that use
1041           the  stack  to remember their state. This is the usual way that PCRE is
1042           compiled. The output is zero if PCRE was compiled to use blocks of data
1043           on  the  heap  instead  of  recursive  function  calls.  In  this case,
1044           pcre_stack_malloc and  pcre_stack_free  are  called  to  manage  memory
1045           blocks on the heap, thus avoiding the use of the stack.
1046    
        (?<=\d{3})(?<!999)foo  
1047    
1048       matches "foo" preceded by three digits that are  not  "999".  COMPILING A PATTERN
      Notice  that each of the assertions is applied independently  
      at the same point in the subject string. First  there  is  a  
      check that the previous three characters are all digits, and  
      then there is a check that the same three characters are not  
      "999".   This  pattern  does not match "foo" preceded by six  
      characters, the first of which are digits and the last three  
      of  which  are  not  "999".  For  example,  it doesn't match  
      "123abcfoo". A pattern to do that is  
1049    
1050         (?<=\d{3}...)(?<!999)foo         pcre *pcre_compile(const char *pattern, int options,
1051                const char **errptr, int *erroffset,
1052                const unsigned char *tableptr);
1053    
1054           pcre *pcre_compile2(const char *pattern, int options,
1055                int *errorcodeptr,
1056                const char **errptr, int *erroffset,
1057                const unsigned char *tableptr);
1058    
1059           Either of the functions pcre_compile() or pcre_compile2() can be called
1060           to compile a pattern into an internal form. The only difference between
1061           the  two interfaces is that pcre_compile2() has an additional argument,
1062           errorcodeptr, via which a numerical error  code  can  be  returned.  To
1063           avoid  too  much repetition, we refer just to pcre_compile() below, but
1064           the information applies equally to pcre_compile2().
1065    
1066           The pattern is a C string terminated by a binary zero, and is passed in
1067           the  pattern  argument.  A  pointer to a single block of memory that is
1068           obtained via pcre_malloc is returned. This contains the  compiled  code
1069           and related data. The pcre type is defined for the returned block; this
1070           is a typedef for a structure whose contents are not externally defined.
1071           It is up to the caller to free the memory (via pcre_free) when it is no
1072           longer required.
1073    
1074           Although the compiled code of a PCRE regex is relocatable, that is,  it
1075           does not depend on memory location, the complete pcre data block is not
1076           fully relocatable, because it may contain a copy of the tableptr  argu-
1077           ment, which is an address (see below).
1078    
1079           The options argument contains various bit settings that affect the com-
1080           pilation. It should be zero if no options are required.  The  available
1081           options  are  described  below. Some of them (in particular, those that
1082           are compatible with Perl, but some others as well) can also be set  and
1083           unset  from  within  the  pattern  (see the detailed description in the
1084           pcrepattern documentation). For those options that can be different  in
1085           different  parts  of  the pattern, the contents of the options argument
1086           specifies their settings at the start of compilation and execution. The
1087           PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
1088           PCRE_NO_START_OPT options can be set at the time of matching as well as
1089           at compile time.
1090    
1091           If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
1092           if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
1093           sets the variable pointed to by errptr to point to a textual error mes-
1094           sage. This is a static string that is part of the library. You must not
1095           try  to  free it. Normally, the offset from the start of the pattern to
1096           the byte that was being processed when  the  error  was  discovered  is
1097           placed  in the variable pointed to by erroffset, which must not be NULL
1098           (if it is, an immediate error is given). However, for an invalid  UTF-8
1099           string,  the offset is that of the first byte of the failing character.
1100           Also, some errors are not detected until checks are  carried  out  when
1101           the  whole  pattern  has been scanned; in these cases the offset passed
1102           back is the length of the pattern.
1103    
1104           Note that the offset is in bytes, not characters, even in  UTF-8  mode.
1105           It may sometimes point into the middle of a UTF-8 character.
1106    
1107           If  pcre_compile2()  is  used instead of pcre_compile(), and the error-
1108           codeptr argument is not NULL, a non-zero error code number is  returned
1109           via  this argument in the event of an error. This is in addition to the
1110           textual error message. Error codes and messages are listed below.
1111    
1112           If the final argument, tableptr, is NULL, PCRE uses a  default  set  of
1113           character  tables  that  are  built  when  PCRE  is compiled, using the
1114           default C locale. Otherwise, tableptr must be an address  that  is  the
1115           result  of  a  call to pcre_maketables(). This value is stored with the
1116           compiled pattern, and used again by pcre_exec(), unless  another  table
1117           pointer is passed to it. For more discussion, see the section on locale
1118           support below.
1119    
1120           This code fragment shows a typical straightforward  call  to  pcre_com-
1121           pile():
1122    
1123             pcre *re;
1124             const char *error;
1125             int erroffset;
1126             re = pcre_compile(
1127               "^A.*Z",          /* the pattern */
1128               0,                /* default options */
1129               &error,           /* for error message */
1130               &erroffset,       /* for error offset */
1131               NULL);            /* use default character tables */
1132    
1133           The  following  names  for option bits are defined in the pcre.h header
1134           file:
1135    
1136             PCRE_ANCHORED
1137    
1138           If this bit is set, the pattern is forced to be "anchored", that is, it
1139           is  constrained to match only at the first matching point in the string
1140           that is being searched (the "subject string"). This effect can also  be
1141           achieved  by appropriate constructs in the pattern itself, which is the
1142           only way to do it in Perl.
1143    
1144             PCRE_AUTO_CALLOUT
1145    
1146           If this bit is set, pcre_compile() automatically inserts callout items,
1147           all  with  number  255, before each pattern item. For discussion of the
1148           callout facility, see the pcrecallout documentation.
1149    
1150             PCRE_BSR_ANYCRLF
1151             PCRE_BSR_UNICODE
1152    
1153           These options (which are mutually exclusive) control what the \R escape
1154           sequence  matches.  The choice is either to match only CR, LF, or CRLF,
1155           or to match any Unicode newline sequence. The default is specified when
1156           PCRE is built. It can be overridden from within the pattern, or by set-
1157           ting an option when a compiled pattern is matched.
1158    
1159             PCRE_CASELESS
1160    
1161           If this bit is set, letters in the pattern match both upper  and  lower
1162           case  letters.  It  is  equivalent  to  Perl's /i option, and it can be
1163           changed within a pattern by a (?i) option setting. In UTF-8 mode,  PCRE
1164           always  understands the concept of case for characters whose values are
1165           less than 128, so caseless matching is always possible. For  characters
1166           with  higher  values,  the concept of case is supported if PCRE is com-
1167           piled with Unicode property support, but not otherwise. If you want  to
1168           use  caseless  matching  for  characters 128 and above, you must ensure
1169           that PCRE is compiled with Unicode property support  as  well  as  with
1170           UTF-8 support.
1171    
1172             PCRE_DOLLAR_ENDONLY
1173    
1174           If  this bit is set, a dollar metacharacter in the pattern matches only
1175           at the end of the subject string. Without this option,  a  dollar  also
1176           matches  immediately before a newline at the end of the string (but not
1177           before any other newlines). The PCRE_DOLLAR_ENDONLY option  is  ignored
1178           if  PCRE_MULTILINE  is  set.   There is no equivalent to this option in
1179           Perl, and no way to set it within a pattern.
1180    
1181             PCRE_DOTALL
1182    
1183           If this bit is set, a dot metacharacter in the pattern matches a  char-
1184           acter of any value, including one that indicates a newline. However, it
1185           only ever matches one character, even if newlines are  coded  as  CRLF.
1186           Without  this option, a dot does not match when the current position is
1187           at a newline. This option is equivalent to Perl's /s option, and it can
1188           be  changed within a pattern by a (?s) option setting. A negative class
1189           such as [^a] always matches newline characters, independent of the set-
1190           ting of this option.
1191    
1192             PCRE_DUPNAMES
1193    
1194           If  this  bit is set, names used to identify capturing subpatterns need
1195           not be unique. This can be helpful for certain types of pattern when it
1196           is  known  that  only  one instance of the named subpattern can ever be
1197           matched. There are more details of named subpatterns  below;  see  also
1198           the pcrepattern documentation.
1199    
1200             PCRE_EXTENDED
1201    
1202           If  this  bit  is  set,  whitespace  data characters in the pattern are
1203           totally ignored except when escaped or inside a character class. White-
1204           space does not include the VT character (code 11). In addition, charac-
1205           ters between an unescaped # outside a character class and the next new-
1206           line,  inclusive,  are  also  ignored.  This is equivalent to Perl's /x
1207           option, and it can be changed within a pattern by a  (?x)  option  set-
1208           ting.
1209    
1210           Which  characters  are  interpreted  as  newlines  is controlled by the
1211           options passed to pcre_compile() or by a special sequence at the  start
1212           of  the  pattern, as described in the section entitled "Newline conven-
1213           tions" in the pcrepattern documentation. Note that the end of this type
1214           of  comment  is  a  literal  newline  sequence  in  the pattern; escape
1215           sequences that happen to represent a newline do not count.
1216    
1217           This option makes it possible to include  comments  inside  complicated
1218           patterns.   Note,  however,  that this applies only to data characters.
1219           Whitespace  characters  may  never  appear  within  special   character
1220           sequences in a pattern, for example within the sequence (?( that intro-
1221           duces a conditional subpattern.
1222    
1223             PCRE_EXTRA
1224    
1225           This option was invented in order to turn on  additional  functionality
1226           of  PCRE  that  is  incompatible with Perl, but it is currently of very
1227           little use. When set, any backslash in a pattern that is followed by  a
1228           letter  that  has  no  special  meaning causes an error, thus reserving
1229           these combinations for future expansion. By  default,  as  in  Perl,  a
1230           backslash  followed by a letter with no special meaning is treated as a
1231           literal. (Perl can, however, be persuaded to give an error for this, by
1232           running  it with the -w option.) There are at present no other features
1233           controlled by this option. It can also be set by a (?X) option  setting
1234           within a pattern.
1235    
1236             PCRE_FIRSTLINE
1237    
1238           If  this  option  is  set,  an  unanchored pattern is required to match
1239           before or at the first  newline  in  the  subject  string,  though  the
1240           matched text may continue over the newline.
1241    
1242             PCRE_JAVASCRIPT_COMPAT
1243    
1244           If this option is set, PCRE's behaviour is changed in some ways so that
1245           it is compatible with JavaScript rather than Perl. The changes  are  as
1246           follows:
1247    
1248           (1)  A  lone  closing square bracket in a pattern causes a compile-time
1249           error, because this is illegal in JavaScript (by default it is  treated
1250           as a data character). Thus, the pattern AB]CD becomes illegal when this
1251           option is set.
1252    
1253           (2) At run time, a back reference to an unset subpattern group  matches
1254           an  empty  string (by default this causes the current matching alterna-
1255           tive to fail). A pattern such as (\1)(a) succeeds when this  option  is
1256           set  (assuming  it can find an "a" in the subject), whereas it fails by
1257           default, for Perl compatibility.
1258    
1259             PCRE_MULTILINE
1260    
1261           By default, PCRE treats the subject string as consisting  of  a  single
1262           line  of characters (even if it actually contains newlines). The "start
1263           of line" metacharacter (^) matches only at the  start  of  the  string,
1264           while  the  "end  of line" metacharacter ($) matches only at the end of
1265           the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
1266           is set). This is the same as Perl.
1267    
1268           When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"
1269           constructs match immediately following or immediately  before  internal
1270           newlines  in  the  subject string, respectively, as well as at the very
1271           start and end. This is equivalent to Perl's /m option, and  it  can  be
1272           changed within a pattern by a (?m) option setting. If there are no new-
1273           lines in a subject string, or no occurrences of ^ or $  in  a  pattern,
1274           setting PCRE_MULTILINE has no effect.
1275    
1276             PCRE_NEWLINE_CR
1277             PCRE_NEWLINE_LF
1278             PCRE_NEWLINE_CRLF
1279             PCRE_NEWLINE_ANYCRLF
1280             PCRE_NEWLINE_ANY
1281    
1282           These  options  override the default newline definition that was chosen
1283           when PCRE was built. Setting the first or the second specifies  that  a
1284           newline  is  indicated  by a single character (CR or LF, respectively).
1285           Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the
1286           two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies
1287           that any of the three preceding sequences should be recognized. Setting
1288           PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be
1289           recognized. The Unicode newline sequences are the three just mentioned,
1290           plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,
1291           U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS
1292           (paragraph  separator,  U+2029).  The  last  two are recognized only in
1293           UTF-8 mode.
1294    
1295           The newline setting in the  options  word  uses  three  bits  that  are
1296           treated as a number, giving eight possibilities. Currently only six are
1297           used (default plus the five values above). This means that if  you  set
1298           more  than one newline option, the combination may or may not be sensi-
1299           ble. For example, PCRE_NEWLINE_CR with PCRE_NEWLINE_LF is equivalent to
1300           PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
1301           cause an error.
1302    
1303           The only time that a line break in a pattern  is  specially  recognized
1304           when  compiling  is when PCRE_EXTENDED is set. CR and LF are whitespace
1305           characters, and so are ignored in this mode. Also, an unescaped #  out-
1306           side  a  character class indicates a comment that lasts until after the
1307           next line break sequence. In other circumstances, line break  sequences
1308           in patterns are treated as literal data.
1309    
1310           The newline option that is set at compile time becomes the default that
1311           is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
1312    
1313             PCRE_NO_AUTO_CAPTURE
1314    
1315           If this option is set, it disables the use of numbered capturing paren-
1316           theses  in the pattern. Any opening parenthesis that is not followed by
1317           ? behaves as if it were followed by ?: but named parentheses can  still
1318           be  used  for  capturing  (and  they acquire numbers in the usual way).
1319           There is no equivalent of this option in Perl.
1320    
1321             NO_START_OPTIMIZE
1322    
1323           This is an option that acts at matching time; that is, it is really  an
1324           option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
1325           time, it is remembered with the compiled pattern and assumed at  match-
1326           ing  time.  For  details  see  the discussion of PCRE_NO_START_OPTIMIZE
1327           below.
1328    
1329             PCRE_UCP
1330    
1331           This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
1332           \w,  and  some  of  the POSIX character classes. By default, only ASCII
1333           characters are recognized, but if PCRE_UCP is set,  Unicode  properties
1334           are  used instead to classify characters. More details are given in the
1335           section on generic character types in the pcrepattern page. If you  set
1336           PCRE_UCP,  matching  one of the items it affects takes much longer. The
1337           option is available only if PCRE has been compiled with  Unicode  prop-
1338           erty support.
1339    
1340             PCRE_UNGREEDY
1341    
1342           This  option  inverts  the "greediness" of the quantifiers so that they
1343           are not greedy by default, but become greedy if followed by "?". It  is
1344           not  compatible  with Perl. It can also be set by a (?U) option setting
1345           within the pattern.
1346    
1347             PCRE_UTF8
1348    
1349           This option causes PCRE to regard both the pattern and the  subject  as
1350           strings  of  UTF-8 characters instead of single-byte character strings.
1351           However, it is available only when PCRE is built to include UTF-8  sup-
1352           port.  If not, the use of this option provokes an error. Details of how
1353           this option changes the behaviour of PCRE are given in the  pcreunicode
1354           page.
1355    
1356             PCRE_NO_UTF8_CHECK
1357    
1358           When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
1359           automatically checked. There is a  discussion  about  the  validity  of
1360           UTF-8  strings  in  the main pcre page. If an invalid UTF-8 sequence of
1361           bytes is found, pcre_compile() returns an error. If  you  already  know
1362           that your pattern is valid, and you want to skip this check for perfor-
1363           mance reasons, you can set the PCRE_NO_UTF8_CHECK option.  When  it  is
1364           set,  the  effect  of  passing  an invalid UTF-8 string as a pattern is
1365           undefined. It may cause your program to crash. Note  that  this  option
1366           can  also be passed to pcre_exec() and pcre_dfa_exec(), to suppress the
1367           UTF-8 validity checking of subject strings.
1368    
1369    
1370    COMPILATION ERROR CODES
1371    
1372           The following table lists the error  codes  than  may  be  returned  by
1373           pcre_compile2(),  along with the error messages that may be returned by
1374           both compiling functions. As PCRE has developed, some error codes  have
1375           fallen out of use. To avoid confusion, they have not been re-used.
1376    
1377              0  no error
1378              1  \ at end of pattern
1379              2  \c at end of pattern
1380              3  unrecognized character follows \
1381              4  numbers out of order in {} quantifier
1382              5  number too big in {} quantifier
1383              6  missing terminating ] for character class
1384              7  invalid escape sequence in character class
1385              8  range out of order in character class
1386              9  nothing to repeat
1387             10  [this code is not in use]
1388             11  internal error: unexpected repeat
1389             12  unrecognized character after (? or (?-
1390             13  POSIX named classes are supported only within a class
1391             14  missing )
1392             15  reference to non-existent subpattern
1393             16  erroffset passed as NULL
1394             17  unknown option bit(s) set
1395             18  missing ) after comment
1396             19  [this code is not in use]
1397             20  regular expression is too large
1398             21  failed to get memory
1399             22  unmatched parentheses
1400             23  internal error: code overflow
1401             24  unrecognized character after (?<
1402             25  lookbehind assertion is not fixed length
1403             26  malformed number or name after (?(
1404             27  conditional group contains more than two branches
1405             28  assertion expected after (?(
1406             29  (?R or (?[+-]digits must be followed by )
1407             30  unknown POSIX class name
1408             31  POSIX collating elements are not supported
1409             32  this version of PCRE is not compiled with PCRE_UTF8 support
1410             33  [this code is not in use]
1411             34  character value in \x{...} sequence is too large
1412             35  invalid condition (?(0)
1413             36  \C not allowed in lookbehind assertion
1414             37  PCRE does not support \L, \l, \N{name}, \U, or \u
1415             38  number after (?C is > 255
1416             39  closing ) for (?C expected
1417             40  recursive call could loop indefinitely
1418             41  unrecognized character after (?P
1419             42  syntax error in subpattern name (missing terminator)
1420             43  two named subpatterns have the same name
1421             44  invalid UTF-8 string
1422             45  support for \P, \p, and \X has not been compiled
1423             46  malformed \P or \p sequence
1424             47  unknown property name after \P or \p
1425             48  subpattern name is too long (maximum 32 characters)
1426             49  too many named subpatterns (maximum 10000)
1427             50  [this code is not in use]
1428             51  octal value is greater than \377 (not in UTF-8 mode)
1429             52  internal error: overran compiling workspace
1430             53  internal error: previously-checked referenced subpattern
1431                   not found
1432             54  DEFINE group contains more than one branch
1433             55  repeating a DEFINE group is not allowed
1434             56  inconsistent NEWLINE options
1435             57  \g is not followed by a braced, angle-bracketed, or quoted
1436                   name/number or by a plain number
1437             58  a numbered reference must not be zero
1438             59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
1439             60  (*VERB) not recognized
1440             61  number is too big
1441             62  subpattern name expected
1442             63  digit expected after (?+
1443             64  ] is an invalid data character in JavaScript compatibility mode
1444             65  different names for subpatterns of the same number are
1445                   not allowed
1446             66  (*MARK) must have an argument
1447             67  this version of PCRE is not compiled with PCRE_UCP support
1448             68  \c must be followed by an ASCII character
1449             69  \k is not followed by a braced, angle-bracketed, or quoted name
1450    
1451       This time the first assertion looks  at  the  preceding  six         The  numbers  32  and 10000 in errors 48 and 49 are defaults; different
1452       characters,  checking  that  the first three are digits, and         values may be used if the limits were changed when PCRE was built.
      then the second assertion checks that  the  preceding  three  
      characters are not "999".  
1453    
      Assertions can be nested in any combination. For example,  
1454    
1455         (?<=(?<!foo)bar)baz  STUDYING A PATTERN
1456    
1457       matches an occurrence of "baz" that  is  preceded  by  "bar"         pcre_extra *pcre_study(const pcre *code, int options
1458       which in turn is not preceded by "foo", while              const char **errptr);
1459    
1460         (?<=\d{3}(?!999)...)foo         If a compiled pattern is going to be used several times,  it  is  worth
1461           spending more time analyzing it in order to speed up the time taken for
1462           matching. The function pcre_study() takes a pointer to a compiled  pat-
1463           tern as its first argument. If studying the pattern produces additional
1464           information that will help speed up matching,  pcre_study()  returns  a
1465           pointer  to a pcre_extra block, in which the study_data field points to
1466           the results of the study.
1467    
1468           The  returned  value  from  pcre_study()  can  be  passed  directly  to
1469           pcre_exec()  or  pcre_dfa_exec(). However, a pcre_extra block also con-
1470           tains other fields that can be set by the caller before  the  block  is
1471           passed; these are described below in the section on matching a pattern.
1472    
1473           If  studying  the  pattern  does  not  produce  any useful information,
1474           pcre_study() returns NULL. In that circumstance, if the calling program
1475           wants   to   pass   any   of   the   other  fields  to  pcre_exec()  or
1476           pcre_dfa_exec(), it must set up its own pcre_extra block.
1477    
1478           The second argument of pcre_study() contains option bits. There is only
1479           one  option:  PCRE_STUDY_JIT_COMPILE.  If this is set, and the just-in-
1480           time compiler is  available,  the  pattern  is  further  compiled  into
1481           machine  code  that  executes much faster than the pcre_exec() matching
1482           function. If the just-in-time compiler is not available, this option is
1483           ignored. All other bits in the options argument must be zero.
1484    
1485           JIT  compilation  is  a heavyweight optimization. It can take some time
1486           for patterns to be analyzed, and for one-off matches  and  simple  pat-
1487           terns  the benefit of faster execution might be offset by a much slower
1488           study time.  Not all patterns can be optimized by the JIT compiler. For
1489           those  that cannot be handled, matching automatically falls back to the
1490           pcre_exec() interpreter. For more details, see the  pcrejit  documenta-
1491           tion.
1492    
1493           The  third argument for pcre_study() is a pointer for an error message.
1494           If studying succeeds (even if no data is  returned),  the  variable  it
1495           points  to  is  set  to NULL. Otherwise it is set to point to a textual
1496           error message. This is a static string that is part of the library. You
1497           must  not  try  to  free it. You should test the error pointer for NULL
1498           after calling pcre_study(), to be sure that it has run successfully.
1499    
1500           When you are finished with a pattern, you can free the memory used  for
1501           the study data by calling pcre_free_study(). This function was added to
1502           the API for release 8.20. For earlier versions,  the  memory  could  be
1503           freed  with  pcre_free(), just like the pattern itself. This will still
1504           work in cases where PCRE_STUDY_JIT_COMPILE  is  not  used,  but  it  is
1505           advisable to change to the new function when convenient.
1506    
1507           This  is  a typical way in which pcre_study() is used (except that in a
1508           real application there should be tests for errors):
1509    
1510             int rc;
1511             pcre *re;
1512             pcre_extra *sd;
1513             re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
1514             sd = pcre_study(
1515               re,             /* result of pcre_compile() */
1516               0,              /* no options */
1517               &error);        /* set to NULL or points to a message */
1518             rc = pcre_exec(   /* see below for details of pcre_exec() options */
1519               re, sd, "subject", 7, 0, 0, ovector, 30);
1520             ...
1521             pcre_free_study(sd);
1522             pcre_free(re);
1523    
1524           Studying a pattern does two things: first, a lower bound for the length
1525           of subject string that is needed to match the pattern is computed. This
1526           does not mean that there are any strings of that length that match, but
1527           it  does  guarantee that no shorter strings match. The value is used by
1528           pcre_exec() and pcre_dfa_exec() to avoid  wasting  time  by  trying  to
1529           match  strings  that are shorter than the lower bound. You can find out
1530           the value in a calling program via the pcre_fullinfo() function.
1531    
1532           Studying a pattern is also useful for non-anchored patterns that do not
1533           have  a  single fixed starting character. A bitmap of possible starting
1534           bytes is created. This speeds up finding a position in the  subject  at
1535           which to start matching.
1536    
1537           These  two optimizations apply to both pcre_exec() and pcre_dfa_exec().
1538           However, they are not used by pcre_exec()  if  pcre_study()  is  called
1539           with  the  PCRE_STUDY_JIT_COMPILE option, and just-in-time compiling is
1540           successful.  The  optimizations  can  be  disabled   by   setting   the
1541           PCRE_NO_START_OPTIMIZE    option    when    calling    pcre_exec()   or
1542           pcre_dfa_exec(). You might want to do this  if  your  pattern  contains
1543           callouts  or (*MARK) (which cannot be handled by the JIT compiler), and
1544           you want to make use of these facilities in cases where matching fails.
1545           See the discussion of PCRE_NO_START_OPTIMIZE below.
1546    
      is another pattern which matches  "foo"  preceded  by  three  
      digits and any three characters that are not "999".  
1547    
1548       Assertion subpatterns are not capturing subpatterns, and may  LOCALE SUPPORT
      not  be  repeated,  because  it makes no sense to assert the  
      same thing several times. If any kind of assertion  contains  
      capturing  subpatterns  within it, these are counted for the  
      purposes of numbering the capturing subpatterns in the whole  
      pattern.   However,  substring capturing is carried out only  
      for positive assertions, because it does not make sense  for  
      negative assertions.  
1549    
1550           PCRE  handles  caseless matching, and determines whether characters are
1551           letters, digits, or whatever, by reference to a set of tables,  indexed
1552           by  character  value.  When running in UTF-8 mode, this applies only to
1553           characters with codes less than 128. By  default,  higher-valued  codes
1554           never match escapes such as \w or \d, but they can be tested with \p if
1555           PCRE is built with Unicode character property  support.  Alternatively,
1556           the  PCRE_UCP  option  can  be  set at compile time; this causes \w and
1557           friends to use Unicode property support instead of built-in tables. The
1558           use of locales with Unicode is discouraged. If you are handling charac-
1559           ters with codes greater than 128, you should either use UTF-8 and  Uni-
1560           code, or use locales, but not try to mix the two.
1561    
1562           PCRE  contains  an  internal set of tables that are used when the final
1563           argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
1564           applications.  Normally, the internal tables recognize only ASCII char-
1565           acters. However, when PCRE is built, it is possible to cause the inter-
1566           nal tables to be rebuilt in the default "C" locale of the local system,
1567           which may cause them to be different.
1568    
1569           The internal tables can always be overridden by tables supplied by  the
1570           application that calls PCRE. These may be created in a different locale
1571           from the default. As more and more applications change  to  using  Uni-
1572           code, the need for this locale support is expected to die away.
1573    
1574           External  tables  are  built by calling the pcre_maketables() function,
1575           which has no arguments, in the relevant locale. The result can then  be
1576           passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
1577           example, to build and use tables that are appropriate  for  the  French
1578           locale  (where  accented  characters  with  values greater than 128 are
1579           treated as letters), the following code could be used:
1580    
1581             setlocale(LC_CTYPE, "fr_FR");
1582             tables = pcre_maketables();
1583             re = pcre_compile(..., tables);
1584    
1585           The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
1586           if you are using Windows, the name for the French locale is "french".
1587    
1588           When  pcre_maketables()  runs,  the  tables are built in memory that is
1589           obtained via pcre_malloc. It is the caller's responsibility  to  ensure
1590           that  the memory containing the tables remains available for as long as
1591           it is needed.
1592    
1593           The pointer that is passed to pcre_compile() is saved with the compiled
1594           pattern,  and the same tables are used via this pointer by pcre_study()
1595           and normally also by pcre_exec(). Thus, by default, for any single pat-
1596           tern, compilation, studying and matching all happen in the same locale,
1597           but different patterns can be compiled in different locales.
1598    
1599           It is possible to pass a table pointer or NULL (indicating the  use  of
1600           the  internal  tables)  to  pcre_exec(). Although not intended for this
1601           purpose, this facility could be used to match a pattern in a  different
1602           locale from the one in which it was compiled. Passing table pointers at
1603           run time is discussed below in the section on matching a pattern.
1604    
 CONDITIONAL SUBPATTERNS  
1605    
1606       It is possible to cause the matching process to obey a  sub-  INFORMATION ABOUT A PATTERN
      pattern  conditionally  or to choose between two alternative  
      subpatterns, depending on the result  of  an  assertion,  or  
      whether  a previous capturing subpattern matched or not. The  
      two possible forms of conditional subpattern are  
   
        (?(condition)yes-pattern)  
        (?(condition)yes-pattern|no-pattern)  
   
      If the condition is satisfied, the yes-pattern is used; oth-  
      erwise  the  no-pattern  (if  present) is used. If there are  
      more than two alternatives in the subpattern, a compile-time  
      error occurs.  
   
      There are three kinds of condition. If the text between  the  
      parentheses  consists of a sequence of digits, the condition  
      is satisfied if the capturing subpattern of that number  has  
      previously  matched.  The  number must be greater than zero.  
      Consider  the  following  pattern,   which   contains   non-  
      significant white space to make it more readable (assume the  
      PCRE_EXTENDED option) and to divide it into three parts  for  
      ease of discussion:  
   
        ( \( )?    [^()]+    (?(1) \) )  
   
      The first part matches an optional opening parenthesis,  and  
      if  that character is present, sets it as the first captured  
      substring. The second part matches one  or  more  characters  
      that  are  not  parentheses. The third part is a conditional  
      subpattern that tests whether the first set  of  parentheses  
      matched  or  not.  If  they did, that is, if subject started  
      with an opening parenthesis, the condition is true,  and  so  
      the  yes-pattern  is  executed  and a closing parenthesis is  
      required. Otherwise, since no-pattern is  not  present,  the  
      subpattern  matches  nothing.  In  other words, this pattern  
      matches a sequence of non-parentheses,  optionally  enclosed  
      in parentheses.  
   
      If the condition is the string (R), it  is  satisfied  if  a  
      recursive  call  to the pattern or subpattern has been made.  
      At "top level", the condition is  false.   This  is  a  PCRE  
      extension.  Recursive  patterns  are  described  in the next  
      section.  
   
      If the condition is not a sequence of digits or (R), it must  
      be  an assertion.  This may be a positive or negative looka-  
      head or lookbehind assertion. Consider this  pattern,  again  
      containing  non-significant  white  space,  and with the two  
      alternatives on the second line:  
   
        (?(?=[^a-z]*[a-z])  
        \d{2}-[a-z]{3}-\d{2}  |  \d{2}-\d{2}-\d{2} )  
   
      The condition is a positive lookahead assertion that matches  
      an optional sequence of non-letters followed by a letter. In  
      other words, it tests for  the  presence  of  at  least  one  
      letter  in the subject. If a letter is found, the subject is  
      matched against  the  first  alternative;  otherwise  it  is  
      matched  against the second. This pattern matches strings in  
      one of the two forms dd-aaa-dd or dd-dd-dd,  where  aaa  are  
      letters and dd are digits.  
1607    
1608           int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1609                int what, void *where);
1610    
1611  COMMENTS         The pcre_fullinfo() function returns information about a compiled  pat-
1612           tern. It replaces the obsolete pcre_info() function, which is neverthe-
1613           less retained for backwards compability (and is documented below).
1614    
1615           The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
1616           pattern.  The second argument is the result of pcre_study(), or NULL if
1617           the pattern was not studied. The third argument specifies  which  piece
1618           of  information  is required, and the fourth argument is a pointer to a
1619           variable to receive the data. The yield of the  function  is  zero  for
1620           success, or one of the following negative numbers:
1621    
1622             PCRE_ERROR_NULL       the argument code was NULL
1623                                   the argument where was NULL
1624             PCRE_ERROR_BADMAGIC   the "magic number" was not found
1625             PCRE_ERROR_BADOPTION  the value of what was invalid
1626    
1627           The  "magic  number" is placed at the start of each compiled pattern as
1628           an simple check against passing an arbitrary memory pointer. Here is  a
1629           typical  call  of pcre_fullinfo(), to obtain the length of the compiled
1630           pattern:
1631    
1632             int rc;
1633             size_t length;
1634             rc = pcre_fullinfo(
1635               re,               /* result of pcre_compile() */
1636               sd,               /* result of pcre_study(), or NULL */
1637               PCRE_INFO_SIZE,   /* what is required */
1638               &length);         /* where to put the data */
1639    
1640           The possible values for the third argument are defined in  pcre.h,  and
1641           are as follows:
1642    
1643             PCRE_INFO_BACKREFMAX
1644    
1645           Return  the  number  of  the highest back reference in the pattern. The
1646           fourth argument should point to an int variable. Zero  is  returned  if
1647           there are no back references.
1648    
1649             PCRE_INFO_CAPTURECOUNT
1650    
1651           Return  the  number of capturing subpatterns in the pattern. The fourth
1652           argument should point to an int variable.
1653    
1654             PCRE_INFO_DEFAULT_TABLES
1655    
1656           Return a pointer to the internal default character tables within  PCRE.
1657           The  fourth  argument should point to an unsigned char * variable. This
1658           information call is provided for internal use by the pcre_study() func-
1659           tion.  External  callers  can  cause PCRE to use its internal tables by
1660           passing a NULL table pointer.
1661    
1662             PCRE_INFO_FIRSTBYTE
1663    
1664           Return information about the first byte of any matched  string,  for  a
1665           non-anchored  pattern. The fourth argument should point to an int vari-
1666           able. (This option used to be called PCRE_INFO_FIRSTCHAR; the old  name
1667           is still recognized for backwards compatibility.)
1668    
1669           If  there  is  a  fixed first byte, for example, from a pattern such as
1670           (cat|cow|coyote), its value is returned. Otherwise, if either
1671    
1672           (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
1673           branch starts with "^", or
1674    
1675           (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
1676           set (if it were set, the pattern would be anchored),
1677    
1678           -1 is returned, indicating that the pattern matches only at  the  start
1679           of  a  subject string or after any newline within the string. Otherwise
1680           -2 is returned. For anchored patterns, -2 is returned.
1681    
1682             PCRE_INFO_FIRSTTABLE
1683    
1684           If the pattern was studied, and this resulted in the construction of  a
1685           256-bit table indicating a fixed set of bytes for the first byte in any
1686           matching string, a pointer to the table is returned. Otherwise NULL  is
1687           returned.  The fourth argument should point to an unsigned char * vari-
1688           able.
1689    
1690             PCRE_INFO_HASCRORLF
1691    
1692           Return 1 if the pattern contains any explicit  matches  for  CR  or  LF
1693           characters,  otherwise  0.  The  fourth argument should point to an int
1694           variable. An explicit match is either a literal CR or LF character,  or
1695           \r or \n.
1696    
1697             PCRE_INFO_JCHANGED
1698    
1699           Return  1  if  the (?J) or (?-J) option setting is used in the pattern,
1700           otherwise 0. The fourth argument should point to an int variable.  (?J)
1701           and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
1702    
1703             PCRE_INFO_JIT
1704    
1705           Return  1  if  the  pattern was studied with the PCRE_STUDY_JIT_COMPILE
1706           option, and just-in-time compiling was successful. The fourth  argument
1707           should  point  to  an  int variable. A return value of 0 means that JIT
1708           support is not available in this version of PCRE, or that  the  pattern
1709           was not studied with the PCRE_STUDY_JIT_COMPILE option, or that the JIT
1710           compiler could not handle this particular pattern. See the pcrejit doc-
1711           umentation for details of what can and cannot be handled.
1712    
1713             PCRE_INFO_LASTLITERAL
1714    
1715           Return  the  value of the rightmost literal byte that must exist in any
1716           matched string, other than at its  start,  if  such  a  byte  has  been
1717           recorded. The fourth argument should point to an int variable. If there
1718           is no such byte, -1 is returned. For anchored patterns, a last  literal
1719           byte  is  recorded only if it follows something of variable length. For
1720           example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
1721           /^a\dz\d/ the returned value is -1.
1722    
1723             PCRE_INFO_MINLENGTH
1724    
1725           If  the  pattern  was studied and a minimum length for matching subject
1726           strings was computed, its value is  returned.  Otherwise  the  returned
1727           value  is  -1. The value is a number of characters, not bytes (this may
1728           be relevant in UTF-8 mode). The fourth argument should point to an  int
1729           variable.  A  non-negative  value is a lower bound to the length of any
1730           matching string. There may not be any strings of that  length  that  do
1731           actually match, but every string that does match is at least that long.
1732    
1733             PCRE_INFO_NAMECOUNT
1734             PCRE_INFO_NAMEENTRYSIZE
1735             PCRE_INFO_NAMETABLE
1736    
1737           PCRE  supports the use of named as well as numbered capturing parenthe-
1738           ses. The names are just an additional way of identifying the  parenthe-
1739           ses, which still acquire numbers. Several convenience functions such as
1740           pcre_get_named_substring() are provided for  extracting  captured  sub-
1741           strings  by  name. It is also possible to extract the data directly, by
1742           first converting the name to a number in order to  access  the  correct
1743           pointers in the output vector (described with pcre_exec() below). To do
1744           the conversion, you need  to  use  the  name-to-number  map,  which  is
1745           described by these three values.
1746    
1747           The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
1748           gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
1749           of  each  entry;  both  of  these  return  an int value. The entry size
1750           depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns
1751           a  pointer  to  the  first  entry of the table (a pointer to char). The
1752           first two bytes of each entry are the number of the capturing parenthe-
1753           sis,  most  significant byte first. The rest of the entry is the corre-
1754           sponding name, zero terminated.
1755    
1756           The names are in alphabetical order. Duplicate names may appear if  (?|
1757           is used to create multiple groups with the same number, as described in
1758           the section on duplicate subpattern numbers in  the  pcrepattern  page.
1759           Duplicate  names  for  subpatterns with different numbers are permitted
1760           only if PCRE_DUPNAMES is set. In all cases  of  duplicate  names,  they
1761           appear  in  the table in the order in which they were found in the pat-
1762           tern. In the absence of (?| this is the  order  of  increasing  number;
1763           when (?| is used this is not necessarily the case because later subpat-
1764           terns may have lower numbers.
1765    
1766           As a simple example of the name/number table,  consider  the  following
1767           pattern  (assume  PCRE_EXTENDED is set, so white space - including new-
1768           lines - is ignored):
1769    
1770             (?<date> (?<year>(\d\d)?\d\d) -
1771             (?<month>\d\d) - (?<day>\d\d) )
1772    
1773           There are four named subpatterns, so the table has  four  entries,  and
1774           each  entry  in the table is eight bytes long. The table is as follows,
1775           with non-printing bytes shows in hexadecimal, and undefined bytes shown
1776           as ??:
1777    
1778             00 01 d  a  t  e  00 ??
1779             00 05 d  a  y  00 ?? ??
1780             00 04 m  o  n  t  h  00
1781             00 02 y  e  a  r  00 ??
1782    
1783           When  writing  code  to  extract  data from named subpatterns using the
1784           name-to-number map, remember that the length of the entries  is  likely
1785           to be different for each compiled pattern.
1786    
1787             PCRE_INFO_OKPARTIAL
1788    
1789           Return  1  if  the  pattern  can  be  used  for  partial  matching with
1790           pcre_exec(), otherwise 0. The fourth argument should point  to  an  int
1791           variable.  From  release  8.00,  this  always  returns  1,  because the
1792           restrictions that previously applied  to  partial  matching  have  been
1793           lifted.  The  pcrepartial documentation gives details of partial match-
1794           ing.
1795    
1796             PCRE_INFO_OPTIONS
1797    
1798           Return a copy of the options with which the pattern was  compiled.  The
1799           fourth  argument  should  point to an unsigned long int variable. These
1800           option bits are those specified in the call to pcre_compile(), modified
1801           by any top-level option settings at the start of the pattern itself. In
1802           other words, they are the options that will be in force  when  matching
1803           starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with
1804           the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,
1805           and PCRE_EXTENDED.
1806    
1807           A  pattern  is  automatically  anchored by PCRE if all of its top-level
1808           alternatives begin with one of the following:
1809    
1810             ^     unless PCRE_MULTILINE is set
1811             \A    always
1812             \G    always
1813             .*    if PCRE_DOTALL is set and there are no back
1814                     references to the subpattern in which .* appears
1815    
1816           For such patterns, the PCRE_ANCHORED bit is set in the options returned
1817           by pcre_fullinfo().
1818    
1819             PCRE_INFO_SIZE
1820    
1821           Return  the  size  of the compiled pattern, that is, the value that was
1822           passed as the argument to pcre_malloc() when PCRE was getting memory in
1823           which to place the compiled data. The fourth argument should point to a
1824           size_t variable.
1825    
1826             PCRE_INFO_STUDYSIZE
1827    
1828           Return the size of the data block pointed to by the study_data field in
1829           a  pcre_extra  block. If pcre_extra is NULL, or there is no study data,
1830           zero is returned. The fourth argument should point to  a  size_t  vari-
1831           able.   The  study_data field is set by pcre_study() to record informa-
1832           tion that will speed up matching (see the section entitled "Studying  a
1833           pattern" above). The format of the study_data block is private, but its
1834           length is made available via this option so that it can  be  saved  and
1835           restored (see the pcreprecompile documentation for details).
1836    
      The sequence (?# marks the start of a comment which  contin-  
      ues  up  to the next closing parenthesis. Nested parentheses  
      are not permitted. The characters that  make  up  a  comment  
      play no part in the pattern matching at all.  
   
      If the PCRE_EXTENDED option is set, an unescaped # character  
      outside  a character class introduces a comment that contin-  
      ues up to the next newline character in the pattern.  
1837    
1838    OBSOLETE INFO FUNCTION
1839    
1840  RECURSIVE PATTERNS         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);
1841    
1842       Consider the problem of matching a  string  in  parentheses,         The  pcre_info()  function is now obsolete because its interface is too
1843       allowing  for  unlimited nested parentheses. Without the use         restrictive to return all the available data about a compiled  pattern.
1844       of recursion, the best that can be done is to use a  pattern         New   programs   should  use  pcre_fullinfo()  instead.  The  yield  of
1845       that  matches  up  to some fixed depth of nesting. It is not         pcre_info() is the number of capturing subpatterns, or one of the  fol-
1846       possible to handle an arbitrary nesting depth. Perl has pro-         lowing negative numbers:
1847       vided  an  experimental facility that allows regular expres-  
1848       sions to recurse (amongst other things).  It  does  this  by           PCRE_ERROR_NULL       the argument code was NULL
1849       interpolating  Perl  code in the expression at run time, and           PCRE_ERROR_BADMAGIC   the "magic number" was not found
1850       the code can refer to the expression itself. A Perl  pattern  
1851       to solve the parentheses problem can be created like this:         If  the  optptr  argument is not NULL, a copy of the options with which
1852           the pattern was compiled is placed in the integer  it  points  to  (see
1853         $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;         PCRE_INFO_OPTIONS above).
1854    
1855       The (?p{...}) item interpolates Perl code at run  time,  and         If  the  pattern  is  not anchored and the firstcharptr argument is not
1856       in  this  case refers recursively to the pattern in which it         NULL, it is used to pass back information about the first character  of
1857       appears. Obviously, PCRE cannot support the interpolation of         any matched string (see PCRE_INFO_FIRSTBYTE above).
1858       Perl  code.  Instead,  it  supports  some special syntax for  
1859       recursion of the entire pattern,  and  also  for  individual  
1860       subpattern recursion.  REFERENCE COUNTS
1861    
1862       The special item that consists of (? followed  by  a  number         int pcre_refcount(pcre *code, int adjust);
1863       greater  than  zero and a closing parenthesis is a recursive  
1864       call of the subpattern of the given number, provided that it         The  pcre_refcount()  function is used to maintain a reference count in
1865       occurs inside that subpattern. (If not, it is a "subroutine"         the data block that contains a compiled pattern. It is provided for the
1866       call, which is described in the next section.)  The  special         benefit  of  applications  that  operate  in an object-oriented manner,
1867       item  (?R) is a recursive call of the entire regular expres-         where different parts of the application may be using the same compiled
1868       sion.         pattern, but you want to free the block when they are all done.
1869    
1870       For example, this PCRE pattern solves the nested parentheses         When a pattern is compiled, the reference count field is initialized to
1871       problem  (assume  the  PCRE_EXTENDED  option  is set so that         zero.  It is changed only by calling this function, whose action is  to
1872       white space is ignored):         add  the  adjust  value  (which may be positive or negative) to it. The
1873           yield of the function is the new value. However, the value of the count
1874         \( ( (?>[^()]+) | (?R) )* \)         is  constrained to lie between 0 and 65535, inclusive. If the new value
1875           is outside these limits, it is forced to the appropriate limit value.
1876       First it matches an opening parenthesis. Then it matches any  
1877       number  of substrings which can either be a sequence of non-         Except when it is zero, the reference count is not correctly  preserved
1878       parentheses, or a recursive  match  of  the  pattern  itself         if  a  pattern  is  compiled on one host and then transferred to a host
1879       (that  is  a  correctly  parenthesized  substring).  Finally         whose byte-order is different. (This seems a highly unlikely scenario.)
1880       there is a closing parenthesis.  
1881    
1882       If this were part of a larger pattern, you would not want to  MATCHING A PATTERN: THE TRADITIONAL FUNCTION
1883       recurse the entire pattern, so instead you could use this:  
1884           int pcre_exec(const pcre *code, const pcre_extra *extra,
1885         ( \( ( (?>[^()]+) | (?1) )* \) )              const char *subject, int length, int startoffset,
1886                int options, int *ovector, int ovecsize);
1887       We have put the pattern into  parentheses,  and  caused  the  
1888       recursion  to refer to them instead of the whole pattern. In         The function pcre_exec() is called to match a subject string against  a
1889       a larger pattern, keeping track of parenthesis  numbers  can         compiled  pattern, which is passed in the code argument. If the pattern
1890       be   tricky.   It  may  be  more  convenient  to  use  named         was studied, the result of the study should  be  passed  in  the  extra
1891       parentheses instead. For this, PCRE uses (?P>name), which is         argument.  You  can call pcre_exec() with the same code and extra argu-
1892       an  extension  to the Python syntax that PCRE uses for named         ments as many times as you like, in order to  match  different  subject
1893       parentheses (Perl does not provide  named  parentheses).  We         strings with the same pattern.
1894       could rewrite the above example as follows:  
1895           This  function  is  the  main  matching facility of the library, and it
1896         (?<pn> \( ( (?>[^()]+) | (?P>pn) )* \) )         operates in a Perl-like manner. For specialist use  there  is  also  an
1897           alternative  matching function, which is described below in the section
1898       This particular example pattern  contains  nested  unlimited         about the pcre_dfa_exec() function.
1899       repeats,  and  so  the  use  of atomic grouping for matching  
1900       strings of non-parentheses is important  when  applying  the         In most applications, the pattern will have been compiled (and  option-
1901       pattern to strings that do not match. For example, when this         ally  studied)  in the same process that calls pcre_exec(). However, it
1902       pattern is applied to         is possible to save compiled patterns and study data, and then use them
1903           later  in  different processes, possibly even on different hosts. For a
1904         (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()         discussion about this, see the pcreprecompile documentation.
1905    
1906       it yields "no match" quickly. However, if atomic grouping is         Here is an example of a simple call to pcre_exec():
1907       not used, the match runs for a very long time indeed because  
1908       there are so many different ways the +  and  *  repeats  can           int rc;
1909       carve  up  the  subject,  and  all  have to be tested before           int ovector[30];
1910       failure can be reported.           rc = pcre_exec(
1911       At the end of a match, the values set for any capturing sub-             re,             /* result of pcre_compile() */
1912       patterns are those from the outermost level of the recursion             NULL,           /* we didn't study the pattern */
1913       at which the subpattern value is set.  If you want to obtain             "some string",  /* the subject string */
1914       intermediate  values,  a  callout  function can be used (see             11,             /* the length of the subject string */
1915       below and the pcrecallout  documentation).  If  the  pattern             0,              /* start at offset 0 in the subject */
1916       above is matched against             0,              /* default options */
1917               ovector,        /* vector of integers for substring information */
1918         (ab(cd)ef)             30);            /* number of elements (NOT size in bytes) */
1919    
1920       the value for the capturing parentheses is  "ef",  which  is     Extra data for pcre_exec()
1921       the  last  value  taken  on  at the top level. If additional  
1922       parentheses are added, giving         If the extra argument is not NULL, it must point to a  pcre_extra  data
1923           block.  The pcre_study() function returns such a block (when it doesn't
1924         \( ( ( (?>[^()]+) | (?R) )* ) \)         return NULL), but you can also create one for yourself, and pass  addi-
1925            ^                        ^         tional  information  in it. The pcre_extra block contains the following
1926            ^                        ^         fields (not necessarily in this order):