/[pcre]/code/trunk/doc/pcre.txt
ViewVC logotype

Diff of /code/trunk/doc/pcre.txt

Parent Directory Parent Directory | Revision Log Revision Log | View Patch Patch

revision 208 by ph10, Mon Aug 6 15:23:29 2007 UTC revision 1298 by ph10, Fri Mar 22 16:13:13 2013 UTC
# Line 2  Line 2 
2  This file contains a concatenation of the PCRE man pages, converted to plain  This file contains a concatenation of the PCRE man pages, converted to plain
3  text format for ease of searching with a text editor, or for use on systems  text format for ease of searching with a text editor, or for use on systems
4  that do not have a man page processor. The small individual files that give  that do not have a man page processor. The small individual files that give
5  synopses of each function in the library have not been included. There are  synopses of each function in the library have not been included. Neither has
6  separate text files for the pcregrep and pcretest commands.  the pcredemo program. There are separate text files for the pcregrep and
7    pcretest commands.
8  -----------------------------------------------------------------------------  -----------------------------------------------------------------------------
9    
10    
11  PCRE(3)                                                                PCRE(3)  PCRE(3)                    Library Functions Manual                    PCRE(3)
12    
13    
14    
15  NAME  NAME
16         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
17    
   
18  INTRODUCTION  INTRODUCTION
19    
20         The  PCRE  library is a set of functions that implement regular expres-         The  PCRE  library is a set of functions that implement regular expres-
21         sion pattern matching using the same syntax and semantics as Perl, with         sion pattern matching using the same syntax and semantics as Perl, with
22         just  a  few differences. (Certain features that appeared in Python and         just  a few differences. Some features that appeared in Python and PCRE
23         PCRE before they appeared in Perl are also available using  the  Python         before they appeared in Perl are also available using the  Python  syn-
24         syntax.)         tax,  there  is  some  support for one or two .NET and Oniguruma syntax
25           items, and there is an option for requesting some  minor  changes  that
26         The  current  implementation of PCRE (release 7.x) corresponds approxi-         give better JavaScript compatibility.
27         mately with Perl 5.10, including support for UTF-8 encoded strings  and  
28         Unicode general category properties. However, UTF-8 and Unicode support         Starting with release 8.30, it is possible to compile two separate PCRE
29           libraries:  the  original,  which  supports  8-bit  character   strings
30           (including  UTF-8  strings),  and a second library that supports 16-bit
31           character strings (including UTF-16 strings). The build process  allows
32           either  one  or both to be built. The majority of the work to make this
33           possible was done by Zoltan Herczeg.
34    
35           Starting with release 8.32 it is possible to compile a  third  separate
36           PCRE library, which supports 32-bit character strings (including UTF-32
37           strings). The build process allows any set of the 8-,  16-  and  32-bit
38           libraries. The work to make this possible was done by Christian Persch.
39    
40           The  three  libraries  contain identical sets of functions, except that
41           the names in the 16-bit library start with pcre16_  instead  of  pcre_,
42           and  the  names  in  the  32-bit  library start with pcre32_ instead of
43           pcre_. To avoid over-complication and reduce the documentation  mainte-
44           nance load, most of the documentation describes the 8-bit library, with
45           the differences for the 16-bit and  32-bit  libraries  described  sepa-
46           rately  in  the  pcre16  and  pcre32  pages. References to functions or
47           structures of the  form  pcre[16|32]_xxx  should  be  read  as  meaning
48           "pcre_xxx  when  using  the  8-bit  library,  pcre16_xxx when using the
49           16-bit library, or pcre32_xxx when using the 32-bit library".
50    
51           The current implementation of PCRE corresponds approximately with  Perl
52           5.12,  including  support  for  UTF-8/16/32 encoded strings and Unicode
53           general category properties. However, UTF-8/16/32 and  Unicode  support
54         has to be explicitly enabled; it is not the default. The Unicode tables         has to be explicitly enabled; it is not the default. The Unicode tables
55         correspond to Unicode release 5.0.0.         correspond to Unicode release 6.2.0.
56    
57         In  addition to the Perl-compatible matching function, PCRE contains an         In addition to the Perl-compatible matching function, PCRE contains  an
58         alternative matching function that matches the same  compiled  patterns         alternative  function that matches the same compiled patterns in a dif-
59         in  a different way. In certain circumstances, the alternative function         ferent way. In certain circumstances, the alternative function has some
60         has some advantages. For a discussion of the two  matching  algorithms,         advantages.   For  a discussion of the two matching algorithms, see the
61         see the pcrematching page.         pcrematching page.
62    
63         PCRE  is  written  in C and released as a C library. A number of people         PCRE is written in C and released as a C library. A  number  of  people
64         have written wrappers and interfaces of various kinds.  In  particular,         have  written  wrappers and interfaces of various kinds. In particular,
65         Google  Inc.   have  provided  a comprehensive C++ wrapper. This is now         Google Inc.  have provided a comprehensive C++ wrapper  for  the  8-bit
66         included as part of the PCRE distribution. The pcrecpp page has details         library.  This  is  now  included as part of the PCRE distribution. The
67         of  this  interface.  Other  people's contributions can be found in the         pcrecpp page has details of this interface.  Other  people's  contribu-
68         Contrib directory at the primary FTP site, which is:         tions  can  be  found in the Contrib directory at the primary FTP site,
69           which is:
70    
71         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre         ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre
72    
# Line 52  INTRODUCTION Line 79  INTRODUCTION
79         library is built. The pcre_config() function makes it  possible  for  a         library is built. The pcre_config() function makes it  possible  for  a
80         client  to  discover  which  features are available. The features them-         client  to  discover  which  features are available. The features them-
81         selves are described in the pcrebuild page. Documentation about  build-         selves are described in the pcrebuild page. Documentation about  build-
82         ing  PCRE for various operating systems can be found in the README file         ing  PCRE  for various operating systems can be found in the README and
83         in the source distribution.         NON-AUTOTOOLS_BUILD files in the source distribution.
84    
85         The library contains a number of undocumented  internal  functions  and         The libraries contains a number of undocumented internal functions  and
86         data  tables  that  are  used by more than one of the exported external         data  tables  that  are  used by more than one of the exported external
87         functions, but which are not intended  for  use  by  external  callers.         functions, but which are not intended  for  use  by  external  callers.
88         Their  names  all begin with "_pcre_", which hopefully will not provoke         Their  names all begin with "_pcre_" or "_pcre16_" or "_pcre32_", which
89         any name clashes. In some environments, it is possible to control which         hopefully will not provoke any name clashes. In some  environments,  it
90         external  symbols  are  exported when a shared library is built, and in         is  possible  to  control  which  external  symbols are exported when a
91         these cases the undocumented symbols are not exported.         shared library is built, and in these cases  the  undocumented  symbols
92           are not exported.
93    
94    
95    SECURITY CONSIDERATIONS
96    
97           If  you  are  using PCRE in a non-UTF application that permits users to
98           supply arbitrary patterns for compilation, you should  be  aware  of  a
99           feature that allows users to turn on UTF support from within a pattern,
100           provided that PCRE was built with UTF support. For  example,  an  8-bit
101           pattern  that  begins  with  "(*UTF8)" or "(*UTF)" turns on UTF-8 mode,
102           which interprets patterns and subjects as strings of  UTF-8  characters
103           instead  of  individual 8-bit characters.  This causes both the pattern
104           and any data against which it is matched to be checked for UTF-8 valid-
105           ity.  If  the  data  string is very long, such a check might use suffi-
106           ciently many resources as to cause your  application  to  lose  perfor-
107           mance.
108    
109           The  best  way  of  guarding  against  this  possibility  is to use the
110           pcre_fullinfo() function to check the compiled  pattern's  options  for
111           UTF.
112    
113           If  your  application  is one that supports UTF, be aware that validity
114           checking can take time. If the same data string is to be  matched  many
115           times, you can use the PCRE_NO_UTF[8|16|32]_CHECK option for the second
116           and subsequent matches to save redundant checks.
117    
118           Another way that performance can be hit is by running  a  pattern  that
119           has  a  very  large search tree against a string that will never match.
120           Nested unlimited repeats in a pattern are a common example.  PCRE  pro-
121           vides some protection against this: see the PCRE_EXTRA_MATCH_LIMIT fea-
122           ture in the pcreapi page.
123    
124    
125  USER DOCUMENTATION  USER DOCUMENTATION
# Line 69  USER DOCUMENTATION Line 127  USER DOCUMENTATION
127         The user documentation for PCRE comprises a number  of  different  sec-         The user documentation for PCRE comprises a number  of  different  sec-
128         tions.  In the "man" format, each of these is a separate "man page". In         tions.  In the "man" format, each of these is a separate "man page". In
129         the HTML format, each is a separate page, linked from the  index  page.         the HTML format, each is a separate page, linked from the  index  page.
130         In  the  plain text format, all the sections are concatenated, for ease         In  the  plain  text format, all the sections, except the pcredemo sec-
131         of searching. The sections are as follows:         tion, are concatenated, for ease of searching. The sections are as fol-
132           lows:
133    
134           pcre              this document           pcre              this document
135             pcre16            details of the 16-bit library
136             pcre32            details of the 32-bit library
137           pcre-config       show PCRE installation configuration information           pcre-config       show PCRE installation configuration information
138           pcreapi           details of PCRE's native C API           pcreapi           details of PCRE's native C API
139           pcrebuild         options for building PCRE           pcrebuild         options for building PCRE
140           pcrecallout       details of the callout feature           pcrecallout       details of the callout feature
141           pcrecompat        discussion of Perl compatibility           pcrecompat        discussion of Perl compatibility
142           pcrecpp           details of the C++ wrapper           pcrecpp           details of the C++ wrapper for the 8-bit library
143           pcregrep          description of the pcregrep command           pcredemo          a demonstration C program that uses PCRE
144             pcregrep          description of the pcregrep command (8-bit only)
145             pcrejit           discussion of the just-in-time optimization support
146             pcrelimits        details of size and other limits
147           pcrematching      discussion of the two matching algorithms           pcrematching      discussion of the two matching algorithms
148           pcrepartial       details of the partial matching facility           pcrepartial       details of the partial matching facility
149           pcrepattern       syntax and semantics of supported           pcrepattern       syntax and semantics of supported
150                               regular expressions                               regular expressions
          pcresyntax        quick syntax reference  
151           pcreperform       discussion of performance issues           pcreperform       discussion of performance issues
152           pcreposix         the POSIX-compatible C API           pcreposix         the POSIX-compatible C API for the 8-bit library
153           pcreprecompile    details of saving and re-using precompiled patterns           pcreprecompile    details of saving and re-using precompiled patterns
154           pcresample        discussion of the sample program           pcresample        discussion of the pcredemo program
155           pcrestack         discussion of stack usage           pcrestack         discussion of stack usage
156             pcresyntax        quick syntax reference
157           pcretest          description of the pcretest testing command           pcretest          description of the pcretest testing command
158             pcreunicode       discussion of Unicode and UTF-8/16/32 support
159    
160         In  addition,  in the "man" and HTML formats, there is a short page for         In  addition,  in the "man" and HTML formats, there is a short page for
161         each C library function, listing its arguments and results.         each C library function, listing its arguments and results.
162    
163    
164  LIMITATIONS  AUTHOR
165    
166         There are some size limitations in PCRE but it is hoped that they  will         Philip Hazel
167         never in practice be relevant.         University Computing Service
168           Cambridge CB2 3QH, England.
169    
170         The  maximum  length of a compiled pattern is 65539 (sic) bytes if PCRE         Putting an actual email address here seems to have been a spam  magnet,
171         is compiled with the default internal linkage size of 2. If you want to         so  I've  taken  it away. If you want to email me, use my two initials,
172         process  regular  expressions  that are truly enormous, you can compile         followed by the two digits 10, at the domain cam.ac.uk.
        PCRE with an internal linkage size of 3 or 4 (see the  README  file  in  
        the  source  distribution and the pcrebuild documentation for details).  
        In these cases the limit is substantially larger.  However,  the  speed  
        of execution is slower.  
173    
        All values in repeating quantifiers must be less than 65536.  
174    
175         There is no limit to the number of parenthesized subpatterns, but there  REVISION
        can be no more than 65535 capturing subpatterns.  
176    
177         The maximum length of name for a named subpattern is 32 characters, and         Last updated: 11 November 2012
178         the maximum number of named subpatterns is 10000.         Copyright (c) 1997-2012 University of Cambridge.
179    ------------------------------------------------------------------------------
180    
181    
182    PCRE(3)                    Library Functions Manual                    PCRE(3)
183    
        The  maximum  length of a subject string is the largest positive number  
        that an integer variable can hold. However, when using the  traditional  
        matching function, PCRE uses recursion to handle subpatterns and indef-  
        inite repetition.  This means that the available stack space may  limit  
        the size of a subject string that can be processed by certain patterns.  
        For a discussion of stack issues, see the pcrestack documentation.  
184    
185    
186  UTF-8 AND UNICODE PROPERTY SUPPORT  NAME
187           PCRE - Perl-compatible regular expressions
188    
189         From release 3.3, PCRE has  had  some  support  for  character  strings         #include <pcre.h>
        encoded  in the UTF-8 format. For release 4.0 this was greatly extended  
        to cover most common requirements, and in release 5.0  additional  sup-  
        port for Unicode general category properties was added.  
   
        In  order  process  UTF-8 strings, you must build PCRE to include UTF-8  
        support in the code, and, in addition,  you  must  call  pcre_compile()  
        with  the PCRE_UTF8 option flag. When you do this, both the pattern and  
        any subject strings that are matched against it are  treated  as  UTF-8  
        strings instead of just strings of bytes.  
   
        If  you compile PCRE with UTF-8 support, but do not use it at run time,  
        the library will be a bit bigger, but the additional run time  overhead  
        is limited to testing the PCRE_UTF8 flag occasionally, so should not be  
        very big.  
190    
        If PCRE is built with Unicode character property support (which implies  
        UTF-8  support),  the  escape sequences \p{..}, \P{..}, and \X are sup-  
        ported.  The available properties that can be tested are limited to the  
        general  category  properties such as Lu for an upper case letter or Nd  
        for a decimal number, the Unicode script names such as Arabic  or  Han,  
        and  the  derived  properties  Any  and L&. A full list is given in the  
        pcrepattern documentation. Only the short names for properties are sup-  
        ported.  For example, \p{L} matches a letter. Its Perl synonym, \p{Let-  
        ter}, is not supported.  Furthermore,  in  Perl,  many  properties  may  
        optionally  be  prefixed by "Is", for compatibility with Perl 5.6. PCRE  
        does not support this.  
   
        The following comments apply when PCRE is running in UTF-8 mode:  
   
        1. When you set the PCRE_UTF8 flag, the strings passed as patterns  and  
        subjects  are  checked for validity on entry to the relevant functions.  
        If an invalid UTF-8 string is passed, an error return is given. In some  
        situations,  you  may  already  know  that  your strings are valid, and  
        therefore want to skip these checks in order to improve performance. If  
        you  set  the  PCRE_NO_UTF8_CHECK  flag at compile time or at run time,  
        PCRE assumes that the pattern or subject  it  is  given  (respectively)  
        contains  only valid UTF-8 codes. In this case, it does not diagnose an  
        invalid UTF-8 string. If you pass an invalid UTF-8 string to PCRE  when  
        PCRE_NO_UTF8_CHECK  is set, the results are undefined. Your program may  
        crash.  
   
        2. An unbraced hexadecimal escape sequence (such  as  \xb3)  matches  a  
        two-byte UTF-8 character if the value is greater than 127.  
   
        3.  Octal  numbers  up to \777 are recognized, and match two-byte UTF-8  
        characters for values greater than \177.  
   
        4. Repeat quantifiers apply to complete UTF-8 characters, not to  indi-  
        vidual bytes, for example: \x{100}{3}.  
   
        5.  The dot metacharacter matches one UTF-8 character instead of a sin-  
        gle byte.  
   
        6. The escape sequence \C can be used to match a single byte  in  UTF-8  
        mode,  but  its  use can lead to some strange effects. This facility is  
        not available in the alternative matching function, pcre_dfa_exec().  
   
        7. The character escapes \b, \B, \d, \D, \s, \S, \w, and  \W  correctly  
        test  characters of any code value, but the characters that PCRE recog-  
        nizes as digits, spaces, or word characters  remain  the  same  set  as  
        before, all with values less than 256. This remains true even when PCRE  
        includes Unicode property support, because to do otherwise  would  slow  
        down  PCRE in many common cases. If you really want to test for a wider  
        sense of, say, "digit", you must use Unicode  property  tests  such  as  
        \p{Nd}.  
   
        8.  Similarly,  characters that match the POSIX named character classes  
        are all low-valued characters.  
   
        9. However, the Perl 5.10 horizontal and vertical  whitespace  matching  
        escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-  
        acters.  
191    
192         10. Case-insensitive matching applies only to characters  whose  values  PCRE 16-BIT API BASIC FUNCTIONS
193         are  less than 128, unless PCRE is built with Unicode property support.  
194         Even when Unicode property support is available, PCRE  still  uses  its         pcre16 *pcre16_compile(PCRE_SPTR16 pattern, int options,
195         own  character  tables when checking the case of low-valued characters,              const char **errptr, int *erroffset,
196         so as not to degrade performance.  The Unicode property information  is              const unsigned char *tableptr);
197         used only for characters with higher values. Even when Unicode property  
198         support is available, PCRE supports case-insensitive matching only when         pcre16 *pcre16_compile2(PCRE_SPTR16 pattern, int options,
199         there  is  a  one-to-one  mapping between a letter's cases. There are a              int *errorcodeptr,
200         small number of many-to-one mappings in Unicode;  these  are  not  sup-              const char **errptr, int *erroffset,
201         ported by PCRE.              const unsigned char *tableptr);
202    
203           pcre16_extra *pcre16_study(const pcre16 *code, int options,
204                const char **errptr);
205    
206           void pcre16_free_study(pcre16_extra *extra);
207    
208           int pcre16_exec(const pcre16 *code, const pcre16_extra *extra,
209                PCRE_SPTR16 subject, int length, int startoffset,
210                int options, int *ovector, int ovecsize);
211    
212           int pcre16_dfa_exec(const pcre16 *code, const pcre16_extra *extra,
213                PCRE_SPTR16 subject, int length, int startoffset,
214                int options, int *ovector, int ovecsize,
215                int *workspace, int wscount);
216    
217    
218    PCRE 16-BIT API STRING EXTRACTION FUNCTIONS
219    
220           int pcre16_copy_named_substring(const pcre16 *code,
221                PCRE_SPTR16 subject, int *ovector,
222                int stringcount, PCRE_SPTR16 stringname,
223                PCRE_UCHAR16 *buffer, int buffersize);
224    
225           int pcre16_copy_substring(PCRE_SPTR16 subject, int *ovector,
226                int stringcount, int stringnumber, PCRE_UCHAR16 *buffer,
227                int buffersize);
228    
229           int pcre16_get_named_substring(const pcre16 *code,
230                PCRE_SPTR16 subject, int *ovector,
231                int stringcount, PCRE_SPTR16 stringname,
232                PCRE_SPTR16 *stringptr);
233    
234           int pcre16_get_stringnumber(const pcre16 *code,
235                PCRE_SPTR16 name);
236    
237           int pcre16_get_stringtable_entries(const pcre16 *code,
238                PCRE_SPTR16 name, PCRE_UCHAR16 **first, PCRE_UCHAR16 **last);
239    
240           int pcre16_get_substring(PCRE_SPTR16 subject, int *ovector,
241                int stringcount, int stringnumber,
242                PCRE_SPTR16 *stringptr);
243    
244           int pcre16_get_substring_list(PCRE_SPTR16 subject,
245                int *ovector, int stringcount, PCRE_SPTR16 **listptr);
246    
247           void pcre16_free_substring(PCRE_SPTR16 stringptr);
248    
249           void pcre16_free_substring_list(PCRE_SPTR16 *stringptr);
250    
251    
252    PCRE 16-BIT API AUXILIARY FUNCTIONS
253    
254           pcre16_jit_stack *pcre16_jit_stack_alloc(int startsize, int maxsize);
255    
256           void pcre16_jit_stack_free(pcre16_jit_stack *stack);
257    
258           void pcre16_assign_jit_stack(pcre16_extra *extra,
259                pcre16_jit_callback callback, void *data);
260    
261           const unsigned char *pcre16_maketables(void);
262    
263           int pcre16_fullinfo(const pcre16 *code, const pcre16_extra *extra,
264                int what, void *where);
265    
266           int pcre16_refcount(pcre16 *code, int adjust);
267    
268           int pcre16_config(int what, void *where);
269    
270           const char *pcre16_version(void);
271    
272           int pcre16_pattern_to_host_byte_order(pcre16 *code,
273                pcre16_extra *extra, const unsigned char *tables);
274    
275    
276    PCRE 16-BIT API INDIRECTED FUNCTIONS
277    
278           void *(*pcre16_malloc)(size_t);
279    
280           void (*pcre16_free)(void *);
281    
282           void *(*pcre16_stack_malloc)(size_t);
283    
284           void (*pcre16_stack_free)(void *);
285    
286           int (*pcre16_callout)(pcre16_callout_block *);
287    
288    
289    PCRE 16-BIT API 16-BIT-ONLY FUNCTION
290    
291           int pcre16_utf16_to_host_byte_order(PCRE_UCHAR16 *output,
292                PCRE_SPTR16 input, int length, int *byte_order,
293                int keep_boms);
294    
295    
296    THE PCRE 16-BIT LIBRARY
297    
298           Starting  with  release  8.30, it is possible to compile a PCRE library
299           that supports 16-bit character strings, including  UTF-16  strings,  as
300           well  as  or instead of the original 8-bit library. The majority of the
301           work to make  this  possible  was  done  by  Zoltan  Herczeg.  The  two
302           libraries contain identical sets of functions, used in exactly the same
303           way. Only the names of the functions and the data types of their  argu-
304           ments  and results are different. To avoid over-complication and reduce
305           the documentation maintenance load,  most  of  the  PCRE  documentation
306           describes  the  8-bit  library,  with only occasional references to the
307           16-bit library. This page describes what is different when you use  the
308           16-bit library.
309    
310           WARNING:  A  single  application can be linked with both libraries, but
311           you must take care when processing any particular pattern to use  func-
312           tions  from  just one library. For example, if you want to study a pat-
313           tern that was compiled with  pcre16_compile(),  you  must  do  so  with
314           pcre16_study(), not pcre_study(), and you must free the study data with
315           pcre16_free_study().
316    
317    
318    THE HEADER FILE
319    
320           There is only one header file, pcre.h. It contains prototypes  for  all
321           the functions in all libraries, as well as definitions of flags, struc-
322           tures, error codes, etc.
323    
324    
325    THE LIBRARY NAME
326    
327           In Unix-like systems, the 16-bit library is called libpcre16,  and  can
328           normally  be  accesss  by adding -lpcre16 to the command for linking an
329           application that uses PCRE.
330    
331    
332    STRING TYPES
333    
334           In the 8-bit library, strings are passed to PCRE library  functions  as
335           vectors  of  bytes  with  the  C  type "char *". In the 16-bit library,
336           strings are passed as vectors of unsigned 16-bit quantities. The  macro
337           PCRE_UCHAR16  specifies  an  appropriate  data type, and PCRE_SPTR16 is
338           defined as "const PCRE_UCHAR16 *". In very  many  environments,  "short
339           int" is a 16-bit data type. When PCRE is built, it defines PCRE_UCHAR16
340           as "unsigned short int", but checks that it really  is  a  16-bit  data
341           type.  If  it is not, the build fails with an error message telling the
342           maintainer to modify the definition appropriately.
343    
344    
345    STRUCTURE TYPES
346    
347           The types of the opaque structures that are used  for  compiled  16-bit
348           patterns  and  JIT stacks are pcre16 and pcre16_jit_stack respectively.
349           The  type  of  the  user-accessible  structure  that  is  returned   by
350           pcre16_study()  is  pcre16_extra, and the type of the structure that is
351           used for passing data to a callout  function  is  pcre16_callout_block.
352           These structures contain the same fields, with the same names, as their
353           8-bit counterparts. The only difference is that pointers  to  character
354           strings are 16-bit instead of 8-bit types.
355    
356    
357    16-BIT FUNCTIONS
358    
359           For  every function in the 8-bit library there is a corresponding func-
360           tion in the 16-bit library with a name that starts with pcre16_ instead
361           of  pcre_.  The  prototypes are listed above. In addition, there is one
362           extra function, pcre16_utf16_to_host_byte_order(). This  is  a  utility
363           function  that converts a UTF-16 character string to host byte order if
364           necessary. The other 16-bit  functions  expect  the  strings  they  are
365           passed to be in host byte order.
366    
367           The input and output arguments of pcre16_utf16_to_host_byte_order() may
368           point to the same address, that is, conversion in place  is  supported.
369           The output buffer must be at least as long as the input.
370    
371           The  length  argument  specifies the number of 16-bit data units in the
372           input string; a negative value specifies a zero-terminated string.
373    
374           If byte_order is NULL, it is assumed that the string starts off in host
375           byte  order. This may be changed by byte-order marks (BOMs) anywhere in
376           the string (commonly as the first character).
377    
378           If byte_order is not NULL, a non-zero value of the integer to which  it
379           points  means  that  the input starts off in host byte order, otherwise
380           the opposite order is assumed. Again, BOMs in  the  string  can  change
381           this. The final byte order is passed back at the end of processing.
382    
383           If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
384           copied into the output string. Otherwise they are discarded.
385    
386           The result of the function is the number of 16-bit  units  placed  into
387           the  output  buffer,  including  the  zero terminator if the string was
388           zero-terminated.
389    
390    
391    SUBJECT STRING OFFSETS
392    
393           The offsets within subject strings that are returned  by  the  matching
394           functions are in 16-bit units rather than bytes.
395    
396    
397    NAMED SUBPATTERNS
398    
399           The  name-to-number translation table that is maintained for named sub-
400           patterns uses 16-bit characters.  The  pcre16_get_stringtable_entries()
401           function returns the length of each entry in the table as the number of
402           16-bit data units.
403    
404    
405    OPTION NAMES
406    
407           There   are   two   new   general   option   names,   PCRE_UTF16    and
408           PCRE_NO_UTF16_CHECK,     which     correspond    to    PCRE_UTF8    and
409           PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
410           define  the  same bits in the options word. There is a discussion about
411           the validity of UTF-16 strings in the pcreunicode page.
412    
413           For the pcre16_config() function there is an  option  PCRE_CONFIG_UTF16
414           that  returns  1  if UTF-16 support is configured, otherwise 0. If this
415           option  is  given  to  pcre_config()  or  pcre32_config(),  or  if  the
416           PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF32  option is given to pcre16_con-
417           fig(), the result is the PCRE_ERROR_BADOPTION error.
418    
419    
420    CHARACTER CODES
421    
422           In 16-bit mode, when  PCRE_UTF16  is  not  set,  character  values  are
423           treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
424           that they can range from 0 to 0xffff instead of 0  to  0xff.  Character
425           types  for characters less than 0xff can therefore be influenced by the
426           locale in the same way as before.  Characters greater  than  0xff  have
427           only one case, and no "type" (such as letter or digit).
428    
429           In  UTF-16  mode,  the  character  code  is  Unicode, in the range 0 to
430           0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
431           because  those  are "surrogate" values that are used in pairs to encode
432           values greater than 0xffff.
433    
434           A UTF-16 string can indicate its endianness by special code knows as  a
435           byte-order mark (BOM). The PCRE functions do not handle this, expecting
436           strings  to  be  in  host  byte  order.  A  utility   function   called
437           pcre16_utf16_to_host_byte_order()  is  provided  to help with this (see
438           above).
439    
440    
441    ERROR NAMES
442    
443           The errors PCRE_ERROR_BADUTF16_OFFSET and PCRE_ERROR_SHORTUTF16  corre-
444           spond  to  their  8-bit  counterparts.  The error PCRE_ERROR_BADMODE is
445           given when a compiled pattern is passed to a  function  that  processes
446           patterns  in  the  other  mode, for example, if a pattern compiled with
447           pcre_compile() is passed to pcre16_exec().
448    
449           There are new error codes whose names  begin  with  PCRE_UTF16_ERR  for
450           invalid  UTF-16  strings,  corresponding to the PCRE_UTF8_ERR codes for
451           UTF-8 strings that are described in the section entitled "Reason  codes
452           for  invalid UTF-8 strings" in the main pcreapi page. The UTF-16 errors
453           are:
454    
455             PCRE_UTF16_ERR1  Missing low surrogate at end of string
456             PCRE_UTF16_ERR2  Invalid low surrogate follows high surrogate
457             PCRE_UTF16_ERR3  Isolated low surrogate
458             PCRE_UTF16_ERR4  Non-character
459    
460    
461    ERROR TEXTS
462    
463           If there is an error while compiling a pattern, the error text that  is
464           passed  back by pcre16_compile() or pcre16_compile2() is still an 8-bit
465           character string, zero-terminated.
466    
467    
468    CALLOUTS
469    
470           The subject and mark fields in the callout block that is  passed  to  a
471           callout function point to 16-bit vectors.
472    
473    
474    TESTING
475    
476           The  pcretest  program continues to operate with 8-bit input and output
477           files, but it can be used for testing the 16-bit library. If it is  run
478           with the command line option -16, patterns and subject strings are con-
479           verted from 8-bit to 16-bit before being passed to PCRE, and the 16-bit
480           library  functions  are used instead of the 8-bit ones. Returned 16-bit
481           strings are converted to 8-bit for output. If both the  8-bit  and  the
482           32-bit libraries were not compiled, pcretest defaults to 16-bit and the
483           -16 option is ignored.
484    
485           When PCRE is being built, the RunTest script that is  called  by  "make
486           check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
487           16-bit and 32-bit libraries has been built, and runs the  tests  appro-
488           priately.
489    
490    
491    NOT SUPPORTED IN 16-BIT MODE
492    
493           Not all the features of the 8-bit library are available with the 16-bit
494           library. The C++ and POSIX wrapper functions  support  only  the  8-bit
495           library, and the pcregrep program is at present 8-bit only.
496    
497    
498  AUTHOR  AUTHOR
# Line 219  AUTHOR Line 501  AUTHOR
501         University Computing Service         University Computing Service
502         Cambridge CB2 3QH, England.         Cambridge CB2 3QH, England.
503    
        Putting  an actual email address here seems to have been a spam magnet,  
        so I've taken it away. If you want to email me, use  my  two  initials,  
        followed by the two digits 10, at the domain cam.ac.uk.  
   
504    
505  REVISION  REVISION
506    
507         Last updated: 06 August 2007         Last updated: 08 November 2012
508         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
509  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
510    
511    
512    PCRE(3)                    Library Functions Manual                    PCRE(3)
513    
514    
 PCREBUILD(3)                                                      PCREBUILD(3)  
   
515    
516  NAME  NAME
517         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
518    
519           #include <pcre.h>
520    
521    
522    PCRE 32-BIT API BASIC FUNCTIONS
523    
524           pcre32 *pcre32_compile(PCRE_SPTR32 pattern, int options,
525                const char **errptr, int *erroffset,
526                const unsigned char *tableptr);
527    
528           pcre32 *pcre32_compile2(PCRE_SPTR32 pattern, int options,
529                int *errorcodeptr,
530                const char **errptr, int *erroffset,
531                const unsigned char *tableptr);
532    
533           pcre32_extra *pcre32_study(const pcre32 *code, int options,
534                const char **errptr);
535    
536           void pcre32_free_study(pcre32_extra *extra);
537    
538           int pcre32_exec(const pcre32 *code, const pcre32_extra *extra,
539                PCRE_SPTR32 subject, int length, int startoffset,
540                int options, int *ovector, int ovecsize);
541    
542           int pcre32_dfa_exec(const pcre32 *code, const pcre32_extra *extra,
543                PCRE_SPTR32 subject, int length, int startoffset,
544                int options, int *ovector, int ovecsize,
545                int *workspace, int wscount);
546    
547    
548    PCRE 32-BIT API STRING EXTRACTION FUNCTIONS
549    
550           int pcre32_copy_named_substring(const pcre32 *code,
551                PCRE_SPTR32 subject, int *ovector,
552                int stringcount, PCRE_SPTR32 stringname,
553                PCRE_UCHAR32 *buffer, int buffersize);
554    
555           int pcre32_copy_substring(PCRE_SPTR32 subject, int *ovector,
556                int stringcount, int stringnumber, PCRE_UCHAR32 *buffer,
557                int buffersize);
558    
559           int pcre32_get_named_substring(const pcre32 *code,
560                PCRE_SPTR32 subject, int *ovector,
561                int stringcount, PCRE_SPTR32 stringname,
562                PCRE_SPTR32 *stringptr);
563    
564           int pcre32_get_stringnumber(const pcre32 *code,
565                PCRE_SPTR32 name);
566    
567           int pcre32_get_stringtable_entries(const pcre32 *code,
568                PCRE_SPTR32 name, PCRE_UCHAR32 **first, PCRE_UCHAR32 **last);
569    
570           int pcre32_get_substring(PCRE_SPTR32 subject, int *ovector,
571                int stringcount, int stringnumber,
572                PCRE_SPTR32 *stringptr);
573    
574           int pcre32_get_substring_list(PCRE_SPTR32 subject,
575                int *ovector, int stringcount, PCRE_SPTR32 **listptr);
576    
577           void pcre32_free_substring(PCRE_SPTR32 stringptr);
578    
579           void pcre32_free_substring_list(PCRE_SPTR32 *stringptr);
580    
581    
582    PCRE 32-BIT API AUXILIARY FUNCTIONS
583    
584           pcre32_jit_stack *pcre32_jit_stack_alloc(int startsize, int maxsize);
585    
586           void pcre32_jit_stack_free(pcre32_jit_stack *stack);
587    
588           void pcre32_assign_jit_stack(pcre32_extra *extra,
589                pcre32_jit_callback callback, void *data);
590    
591           const unsigned char *pcre32_maketables(void);
592    
593           int pcre32_fullinfo(const pcre32 *code, const pcre32_extra *extra,
594                int what, void *where);
595    
596           int pcre32_refcount(pcre32 *code, int adjust);
597    
598           int pcre32_config(int what, void *where);
599    
600           const char *pcre32_version(void);
601    
602           int pcre32_pattern_to_host_byte_order(pcre32 *code,
603                pcre32_extra *extra, const unsigned char *tables);
604    
605    
606    PCRE 32-BIT API INDIRECTED FUNCTIONS
607    
608           void *(*pcre32_malloc)(size_t);
609    
610           void (*pcre32_free)(void *);
611    
612           void *(*pcre32_stack_malloc)(size_t);
613    
614           void (*pcre32_stack_free)(void *);
615    
616           int (*pcre32_callout)(pcre32_callout_block *);
617    
618    
619    PCRE 32-BIT API 32-BIT-ONLY FUNCTION
620    
621           int pcre32_utf32_to_host_byte_order(PCRE_UCHAR32 *output,
622                PCRE_SPTR32 input, int length, int *byte_order,
623                int keep_boms);
624    
625    
626    THE PCRE 32-BIT LIBRARY
627    
628           Starting  with  release  8.32, it is possible to compile a PCRE library
629           that supports 32-bit character strings, including  UTF-32  strings,  as
630           well as or instead of the original 8-bit library. This work was done by
631           Christian Persch, based on the work done  by  Zoltan  Herczeg  for  the
632           16-bit  library.  All  three  libraries contain identical sets of func-
633           tions, used in exactly the same way.  Only the names of  the  functions
634           and  the  data  types  of their arguments and results are different. To
635           avoid over-complication and reduce the documentation maintenance  load,
636           most  of  the PCRE documentation describes the 8-bit library, with only
637           occasional references to the 16-bit and  32-bit  libraries.  This  page
638           describes what is different when you use the 32-bit library.
639    
640           WARNING:  A  single  application  can  be linked with all or any of the
641           three libraries, but you must take care when processing any  particular
642           pattern  to  use  functions  from just one library. For example, if you
643           want to study a pattern that was compiled  with  pcre32_compile(),  you
644           must do so with pcre32_study(), not pcre_study(), and you must free the
645           study data with pcre32_free_study().
646    
647    
648    THE HEADER FILE
649    
650           There is only one header file, pcre.h. It contains prototypes  for  all
651           the functions in all libraries, as well as definitions of flags, struc-
652           tures, error codes, etc.
653    
654    
655    THE LIBRARY NAME
656    
657           In Unix-like systems, the 32-bit library is called libpcre32,  and  can
658           normally  be  accesss  by adding -lpcre32 to the command for linking an
659           application that uses PCRE.
660    
661    
662    STRING TYPES
663    
664           In the 8-bit library, strings are passed to PCRE library  functions  as
665           vectors  of  bytes  with  the  C  type "char *". In the 32-bit library,
666           strings are passed as vectors of unsigned 32-bit quantities. The  macro
667           PCRE_UCHAR32  specifies  an  appropriate  data type, and PCRE_SPTR32 is
668           defined as "const PCRE_UCHAR32 *". In very many environments, "unsigned
669           int" is a 32-bit data type. When PCRE is built, it defines PCRE_UCHAR32
670           as "unsigned int", but checks that it really is a 32-bit data type.  If
671           it is not, the build fails with an error message telling the maintainer
672           to modify the definition appropriately.
673    
674    
675    STRUCTURE TYPES
676    
677           The types of the opaque structures that are used  for  compiled  32-bit
678           patterns  and  JIT stacks are pcre32 and pcre32_jit_stack respectively.
679           The  type  of  the  user-accessible  structure  that  is  returned   by
680           pcre32_study()  is  pcre32_extra, and the type of the structure that is
681           used for passing data to a callout  function  is  pcre32_callout_block.
682           These structures contain the same fields, with the same names, as their
683           8-bit counterparts. The only difference is that pointers  to  character
684           strings are 32-bit instead of 8-bit types.
685    
686    
687    32-BIT FUNCTIONS
688    
689           For  every function in the 8-bit library there is a corresponding func-
690           tion in the 32-bit library with a name that starts with pcre32_ instead
691           of  pcre_.  The  prototypes are listed above. In addition, there is one
692           extra function, pcre32_utf32_to_host_byte_order(). This  is  a  utility
693           function  that converts a UTF-32 character string to host byte order if
694           necessary. The other 32-bit  functions  expect  the  strings  they  are
695           passed to be in host byte order.
696    
697           The input and output arguments of pcre32_utf32_to_host_byte_order() may
698           point to the same address, that is, conversion in place  is  supported.
699           The output buffer must be at least as long as the input.
700    
701           The  length  argument  specifies the number of 32-bit data units in the
702           input string; a negative value specifies a zero-terminated string.
703    
704           If byte_order is NULL, it is assumed that the string starts off in host
705           byte  order. This may be changed by byte-order marks (BOMs) anywhere in
706           the string (commonly as the first character).
707    
708           If byte_order is not NULL, a non-zero value of the integer to which  it
709           points  means  that  the input starts off in host byte order, otherwise
710           the opposite order is assumed. Again, BOMs in  the  string  can  change
711           this. The final byte order is passed back at the end of processing.
712    
713           If  keep_boms  is  not  zero,  byte-order  mark characters (0xfeff) are
714           copied into the output string. Otherwise they are discarded.
715    
716           The result of the function is the number of 32-bit  units  placed  into
717           the  output  buffer,  including  the  zero terminator if the string was
718           zero-terminated.
719    
720    
721    SUBJECT STRING OFFSETS
722    
723           The offsets within subject strings that are returned  by  the  matching
724           functions are in 32-bit units rather than bytes.
725    
726    
727    NAMED SUBPATTERNS
728    
729           The  name-to-number translation table that is maintained for named sub-
730           patterns uses 32-bit characters.  The  pcre32_get_stringtable_entries()
731           function returns the length of each entry in the table as the number of
732           32-bit data units.
733    
734    
735    OPTION NAMES
736    
737           There   are   two   new   general   option   names,   PCRE_UTF32    and
738           PCRE_NO_UTF32_CHECK,     which     correspond    to    PCRE_UTF8    and
739           PCRE_NO_UTF8_CHECK in the 8-bit library. In  fact,  these  new  options
740           define  the  same bits in the options word. There is a discussion about
741           the validity of UTF-32 strings in the pcreunicode page.
742    
743           For the pcre32_config() function there is an  option  PCRE_CONFIG_UTF32
744           that  returns  1  if UTF-32 support is configured, otherwise 0. If this
745           option  is  given  to  pcre_config()  or  pcre16_config(),  or  if  the
746           PCRE_CONFIG_UTF8  or  PCRE_CONFIG_UTF16  option is given to pcre32_con-
747           fig(), the result is the PCRE_ERROR_BADOPTION error.
748    
749    
750    CHARACTER CODES
751    
752           In 32-bit mode, when  PCRE_UTF32  is  not  set,  character  values  are
753           treated in the same way as in 8-bit, non UTF-8 mode, except, of course,
754           that they can range from 0 to 0x7fffffff instead of 0 to 0xff.  Charac-
755           ter  types for characters less than 0xff can therefore be influenced by
756           the locale in the same way as before.   Characters  greater  than  0xff
757           have only one case, and no "type" (such as letter or digit).
758    
759           In  UTF-32  mode,  the  character  code  is  Unicode, in the range 0 to
760           0x10ffff, with the exception of values in the range  0xd800  to  0xdfff
761           because those are "surrogate" values that are ill-formed in UTF-32.
762    
763           A  UTF-32 string can indicate its endianness by special code knows as a
764           byte-order mark (BOM). The PCRE functions do not handle this, expecting
765           strings   to   be  in  host  byte  order.  A  utility  function  called
766           pcre32_utf32_to_host_byte_order() is provided to help  with  this  (see
767           above).
768    
769    
770    ERROR NAMES
771    
772           The  error  PCRE_ERROR_BADUTF32  corresponds  to its 8-bit counterpart.
773           The error PCRE_ERROR_BADMODE is given when a compiled pattern is passed
774           to  a  function that processes patterns in the other mode, for example,
775           if a pattern compiled with pcre_compile() is passed to pcre32_exec().
776    
777           There are new error codes whose names  begin  with  PCRE_UTF32_ERR  for
778           invalid  UTF-32  strings,  corresponding to the PCRE_UTF8_ERR codes for
779           UTF-8 strings that are described in the section entitled "Reason  codes
780           for  invalid UTF-8 strings" in the main pcreapi page. The UTF-32 errors
781           are:
782    
783             PCRE_UTF32_ERR1  Surrogate character (range from 0xd800 to 0xdfff)
784             PCRE_UTF32_ERR2  Non-character
785             PCRE_UTF32_ERR3  Character > 0x10ffff
786    
787    
788    ERROR TEXTS
789    
790           If there is an error while compiling a pattern, the error text that  is
791           passed  back by pcre32_compile() or pcre32_compile2() is still an 8-bit
792           character string, zero-terminated.
793    
794    
795    CALLOUTS
796    
797           The subject and mark fields in the callout block that is  passed  to  a
798           callout function point to 32-bit vectors.
799    
800    
801    TESTING
802    
803           The  pcretest  program continues to operate with 8-bit input and output
804           files, but it can be used for testing the 32-bit library. If it is  run
805           with the command line option -32, patterns and subject strings are con-
806           verted from 8-bit to 32-bit before being passed to PCRE, and the 32-bit
807           library  functions  are used instead of the 8-bit ones. Returned 32-bit
808           strings are converted to 8-bit for output. If both the  8-bit  and  the
809           16-bit libraries were not compiled, pcretest defaults to 32-bit and the
810           -32 option is ignored.
811    
812           When PCRE is being built, the RunTest script that is  called  by  "make
813           check"  uses  the  pcretest  -C  option to discover which of the 8-bit,
814           16-bit and 32-bit libraries has been built, and runs the  tests  appro-
815           priately.
816    
817    
818    NOT SUPPORTED IN 32-BIT MODE
819    
820           Not all the features of the 8-bit library are available with the 32-bit
821           library. The C++ and POSIX wrapper functions  support  only  the  8-bit
822           library, and the pcregrep program is at present 8-bit only.
823    
824    
825    AUTHOR
826    
827           Philip Hazel
828           University Computing Service
829           Cambridge CB2 3QH, England.
830    
831    
832    REVISION
833    
834           Last updated: 08 November 2012
835           Copyright (c) 1997-2012 University of Cambridge.
836    ------------------------------------------------------------------------------
837    
838    
839    PCREBUILD(3)               Library Functions Manual               PCREBUILD(3)
840    
841    
842    
843    NAME
844           PCRE - Perl-compatible regular expressions
845    
846  PCRE BUILD-TIME OPTIONS  PCRE BUILD-TIME OPTIONS
847    
848         This  document  describes  the  optional  features  of PCRE that can be         This  document  describes  the  optional  features  of PCRE that can be
849         selected when the library is compiled. They are all selected, or  dese-         selected when the library is compiled. It assumes use of the  configure
850         lected, by providing options to the configure script that is run before         script,  where the optional features are selected or deselected by pro-
851         the make command. The complete list of  options  for  configure  (which         viding options to configure before running the make  command.  However,
852         includes  the  standard  ones such as the selection of the installation         the  same  options  can be selected in both Unix-like and non-Unix-like
853         directory) can be obtained by running         environments using the GUI facility of cmake-gui if you are using CMake
854           instead of configure to build PCRE.
855    
856           There  is a lot more information about building PCRE without using con-
857           figure (including information about using CMake or building "by  hand")
858           in  the file called NON-AUTOTOOLS-BUILD, which is part of the PCRE dis-
859           tribution. You should consult this file as well as the README  file  if
860           you are building in a non-Unix-like environment.
861    
862           The complete list of options for configure (which includes the standard
863           ones such as the  selection  of  the  installation  directory)  can  be
864           obtained by running
865    
866           ./configure --help           ./configure --help
867    
868         The following sections include  descriptions  of  options  whose  names         The  following  sections  include  descriptions  of options whose names
869         begin with --enable or --disable. These settings specify changes to the         begin with --enable or --disable. These settings specify changes to the
870         defaults for the configure command. Because of the way  that  configure         defaults  for  the configure command. Because of the way that configure
871         works,  --enable  and --disable always come in pairs, so the complemen-         works, --enable and --disable always come in pairs, so  the  complemen-
872         tary option always exists as well, but as it specifies the default,  it         tary  option always exists as well, but as it specifies the default, it
873         is not described.         is not described.
874    
875    
876    BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
877    
878           By default, a library called libpcre  is  built,  containing  functions
879           that  take  string  arguments  contained in vectors of bytes, either as
880           single-byte characters, or interpreted as UTF-8 strings. You  can  also
881           build  a  separate library, called libpcre16, in which strings are con-
882           tained in vectors of 16-bit data units and interpreted either  as  sin-
883           gle-unit characters or UTF-16 strings, by adding
884    
885             --enable-pcre16
886    
887           to the configure command. You can also build a separate library, called
888           libpcre32, in which strings are contained in  vectors  of  32-bit  data
889           units  and  interpreted  either  as  single-unit  characters  or UTF-32
890           strings, by adding
891    
892             --enable-pcre32
893    
894           to the configure command. If you do not want the 8-bit library, add
895    
896             --disable-pcre8
897    
898           as well. At least one of the three libraries must be built.  Note  that
899           the  C++  and  POSIX  wrappers are for the 8-bit library only, and that
900           pcregrep is an 8-bit program. None of these are  built  if  you  select
901           only the 16-bit or 32-bit libraries.
902    
903    
904    BUILDING SHARED AND STATIC LIBRARIES
905    
906           The  PCRE building process uses libtool to build both shared and static
907           Unix libraries by default. You can suppress one of these by adding  one
908           of
909    
910             --disable-shared
911             --disable-static
912    
913           to the configure command, as required.
914    
915    
916  C++ SUPPORT  C++ SUPPORT
917    
918         By default, the configure script will search for a C++ compiler and C++         By  default,  if the 8-bit library is being built, the configure script
919         header files. If it finds them, it automatically builds the C++ wrapper         will search for a C++ compiler and C++ header files. If it finds  them,
920         library for PCRE. You can disable this by adding         it  automatically  builds  the C++ wrapper library (which supports only
921           8-bit strings). You can disable this by adding
922    
923           --disable-cpp           --disable-cpp
924    
925         to the configure command.         to the configure command.
926    
927    
928  UTF-8 SUPPORT  UTF-8, UTF-16 AND UTF-32 SUPPORT
929    
930         To build PCRE with support for UTF-8 character strings, add         To build PCRE with support for UTF Unicode character strings, add
931    
932           --enable-utf8           --enable-utf
933    
934         to  the  configure  command.  Of  itself, this does not make PCRE treat         to the configure command. This setting applies to all three  libraries,
935         strings as UTF-8. As well as compiling PCRE with this option, you  also         adding  support  for  UTF-8 to the 8-bit library, support for UTF-16 to
936         have  have to set the PCRE_UTF8 option when you call the pcre_compile()         the 16-bit library, and  support  for  UTF-32  to  the  to  the  32-bit
937         function.         library.  There  are no separate options for enabling UTF-8, UTF-16 and
938           UTF-32 independently because that would allow ridiculous settings  such
939           as  requesting UTF-16 support while building only the 8-bit library. It
940           is not possible to build one library with UTF support and another with-
941           out  in the same configuration. (For backwards compatibility, --enable-
942           utf8 is a synonym of --enable-utf.)
943    
944           Of itself, this setting does not make  PCRE  treat  strings  as  UTF-8,
945           UTF-16  or UTF-32. As well as compiling PCRE with this option, you also
946           have have to set the PCRE_UTF8, PCRE_UTF16  or  PCRE_UTF32  option  (as
947           appropriate) when you call one of the pattern compiling functions.
948    
949           If  you  set --enable-utf when compiling in an EBCDIC environment, PCRE
950           expects its input to be either ASCII or UTF-8 (depending  on  the  run-
951           time option). It is not possible to support both EBCDIC and UTF-8 codes
952           in the same version of  the  library.  Consequently,  --enable-utf  and
953           --enable-ebcdic are mutually exclusive.
954    
955    
956  UNICODE CHARACTER PROPERTY SUPPORT  UNICODE CHARACTER PROPERTY SUPPORT
957    
958         UTF-8 support allows PCRE to process character values greater than  255         UTF  support allows the libraries to process character codepoints up to
959         in  the  strings that it handles. On its own, however, it does not pro-         0x10ffff in the strings that they handle. On its own, however, it  does
960         vide any facilities for accessing the properties of such characters. If         not provide any facilities for accessing the properties of such charac-
961         you  want  to  be able to use the pattern escapes \P, \p, and \X, which         ters. If you want to be able to use the pattern escapes \P, \p, and \X,
962         refer to Unicode character properties, you must add         which refer to Unicode character properties, you must add
963    
964           --enable-unicode-properties           --enable-unicode-properties
965    
966         to the configure command. This implies UTF-8 support, even if you  have         to  the  configure  command. This implies UTF support, even if you have
967         not explicitly requested it.         not explicitly requested it.
968    
969         Including  Unicode  property  support  adds around 30K of tables to the         Including Unicode property support adds around 30K  of  tables  to  the
970         PCRE library. Only the general category properties such as  Lu  and  Nd         PCRE  library.  Only  the general category properties such as Lu and Nd
971         are supported. Details are given in the pcrepattern documentation.         are supported. Details are given in the pcrepattern documentation.
972    
973    
974    JUST-IN-TIME COMPILER SUPPORT
975    
976           Just-in-time compiler support is included in the build by specifying
977    
978             --enable-jit
979    
980           This support is available only for certain hardware  architectures.  If
981           this  option  is  set  for  an unsupported architecture, a compile time
982           error occurs.  See the pcrejit documentation for a  discussion  of  JIT
983           usage. When JIT support is enabled, pcregrep automatically makes use of
984           it, unless you add
985    
986             --disable-pcregrep-jit
987    
988           to the "configure" command.
989    
990    
991  CODE VALUE OF NEWLINE  CODE VALUE OF NEWLINE
992    
993         By  default,  PCRE interprets character 10 (linefeed, LF) as indicating         By default, PCRE interprets the linefeed (LF) character  as  indicating
994         the end of a line. This is the normal newline  character  on  Unix-like         the  end  of  a line. This is the normal newline character on Unix-like
995         systems. You can compile PCRE to use character 13 (carriage return, CR)         systems. You can compile PCRE to use carriage return (CR)  instead,  by
996         instead, by adding         adding
997    
998           --enable-newline-is-cr           --enable-newline-is-cr
999    
1000         to the  configure  command.  There  is  also  a  --enable-newline-is-lf         to  the  configure  command.  There  is  also  a --enable-newline-is-lf
1001         option, which explicitly specifies linefeed as the newline character.         option, which explicitly specifies linefeed as the newline character.
1002    
1003         Alternatively, you can specify that line endings are to be indicated by         Alternatively, you can specify that line endings are to be indicated by
# Line 319  CODE VALUE OF NEWLINE Line 1009  CODE VALUE OF NEWLINE
1009    
1010           --enable-newline-is-anycrlf           --enable-newline-is-anycrlf
1011    
1012         which causes PCRE to recognize any of the three sequences  CR,  LF,  or         which  causes  PCRE  to recognize any of the three sequences CR, LF, or
1013         CRLF as indicating a line ending. Finally, a fifth option, specified by         CRLF as indicating a line ending. Finally, a fifth option, specified by
1014    
1015           --enable-newline-is-any           --enable-newline-is-any
# Line 331  CODE VALUE OF NEWLINE Line 1021  CODE VALUE OF NEWLINE
1021         conventional to use the standard for your operating system.         conventional to use the standard for your operating system.
1022    
1023    
1024  BUILDING SHARED AND STATIC LIBRARIES  WHAT \R MATCHES
1025    
1026         The PCRE building process uses libtool to build both shared and  static         By default, the sequence \R in a pattern matches  any  Unicode  newline
1027         Unix  libraries by default. You can suppress one of these by adding one         sequence,  whatever  has  been selected as the line ending sequence. If
1028         of         you specify
1029    
1030           --disable-shared           --enable-bsr-anycrlf
          --disable-static  
1031    
1032         to the configure command, as required.         the default is changed so that \R matches only CR, LF, or  CRLF.  What-
1033           ever  is selected when PCRE is built can be overridden when the library
1034           functions are called.
1035    
1036    
1037  POSIX MALLOC USAGE  POSIX MALLOC USAGE
1038    
1039         When PCRE is called through the POSIX interface (see the pcreposix doc-         When the 8-bit library is called through the POSIX interface  (see  the
1040         umentation),  additional  working  storage  is required for holding the         pcreposix  documentation),  additional  working storage is required for
1041         pointers to capturing substrings, because PCRE requires three  integers         holding the pointers to capturing  substrings,  because  PCRE  requires
1042         per  substring,  whereas  the POSIX interface provides only two. If the         three integers per substring, whereas the POSIX interface provides only
1043         number of expected substrings is small, the wrapper function uses space         two. If the number of expected substrings is small, the  wrapper  func-
1044         on the stack, because this is faster than using malloc() for each call.         tion  uses  space  on the stack, because this is faster than using mal-
1045         The default threshold above which the stack is no longer used is 10; it         loc() for each call. The default threshold above which the stack is  no
1046         can be changed by adding a setting such as         longer used is 10; it can be changed by adding a setting such as
1047    
1048           --with-posix-malloc-threshold=20           --with-posix-malloc-threshold=20
1049    
# Line 363  HANDLING VERY LARGE PATTERNS Line 1054  HANDLING VERY LARGE PATTERNS
1054    
1055         Within  a  compiled  pattern,  offset values are used to point from one         Within  a  compiled  pattern,  offset values are used to point from one
1056         part to another (for example, from an opening parenthesis to an  alter-         part to another (for example, from an opening parenthesis to an  alter-
1057         nation  metacharacter).  By default, two-byte values are used for these         nation  metacharacter).  By default, in the 8-bit and 16-bit libraries,
1058         offsets, leading to a maximum size for a  compiled  pattern  of  around         two-byte values are used for these offsets, leading to a  maximum  size
1059         64K.  This  is sufficient to handle all but the most gigantic patterns.         for  a compiled pattern of around 64K. This is sufficient to handle all
1060         Nevertheless, some people do want to process enormous patterns,  so  it         but the most gigantic patterns.  Nevertheless, some people do  want  to
1061         is  possible  to compile PCRE to use three-byte or four-byte offsets by         process  truly  enormous patterns, so it is possible to compile PCRE to
1062         adding a setting such as         use three-byte or four-byte offsets by adding a setting such as
1063    
1064           --with-link-size=3           --with-link-size=3
1065    
1066         to the configure command. The value given must be 2,  3,  or  4.  Using         to the configure command. The value given must be 2, 3, or 4.  For  the
1067         longer  offsets slows down the operation of PCRE because it has to load         16-bit  library,  a  value of 3 is rounded up to 4. In these libraries,
1068         additional bytes when handling them.         using longer offsets slows down the operation of PCRE because it has to
1069           load  additional  data  when  handling them. For the 32-bit library the
1070           value is always 4 and cannot be overridden; the value  of  --with-link-
1071           size is ignored.
1072    
1073    
1074  AVOIDING EXCESSIVE STACK USAGE  AVOIDING EXCESSIVE STACK USAGE
1075    
1076         When matching with the pcre_exec() function, PCRE implements backtrack-         When matching with the pcre_exec() function, PCRE implements backtrack-
1077         ing  by  making recursive calls to an internal function called match().         ing by making recursive calls to an internal function  called  match().
1078         In environments where the size of the stack is limited,  this  can  se-         In  environments  where  the size of the stack is limited, this can se-
1079         verely  limit  PCRE's operation. (The Unix environment does not usually         verely limit PCRE's operation. (The Unix environment does  not  usually
1080         suffer from this problem, but it may sometimes be necessary to increase         suffer from this problem, but it may sometimes be necessary to increase
1081         the  maximum  stack size.  There is a discussion in the pcrestack docu-         the maximum stack size.  There is a discussion in the  pcrestack  docu-
1082         mentation.) An alternative approach to recursion that uses memory  from         mentation.)  An alternative approach to recursion that uses memory from
1083         the  heap  to remember data, instead of using recursive function calls,         the heap to remember data, instead of using recursive  function  calls,
1084         has been implemented to work round the problem of limited  stack  size.         has  been  implemented to work round the problem of limited stack size.
1085         If you want to build a version of PCRE that works this way, add         If you want to build a version of PCRE that works this way, add
1086    
1087           --disable-stack-for-recursion           --disable-stack-for-recursion
1088    
1089         to  the  configure  command. With this configuration, PCRE will use the         to the configure command. With this configuration, PCRE  will  use  the
1090         pcre_stack_malloc and pcre_stack_free variables to call memory  manage-         pcre_stack_malloc  and pcre_stack_free variables to call memory manage-
1091         ment  functions. By default these point to malloc() and free(), but you         ment functions. By default these point to malloc() and free(), but  you
1092         can replace the pointers so that your own functions are used.         can replace the pointers so that your own functions are used instead.
1093    
1094         Separate functions are  provided  rather  than  using  pcre_malloc  and         Separate  functions  are  provided  rather  than  using pcre_malloc and
1095         pcre_free  because  the  usage  is  very  predictable:  the block sizes         pcre_free because the  usage  is  very  predictable:  the  block  sizes
1096         requested are always the same, and  the  blocks  are  always  freed  in         requested  are  always  the  same,  and  the blocks are always freed in
1097         reverse  order.  A calling program might be able to implement optimized         reverse order. A calling program might be able to  implement  optimized
1098         functions that perform better  than  malloc()  and  free().  PCRE  runs         functions  that  perform  better  than  malloc()  and free(). PCRE runs
1099         noticeably more slowly when built in this way. This option affects only         noticeably more slowly when built in this way. This option affects only
1100         the  pcre_exec()  function;  it   is   not   relevant   for   the   the         the pcre_exec() function; it is not relevant for pcre_dfa_exec().
        pcre_dfa_exec() function.  
1101    
1102    
1103  LIMITING PCRE RESOURCE USAGE  LIMITING PCRE RESOURCE USAGE
# Line 449  CREATING CHARACTER TABLES AT BUILD TIME Line 1142  CREATING CHARACTER TABLES AT BUILD TIME
1142         to  the  configure  command, the distributed tables are no longer used.         to  the  configure  command, the distributed tables are no longer used.
1143         Instead, a program called dftables is compiled and  run.  This  outputs         Instead, a program called dftables is compiled and  run.  This  outputs
1144         the source for new set of tables, created in the default locale of your         the source for new set of tables, created in the default locale of your
1145         C runtime system. (This method of replacing the tables does not work if         C run-time system. (This method of replacing the tables does  not  work
1146         you  are cross compiling, because dftables is run on the local host. If         if  you are cross compiling, because dftables is run on the local host.
1147         you need to create alternative tables when cross  compiling,  you  will         If you need to create alternative tables when cross compiling, you will
1148         have to do so "by hand".)         have to do so "by hand".)
1149    
1150    
# Line 466  USING EBCDIC CODE Line 1159  USING EBCDIC CODE
1159    
1160         to the configure command. This setting implies --enable-rebuild-charta-         to the configure command. This setting implies --enable-rebuild-charta-
1161         bles.  You  should  only  use  it if you know that you are in an EBCDIC         bles.  You  should  only  use  it if you know that you are in an EBCDIC
1162         environment (for example, an IBM mainframe operating system).         environment (for example,  an  IBM  mainframe  operating  system).  The
1163           --enable-ebcdic option is incompatible with --enable-utf.
1164    
1165           The EBCDIC character that corresponds to an ASCII LF is assumed to have
1166           the value 0x15 by default. However, in some EBCDIC  environments,  0x25
1167           is used. In such an environment you should use
1168    
1169             --enable-ebcdic-nl25
1170    
1171           as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
1172           has the same value as in ASCII, namely, 0x0d.  Whichever  of  0x15  and
1173           0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
1174           acter (which, in Unicode, is 0x85).
1175    
1176           The options that select newline behaviour, such as --enable-newline-is-
1177           cr, and equivalent run-time options, refer to these character values in
1178           an EBCDIC environment.
1179    
1180    
1181    PCREGREP OPTIONS FOR COMPRESSED FILE SUPPORT
1182    
1183           By default, pcregrep reads all files as plain text. You can build it so
1184           that it recognizes files whose names end in .gz or .bz2, and reads them
1185           with libz or libbz2, respectively, by adding one or both of
1186    
1187             --enable-pcregrep-libz
1188             --enable-pcregrep-libbz2
1189    
1190           to the configure command. These options naturally require that the rel-
1191           evant  libraries  are installed on your system. Configuration will fail
1192           if they are not.
1193    
1194    
1195    PCREGREP BUFFER SIZE
1196    
1197           pcregrep uses an internal buffer to hold a "window" on the file  it  is
1198           scanning, in order to be able to output "before" and "after" lines when
1199           it finds a match. The size of the buffer is controlled by  a  parameter
1200           whose default value is 20K. The buffer itself is three times this size,
1201           but because of the way it is used for holding "before" lines, the long-
1202           est  line  that  is guaranteed to be processable is the parameter size.
1203           You can change the default parameter value by adding, for example,
1204    
1205             --with-pcregrep-bufsize=50K
1206    
1207           to the configure command. The caller of pcregrep can, however, override
1208           this value by specifying a run-time option.
1209    
1210    
1211    PCRETEST OPTION FOR LIBREADLINE SUPPORT
1212    
1213           If you add
1214    
1215             --enable-pcretest-libreadline
1216    
1217           to  the  configure  command,  pcretest  is  linked with the libreadline
1218           library, and when its input is from a terminal, it reads it  using  the
1219           readline() function. This provides line-editing and history facilities.
1220           Note that libreadline is GPL-licensed, so if you distribute a binary of
1221           pcretest linked in this way, there may be licensing issues.
1222    
1223           Setting  this  option  causes  the -lreadline option to be added to the
1224           pcretest build. In many operating environments with  a  sytem-installed
1225           libreadline this is sufficient. However, in some environments (e.g.  if
1226           an unmodified distribution version of readline is in use),  some  extra
1227           configuration  may  be necessary. The INSTALL file for libreadline says
1228           this:
1229    
1230             "Readline uses the termcap functions, but does not link with the
1231             termcap or curses library itself, allowing applications which link
1232             with readline the to choose an appropriate library."
1233    
1234           If your environment has not been set up so that an appropriate  library
1235           is automatically included, you may need to add something like
1236    
1237             LIBS="-ncurses"
1238    
1239           immediately before the configure command.
1240    
1241    
1242    DEBUGGING WITH VALGRIND SUPPORT
1243    
1244           By adding the
1245    
1246             --enable-valgrind
1247    
1248           option  to to the configure command, PCRE will use valgrind annotations
1249           to mark certain memory regions as  unaddressable.  This  allows  it  to
1250           detect invalid memory accesses, and is mostly useful for debugging PCRE
1251           itself.
1252    
1253    
1254    CODE COVERAGE REPORTING
1255    
1256           If your C compiler is gcc, you can build a version  of  PCRE  that  can
1257           generate a code coverage report for its test suite. To enable this, you
1258           must install lcov version 1.6 or above. Then specify
1259    
1260             --enable-coverage
1261    
1262           to the configure command and build PCRE in the usual way.
1263    
1264           Note that using ccache (a caching C compiler) is incompatible with code
1265           coverage  reporting. If you have configured ccache to run automatically
1266           on your system, you must set the environment variable
1267    
1268             CCACHE_DISABLE=1
1269    
1270           before running make to build PCRE, so that ccache is not used.
1271    
1272           When --enable-coverage is used,  the  following  addition  targets  are
1273           added to the Makefile:
1274    
1275             make coverage
1276    
1277           This  creates  a  fresh  coverage report for the PCRE test suite. It is
1278           equivalent to running "make coverage-reset", "make  coverage-baseline",
1279           "make check", and then "make coverage-report".
1280    
1281             make coverage-reset
1282    
1283           This zeroes the coverage counters, but does nothing else.
1284    
1285             make coverage-baseline
1286    
1287           This captures baseline coverage information.
1288    
1289             make coverage-report
1290    
1291           This creates the coverage report.
1292    
1293             make coverage-clean-report
1294    
1295           This  removes the generated coverage report without cleaning the cover-
1296           age data itself.
1297    
1298             make coverage-clean-data
1299    
1300           This removes the captured coverage data without removing  the  coverage
1301           files created at compile time (*.gcno).
1302    
1303             make coverage-clean
1304    
1305           This  cleans all coverage data including the generated coverage report.
1306           For more information about code coverage, see the gcov and  lcov  docu-
1307           mentation.
1308    
1309    
1310  SEE ALSO  SEE ALSO
1311    
1312         pcreapi(3), pcre_config(3).         pcreapi(3), pcre16, pcre32, pcre_config(3).
1313    
1314    
1315  AUTHOR  AUTHOR
# Line 483  AUTHOR Line 1321  AUTHOR
1321    
1322  REVISION  REVISION
1323    
1324         Last updated: 30 July 2007         Last updated: 30 October 2012
1325         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
1326  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
1327    
1328    
1329    PCREMATCHING(3)            Library Functions Manual            PCREMATCHING(3)
1330    
1331    
 PCREMATCHING(3)                                                PCREMATCHING(3)  
   
1332    
1333  NAME  NAME
1334         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
1335    
   
1336  PCRE MATCHING ALGORITHMS  PCRE MATCHING ALGORITHMS
1337    
1338         This document describes the two different algorithms that are available         This document describes the two different algorithms that are available
1339         in PCRE for matching a compiled regular expression against a given sub-         in PCRE for matching a compiled regular expression against a given sub-
1340         ject  string.  The  "standard"  algorithm  is  the  one provided by the         ject  string.  The  "standard"  algorithm  is  the  one provided by the
1341         pcre_exec() function.  This works in the same was  as  Perl's  matching         pcre_exec(), pcre16_exec() and pcre32_exec() functions. These  work  in
1342         function, and provides a Perl-compatible matching operation.         the  same as as Perl's matching function, and provide a Perl-compatible
1343           matching  operation.   The  just-in-time  (JIT)  optimization  that  is
1344         An  alternative  algorithm is provided by the pcre_dfa_exec() function;         described  in  the pcrejit documentation is compatible with these func-
1345         this operates in a different way, and is not  Perl-compatible.  It  has         tions.
1346         advantages  and disadvantages compared with the standard algorithm, and  
1347         these are described below.         An  alternative  algorithm  is   provided   by   the   pcre_dfa_exec(),
1348           pcre16_dfa_exec()  and  pcre32_dfa_exec()  functions; they operate in a
1349           different way, and are not Perl-compatible. This alternative has advan-
1350           tages and disadvantages compared with the standard algorithm, and these
1351           are described below.
1352    
1353         When there is only one possible way in which a given subject string can         When there is only one possible way in which a given subject string can
1354         match  a pattern, the two algorithms give the same answer. A difference         match  a pattern, the two algorithms give the same answer. A difference
# Line 571  THE ALTERNATIVE MATCHING ALGORITHM Line 1413  THE ALTERNATIVE MATCHING ALGORITHM
1413         though  it is not implemented as a traditional finite state machine (it         though  it is not implemented as a traditional finite state machine (it
1414         keeps multiple states active simultaneously).         keeps multiple states active simultaneously).
1415    
1416           Although the general principle of this matching algorithm  is  that  it
1417           scans  the subject string only once, without backtracking, there is one
1418           exception: when a lookaround assertion is encountered,  the  characters
1419           following  or  preceding  the  current  point  have to be independently
1420           inspected.
1421    
1422         The scan continues until either the end of the subject is  reached,  or         The scan continues until either the end of the subject is  reached,  or
1423         there  are  no more unterminated paths. At this point, terminated paths         there  are  no more unterminated paths. At this point, terminated paths
1424         represent the different matching possibilities (if there are none,  the         represent the different matching possibilities (if there are none,  the
1425         match  has  failed).   Thus,  if there is more than one possible match,         match  has  failed).   Thus,  if there is more than one possible match,
1426         this algorithm finds all of them, and in particular, it finds the long-         this algorithm finds all of them, and in particular, it finds the long-
1427         est.  In PCRE, there is an option to stop the algorithm after the first         est.  The  matches are returned in decreasing order of length. There is
1428         match (which is necessarily the shortest) has been found.         an option to stop the algorithm after the first match (which is  neces-
1429           sarily the shortest) is found.
1430    
1431         Note that all the matches that are found start at the same point in the         Note that all the matches that are found start at the same point in the
1432         subject. If the pattern         subject. If the pattern
1433    
1434           cat(er(pillar)?)           cat(er(pillar)?)?
1435    
1436         is  matched  against the string "the caterpillar catchment", the result         is matched against the string "the caterpillar catchment",  the  result
1437         will be the three strings "cat", "cater", and "caterpillar" that  start         will  be the three strings "caterpillar", "cater", and "cat" that start
1438         at the fourth character of the subject. The algorithm does not automat-         at the fifth character of the subject. The algorithm does not automati-
1439         ically move on to find matches that start at later positions.         cally move on to find matches that start at later positions.
1440    
1441         There are a number of features of PCRE regular expressions that are not         There are a number of features of PCRE regular expressions that are not
1442         supported by the alternative matching algorithm. They are as follows:         supported by the alternative matching algorithm. They are as follows:
1443    
1444         1.  Because  the  algorithm  finds  all possible matches, the greedy or         1. Because the algorithm finds all  possible  matches,  the  greedy  or
1445         ungreedy nature of repetition quantifiers is not relevant.  Greedy  and         ungreedy  nature  of repetition quantifiers is not relevant. Greedy and
1446         ungreedy quantifiers are treated in exactly the same way. However, pos-         ungreedy quantifiers are treated in exactly the same way. However, pos-
1447         sessive quantifiers can make a difference when what follows could  also         sessive  quantifiers can make a difference when what follows could also
1448         match what is quantified, for example in a pattern like this:         match what is quantified, for example in a pattern like this:
1449    
1450           ^a++\w!           ^a++\w!
1451    
1452         This  pattern matches "aaab!" but not "aaa!", which would be matched by         This pattern matches "aaab!" but not "aaa!", which would be matched  by
1453         a non-possessive quantifier. Similarly, if an atomic group is  present,         a  non-possessive quantifier. Similarly, if an atomic group is present,
1454         it  is matched as if it were a standalone pattern at the current point,         it is matched as if it were a standalone pattern at the current  point,
1455         and the longest match is then "locked in" for the rest of  the  overall         and  the  longest match is then "locked in" for the rest of the overall
1456         pattern.         pattern.
1457    
1458         2. When dealing with multiple paths through the tree simultaneously, it         2. When dealing with multiple paths through the tree simultaneously, it
1459         is not straightforward to keep track of  captured  substrings  for  the         is  not  straightforward  to  keep track of captured substrings for the
1460         different  matching  possibilities,  and  PCRE's implementation of this         different matching possibilities, and  PCRE's  implementation  of  this
1461         algorithm does not attempt to do this. This means that no captured sub-         algorithm does not attempt to do this. This means that no captured sub-
1462         strings are available.         strings are available.
1463    
1464         3.  Because no substrings are captured, back references within the pat-         3. Because no substrings are captured, back references within the  pat-
1465         tern are not supported, and cause errors if encountered.         tern are not supported, and cause errors if encountered.
1466    
1467         4. For the same reason, conditional expressions that use  a  backrefer-         4.  For  the same reason, conditional expressions that use a backrefer-
1468         ence  as  the  condition or test for a specific group recursion are not         ence as the condition or test for a specific group  recursion  are  not
1469         supported.         supported.
1470    
1471         5. Because many paths through the tree may be  active,  the  \K  escape         5.  Because  many  paths  through the tree may be active, the \K escape
1472         sequence, which resets the start of the match when encountered (but may         sequence, which resets the start of the match when encountered (but may
1473         be on some paths and not on others), is not  supported.  It  causes  an         be  on  some  paths  and not on others), is not supported. It causes an
1474         error if encountered.         error if encountered.
1475    
1476         6.  Callouts  are  supported, but the value of the capture_top field is         6. Callouts are supported, but the value of the  capture_top  field  is
1477         always 1, and the value of the capture_last field is always -1.         always 1, and the value of the capture_last field is always -1.
1478    
1479         7.  The \C escape sequence, which (in the standard algorithm) matches a         7.  The  \C  escape  sequence, which (in the standard algorithm) always
1480         single  byte, even in UTF-8 mode, is not supported because the alterna-         matches a single data unit, even in UTF-8, UTF-16 or UTF-32  modes,  is
1481         tive algorithm moves through the subject  string  one  character  at  a         not  supported  in these modes, because the alternative algorithm moves
1482         time, for all active paths through the tree.         through the subject string one character (not data unit) at a time, for
1483           all active paths through the tree.
1484    
1485           8.  Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
1486           are not supported. (*FAIL) is supported, and  behaves  like  a  failing
1487           negative assertion.
1488    
1489    
1490  ADVANTAGES OF THE ALTERNATIVE ALGORITHM  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
# Line 643  ADVANTAGES OF THE ALTERNATIVE ALGORITHM Line 1497  ADVANTAGES OF THE ALTERNATIVE ALGORITHM
1497         more than one match using the standard algorithm, you have to do kludgy         more than one match using the standard algorithm, you have to do kludgy
1498         things with callouts.         things with callouts.
1499    
1500         2.  There is much better support for partial matching. The restrictions         2.  Because  the  alternative  algorithm  scans the subject string just
1501         on the content of the pattern that apply when using the standard  algo-         once, and never needs to backtrack (except for lookbehinds), it is pos-
1502         rithm  for  partial matching do not apply to the alternative algorithm.         sible  to  pass  very  long subject strings to the matching function in
1503         For non-anchored patterns, the starting position of a partial match  is         several pieces, checking for partial matching each time. Although it is
1504         available.         possible  to  do multi-segment matching using the standard algorithm by
1505           retaining partially matched substrings, it  is  more  complicated.  The
1506         3.  Because  the  alternative  algorithm  scans the subject string just         pcrepartial  documentation  gives  details of partial matching and dis-
1507         once, and never needs to backtrack, it is possible to  pass  very  long         cusses multi-segment matching.
        subject  strings  to  the matching function in several pieces, checking  
        for partial matching each time.  
1508    
1509    
1510  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM  DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
# Line 678  AUTHOR Line 1530  AUTHOR
1530    
1531  REVISION  REVISION
1532    
1533         Last updated: 29 May 2007         Last updated: 08 January 2012
1534         Copyright (c) 1997-2007 University of Cambridge.         Copyright (c) 1997-2012 University of Cambridge.
1535  ------------------------------------------------------------------------------  ------------------------------------------------------------------------------
1536    
1537    
1538    PCREAPI(3)                 Library Functions Manual                 PCREAPI(3)
1539    
1540    
 PCREAPI(3)                                                          PCREAPI(3)  
   
1541    
1542  NAME  NAME
1543         PCRE - Perl-compatible regular expressions         PCRE - Perl-compatible regular expressions
1544    
1545           #include <pcre.h>
1546    
 PCRE NATIVE API  
1547    
1548         #include <pcre.h>  PCRE NATIVE API BASIC FUNCTIONS
1549    
1550         pcre *pcre_compile(const char *pattern, int options,         pcre *pcre_compile(const char *pattern, int options,
1551              const char **errptr, int *erroffset,              const char **errptr, int *erroffset,
# Line 706  PCRE NATIVE API Line 1559  PCRE NATIVE API
1559         pcre_extra *pcre_study(const pcre *code, int options,         pcre_extra *pcre_study(const pcre *code, int options,
1560              const char **errptr);              const char **errptr);
1561    
1562           void pcre_free_study(pcre_extra *extra);
1563    
1564         int pcre_exec(const pcre *code, const pcre_extra *extra,         int pcre_exec(const pcre *code, const pcre_extra *extra,
1565              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
1566              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
# Line 715  PCRE NATIVE API Line 1570  PCRE NATIVE API
1570              int options, int *ovector, int ovecsize,              int options, int *ovector, int ovecsize,
1571              int *workspace, int wscount);              int *workspace, int wscount);
1572    
1573    
1574    PCRE NATIVE API STRING EXTRACTION FUNCTIONS
1575    
1576         int pcre_copy_named_substring(const pcre *code,         int pcre_copy_named_substring(const pcre *code,
1577              const char *subject, int *ovector,              const char *subject, int *ovector,
1578              int stringcount, const char *stringname,              int stringcount, const char *stringname,
# Line 746  PCRE NATIVE API Line 1604  PCRE NATIVE API
1604    
1605         void pcre_free_substring_list(const char **stringptr);         void pcre_free_substring_list(const char **stringptr);
1606    
1607    
1608    PCRE NATIVE API AUXILIARY FUNCTIONS
1609    
1610           int pcre_jit_exec(const pcre *code, const pcre_extra *extra,
1611                const char *subject, int length, int startoffset,
1612                int options, int *ovector, int ovecsize,
1613                pcre_jit_stack *jstack);
1614    
1615           pcre_jit_stack *pcre_jit_stack_alloc(int startsize, int maxsize);
1616    
1617           void pcre_jit_stack_free(pcre_jit_stack *stack);
1618    
1619           void pcre_assign_jit_stack(pcre_extra *extra,
1620                pcre_jit_callback callback, void *data);
1621    
1622         const unsigned char *pcre_maketables(void);         const unsigned char *pcre_maketables(void);
1623    
1624         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
1625              int what, void *where);              int what, void *where);
1626    
        int pcre_info(const pcre *code, int *optptr, int *firstcharptr);  
   
1627         int pcre_refcount(pcre *code, int adjust);         int pcre_refcount(pcre *code, int adjust);
1628    
1629         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1630    
1631         char *pcre_version(void);         const char *pcre_version(void);
1632    
1633           int pcre_pattern_to_host_byte_order(pcre *code,
1634                pcre_extra *extra, const unsigned char *tables);
1635    
1636    
1637    PCRE NATIVE API INDIRECTED FUNCTIONS
1638    
1639         void *(*pcre_malloc)(size_t);         void *(*pcre_malloc)(size_t);
1640    
# Line 770  PCRE NATIVE API Line 1647  PCRE NATIVE API
1647         int (*pcre_callout)(pcre_callout_block *);         int (*pcre_callout)(pcre_callout_block *);
1648    
1649    
1650    PCRE 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
1651    
1652           As  well  as  support  for  8-bit character strings, PCRE also supports
1653           16-bit strings (from release 8.30) and  32-bit  strings  (from  release
1654           8.32),  by means of two additional libraries. They can be built as well
1655           as, or instead of, the 8-bit library. To avoid too  much  complication,
1656           this  document describes the 8-bit versions of the functions, with only
1657           occasional references to the 16-bit and 32-bit libraries.
1658    
1659           The 16-bit and 32-bit functions operate in the same way as their  8-bit
1660           counterparts;  they  just  use different data types for their arguments
1661           and results, and their names start with pcre16_ or pcre32_  instead  of
1662           pcre_.  For  every  option  that  has  UTF8  in  its name (for example,
1663           PCRE_UTF8), there are corresponding 16-bit and 32-bit names  with  UTF8
1664           replaced by UTF16 or UTF32, respectively. This facility is in fact just
1665           cosmetic; the 16-bit and 32-bit option names define the same  bit  val-
1666           ues.
1667    
1668           References to bytes and UTF-8 in this document should be read as refer-
1669           ences to 16-bit data  quantities  and  UTF-16  when  using  the  16-bit
1670           library,  or  32-bit  data  quantities and UTF-32 when using the 32-bit
1671           library, unless specified otherwise. More details of the specific  dif-
1672           ferences  for  the  16-bit and 32-bit libraries are given in the pcre16
1673           and pcre32 pages.
1674    
1675    
1676  PCRE API OVERVIEW  PCRE API OVERVIEW
1677    
1678         PCRE has its own native API, which is described in this document. There         PCRE has its own native API, which is described in this document. There
1679         are also some wrapper functions that correspond to  the  POSIX  regular         are  also some wrapper functions (for the 8-bit library only) that cor-
1680         expression  API.  These  are  described in the pcreposix documentation.         respond to the POSIX regular expression  API,  but  they  do  not  give
1681         Both of these APIs define a set of C function calls. A C++  wrapper  is         access  to  all  the functionality. They are described in the pcreposix
1682         distributed with PCRE. It is documented in the pcrecpp page.         documentation. Both of these APIs define a set of C function  calls.  A
1683           C++ wrapper (again for the 8-bit library only) is also distributed with
1684         The  native  API  C  function prototypes are defined in the header file         PCRE. It is documented in the pcrecpp page.
1685         pcre.h, and on Unix systems the library itself is called  libpcre.   It  
1686         can normally be accessed by adding -lpcre to the command for linking an         The native API C function prototypes are defined  in  the  header  file
1687         application  that  uses  PCRE.  The  header  file  defines  the  macros         pcre.h,  and  on Unix-like systems the (8-bit) library itself is called
1688         PCRE_MAJOR  and  PCRE_MINOR to contain the major and minor release num-         libpcre. It can normally be accessed by adding -lpcre  to  the  command
1689         bers for the library.  Applications can use these  to  include  support         for  linking an application that uses PCRE. The header file defines the
1690           macros PCRE_MAJOR and PCRE_MINOR to contain the major and minor release
1691           numbers  for the library. Applications can use these to include support
1692         for different releases of PCRE.         for different releases of PCRE.
1693    
1694         The   functions   pcre_compile(),  pcre_compile2(),  pcre_study(),  and         In a Windows environment, if you want to statically link an application
1695         pcre_exec() are used for compiling and matching regular expressions  in         program  against  a  non-dll  pcre.a  file, you must define PCRE_STATIC
1696         a  Perl-compatible  manner. A sample program that demonstrates the sim-         before including pcre.h or pcrecpp.h, because otherwise  the  pcre_mal-
1697         plest way of using them is provided in the file  called  pcredemo.c  in         loc()   and   pcre_free()   exported   functions   will   be   declared
1698         the  source distribution. The pcresample documentation describes how to         __declspec(dllimport), with unwanted results.
1699         run it.  
1700           The  functions  pcre_compile(),  pcre_compile2(),   pcre_study(),   and
1701           pcre_exec()  are used for compiling and matching regular expressions in
1702           a Perl-compatible manner. A sample program that demonstrates  the  sim-
1703           plest  way  of  using them is provided in the file called pcredemo.c in
1704           the PCRE source distribution. A listing of this program is given in the
1705           pcredemo  documentation, and the pcresample documentation describes how
1706           to compile and run it.
1707    
1708           Just-in-time compiler support is an optional feature of PCRE  that  can
1709           be built in appropriate hardware environments. It greatly speeds up the
1710           matching performance of  many  patterns.  Simple  programs  can  easily
1711           request  that  it  be  used  if available, by setting an option that is
1712           ignored when it is not relevant. More complicated programs  might  need
1713           to     make    use    of    the    functions    pcre_jit_stack_alloc(),
1714           pcre_jit_stack_free(), and pcre_assign_jit_stack() in order to  control
1715           the JIT code's memory usage.
1716    
1717           From  release  8.32 there is also a direct interface for JIT execution,
1718           which gives improved performance. The JIT-specific functions  are  dis-
1719           cussed in the pcrejit documentation.
1720    
1721         A second matching function, pcre_dfa_exec(), which is not Perl-compati-         A second matching function, pcre_dfa_exec(), which is not Perl-compati-
1722         ble,  is  also provided. This uses a different algorithm for the match-         ble, is also provided. This uses a different algorithm for  the  match-
1723         ing. The alternative algorithm finds all possible matches (at  a  given         ing.  The  alternative algorithm finds all possible matches (at a given
1724         point  in  the subject), and scans the subject just once. However, this         point in the subject), and scans the subject just  once  (unless  there
1725         algorithm does not return captured substrings. A description of the two         are  lookbehind  assertions).  However,  this algorithm does not return
1726         matching  algorithms and their advantages and disadvantages is given in         captured substrings. A description of the two matching  algorithms  and
1727         the pcrematching documentation.         their  advantages  and disadvantages is given in the pcrematching docu-
1728           mentation.
1729    
1730         In addition to the main compiling and  matching  functions,  there  are         In addition to the main compiling and  matching  functions,  there  are
1731         convenience functions for extracting captured substrings from a subject         convenience functions for extracting captured substrings from a subject
# Line 824  PCRE API OVERVIEW Line 1750  PCRE API OVERVIEW
1750         built are used.         built are used.
1751    
1752         The function pcre_fullinfo() is used to find out  information  about  a         The function pcre_fullinfo() is used to find out  information  about  a
1753         compiled  pattern; pcre_info() is an obsolete version that returns only         compiled  pattern.  The  function pcre_version() returns a pointer to a
1754         some of the available information, but is retained for  backwards  com-         string containing the version of PCRE and its date of release.
        patibility.   The function pcre_version() returns a pointer to a string  
        containing the version of PCRE and its date of release.  
1755    
1756         The function pcre_refcount() maintains a  reference  count  in  a  data         The function pcre_refcount() maintains a  reference  count  in  a  data
1757         block  containing  a compiled pattern. This is provided for the benefit         block  containing  a compiled pattern. This is provided for the benefit
# Line 866  NEWLINES Line 1790  NEWLINES
1790         feed) character, the two-character sequence CRLF, any of the three pre-         feed) character, the two-character sequence CRLF, any of the three pre-
1791         ceding, or any Unicode newline sequence. The Unicode newline  sequences         ceding, or any Unicode newline sequence. The Unicode newline  sequences
1792         are  the  three just mentioned, plus the single characters VT (vertical         are  the  three just mentioned, plus the single characters VT (vertical
1793         tab, U+000B), FF (formfeed, U+000C), NEL (next line, U+0085), LS  (line         tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
1794         separator, U+2028), and PS (paragraph separator, U+2029).         separator, U+2028), and PS (paragraph separator, U+2029).
1795    
1796         Each  of  the first three conventions is used by at least one operating         Each  of  the first three conventions is used by at least one operating
# Line 875  NEWLINES Line 1799  NEWLINES
1799         dard. When PCRE is run, the default can be overridden,  either  when  a         dard. When PCRE is run, the default can be overridden,  either  when  a
1800         pattern is compiled, or when it is matched.         pattern is compiled, or when it is matched.
1801    
1802           At compile time, the newline convention can be specified by the options
1803           argument of pcre_compile(), or it can be specified by special  text  at
1804           the start of the pattern itself; this overrides any other settings. See
1805           the pcrepattern page for details of the special character sequences.
1806    
1807         In the PCRE documentation the word "newline" is used to mean "the char-         In the PCRE documentation the word "newline" is used to mean "the char-
1808         acter or pair of characters that indicate a line break". The choice  of         acter  or pair of characters that indicate a line break". The choice of
1809         newline  convention  affects  the  handling of the dot, circumflex, and         newline convention affects the handling of  the  dot,  circumflex,  and
1810         dollar metacharacters, the handling of #-comments in /x mode, and, when         dollar metacharacters, the handling of #-comments in /x mode, and, when
1811         CRLF  is a recognized line ending sequence, the match position advance-         CRLF is a recognized line ending sequence, the match position  advance-
1812         ment for a non-anchored pattern. The choice of newline convention  does         ment for a non-anchored pattern. There is more detail about this in the
1813         not affect the interpretation of the \n or \r escape sequences.         section on pcre_exec() options below.
1814    
1815           The choice of newline convention does not affect the interpretation  of
1816           the  \n  or  \r  escape  sequences, nor does it affect what \R matches,
1817           which is controlled in a similar way, but by separate options.
1818    
1819    
1820  MULTITHREADING  MULTITHREADING
1821    
1822         The  PCRE  functions  can be used in multi-threading applications, with         The PCRE functions can be used in  multi-threading  applications,  with
1823         the  proviso  that  the  memory  management  functions  pointed  to  by         the  proviso  that  the  memory  management  functions  pointed  to  by
1824         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the         pcre_malloc, pcre_free, pcre_stack_malloc, and pcre_stack_free, and the
1825         callout function pointed to by pcre_callout, are shared by all threads.         callout function pointed to by pcre_callout, are shared by all threads.
# Line 895  MULTITHREADING Line 1828  MULTITHREADING
1828         ing, so the same compiled pattern can safely be used by several threads         ing, so the same compiled pattern can safely be used by several threads
1829         at once.         at once.
1830    
1831           If  the just-in-time optimization feature is being used, it needs sepa-
1832           rate memory stack areas for each thread. See the pcrejit  documentation
1833           for more details.
1834    
1835    
1836  SAVING PRECOMPILED PATTERNS FOR LATER USE  SAVING PRECOMPILED PATTERNS FOR LATER USE
1837    
1838         The compiled form of a regular expression can be saved and re-used at a         The compiled form of a regular expression can be saved and re-used at a
1839         later time, possibly by a different program, and even on a  host  other         later time, possibly by a different program, and even on a  host  other
1840         than  the  one  on  which  it  was  compiled.  Details are given in the         than  the  one  on  which  it  was  compiled.  Details are given in the
1841         pcreprecompile documentation. However, compiling a  regular  expression         pcreprecompile documentation,  which  includes  a  description  of  the
1842         with  one version of PCRE for use with a different version is not guar-         pcre_pattern_to_host_byte_order()  function. However, compiling a regu-
1843         anteed to work and may cause crashes.         lar expression with one version of PCRE for use with a  different  ver-
1844           sion is not guaranteed to work and may cause crashes.
1845    
1846    
1847  CHECKING BUILD-TIME OPTIONS  CHECKING BUILD-TIME OPTIONS
1848    
1849         int pcre_config(int what, void *where);         int pcre_config(int what, void *where);
1850    
1851         The function pcre_config() makes it possible for a PCRE client to  dis-         The  function pcre_config() makes it possible for a PCRE client to dis-
1852         cover which optional features have been compiled into the PCRE library.         cover which optional features have been compiled into the PCRE library.
1853         The pcrebuild documentation has more details about these optional  fea-         The  pcrebuild documentation has more details about these optional fea-
1854         tures.         tures.
1855    
1856         The  first  argument  for pcre_config() is an integer, specifying which         The first argument for pcre_config() is an  integer,  specifying  which
1857         information is required; the second argument is a pointer to a variable         information is required; the second argument is a pointer to a variable
1858         into  which  the  information  is  placed. The following information is         into which the information is placed. The returned  value  is  zero  on
1859           success,  or  the negative error code PCRE_ERROR_BADOPTION if the value
1860           in the first argument is not recognized. The following  information  is
1861         available:         available:
1862    
1863           PCRE_CONFIG_UTF8           PCRE_CONFIG_UTF8
1864    
1865         The output is an integer that is set to one if UTF-8 support is  avail-         The  output is an integer that is set to one if UTF-8 support is avail-
1866         able; otherwise it is set to zero.         able; otherwise it is set to zero. This value should normally be  given
1867           to the 8-bit version of this function, pcre_config(). If it is given to
1868           the  16-bit  or  32-bit  version  of  this  function,  the  result   is
1869           PCRE_ERROR_BADOPTION.
1870    
1871             PCRE_CONFIG_UTF16
1872    
1873           The output is an integer that is set to one if UTF-16 support is avail-
1874           able; otherwise it is set to zero. This value should normally be  given
1875           to the 16-bit version of this function, pcre16_config(). If it is given
1876           to the 8-bit  or  32-bit  version  of  this  function,  the  result  is
1877           PCRE_ERROR_BADOPTION.
1878    
1879             PCRE_CONFIG_UTF32
1880    
1881           The output is an integer that is set to one if UTF-32 support is avail-
1882           able; otherwise it is set to zero. This value should normally be  given
1883           to the 32-bit version of this function, pcre32_config(). If it is given
1884           to the 8-bit  or  16-bit  version  of  this  function,  the  result  is
1885           PCRE_ERROR_BADOPTION.
1886    
1887           PCRE_CONFIG_UNICODE_PROPERTIES           PCRE_CONFIG_UNICODE_PROPERTIES
1888    
1889         The  output  is  an  integer  that is set to one if support for Unicode         The  output  is  an  integer  that is set to one if support for Unicode
1890         character properties is available; otherwise it is set to zero.         character properties is available; otherwise it is set to zero.
1891    
1892             PCRE_CONFIG_JIT
1893    
1894           The output is an integer that is set to one if support for just-in-time
1895           compiling is available; otherwise it is set to zero.
1896    
1897             PCRE_CONFIG_JITTARGET
1898    
1899           The  output is a pointer to a zero-terminated "const char *" string. If
1900           JIT support is available, the string contains the name of the architec-
1901           ture  for  which the JIT compiler is configured, for example "x86 32bit
1902           (little endian + unaligned)". If JIT  support  is  not  available,  the
1903           result is NULL.
1904    
1905           PCRE_CONFIG_NEWLINE           PCRE_CONFIG_NEWLINE
1906    
1907         The output is an integer whose value specifies  the  default  character         The  output  is  an integer whose value specifies the default character
1908         sequence  that is recognized as meaning "newline". The four values that         sequence that is recognized as meaning "newline". The values  that  are
1909         are supported are: 10 for LF, 13 for CR, 3338 for CRLF, -2 for ANYCRLF,         supported in ASCII/Unicode environments are: 10 for LF, 13 for CR, 3338
1910         and  -1  for  ANY. The default should normally be the standard sequence         for CRLF, -2 for ANYCRLF, and -1 for ANY. In EBCDIC  environments,  CR,
1911         for your operating system.         ANYCRLF,  and  ANY  yield the same values. However, the value for LF is
1912           normally 21, though some EBCDIC environments use 37. The  corresponding
1913           values  for  CRLF are 3349 and 3365. The default should normally corre-
1914           spond to the standard sequence for your operating system.
1915    
1916             PCRE_CONFIG_BSR
1917    
1918           The output is an integer whose value indicates what character sequences
1919           the  \R  escape sequence matches by default. A value of 0 means that \R
1920           matches any Unicode line ending sequence; a value of 1  means  that  \R
1921           matches only CR, LF, or CRLF. The default can be overridden when a pat-
1922           tern is compiled or matched.
1923    
1924           PCRE_CONFIG_LINK_SIZE           PCRE_CONFIG_LINK_SIZE
1925    
1926         The output is an integer that contains the number  of  bytes  used  for         The output is an integer that contains the number  of  bytes  used  for
1927         internal linkage in compiled regular expressions. The value is 2, 3, or         internal  linkage  in  compiled  regular  expressions.  For  the  8-bit
1928         4. Larger values allow larger regular expressions to  be  compiled,  at         library, the value can be 2, 3, or 4. For the 16-bit library, the value
1929         the  expense  of  slower matching. The default value of 2 is sufficient         is  either  2  or  4  and  is  still  a number of bytes. For the 32-bit
1930         for all but the most massive patterns, since  it  allows  the  compiled         library, the value is either 2 or 4 and is still a number of bytes. The
1931         pattern to be up to 64K in size.         default value of 2 is sufficient for all but the most massive patterns,
1932           since it allows the compiled pattern to be up to 64K  in  size.  Larger
1933           values  allow larger regular expressions to be compiled, at the expense
1934           of slower matching.
1935    
1936           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD           PCRE_CONFIG_POSIX_MALLOC_THRESHOLD
1937    
1938         The  output  is  an integer that contains the threshold above which the         The output is an integer that contains the threshold  above  which  the
1939         POSIX interface uses malloc() for output vectors. Further  details  are         POSIX  interface  uses malloc() for output vectors. Further details are
1940         given in the pcreposix documentation.         given in the pcreposix documentation.
1941    
1942           PCRE_CONFIG_MATCH_LIMIT           PCRE_CONFIG_MATCH_LIMIT
1943    
1944         The output is an integer that gives the default limit for the number of         The output is a long integer that gives the default limit for the  num-
1945         internal matching function calls in a  pcre_exec()  execution.  Further         ber  of  internal  matching  function calls in a pcre_exec() execution.
1946         details are given with pcre_exec() below.         Further details are given with pcre_exec() below.
1947    
1948           PCRE_CONFIG_MATCH_LIMIT_RECURSION           PCRE_CONFIG_MATCH_LIMIT_RECURSION
1949    
1950         The  output is an integer that gives the default limit for the depth of         The output is a long integer that gives the default limit for the depth
1951         recursion when calling the internal matching function in a  pcre_exec()         of   recursion  when  calling  the  internal  matching  function  in  a
1952         execution. Further details are given with pcre_exec() below.         pcre_exec() execution.  Further  details  are  given  with  pcre_exec()
1953           below.
1954    
1955           PCRE_CONFIG_STACKRECURSE           PCRE_CONFIG_STACKRECURSE
1956    
# Line 990  COMPILING A PATTERN Line 1977  COMPILING A PATTERN
1977         Either of the functions pcre_compile() or pcre_compile2() can be called         Either of the functions pcre_compile() or pcre_compile2() can be called
1978         to compile a pattern into an internal form. The only difference between         to compile a pattern into an internal form. The only difference between
1979         the  two interfaces is that pcre_compile2() has an additional argument,         the  two interfaces is that pcre_compile2() has an additional argument,
1980         errorcodeptr, via which a numerical error code can be returned.         errorcodeptr, via which a numerical error  code  can  be  returned.  To
1981           avoid  too  much repetition, we refer just to pcre_compile() below, but
1982           the information applies equally to pcre_compile2().
1983    
1984         The pattern is a C string terminated by a binary zero, and is passed in         The pattern is a C string terminated by a binary zero, and is passed in
1985         the  pattern  argument.  A  pointer to a single block of memory that is         the  pattern  argument.  A  pointer to a single block of memory that is
# Line 1007  COMPILING A PATTERN Line 1996  COMPILING A PATTERN
1996    
1997         The options argument contains various bit settings that affect the com-         The options argument contains various bit settings that affect the com-
1998         pilation. It should be zero if no options are required.  The  available         pilation. It should be zero if no options are required.  The  available
1999         options  are  described  below. Some of them, in particular, those that         options  are  described  below. Some of them (in particular, those that
2000         are compatible with Perl, can also be set and  unset  from  within  the         are compatible with Perl, but some others as well) can also be set  and
2001         pattern  (see  the  detailed  description in the pcrepattern documenta-         unset  from  within  the  pattern  (see the detailed description in the
2002         tion). For these options, the contents of the options  argument  speci-         pcrepattern documentation). For those options that can be different  in
2003         fies  their initial settings at the start of compilation and execution.         different  parts  of  the pattern, the contents of the options argument
2004         The PCRE_ANCHORED and PCRE_NEWLINE_xxx options can be set at  the  time         specifies their settings at the start of compilation and execution. The
2005         of matching as well as at compile time.         PCRE_ANCHORED,  PCRE_BSR_xxx, PCRE_NEWLINE_xxx, PCRE_NO_UTF8_CHECK, and
2006           PCRE_NO_START_OPTIMIZE options can be set at the time  of  matching  as
2007           well as at compile time.
2008    
2009         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,         If errptr is NULL, pcre_compile() returns NULL immediately.  Otherwise,
2010         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and         if compilation of a pattern fails,  pcre_compile()  returns  NULL,  and
2011         sets the variable pointed to by errptr to point to a textual error mes-         sets the variable pointed to by errptr to point to a textual error mes-
2012         sage. This is a static string that is part of the library. You must not         sage. This is a static string that is part of the library. You must not
2013         try to free it. The offset from the start of the pattern to the charac-         try  to  free it. Normally, the offset from the start of the pattern to
2014         ter where the error was discovered is placed in the variable pointed to         the byte that was being processed when  the  error  was  discovered  is
2015         by  erroffset,  which must not be NULL. If it is, an immediate error is         placed  in the variable pointed to by erroffset, which must not be NULL
2016         given.         (if it is, an immediate error is given). However, for an invalid  UTF-8
2017           string, the offset is that of the first byte of the failing character.
2018    
2019           Some  errors are not detected until the whole pattern has been scanned;
2020           in these cases, the offset passed back is the length  of  the  pattern.
2021           Note  that  the offset is in bytes, not characters, even in UTF-8 mode.
2022           It may sometimes point into the middle of a UTF-8 character.
2023    
2024         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-         If pcre_compile2() is used instead of pcre_compile(),  and  the  error-
2025         codeptr  argument is not NULL, a non-zero error code number is returned         codeptr  argument is not NULL, a non-zero error code number is returned
# Line 1067  COMPILING A PATTERN Line 2064  COMPILING A PATTERN
2064         all with number 255, before each pattern item. For  discussion  of  the         all with number 255, before each pattern item. For  discussion  of  the
2065         callout facility, see the pcrecallout documentation.         callout facility, see the pcrecallout documentation.
2066    
2067             PCRE_BSR_ANYCRLF
2068             PCRE_BSR_UNICODE
2069    
2070           These options (which are mutually exclusive) control what the \R escape
2071           sequence matches. The choice is either to match only CR, LF,  or  CRLF,
2072           or to match any Unicode newline sequence. The default is specified when
2073           PCRE is built. It can be overridden from within the pattern, or by set-
2074           ting an option when a compiled pattern is matched.
2075    
2076           PCRE_CASELESS           PCRE_CASELESS
2077    
2078         If  this  bit is set, letters in the pattern match both upper and lower         If  this  bit is set, letters in the pattern match both upper and lower
# Line 1091  COMPILING A PATTERN Line 2097  COMPILING A PATTERN
2097    
2098           PCRE_DOTALL           PCRE_DOTALL
2099    
2100         If this bit is set, a dot metacharater in the pattern matches all char-         If  this bit is set, a dot metacharacter in the pattern matches a char-
2101         acters, including those that indicate newline. Without it, a  dot  does         acter of any value, including one that indicates a newline. However, it
2102         not  match  when  the  current position is at a newline. This option is         only  ever  matches  one character, even if newlines are coded as CRLF.
2103         equivalent to Perl's /s option, and it can be changed within a  pattern         Without this option, a dot does not match when the current position  is
2104         by  a (?s) option setting. A negative class such as [^a] always matches         at a newline. This option is equivalent to Perl's /s option, and it can
2105         newline characters, independent of the setting of this option.         be changed within a pattern by a (?s) option setting. A negative  class
2106           such as [^a] always matches newline characters, independent of the set-
2107           ting of this option.
2108    
2109           PCRE_DUPNAMES           PCRE_DUPNAMES
2110    
# Line 1108  COMPILING A PATTERN Line 2116  COMPILING A PATTERN
2116    
2117           PCRE_EXTENDED           PCRE_EXTENDED
2118    
2119         If this bit is set, whitespace  data  characters  in  the  pattern  are         If this bit is set, white space data  characters  in  the  pattern  are
2120         totally ignored except when escaped or inside a character class. White-         totally  ignored except when escaped or inside a character class. White
2121         space does not include the VT character (code 11). In addition, charac-         space does not include the VT character (code 11). In addition, charac-
2122         ters between an unescaped # outside a character class and the next new-         ters between an unescaped # outside a character class and the next new-
2123         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x         line, inclusive, are also ignored. This  is  equivalent  to  Perl's  /x
2124         option,  and  it  can be changed within a pattern by a (?x) option set-         option,  and  it  can be changed within a pattern by a (?x) option set-
2125         ting.         ting.
2126    
2127         This option makes it possible to include  comments  inside  complicated         Which characters are interpreted  as  newlines  is  controlled  by  the
2128         patterns.   Note,  however,  that this applies only to data characters.         options  passed to pcre_compile() or by a special sequence at the start
2129         Whitespace  characters  may  never  appear  within  special   character         of the pattern, as described in the section entitled  "Newline  conven-
2130         sequences  in  a  pattern,  for  example  within the sequence (?( which         tions" in the pcrepattern documentation. Note that the end of this type
2131         introduces a conditional subpattern.         of comment is  a  literal  newline  sequence  in  the  pattern;  escape
2132           sequences that happen to represent a newline do not count.
2133    
2134           This  option  makes  it possible to include comments inside complicated
2135           patterns.  Note, however, that this applies only  to  data  characters.
2136           White  space  characters  may  never  appear  within  special character
2137           sequences in a pattern, for example within the sequence (?( that intro-
2138           duces a conditional subpattern.
2139    
2140           PCRE_EXTRA           PCRE_EXTRA
2141    
2142         This option was invented in order to turn on  additional  functionality         This  option  was invented in order to turn on additional functionality
2143         of  PCRE  that  is  incompatible with Perl, but it is currently of very         of PCRE that is incompatible with Perl, but it  is  currently  of  very
2144         little use. When set, any backslash in a pattern that is followed by  a         little  use. When set, any backslash in a pattern that is followed by a
2145         letter  that  has  no  special  meaning causes an error, thus reserving         letter that has no special meaning  causes  an  error,  thus  reserving
2146         these combinations for future expansion. By  default,  as  in  Perl,  a         these  combinations  for  future  expansion.  By default, as in Perl, a
2147         backslash  followed by a letter with no special meaning is treated as a         backslash followed by a letter with no special meaning is treated as  a
2148         literal. (Perl can, however, be persuaded to give a warning for  this.)         literal. (Perl can, however, be persuaded to give an error for this, by
2149         There  are  at  present no other features controlled by this option. It         running it with the -w option.) There are at present no other  features
2150         can also be set by a (?X) option setting within a pattern.         controlled  by this option. It can also be set by a (?X) option setting
2151           within a pattern.
2152    
2153           PCRE_FIRSTLINE           PCRE_FIRSTLINE
2154    
# Line 1140  COMPILING A PATTERN Line 2156  COMPILING A PATTERN
2156         before  or  at  the  first  newline  in  the subject string, though the         before  or  at  the  first  newline  in  the subject string, though the
2157         matched text may continue over the newline.         matched text may continue over the newline.
2158    
2159             PCRE_JAVASCRIPT_COMPAT
2160    
2161           If this option is set, PCRE's behaviour is changed in some ways so that
2162           it  is  compatible with JavaScript rather than Perl. The changes are as
2163           follows:
2164    
2165           (1) A lone closing square bracket in a pattern  causes  a  compile-time
2166           error,  because this is illegal in JavaScript (by default it is treated
2167           as a data character). Thus, the pattern AB]CD becomes illegal when this
2168           option is set.
2169    
2170           (2)  At run time, a back reference to an unset subpattern group matches
2171           an empty string (by default this causes the current  matching  alterna-
2172           tive  to  fail). A pattern such as (\1)(a) succeeds when this option is
2173           set (assuming it can find an "a" in the subject), whereas it  fails  by
2174           default, for Perl compatibility.
2175    
2176           (3) \U matches an upper case "U" character; by default \U causes a com-
2177           pile time error (Perl uses \U to upper case subsequent characters).
2178    
2179           (4) \u matches a lower case "u" character unless it is followed by four
2180           hexadecimal  digits,  in  which case the hexadecimal number defines the
2181           code point to match. By default, \u causes a compile time  error  (Perl
2182           uses it to upper case the following character).
2183    
2184           (5)  \x matches a lower case "x" character unless it is followed by two
2185           hexadecimal digits, in which case the hexadecimal  number  defines  the
2186           code  point  to  match. By default, as in Perl, a hexadecimal number is
2187           always expected after \x, but it may have zero, one, or two digits (so,
2188           for example, \xz matches a binary zero character followed by z).
2189    
2190           PCRE_MULTILINE           PCRE_MULTILINE
2191    
2192         By default, PCRE treats the subject string as consisting  of  a  single         By  default,  PCRE  treats the subject string as consisting of a single
2193         line  of characters (even if it actually contains newlines). The "start         line of characters (even if it actually contains newlines). The  "start
2194         of line" metacharacter (^) matches only at the  start  of  the  string,         of  line"  metacharacter  (^)  matches only at the start of the string,
2195         while  the  "end  of line" metacharacter ($) matches only at the end of         while the "end of line" metacharacter ($) matches only at  the  end  of
2196         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY         the string, or before a terminating newline (unless PCRE_DOLLAR_ENDONLY
2197         is set). This is the same as Perl.         is set). This is the same as Perl.
2198    
2199         When  PCRE_MULTILINE  it  is set, the "start of line" and "end of line"         When PCRE_MULTILINE it is set, the "start of line" and  "end  of  line"
2200         constructs match immediately following or immediately  before  internal         constructs  match  immediately following or immediately before internal
2201         newlines  in  the  subject string, respectively, as well as at the very         newlines in the subject string, respectively, as well as  at  the  very
2202         start and end. This is equivalent to Perl's /m option, and  it  can  be         start  and  end.  This is equivalent to Perl's /m option, and it can be
2203         changed within a pattern by a (?m) option setting. If there are no new-         changed within a pattern by a (?m) option setting. If there are no new-
2204         lines in a subject string, or no occurrences of ^ or $  in  a  pattern,         lines  in  a  subject string, or no occurrences of ^ or $ in a pattern,
2205         setting PCRE_MULTILINE has no effect.         setting PCRE_MULTILINE has no effect.
2206    
2207           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
# Line 1163  COMPILING A PATTERN Line 2210  COMPILING A PATTERN
2210           PCRE_NEWLINE_ANYCRLF           PCRE_NEWLINE_ANYCRLF
2211           PCRE_NEWLINE_ANY           PCRE_NEWLINE_ANY
2212    
2213         These  options  override the default newline definition that was chosen         These options override the default newline definition that  was  chosen
2214         when PCRE was built. Setting the first or the second specifies  that  a         when  PCRE  was built. Setting the first or the second specifies that a
2215         newline  is  indicated  by a single character (CR or LF, respectively).         newline is indicated by a single character (CR  or  LF,  respectively).
2216         Setting PCRE_NEWLINE_CRLF specifies that a newline is indicated by  the         Setting  PCRE_NEWLINE_CRLF specifies that a newline is indicated by the
2217         two-character  CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF specifies         two-character CRLF  sequence.  Setting  PCRE_NEWLINE_ANYCRLF  specifies
2218         that any of the three preceding sequences should be recognized. Setting         that any of the three preceding sequences should be recognized. Setting
2219         PCRE_NEWLINE_ANY  specifies that any Unicode newline sequence should be         PCRE_NEWLINE_ANY specifies that any Unicode newline sequence should  be
2220         recognized. The Unicode newline sequences are the three just mentioned,         recognized.
2221         plus  the  single  characters  VT (vertical tab, U+000B), FF (formfeed,  
2222         U+000C), NEL (next line, U+0085), LS (line separator, U+2028),  and  PS         In  an ASCII/Unicode environment, the Unicode newline sequences are the
2223         (paragraph  separator,  U+2029).  The  last  two are recognized only in         three just mentioned, plus the  single  characters  VT  (vertical  tab,
2224         UTF-8 mode.         U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line sep-
2225           arator, U+2028), and PS (paragraph separator, U+2029).  For  the  8-bit
2226           library, the last two are recognized only in UTF-8 mode.
2227    
2228           When  PCRE is compiled to run in an EBCDIC (mainframe) environment, the
2229           code for CR is 0x0d, the same as ASCII. However, the character code for
2230           LF  is  normally 0x15, though in some EBCDIC environments 0x25 is used.
2231           Whichever of these is not LF is made to  correspond  to  Unicode's  NEL
2232           character.  EBCDIC  codes  are all less than 256. For more details, see
2233           the pcrebuild documentation.
2234    
2235         The newline setting in the  options  word  uses  three  bits  that  are         The newline setting in the  options  word  uses  three  bits  that  are
2236         treated as a number, giving eight possibilities. Currently only six are         treated as a number, giving eight possibilities. Currently only six are
# Line 1184  COMPILING A PATTERN Line 2240  COMPILING A PATTERN
2240         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and         PCRE_NEWLINE_CRLF,  but other combinations may yield unused numbers and
2241         cause an error.         cause an error.
2242    
2243         The only time that a line break is specially recognized when  compiling         The only time that a line break in a pattern  is  specially  recognized
2244         a  pattern  is  if  PCRE_EXTENDED  is set, and an unescaped # outside a         when  compiling is when PCRE_EXTENDED is set. CR and LF are white space
2245         character class is encountered. This indicates  a  comment  that  lasts         characters, and so are ignored in this mode. Also, an unescaped #  out-
2246         until  after the next line break sequence. In other circumstances, line         side  a  character class indicates a comment that lasts until after the
2247         break  sequences  are  treated  as  literal  data,   except   that   in         next line break sequence. In other circumstances, line break  sequences
2248         PCRE_EXTENDED mode, both CR and LF are treated as whitespace characters         in patterns are treated as literal data.
        and are therefore ignored.  
2249    
2250         The newline option that is set at compile time becomes the default that         The newline option that is set at compile time becomes the default that
2251         is  used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.         is used for pcre_exec() and pcre_dfa_exec(), but it can be overridden.
2252    
2253           PCRE_NO_AUTO_CAPTURE           PCRE_NO_AUTO_CAPTURE
2254    
# Line 1203  COMPILING A PATTERN Line 2258  COMPILING A PATTERN
2258         be  used  for  capturing  (and  they acquire numbers in the usual way).         be  used  for  capturing  (and  they acquire numbers in the usual way).
2259         There is no equivalent of this option in Perl.         There is no equivalent of this option in Perl.
2260    
2261             NO_START_OPTIMIZE
2262    
2263           This is an option that acts at matching time; that is, it is really  an
2264           option  for  pcre_exec()  or  pcre_dfa_exec().  If it is set at compile
2265           time, it is remembered with the compiled pattern and assumed at  match-
2266           ing  time.  For  details  see  the discussion of PCRE_NO_START_OPTIMIZE
2267           below.
2268    
2269             PCRE_UCP
2270    
2271           This option changes the way PCRE processes \B, \b, \D, \d, \S, \s,  \W,
2272           \w,  and  some  of  the POSIX character classes. By default, only ASCII
2273           characters are recognized, but if PCRE_UCP is set,  Unicode  properties
2274           are  used instead to classify characters. More details are given in the
2275           section on generic character types in the pcrepattern page. If you  set
2276           PCRE_UCP,  matching  one of the items it affects takes much longer. The
2277           option is available only if PCRE has been compiled with  Unicode  prop-
2278           erty support.
2279    
2280           PCRE_UNGREEDY           PCRE_UNGREEDY
2281    
2282         This option inverts the "greediness" of the quantifiers  so  that  they         This  option  inverts  the "greediness" of the quantifiers so that they
2283         are  not greedy by default, but become greedy if followed by "?". It is         are not greedy by default, but become greedy if followed by "?". It  is
2284         not compatible with Perl. It can also be set by a (?U)  option  setting         not  compatible  with Perl. It can also be set by a (?U) option setting
2285         within the pattern.         within the pattern.
2286    
2287           PCRE_UTF8           PCRE_UTF8
2288    
2289         This  option  causes PCRE to regard both the pattern and the subject as         This option causes PCRE to regard both the pattern and the  subject  as
2290         strings of UTF-8 characters instead of single-byte  character  strings.         strings of UTF-8 characters instead of single-byte strings. However, it
2291         However,  it is available only when PCRE is built to include UTF-8 sup-         is available only when PCRE is built to include UTF  support.  If  not,
2292         port. If not, the use of this option provokes an error. Details of  how         the  use  of  this option provokes an error. Details of how this option
2293         this  option  changes the behaviour of PCRE are given in the section on         changes the behaviour of PCRE are given in the pcreunicode page.
        UTF-8 support in the main pcre page.  
2294    
2295           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
2296    
2297         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is         When PCRE_UTF8 is set, the validity of the pattern as a UTF-8 string is
2298         automatically  checked. If an invalid UTF-8 sequence of bytes is found,         automatically  checked.  There  is  a  discussion about the validity of
2299         pcre_compile() returns an error. If you already know that your  pattern         UTF-8 strings in the pcreunicode page. If an invalid UTF-8 sequence  is
2300         is  valid, and you want to skip this check for performance reasons, you         found,  pcre_compile()  returns an error. If you already know that your
2301         can set the PCRE_NO_UTF8_CHECK option. When it is set,  the  effect  of         pattern is valid, and you want to skip this check for performance  rea-
2302         passing an invalid UTF-8 string as a pattern is undefined. It may cause         sons,  you  can set the PCRE_NO_UTF8_CHECK option.  When it is set, the
2303         your program to crash.  Note that this option can  also  be  passed  to         effect of passing an invalid UTF-8 string as a pattern is undefined. It
2304         pcre_exec()  and pcre_dfa_exec(), to suppress the UTF-8 validity check-         may  cause  your  program  to  crash. Note that this option can also be
2305         ing of subject strings.         passed to pcre_exec() and pcre_dfa_exec(),  to  suppress  the  validity
2306           checking  of  subject strings only. If the same string is being matched
2307           many times, the option can be safely set for the second and  subsequent
2308           matchings to improve performance.
2309    
2310    
2311  COMPILATION ERROR CODES  COMPILATION ERROR CODES
2312    
2313         The following table lists the error  codes  than  may  be  returned  by         The  following  table  lists  the  error  codes than may be returned by
2314         pcre_compile2(),  along with the error messages that may be returned by         pcre_compile2(), along with the error messages that may be returned  by
2315         both compiling functions. As PCRE has developed, some error codes  have         both  compiling  functions.  Note  that error messages are always 8-bit
2316         fallen out of use. To avoid confusion, they have not been re-used.         ASCII strings, even in 16-bit or 32-bit mode. As  PCRE  has  developed,
2317           some  error codes have fallen out of use. To avoid confusion, they have
2318           not been re-used.
2319    
2320            0  no error            0  no error
2321            1  \ at end of pattern            1  \ at end of pattern
# Line 1251  COMPILATION ERROR CODES Line 2329  COMPILATION ERROR CODES
2329            9  nothing to repeat            9  nothing to repeat
2330           10  [this code is not in use]           10  [this code is not in use]
2331           11  internal error: unexpected repeat           11  internal error: unexpected repeat
2332           12  unrecognized character after (?           12  unrecognized character after (? or (?-
2333           13  POSIX named classes are supported only within a class           13  POSIX named classes are supported only within a class
2334           14  missing )           14  missing )
2335           15  reference to non-existent subpattern           15  reference to non-existent subpattern
# Line 1259  COMPILATION ERROR CODES Line 2337  COMPILATION ERROR CODES
2337           17  unknown option bit(s) set           17  unknown option bit(s) set
2338           18  missing ) after comment           18  missing ) after comment
2339           19  [this code is not in use]           19  [this code is not in use]
2340           20  regular expression too large           20  regular expression is too large
2341           21  failed to get memory           21  failed to get memory
2342           22  unmatched parentheses           22  unmatched parentheses
2343           23  internal error: code overflow           23  internal error: code overflow
# Line 1271  COMPILATION ERROR CODES Line 2349  COMPILATION ERROR CODES
2349           29  (?R or (?[+-]digits must be followed by )           29  (?R or (?[+-]digits must be followed by )
2350           30  unknown POSIX class name           30  unknown POSIX class name
2351           31  POSIX collating elements are not supported           31  POSIX collating elements are not supported
2352           32  this version of PCRE is not compiled with PCRE_UTF8 support           32  this version of PCRE is compiled without UTF support
2353           33  [this code is not in use]           33  [this code is not in use]
2354           34  character value in \x{...} sequence is too large           34  character value in \x{...} sequence is too large
2355           35  invalid condition (?(0)           35  invalid condition (?(0)
2356           36  \C not allowed in lookbehind assertion           36  \C not allowed in lookbehind assertion
2357           37  PCRE does not support \L, \l, \N, \U, or \u           37  PCRE does not support \L, \l, \N{name}, \U, or \u
2358           38  number after (?C is > 255           38  number after (?C is > 255
2359           39  closing ) for (?C expected           39  closing ) for (?C expected
2360           40  recursive call could loop indefinitely           40  recursive call could loop indefinitely
2361           41  unrecognized character after (?P           41  unrecognized character after (?P
2362           42  syntax error in subpattern name (missing terminator)           42  syntax error in subpattern name (missing terminator)
2363           43  two named subpatterns have the same name           43  two named subpatterns have the same name
2364           44  invalid UTF-8 string           44  invalid UTF-8 string (specifically UTF-8)
2365           45  support for \P, \p, and \X has not been compiled           45  support for \P, \p, and \X has not been compiled
2366           46  malformed \P or \p sequence           46  malformed \P or \p sequence
2367           47  unknown property name after \P or \p           47  unknown property name after \P or \p
2368           48  subpattern name is too long (maximum 32 characters)           48  subpattern name is too long (maximum 32 characters)
2369           49  too many named subpatterns (maximum 10,000)           49  too many named subpatterns (maximum 10000)
2370           50  [this code is not in use]           50  [this code is not in use]
2371           51  octal value is greater than \377 (not in UTF-8 mode)           51  octal value is greater than \377 in 8-bit non-UTF-8 mode
2372           52  internal error: overran compiling workspace           52  internal error: overran compiling workspace
2373           53   internal  error:  previously-checked  referenced  subpattern not           53  internal error: previously-checked referenced subpattern
2374         found                 not found
2375           54  DEFINE group contains more than one branch           54  DEFINE group contains more than one branch
2376           55  repeating a DEFINE group is not allowed           55  repeating a DEFINE group is not allowed
2377           56  inconsistent NEWLINE options"           56  inconsistent NEWLINE options
2378           57  \g is not followed by a braced name or an optionally braced           57  \g is not followed by a braced, angle-bracketed, or quoted
2379                 non-zero number                 name/number or by a plain number
2380           58  (?+ or (?- or (?(+ or (?(- must be followed by a non-zero number           58  a numbered reference must not be zero
2381             59  an argument is not allowed for (*ACCEPT), (*FAIL), or (*COMMIT)
2382             60  (*VERB) not recognized
2383             61  number is too big
2384             62  subpattern name expected
2385             63  digit expected after (?+
2386             64  ] is an invalid data character in JavaScript compatibility mode
2387             65  different names for subpatterns of the same number are
2388                   not allowed
2389             66  (*MARK) must have an argument
2390             67  this version of PCRE is not compiled with Unicode property
2391                   support
2392             68  \c must be followed by an ASCII character
2393             69  \k is not followed by a braced, angle-bracketed, or quoted name
2394             70  internal error: unknown opcode in find_fixedlength()
2395             71  \N is not supported in a class
2396             72  too many forward references
2397             73  disallowed Unicode code point (>= 0xd800 && <= 0xdfff)
2398             74  invalid UTF-16 string (specifically UTF-16)
2399             75  name is too long in (*MARK), (*PRUNE), (*SKIP), or (*THEN)
2400             76  character value in \u.... sequence is too large
2401             77  invalid UTF-32 string (specifically UTF-32)
2402    
2403           The numbers 32 and 10000 in errors 48 and 49  are  defaults;  different
2404           values may be used if the limits were changed when PCRE was built.
2405    
2406    
2407  STUDYING A PATTERN  STUDYING A PATTERN
# Line 1307  STUDYING A PATTERN Line 2409  STUDYING A PATTERN
2409         pcre_extra *pcre_study(const pcre *code, int options         pcre_extra *pcre_study(const pcre *code, int options
2410              const char **errptr);              const char **errptr);
2411    
2412         If a compiled pattern is going to be used several times,  it  is  worth         If  a  compiled  pattern is going to be used several times, it is worth
2413         spending more time analyzing it in order to speed up the time taken for         spending more time analyzing it in order to speed up the time taken for
2414         matching. The function pcre_study() takes a pointer to a compiled  pat-         matching.  The function pcre_study() takes a pointer to a compiled pat-
2415         tern as its first argument. If studying the pattern produces additional         tern as its first argument. If studying the pattern produces additional
2416         information that will help speed up matching,  pcre_study()  returns  a         information  that  will  help speed up matching, pcre_study() returns a
2417         pointer  to a pcre_extra block, in which the study_data field points to         pointer to a pcre_extra block, in which the study_data field points  to
2418         the results of the study.         the results of the study.
2419    
2420         The  returned  value  from  pcre_study()  can  be  passed  directly  to         The  returned  value  from  pcre_study()  can  be  passed  directly  to
2421         pcre_exec().  However,  a  pcre_extra  block also contains other fields         pcre_exec() or pcre_dfa_exec(). However, a pcre_extra block  also  con-
2422         that can be set by the caller before the block  is  passed;  these  are         tains  other  fields  that can be set by the caller before the block is
2423         described below in the section on matching a pattern.         passed; these are described below in the section on matching a pattern.
2424    
2425         If  studying  the  pattern  does not produce any additional information         If studying the  pattern  does  not  produce  any  useful  information,
2426         pcre_study() returns NULL. In that circumstance, if the calling program         pcre_study()  returns  NULL  by  default.  In that circumstance, if the
2427         wants  to  pass  any of the other fields to pcre_exec(), it must set up         calling program wants to pass any of the other fields to pcre_exec() or
2428         its own pcre_extra block.         pcre_dfa_exec(),  it  must set up its own pcre_extra block. However, if
2429           pcre_study() is called  with  the  PCRE_STUDY_EXTRA_NEEDED  option,  it
2430         The second argument of pcre_study() contains option bits.  At  present,         returns a pcre_extra block even if studying did not find any additional
2431         no options are defined, and this argument should always be zero.         information. It may still return NULL, however, if an error  occurs  in
2432           pcre_study().
2433         The  third argument for pcre_study() is a pointer for an error message.  
2434         If studying succeeds (even if no data is  returned),  the  variable  it         The  second  argument  of  pcre_study() contains option bits. There are
2435         points  to  is  set  to NULL. Otherwise it is set to point to a textual         three further options in addition to PCRE_STUDY_EXTRA_NEEDED:
2436    
2437             PCRE_STUDY_JIT_COMPILE
2438             PCRE_STUDY_JIT_PARTIAL_HARD_COMPILE
2439             PCRE_STUDY_JIT_PARTIAL_SOFT_COMPILE
2440    
2441           If any of these are set, and the just-in-time  compiler  is  available,
2442           the  pattern  is  further compiled into machine code that executes much
2443           faster than the pcre_exec()  interpretive  matching  function.  If  the
2444           just-in-time  compiler is not available, these options are ignored. All
2445           undefined bits in the options argument must be zero.
2446    
2447           JIT compilation is a heavyweight optimization. It can  take  some  time
2448           for  patterns  to  be analyzed, and for one-off matches and simple pat-
2449           terns the benefit of faster execution might be offset by a much  slower
2450           study time.  Not all patterns can be optimized by the JIT compiler. For
2451           those that cannot be handled, matching automatically falls back to  the
2452           pcre_exec()  interpreter.  For more details, see the pcrejit documenta-
2453           tion.
2454    
2455           The third argument for pcre_study() is a pointer for an error  message.
2456           If  studying  succeeds  (even  if no data is returned), the variable it
2457           points to is set to NULL. Otherwise it is set to  point  to  a  textual
2458         error message. This is a static string that is part of the library. You         error message. This is a static string that is part of the library. You
2459         must  not  try  to  free it. You should test the error pointer for NULL         must not try to free it. You should test the  error  pointer  for  NULL
2460         after calling pcre_study(), to be sure that it has run successfully.         after calling pcre_study(), to be sure that it has run successfully.
2461    
2462         This is a typical call to pcre_study():         When  you are finished with a pattern, you can free the memory used for
2463           the study data by calling pcre_free_study(). This function was added to
2464           the  API  for  release  8.20. For earlier versions, the memory could be
2465           freed with pcre_free(), just like the pattern itself. This  will  still
2466           work  in  cases where JIT optimization is not used, but it is advisable
2467           to change to the new function when convenient.
2468    
2469           This is a typical way in which pcre_study() is used (except that  in  a
2470           real application there should be tests for errors):
2471    
2472           pcre_extra *pe;           int rc;
2473           pe = pcre_study(           pcre *re;
2474             pcre_extra *sd;
2475             re = pcre_compile("pattern", 0, &error, &erroroffset, NULL);
2476             sd = pcre_study(
2477             re,             /* result of pcre_compile() */             re,             /* result of pcre_compile() */
2478             0,              /* no options exist */             0,              /* no options */
2479             &error);        /* set to NULL or points to a message */             &error);        /* set to NULL or points to a message */
2480             rc = pcre_exec(   /* see below for details of pcre_exec() options */
2481         At present, studying a pattern is useful only for non-anchored patterns             re, sd, "subject", 7, 0, 0, ovector, 30);
2482         that  do not have a single fixed starting character. A bitmap of possi-           ...
2483         ble starting bytes is created.           pcre_free_study(sd);
2484             pcre_free(re);
2485    
2486           Studying a pattern does two things: first, a lower bound for the length
2487           of subject string that is needed to match the pattern is computed. This
2488           does not mean that there are any strings of that length that match, but
2489           it does guarantee that no shorter strings match. The value is  used  to
2490           avoid wasting time by trying to match strings that are shorter than the
2491           lower bound. You can find out the value in a calling  program  via  the
2492           pcre_fullinfo() function.
2493    
2494           Studying a pattern is also useful for non-anchored patterns that do not
2495           have a single fixed starting character. A bitmap of  possible  starting
2496           bytes  is  created. This speeds up finding a position in the subject at
2497           which to start matching. (In 16-bit mode, the bitmap is used for 16-bit
2498           values  less  than  256.  In 32-bit mode, the bitmap is used for 32-bit
2499           values less than 256.)
2500    
2501           These two optimizations apply to both pcre_exec() and  pcre_dfa_exec(),
2502           and  the  information  is also used by the JIT compiler.  The optimiza-
2503           tions can be disabled by setting the PCRE_NO_START_OPTIMIZE option when
2504           calling pcre_exec() or pcre_dfa_exec(), but if this is done, JIT execu-
2505           tion is also disabled. You might want to do this if your  pattern  con-
2506           tains  callouts or (*MARK) and you want to make use of these facilities
2507           in   cases   where   matching   fails.   See    the    discussion    of
2508           PCRE_NO_START_OPTIMIZE below.
2509    
2510    
2511  LOCALE SUPPORT  LOCALE SUPPORT
2512    
2513         PCRE handles caseless matching, and determines whether  characters  are         PCRE  handles  caseless matching, and determines whether characters are
2514         letters,  digits, or whatever, by reference to a set of tables, indexed         letters, digits, or whatever, by reference to a set of tables,  indexed
2515         by character value. When running in UTF-8 mode, this  applies  only  to         by  character  value.  When running in UTF-8 mode, this applies only to
2516         characters  with  codes  less than 128. Higher-valued codes never match         characters with codes less than 128. By  default,  higher-valued  codes
2517         escapes such as \w or \d, but can be tested with \p if  PCRE  is  built         never match escapes such as \w or \d, but they can be tested with \p if
2518         with  Unicode  character property support. The use of locales with Uni-         PCRE is built with Unicode character property  support.  Alternatively,
2519         code is discouraged. If you are handling characters with codes  greater         the  PCRE_UCP  option  can  be  set at compile time; this causes \w and
2520         than  128, you should either use UTF-8 and Unicode, or use locales, but         friends to use Unicode property support instead of built-in tables. The
2521         not try to mix the two.         use of locales with Unicode is discouraged. If you are handling charac-
2522           ters with codes greater than 128, you should either use UTF-8 and  Uni-
2523           code, or use locales, but not try to mix the two.
2524    
2525         PCRE contains an internal set of tables that are used  when  the  final         PCRE  contains  an  internal set of tables that are used when the final
2526         argument  of  pcre_compile()  is  NULL.  These  are sufficient for many         argument of pcre_compile() is  NULL.  These  are  sufficient  for  many
2527         applications.  Normally, the internal tables recognize only ASCII char-         applications.  Normally, the internal tables recognize only ASCII char-
2528         acters. However, when PCRE is built, it is possible to cause the inter-         acters. However, when PCRE is built, it is possible to cause the inter-
2529         nal tables to be rebuilt in the default "C" locale of the local system,         nal tables to be rebuilt in the default "C" locale of the local system,
2530         which may cause them to be different.         which may cause them to be different.
2531    
2532         The  internal tables can always be overridden by tables supplied by the         The internal tables can always be overridden by tables supplied by  the
2533         application that calls PCRE. These may be created in a different locale         application that calls PCRE. These may be created in a different locale
2534         from  the  default.  As more and more applications change to using Uni-         from the default. As more and more applications change  to  using  Uni-
2535         code, the need for this locale support is expected to die away.         code, the need for this locale support is expected to die away.
2536    
2537         External tables are built by calling  the  pcre_maketables()  function,         External  tables  are  built by calling the pcre_maketables() function,
2538         which  has no arguments, in the relevant locale. The result can then be         which has no arguments, in the relevant locale. The result can then  be
2539         passed to pcre_compile() or pcre_exec()  as  often  as  necessary.  For         passed  to  pcre_compile()  or  pcre_exec()  as often as necessary. For
2540         example,  to  build  and use tables that are appropriate for the French         example, to build and use tables that are appropriate  for  the  French
2541         locale (where accented characters with  values  greater  than  128  are         locale  (where  accented  characters  with  values greater than 128 are
2542         treated as letters), the following code could be used:         treated as letters), the following code could be used:
2543    
2544           setlocale(LC_CTYPE, "fr_FR");           setlocale(LC_CTYPE, "fr_FR");
2545           tables = pcre_maketables();           tables = pcre_maketables();
2546           re = pcre_compile(..., tables);           re = pcre_compile(..., tables);
2547    
2548         The  locale  name "fr_FR" is used on Linux and other Unix-like systems;         The locale name "fr_FR" is used on Linux and other  Unix-like  systems;
2549         if you are using Windows, the name for the French locale is "french".         if you are using Windows, the name for the French locale is "french".
2550    
2551         When pcre_maketables() runs, the tables are built  in  memory  that  is         When  pcre_maketables()  runs,  the  tables are built in memory that is
2552         obtained  via  pcre_malloc. It is the caller's responsibility to ensure         obtained via pcre_malloc. It is the caller's responsibility  to  ensure
2553         that the memory containing the tables remains available for as long  as         that  the memory containing the tables remains available for as long as
2554         it is needed.         it is needed.
2555    
2556         The pointer that is passed to pcre_compile() is saved with the compiled         The pointer that is passed to pcre_compile() is saved with the compiled
2557         pattern, and the same tables are used via this pointer by  pcre_study()         pattern,  and the same tables are used via this pointer by pcre_study()
2558         and normally also by pcre_exec(). Thus, by default, for any single pat-         and normally also by pcre_exec(). Thus, by default, for any single pat-
2559         tern, compilation, studying and matching all happen in the same locale,         tern, compilation, studying and matching all happen in the same locale,
2560         but different patterns can be compiled in different locales.         but different patterns can be compiled in different locales.
2561    
2562         It  is  possible to pass a table pointer or NULL (indicating the use of         It is possible to pass a table pointer or NULL (indicating the  use  of
2563         the internal tables) to pcre_exec(). Although  not  intended  for  this         the  internal  tables)  to  pcre_exec(). Although not intended for this
2564         purpose,  this facility could be used to match a pattern in a different         purpose, this facility could be used to match a pattern in a  different
2565         locale from the one in which it was compiled. Passing table pointers at         locale from the one in which it was compiled. Passing table pointers at
2566         run time is discussed below in the section on matching a pattern.         run time is discussed below in the section on matching a pattern.
2567    
# Line 1409  INFORMATION ABOUT A PATTERN Line 2571  INFORMATION ABOUT A PATTERN
2571         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,         int pcre_fullinfo(const pcre *code, const pcre_extra *extra,
2572              int what, void *where);              int what, void *where);
2573    
2574         The  pcre_fullinfo() function returns information about a compiled pat-         The pcre_fullinfo() function returns information about a compiled  pat-
2575         tern. It replaces the obsolete pcre_info() function, which is neverthe-         tern.  It replaces the pcre_info() function, which was removed from the
2576         less retained for backwards compability (and is documented below).         library at version 8.30, after more than 10 years of obsolescence.
2577    
2578         The  first  argument  for  pcre_fullinfo() is a pointer to the compiled         The first argument for pcre_fullinfo() is a  pointer  to  the  compiled
2579         pattern. The second argument is the result of pcre_study(), or NULL  if         pattern.  The second argument is the result of pcre_study(), or NULL if
2580         the  pattern  was not studied. The third argument specifies which piece         the pattern was not studied. The third argument specifies  which  piece
2581         of information is required, and the fourth argument is a pointer  to  a         of  information  is required, and the fourth argument is a pointer to a
2582         variable  to  receive  the  data. The yield of the function is zero for         variable to receive the data. The yield of the  function  is  zero  for
2583         success, or one of the following negative numbers:         success, or one of the following negative numbers:
2584    
2585           PCRE_ERROR_NULL       the argument code was NULL           PCRE_ERROR_NULL           the argument code was NULL
2586                                 the argument where was NULL                                     the argument where was NULL
2587           PCRE_ERROR_BADMAGIC   the "magic number" was not found           PCRE_ERROR_BADMAGIC       the "magic number" was not found
2588           PCRE_ERROR_BADOPTION  the value of what was invalid           PCRE_ERROR_BADENDIANNESS  the pattern was compiled with different
2589                                       endianness
2590         The "magic number" is placed at the start of each compiled  pattern  as           PCRE_ERROR_BADOPTION      the value of what was invalid
2591         an  simple check against passing an arbitrary memory pointer. Here is a  
2592         typical call of pcre_fullinfo(), to obtain the length of  the  compiled         The  "magic  number" is placed at the start of each compiled pattern as
2593         pattern:         an simple check against passing an arbitrary memory pointer. The  endi-
2594           anness error can occur if a compiled pattern is saved and reloaded on a
2595           different host. Here is a typical call of  pcre_fullinfo(),  to  obtain
2596           the length of the compiled pattern:
2597    
2598           int rc;           int rc;
2599           size_t length;           size_t length;
2600           rc = pcre_fullinfo(           rc = pcre_fullinfo(
2601             re,               /* result of pcre_compile() */             re,               /* result of pcre_compile() */
2602             pe,               /* result of pcre_study(), or NULL */             sd,               /* result of pcre_study(), or NULL */
2603             PCRE_INFO_SIZE,   /* what is required */             PCRE_INFO_SIZE,   /* what is required */
2604             &length);         /* where to put the data */             &length);         /* where to put the data */
2605    
# Line 1462  INFORMATION ABOUT A PATTERN Line 2627  INFORMATION ABOUT A PATTERN
2627    
2628           PCRE_INFO_FIRSTBYTE           PCRE_INFO_FIRSTBYTE
2629    
2630         Return  information  about  the first byte of any matched string, for a         Return information about the first data unit of any matched string, for
2631         non-anchored pattern. The fourth argument should point to an int  vari-         a non-anchored pattern. (The name of this option refers  to  the  8-bit
2632         able.  (This option used to be called PCRE_INFO_FIRSTCHAR; the old name         library,  where data units are bytes.) The fourth argument should point
2633         is still recognized for backwards compatibility.)         to an int variable.
2634    
2635           If there is a fixed first value, for example, the  letter  "c"  from  a
2636           pattern  such  as (cat|cow|coyote), its value is returned. In the 8-bit
2637           library, the value is always less than 256. In the 16-bit  library  the
2638           value can be up to 0xffff. In the 32-bit library the value can be up to
2639           0x10ffff.
2640    
2641         If there is a fixed first byte, for example, from  a  pattern  such  as         If there is no fixed first value, and if either
        (cat|cow|coyote), its value is returned. Otherwise, if either  
2642    
2643         (a)  the pattern was compiled with the PCRE_MULTILINE option, and every         (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2644         branch starts with "^", or         branch starts with "^", or
2645    
2646         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not         (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2647         set (if it were set, the pattern would be anchored),         set (if it were set, the pattern would be anchored),
2648    
2649         -1  is  returned, indicating that the pattern matches only at the start         -1 is returned, indicating that the pattern matches only at  the  start
2650         of a subject string or after any newline within the  string.  Otherwise         of  a  subject string or after any newline within the string. Otherwise
2651         -2 is returned. For anchored patterns, -2 is returned.         -2 is returned. For anchored patterns, -2 is returned.
2652    
2653           Since for the 32-bit library using the non-UTF-32 mode,  this  function
2654           is  unable to return the full 32-bit range of the character, this value
2655           is   deprecated;   instead   the   PCRE_INFO_FIRSTCHARACTERFLAGS    and
2656           PCRE_INFO_FIRSTCHARACTER values should be used.
2657    
2658           PCRE_INFO_FIRSTTABLE           PCRE_INFO_FIRSTTABLE
2659    
2660         If  the pattern was studied, and this resulted in the construction of a         If  the pattern was studied, and this resulted in the construction of a
2661         256-bit table indicating a fixed set of bytes for the first byte in any         256-bit table indicating a fixed set of values for the first data  unit
2662         matching  string, a pointer to the table is returned. Otherwise NULL is         in  any  matching string, a pointer to the table is returned. Otherwise
2663         returned. The fourth argument should point to an unsigned char *  vari-         NULL is returned. The fourth argument should point to an unsigned  char
2664         able.         * variable.
2665    
2666             PCRE_INFO_HASCRORLF
2667    
2668           Return  1  if  the  pattern  contains any explicit matches for CR or LF
2669           characters, otherwise 0. The fourth argument should  point  to  an  int
2670           variable.  An explicit match is either a literal CR or LF character, or
2671           \r or \n.
2672    
2673           PCRE_INFO_JCHANGED           PCRE_INFO_JCHANGED
2674    
2675         Return  1  if the (?J) option setting is used in the pattern, otherwise         Return 1 if the (?J) or (?-J) option setting is used  in  the  pattern,
2676         0. The fourth argument should point to an int variable. The (?J) inter-         otherwise  0. The fourth argument should point to an int variable. (?J)
2677         nal option setting changes the local PCRE_DUPNAMES option.         and (?-J) set and unset the local PCRE_DUPNAMES option, respectively.
2678    
2679             PCRE_INFO_JIT
2680    
2681           Return 1 if the pattern was studied with one of the  JIT  options,  and
2682           just-in-time compiling was successful. The fourth argument should point
2683           to an int variable. A return value of 0 means that JIT support  is  not
2684           available  in this version of PCRE, or that the pattern was not studied
2685           with a JIT option, or that the JIT compiler could not handle this  par-
2686           ticular  pattern. See the pcrejit documentation for details of what can
2687           and cannot be handled.
2688    
2689             PCRE_INFO_JITSIZE
2690    
2691           If the pattern was successfully studied with a JIT option,  return  the
2692           size  of the JIT compiled code, otherwise return zero. The fourth argu-
2693           ment should point to a size_t variable.
2694    
2695           PCRE_INFO_LASTLITERAL           PCRE_INFO_LASTLITERAL
2696    
2697         Return  the  value of the rightmost literal byte that must exist in any         Return the value of the rightmost literal data unit that must exist  in
2698         matched string, other than at its  start,  if  such  a  byte  has  been         any  matched  string, other than at its start, if such a value has been
2699         recorded. The fourth argument should point to an int variable. If there         recorded. The fourth argument should point to an int variable. If there
2700         is no such byte, -1 is returned. For anchored patterns, a last  literal         is no such value, -1 is returned. For anchored patterns, a last literal
2701         byte  is  recorded only if it follows something of variable length. For         value is recorded only if it follows something of variable length.  For
2702         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for         example, for the pattern /^a\d+z\d+/ the returned value is "z", but for
2703         /^a\dz\d/ the returned value is -1.         /^a\dz\d/ the returned value is -1.
2704    
2705           Since for the 32-bit library using the non-UTF-32 mode,  this  function
2706           is  unable to return the full 32-bit range of the character, this value
2707           is   deprecated;   instead    the    PCRE_INFO_REQUIREDCHARFLAGS    and
2708           PCRE_INFO_REQUIREDCHAR values should be used.
2709    
2710             PCRE_INFO_MAXLOOKBEHIND
2711    
2712           Return  the  number of characters (NB not bytes) in the longest lookbe-
2713           hind assertion in the pattern. This information is  useful  when  doing
2714           multi-segment matching using the partial matching facilities. Note that
2715           the simple assertions \b and \B require a one-character lookbehind.  \A
2716           also  registers a one-character lookbehind, though it does not actually
2717           inspect the previous character. This is to ensure  that  at  least  one
2718           character  from  the old segment is retained when a new segment is pro-
2719           cessed. Otherwise, if there are no lookbehinds in the pattern, \A might
2720           match incorrectly at the start of a new segment.
2721    
2722             PCRE_INFO_MINLENGTH
2723    
2724           If  the  pattern  was studied and a minimum length for matching subject
2725           strings was computed, its value is  returned.  Otherwise  the  returned
2726           value  is  -1. The value is a number of characters, which in UTF-8 mode
2727           may be different from the number of bytes. The fourth  argument  should
2728           point  to an int variable. A non-negative value is a lower bound to the
2729           length of any matching string. There may not be  any  strings  of  that
2730           length  that  do actually match, but every string that does match is at
2731           least that long.
2732    
2733           PCRE_INFO_NAMECOUNT           PCRE_INFO_NAMECOUNT
2734           PCRE_INFO_NAMEENTRYSIZE           PCRE_INFO_NAMEENTRYSIZE
2735           PCRE_INFO_NAMETABLE           PCRE_INFO_NAMETABLE
2736    
2737         PCRE  supports the use of named as well as numbered capturing parenthe-         PCRE supports the use of named as well as numbered capturing  parenthe-
2738         ses. The names are just an additional way of identifying the  parenthe-         ses.  The names are just an additional way of identifying the parenthe-
2739         ses, which still acquire numbers. Several convenience functions such as         ses, which still acquire numbers. Several convenience functions such as
2740         pcre_get_named_substring() are provided for  extracting  captured  sub-         pcre_get_named_substring()  are  provided  for extracting captured sub-
2741         strings  by  name. It is also possible to extract the data directly, by         strings by name. It is also possible to extract the data  directly,  by
2742         first converting the name to a number in order to  access  the  correct         first  converting  the  name to a number in order to access the correct
2743         pointers in the output vector (described with pcre_exec() below). To do         pointers in the output vector (described with pcre_exec() below). To do
2744         the conversion, you need  to  use  the  name-to-number  map,  which  is         the  conversion,  you  need  to  use  the  name-to-number map, which is
2745         described by these three values.         described by these three values.
2746    
2747         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT         The map consists of a number of fixed-size entries. PCRE_INFO_NAMECOUNT
2748         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size         gives the number of entries, and PCRE_INFO_NAMEENTRYSIZE gives the size
2749         of  each  entry;  both  of  these  return  an int value. The entry size         of each entry; both of these  return  an  int  value.  The  entry  size
2750         depends on the length of the longest name. PCRE_INFO_NAMETABLE  returns         depends  on the length of the longest name. PCRE_INFO_NAMETABLE returns
2751         a  pointer  to  the  first  entry of the table (a pointer to char). The         a pointer to the first entry of the table. This is a pointer to char in
2752         first two bytes of each entry are the number of the capturing parenthe-         the 8-bit library, where the first two bytes of each entry are the num-
2753         sis,  most  significant byte first. The rest of the entry is the corre-         ber of the capturing parenthesis, most significant byte first.  In  the
2754         sponding name, zero terminated. The names are  in  alphabetical  order.         16-bit  library,  the pointer points to 16-bit data units, the first of
2755         When PCRE_DUPNAMES is set, duplicate names are in order of their paren-         which contains the parenthesis number.   In  the  32-bit  library,  the
2756         theses numbers. For example, consider  the  following  pattern  (assume         pointer  points  to  32-bit data units, the first of which contains the
2757         PCRE_EXTENDED  is  set,  so  white  space  -  including  newlines  - is         parenthesis number. The rest of the entry is  the  corresponding  name,
2758         ignored):         zero terminated.
2759    
2760           The  names are in alphabetical order. Duplicate names may appear if (?|
2761           is used to create multiple groups with the same number, as described in
2762           the  section  on  duplicate subpattern numbers in the pcrepattern page.
2763           Duplicate names for subpatterns with different  numbers  are  permitted
2764           only  if  PCRE_DUPNAMES  is  set. In all cases of duplicate names, they
2765           appear in the table in the order in which they were found in  the  pat-
2766           tern.  In  the  absence  of (?| this is the order of increasing number;
2767           when (?| is used this is not necessarily the case because later subpat-
2768           terns may have lower numbers.
2769    
2770           As  a  simple  example of the name/number table, consider the following
2771           pattern after compilation by the 8-bit library (assume PCRE_EXTENDED is
2772           set, so white space - including newlines - is ignored):
2773    
2774           (?<date> (?<year>(\d\d)?\d\d) -           (?<date> (?<year>(\d\d)?\d\d) -
2775           (?<month>\d\d) - (?<day>\d\d) )           (?<month>\d\d) - (?<day>\d\d) )
2776    
2777         There are four named subpatterns, so the table has  four  entries,  and         There  are  four  named subpatterns, so the table has four entries, and
2778         each  entry  in the table is eight bytes long. The table is as follows,         each entry in the table is eight bytes long. The table is  as  follows,
2779         with non-printing bytes shows in hexadecimal, and undefined bytes shown         with non-printing bytes shows in hexadecimal, and undefined bytes shown
2780         as ??:         as ??:
2781    
# Line 1544  INFORMATION ABOUT A PATTERN Line 2784  INFORMATION ABOUT A PATTERN
2784           00 04 m  o  n  t  h  00           00 04 m  o  n  t  h  00
2785           00 02 y  e  a  r  00 ??           00 02 y  e  a  r  00 ??
2786    
2787         When  writing  code  to  extract  data from named subpatterns using the         When writing code to extract data  from  named  subpatterns  using  the
2788         name-to-number map, remember that the length of the entries  is  likely         name-to-number  map,  remember that the length of the entries is likely
2789         to be different for each compiled pattern.         to be different for each compiled pattern.
2790    
2791           PCRE_INFO_OKPARTIAL           PCRE_INFO_OKPARTIAL
2792    
2793         Return  1 if the pattern can be used for partial matching, otherwise 0.         Return 1  if  the  pattern  can  be  used  for  partial  matching  with
2794         The fourth argument should point to an int  variable.  The  pcrepartial         pcre_exec(),  otherwise  0.  The fourth argument should point to an int
2795         documentation  lists  the restrictions that apply to patterns when par-         variable. From  release  8.00,  this  always  returns  1,  because  the
2796         tial matching is used.         restrictions  that  previously  applied  to  partial matching have been
2797           lifted. The pcrepartial documentation gives details of  partial  match-
2798           ing.
2799    
2800           PCRE_INFO_OPTIONS           PCRE_INFO_OPTIONS
2801    
2802         Return a copy of the options with which the pattern was  compiled.  The         Return  a  copy of the options with which the pattern was compiled. The
2803         fourth  argument  should  point to an unsigned long int variable. These         fourth argument should point to an unsigned long  int  variable.  These
2804         option bits are those specified in the call to pcre_compile(), modified         option bits are those specified in the call to pcre_compile(), modified
2805         by any top-level option settings at the start of the pattern itself. In         by any top-level option settings at the start of the pattern itself. In
2806         other words, they are the options that will be in force  when  matching         other  words,  they are the options that will be in force when matching
2807         starts.  For  example, if the pattern /(?im)abc(?-i)d/ is compiled with         starts. For example, if the pattern /(?im)abc(?-i)d/ is  compiled  with
2808         the PCRE_EXTENDED option, the result is PCRE_CASELESS,  PCRE_MULTILINE,         the  PCRE_EXTENDED option, the result is PCRE_CASELESS, PCRE_MULTILINE,
2809         and PCRE_EXTENDED.         and PCRE_EXTENDED.
2810    
2811         A  pattern  is  automatically  anchored by PCRE if all of its top-level         A pattern is automatically anchored by PCRE if  all  of  its  top-level
2812         alternatives begin with one of the following:         alternatives begin with one of the following:
2813    
2814           ^     unless PCRE_MULTILINE is set           ^     unless PCRE_MULTILINE is set
# Line 1580  INFORMATION ABOUT A PATTERN Line 2822  INFORMATION ABOUT A PATTERN
2822    
2823           PCRE_INFO_SIZE           PCRE_INFO_SIZE
2824    
2825         Return  the  size  of the compiled pattern, that is, the value that was         Return the size of the compiled pattern in bytes (for both  libraries).
2826         passed as the argument to pcre_malloc() when PCRE was getting memory in         The  fourth argument should point to a size_t variable. This value does
2827         which to place the compiled data. The fourth argument should point to a         not include the  size  of  the  pcre  structure  that  is  returned  by
2828         size_t variable.         pcre_compile().  The  value that is passed as the argument to pcre_mal-
2829           loc() when pcre_compile() is getting memory in which to place the  com-
2830           piled  data  is  the value returned by this option plus the size of the
2831           pcre structure. Studying a compiled pattern, with or without JIT,  does
2832           not alter the value returned by this option.
2833    
2834           PCRE_INFO_STUDYSIZE           PCRE_INFO_STUDYSIZE
2835    
2836         Return the size of the data block pointed to by the study_data field in         Return the size in bytes of the data block pointed to by the study_data
2837         a  pcre_extra  block.  That  is,  it  is  the  value that was passed to         field in a pcre_extra block. If pcre_extra is  NULL,  or  there  is  no
2838         pcre_malloc() when PCRE was getting memory into which to place the data         study  data,  zero  is  returned. The fourth argument should point to a
2839         created  by  pcre_study(). The fourth argument should point to a size_t         size_t variable. The study_data field is set by pcre_study() to  record
2840           information  that  will  speed  up  matching  (see the section entitled
2841           "Studying a pattern" above). The format of the study_data block is pri-
2842           vate,  but  its length is made available via this option so that it can
2843           be  saved  and  restored  (see  the  pcreprecompile  documentation  for
2844           details).
2845    
2846             PCRE_INFO_FIRSTCHARACTERFLAGS
2847    
2848           Return information about the first data unit of any matched string, for
2849           a non-anchored pattern. The fourth argument  should  point  to  an  int
2850         variable.         variable.
2851    
2852           If  there  is  a  fixed first value, for example, the letter "c" from a
2853           pattern such as (cat|cow|coyote), 1  is  returned,  and  the  character
2854           value can be retrieved using PCRE_INFO_FIRSTCHARACTER.
2855    
2856           If there is no fixed first value, and if either
2857    
2858           (a)  the pattern was compiled with the PCRE_MULTILINE option, and every
2859           branch starts with "^", or
2860    
2861           (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2862           set (if it were set, the pattern would be anchored),
2863    
2864           2 is returned, indicating that the pattern matches only at the start of
2865           a subject string or after any newline within the string. Otherwise 0 is
2866           returned. For anchored patterns, 0 is returned.
2867    
2868             PCRE_INFO_FIRSTCHARACTER
2869    
2870           Return  the  fixed  first character value, if PCRE_INFO_FIRSTCHARACTER-
2871           FLAGS returned 1; otherwise returns 0. The fourth argument should point
2872           to an uint_t variable.
2873    
2874           In  the 8-bit library, the value is always less than 256. In the 16-bit
2875           library the value can be up to 0xffff. In the 32-bit library in  UTF-32
2876           mode  the  value  can  be up to 0x10ffff, and up to 0xffffffff when not
2877           using UTF-32 mode.
2878    
2879           If there is no fixed first value, and if either
2880    
2881           (a) the pattern was compiled with the PCRE_MULTILINE option, and  every
2882           branch starts with "^", or
2883    
2884           (b) every branch of the pattern starts with ".*" and PCRE_DOTALL is not
2885           set (if it were set, the pattern would be anchored),
2886    
2887  OBSOLETE INFO FUNCTION         -1 is returned, indicating that the pattern matches only at  the  start
2888           of  a  subject string or after any newline within the string. Otherwise
2889           -2 is returned. For anchored patterns, -2 is returned.
2890    
2891         int pcre_info(const pcre *code, int *optptr, int *firstcharptr);           PCRE_INFO_REQUIREDCHARFLAGS
2892    
2893         The pcre_info() function is now obsolete because its interface  is  too         Returns 1 if there is a rightmost literal data unit that must exist  in
2894         restrictive  to return all the available data about a compiled pattern.         any matched string, other than at its start. The fourth argument should
2895         New  programs  should  use  pcre_fullinfo()  instead.  The   yield   of         point to an int variable. If there is no such value, 0 is returned.  If
2896         pcre_info()  is the number of capturing subpatterns, or one of the fol-         returning  1,  the  character  value  itself  can  be  retrieved  using
2897         lowing negative numbers:         PCRE_INFO_REQUIREDCHAR.
2898    
2899           PCRE_ERROR_NULL       the argument code was NULL         For anchored patterns, a last literal value is recorded only if it fol-
2900           PCRE_ERROR_BADMAGIC   the "magic number" was not found         lows  something  of  variable  length.  For  example,  for  the pattern
2901           /^a\d+z\d+/  the   returned   value   1   (with   "z"   returned   from
2902         If the optptr argument is not NULL, a copy of the  options  with  which         PCRE_INFO_REQUIREDCHAR), but for /^a\dz\d/ the returned value is 0.
2903         the  pattern  was  compiled  is placed in the integer it points to (see  
2904         PCRE_INFO_OPTIONS above).           PCRE_INFO_REQUIREDCHAR
2905    
2906         If the pattern is not anchored and the  firstcharptr  argument  is  not         Return  the value of the rightmost literal data unit that must exist in
2907         NULL,  it is used to pass back information about the first character of         any matched string, other than at its start, if such a value  has  been
2908         any matched string (see PCRE_INFO_FIRSTBYTE above).         recorded.  The fourth argument should point to an uint32_t variable. If
2909           there is no such value, 0 is returned.
2910    
2911    
2912  REFERENCE COUNTS  REFERENCE COUNTS
# Line 1644  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2937  MATCHING A PATTERN: THE TRADITIONAL FUNC
2937              const char *subject, int length, int startoffset,              const char *subject, int length, int startoffset,
2938              int options, int *ovector, int ovecsize);              int options, int *ovector, int ovecsize);
2939    
2940         The function pcre_exec() is called to match a subject string against  a         The  function pcre_exec() is called to match a subject string against a
2941         compiled  pattern, which is passed in the code argument. If the pattern         compiled pattern, which is passed in the code argument. If the  pattern
2942         has been studied, the result of the study should be passed in the extra         was  studied,  the  result  of  the study should be passed in the extra
2943         argument.  This  function is the main matching facility of the library,         argument. You can call pcre_exec() with the same code and  extra  argu-
2944         and it operates in a Perl-like manner. For specialist use there is also         ments  as  many  times as you like, in order to match different subject
2945         an  alternative matching function, which is described below in the sec-         strings with the same pattern.
2946         tion about the pcre_dfa_exec() function.  
2947           This function is the main matching facility  of  the  library,  and  it
2948           operates  in  a  Perl-like  manner. For specialist use there is also an
2949           alternative matching function, which is described below in the  section
2950           about the pcre_dfa_exec() function.
2951    
2952         In most applications, the pattern will have been compiled (and  option-         In  most applications, the pattern will have been compiled (and option-
2953         ally  studied)  in the same process that calls pcre_exec(). However, it         ally studied) in the same process that calls pcre_exec().  However,  it
2954         is possible to save compiled patterns and study data, and then use them         is possible to save compiled patterns and study data, and then use them
2955         later  in  different processes, possibly even on different hosts. For a         later in different processes, possibly even on different hosts.  For  a
2956         discussion about this, see the pcreprecompile documentation.         discussion about this, see the pcreprecompile documentation.
2957    
2958         Here is an example of a simple call to pcre_exec():         Here is an example of a simple call to pcre_exec():
# Line 1674  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 2971  MATCHING A PATTERN: THE TRADITIONAL FUNC
2971    
2972     Extra data for pcre_exec()     Extra data for pcre_exec()
2973    
2974         If the extra argument is not NULL, it must point to a  pcre_extra  data         If  the  extra argument is not NULL, it must point to a pcre_extra data
2975         block.  The pcre_study() function returns such a block (when it doesn't         block. The pcre_study() function returns such a block (when it  doesn't
2976         return NULL), but you can also create one for yourself, and pass  addi-         return  NULL), but you can also create one for yourself, and pass addi-
2977         tional  information  in it. The pcre_extra block contains the following         tional information in it. The pcre_extra block contains  the  following
2978         fields (not necessarily in this order):         fields (not necessarily in this order):
2979    
2980           unsigned long int flags;           unsigned long int flags;
2981           void *study_data;           void *study_data;
2982             void *executable_jit;
2983           unsigned long int match_limit;           unsigned long int match_limit;
2984           unsigned long int match_limit_recursion;           unsigned long int match_limit_recursion;
2985           void *callout_data;           void *callout_data;
2986           const unsigned char *tables;           const unsigned char *tables;
2987             unsigned char **mark;
2988    
2989         The flags field is a bitmap that specifies which of  the  other  fields         In  the  16-bit  version  of  this  structure,  the mark field has type
2990         are set. The flag bits are:         "PCRE_UCHAR16 **".
2991    
2992           PCRE_EXTRA_STUDY_DATA         In the 32-bit version of  this  structure,  the  mark  field  has  type
2993           "PCRE_UCHAR32 **".
2994    
2995           The  flags  field is used to specify which of the other fields are set.
2996           The flag bits are:
2997    
2998             PCRE_EXTRA_CALLOUT_DATA
2999             PCRE_EXTRA_EXECUTABLE_JIT
3000             PCRE_EXTRA_MARK
3001           PCRE_EXTRA_MATCH_LIMIT           PCRE_EXTRA_MATCH_LIMIT
3002           PCRE_EXTRA_MATCH_LIMIT_RECURSION           PCRE_EXTRA_MATCH_LIMIT_RECURSION
3003           PCRE_EXTRA_CALLOUT_DATA           PCRE_EXTRA_STUDY_DATA
3004           PCRE_EXTRA_TABLES           PCRE_EXTRA_TABLES
3005    
3006         Other  flag  bits should be set to zero. The study_data field is set in         Other flag bits should be set to zero. The study_data field  and  some-
3007         the pcre_extra block that is returned by  pcre_study(),  together  with         times  the executable_jit field are set in the pcre_extra block that is
3008         the appropriate flag bit. You should not set this yourself, but you may         returned by pcre_study(), together with the appropriate flag bits.  You
3009         add to the block by setting the other fields  and  their  corresponding         should  not set these yourself, but you may add to the block by setting
3010         flag bits.         other fields and their corresponding flag bits.
3011    
3012         The match_limit field provides a means of preventing PCRE from using up         The match_limit field provides a means of preventing PCRE from using up
3013         a vast amount of resources when running patterns that are not going  to         a  vast amount of resources when running patterns that are not going to
3014         match,  but  which  have  a very large number of possibilities in their         match, but which have a very large number  of  possibilities  in  their
3015         search trees. The classic  example  is  the  use  of  nested  unlimited         search  trees. The classic example is a pattern that uses nested unlim-
3016         repeats.         ited repeats.
3017    
3018         Internally,  PCRE uses a function called match() which it calls repeat-         Internally, pcre_exec() uses a function called match(), which it  calls
3019         edly (sometimes recursively). The limit set by match_limit  is  imposed         repeatedly  (sometimes  recursively).  The  limit set by match_limit is
3020         on  the  number  of times this function is called during a match, which         imposed on the number of times this function is called during a  match,
3021         has the effect of limiting the amount of  backtracking  that  can  take         which  has  the  effect of limiting the amount of backtracking that can
3022         place. For patterns that are not anchored, the count restarts from zero         take place. For patterns that are not anchored, the count restarts from
3023         for each position in the subject string.         zero for each position in the subject string.
3024    
3025           When pcre_exec() is called with a pattern that was successfully studied
3026           with a JIT option, the way that the matching is  executed  is  entirely
3027           different.  However, there is still the possibility of runaway matching
3028           that goes on for a very long time, and so the match_limit value is also
3029           used in this case (but in a different way) to limit how long the match-
3030           ing can continue.
3031    
3032         The default value for the limit can be set  when  PCRE  is  built;  the         The default value for the limit can be set  when  PCRE  is  built;  the
3033         default  default  is 10 million, which handles all but the most extreme         default  default  is 10 million, which handles all but the most extreme
# Line 1728  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3042  MATCHING A PATTERN: THE TRADITIONAL FUNC
3042         the total number of calls, because not all calls to match() are  recur-         the total number of calls, because not all calls to match() are  recur-
3043         sive.  This limit is of use only if it is set smaller than match_limit.         sive.  This limit is of use only if it is set smaller than match_limit.
3044    
3045         Limiting the recursion depth limits the amount of  stack  that  can  be         Limiting  the  recursion  depth limits the amount of machine stack that
3046         used, or, when PCRE has been compiled to use memory on the heap instead         can be used, or, when PCRE has been compiled to use memory on the  heap
3047         of the stack, the amount of heap memory that can be used.         instead  of the stack, the amount of heap memory that can be used. This
3048           limit is not relevant, and is ignored, when matching is done using  JIT
3049         The default value for match_limit_recursion can be  set  when  PCRE  is         compiled code.
3050         built;  the  default  default  is  the  same  value  as the default for  
3051         match_limit. You can override the default by suppling pcre_exec()  with         The  default  value  for  match_limit_recursion can be set when PCRE is
3052         a   pcre_extra   block  in  which  match_limit_recursion  is  set,  and         built; the default default  is  the  same  value  as  the  default  for
3053         PCRE_EXTRA_MATCH_LIMIT_RECURSION is set in  the  flags  field.  If  the         match_limit.  You can override the default by suppling pcre_exec() with
3054           a  pcre_extra  block  in  which  match_limit_recursion  is   set,   and
3055           PCRE_EXTRA_MATCH_LIMIT_RECURSION  is  set  in  the  flags field. If the
3056         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.         limit is exceeded, pcre_exec() returns PCRE_ERROR_RECURSIONLIMIT.
3057    
3058         The  pcre_callout  field is used in conjunction with the "callout" fea-         The callout_data field is used in conjunction with the  "callout"  fea-
3059         ture, which is described in the pcrecallout documentation.         ture, and is described in the pcrecallout documentation.
3060    
3061         The tables field  is  used  to  pass  a  character  tables  pointer  to         The  tables  field  is  used  to  pass  a  character  tables pointer to
3062         pcre_exec();  this overrides the value that is stored with the compiled         pcre_exec(); this overrides the value that is stored with the  compiled
3063         pattern. A non-NULL value is stored with the compiled pattern  only  if         pattern.  A  non-NULL value is stored with the compiled pattern only if
3064         custom  tables  were  supplied to pcre_compile() via its tableptr argu-         custom tables were supplied to pcre_compile() via  its  tableptr  argu-
3065         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces         ment.  If NULL is passed to pcre_exec() using this mechanism, it forces
3066         PCRE's  internal  tables  to be used. This facility is helpful when re-         PCRE's internal tables to be used. This facility is  helpful  when  re-
3067         using patterns that have been saved after compiling  with  an  external         using  patterns  that  have been saved after compiling with an external
3068         set  of  tables,  because  the  external tables might be at a different         set of tables, because the external tables  might  be  at  a  different
3069         address when pcre_exec() is called. See the  pcreprecompile  documenta-         address  when  pcre_exec() is called. See the pcreprecompile documenta-
3070         tion for a discussion of saving compiled patterns for later use.         tion for a discussion of saving compiled patterns for later use.
3071    
3072           If PCRE_EXTRA_MARK is set in the flags field, the mark  field  must  be
3073           set  to point to a suitable variable. If the pattern contains any back-
3074           tracking control verbs such as (*MARK:NAME), and the execution ends  up
3075           with  a  name  to  pass back, a pointer to the name string (zero termi-
3076           nated) is placed in the variable pointed to  by  the  mark  field.  The
3077           names  are  within  the  compiled pattern; if you wish to retain such a
3078           name you must copy it before freeing the memory of a compiled  pattern.
3079           If  there  is no name to pass back, the variable pointed to by the mark
3080           field is set to NULL. For details of the  backtracking  control  verbs,
3081           see the section entitled "Backtracking control" in the pcrepattern doc-
3082           umentation.
3083    
3084     Option bits for pcre_exec()     Option bits for pcre_exec()
3085    
3086         The  unused  bits of the options argument for pcre_exec() must be zero.         The unused bits of the options argument for pcre_exec() must  be  zero.
3087         The only bits that may  be  set  are  PCRE_ANCHORED,  PCRE_NEWLINE_xxx,         The  only  bits  that  may  be set are PCRE_ANCHORED, PCRE_NEWLINE_xxx,
3088         PCRE_NOTBOL,   PCRE_NOTEOL,   PCRE_NOTEMPTY,   PCRE_NO_UTF8_CHECK   and         PCRE_NOTBOL,   PCRE_NOTEOL,    PCRE_NOTEMPTY,    PCRE_NOTEMPTY_ATSTART,
3089         PCRE_PARTIAL.         PCRE_NO_START_OPTIMIZE,   PCRE_NO_UTF8_CHECK,   PCRE_PARTIAL_HARD,  and
3090           PCRE_PARTIAL_SOFT.
3091    
3092           If the pattern was successfully studied with one  of  the  just-in-time
3093           (JIT) compile options, the only supported options for JIT execution are
3094           PCRE_NO_UTF8_CHECK,    PCRE_NOTBOL,     PCRE_NOTEOL,     PCRE_NOTEMPTY,
3095           PCRE_NOTEMPTY_ATSTART,  PCRE_PARTIAL_HARD, and PCRE_PARTIAL_SOFT. If an
3096           unsupported option is used, JIT execution is disabled  and  the  normal
3097           interpretive code in pcre_exec() is run.
3098    
3099           PCRE_ANCHORED           PCRE_ANCHORED
3100    
3101         The PCRE_ANCHORED option limits pcre_exec() to matching  at  the  first         The  PCRE_ANCHORED  option  limits pcre_exec() to matching at the first
3102         matching  position.  If  a  pattern was compiled with PCRE_ANCHORED, or         matching position. If a pattern was  compiled  with  PCRE_ANCHORED,  or
3103         turned out to be anchored by virtue of its contents, it cannot be  made         turned  out to be anchored by virtue of its contents, it cannot be made
3104         unachored at matching time.         unachored at matching time.
3105    
3106             PCRE_BSR_ANYCRLF
3107             PCRE_BSR_UNICODE
3108    
3109           These options (which are mutually exclusive) control what the \R escape
3110           sequence  matches.  The choice is either to match only CR, LF, or CRLF,
3111           or to match any Unicode newline sequence. These  options  override  the
3112           choice that was made or defaulted when the pattern was compiled.
3113    
3114           PCRE_NEWLINE_CR           PCRE_NEWLINE_CR
3115           PCRE_NEWLINE_LF           PCRE_NEWLINE_LF
3116           PCRE_NEWLINE_CRLF           PCRE_NEWLINE_CRLF
# Line 1778  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3122  MATCHING A PATTERN: THE TRADITIONAL FUNC
3122         tion  of  pcre_compile()  above.  During  matching,  the newline choice         tion  of  pcre_compile()  above.  During  matching,  the newline choice
3123         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-         affects the behaviour of the dot, circumflex,  and  dollar  metacharac-
3124         ters.  It may also alter the way the match position is advanced after a         ters.  It may also alter the way the match position is advanced after a
3125         match  failure  for  an  unanchored  pattern.  When  PCRE_NEWLINE_CRLF,         match failure for an unanchored pattern.
3126         PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY is set, and a match attempt  
3127         fails when the current position is at a CRLF sequence, the match  posi-         When PCRE_NEWLINE_CRLF, PCRE_NEWLINE_ANYCRLF,  or  PCRE_NEWLINE_ANY  is
3128         tion  is  advanced by two characters instead of one, in other words, to         set,  and a match attempt for an unanchored pattern fails when the cur-
3129         after the CRLF.         rent position is at a  CRLF  sequence,  and  the  pattern  contains  no
3130           explicit  matches  for  CR  or  LF  characters,  the  match position is
3131           advanced by two characters instead of one, in other words, to after the
3132           CRLF.
3133    
3134           The above rule is a compromise that makes the most common cases work as
3135           expected. For example, if the  pattern  is  .+A  (and  the  PCRE_DOTALL
3136           option is not set), it does not match the string "\r\nA" because, after
3137           failing at the start, it skips both the CR and the LF before  retrying.
3138           However,  the  pattern  [\r\n]A does match that string, because it con-
3139           tains an explicit CR or LF reference, and so advances only by one char-
3140           acter after the first failure.
3141    
3142           An explicit match for CR of LF is either a literal appearance of one of
3143           those characters, or one of the \r or  \n  escape  sequences.  Implicit
3144           matches  such  as [^X] do not count, nor does \s (which includes CR and
3145           LF in the characters that it matches).
3146    
3147           Notwithstanding the above, anomalous effects may still occur when  CRLF
3148           is a valid newline sequence and explicit \r or \n escapes appear in the
3149           pattern.
3150    
3151           PCRE_NOTBOL           PCRE_NOTBOL
3152    
# Line 1810  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3174  MATCHING A PATTERN: THE TRADITIONAL FUNC
3174    
3175           a?b?           a?b?
3176    
3177         is applied to a string not beginning with "a" or "b",  it  matches  the         is applied to a string not beginning with "a" or  "b",  it  matches  an
3178         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this         empty  string at the start of the subject. With PCRE_NOTEMPTY set, this
3179         match is not valid, so PCRE searches further into the string for occur-         match is not valid, so PCRE searches further into the string for occur-
3180         rences of "a" or "b".         rences of "a" or "b".
3181    
3182         Perl has no direct equivalent of PCRE_NOTEMPTY, but it does make a spe-           PCRE_NOTEMPTY_ATSTART
3183         cial case of a pattern match of the empty  string  within  its  split()  
3184         function,  and  when  using  the /g modifier. It is possible to emulate         This  is  like PCRE_NOTEMPTY, except that an empty string match that is
3185         Perl's behaviour after matching a null string by first trying the match         not at the start of  the  subject  is  permitted.  If  the  pattern  is
3186         again at the same offset with PCRE_NOTEMPTY and PCRE_ANCHORED, and then         anchored, such a match can occur only if the pattern contains \K.
3187         if that fails by advancing the starting offset (see below)  and  trying  
3188         an ordinary match again. There is some code that demonstrates how to do         Perl     has    no    direct    equivalent    of    PCRE_NOTEMPTY    or
3189         this in the pcredemo.c sample program.         PCRE_NOTEMPTY_ATSTART, but it does make a special  case  of  a  pattern
3190           match  of  the empty string within its split() function, and when using
3191           the /g modifier. It is  possible  to  emulate  Perl's  behaviour  after
3192           matching a null string by first trying the match again at the same off-
3193           set with PCRE_NOTEMPTY_ATSTART and  PCRE_ANCHORED,  and  then  if  that
3194           fails, by advancing the starting offset (see below) and trying an ordi-
3195           nary match again. There is some code that demonstrates how to  do  this
3196           in  the  pcredemo sample program. In the most general case, you have to
3197           check to see if the newline convention recognizes CRLF  as  a  newline,
3198           and  if so, and the current character is CR followed by LF, advance the
3199           starting offset by two characters instead of one.
3200    
3201             PCRE_NO_START_OPTIMIZE
3202    
3203           There are a number of optimizations that pcre_exec() uses at the  start
3204           of  a  match,  in  order to speed up the process. For example, if it is
3205           known that an unanchored match must start with a specific character, it
3206           searches  the  subject  for that character, and fails immediately if it
3207           cannot find it, without actually running the  main  matching  function.
3208           This means that a special item such as (*COMMIT) at the start of a pat-
3209           tern is not considered until after a suitable starting  point  for  the
3210           match  has been found. When callouts or (*MARK) items are in use, these
3211           "start-up" optimizations can cause them to be skipped if the pattern is
3212           never  actually  used.  The start-up optimizations are in effect a pre-
3213           scan of the subject that takes place before the pattern is run.
3214    
3215           The PCRE_NO_START_OPTIMIZE option disables the start-up  optimizations,
3216           possibly  causing  performance  to  suffer,  but ensuring that in cases
3217           where the result is "no match", the callouts do occur, and  that  items
3218           such as (*COMMIT) and (*MARK) are considered at every possible starting
3219           position in the subject string. If  PCRE_NO_START_OPTIMIZE  is  set  at
3220           compile  time,  it  cannot  be  unset  at  matching  time.  The  use of
3221           PCRE_NO_START_OPTIMIZE disables JIT execution; when it is set, matching
3222           is always done using interpretively.
3223    
3224           Setting  PCRE_NO_START_OPTIMIZE  can  change  the outcome of a matching
3225           operation.  Consider the pattern
3226    
3227             (*COMMIT)ABC
3228    
3229           When this is compiled, PCRE records the fact that a  match  must  start
3230           with  the  character  "A".  Suppose the subject string is "DEFABC". The
3231           start-up optimization scans along the subject, finds "A" and  runs  the
3232           first  match attempt from there. The (*COMMIT) item means that the pat-
3233           tern must match the current starting position, which in this  case,  it
3234           does.  However,  if  the  same match is run with PCRE_NO_START_OPTIMIZE
3235           set, the initial scan along the subject string  does  not  happen.  The
3236           first  match  attempt  is  run  starting  from "D" and when this fails,
3237           (*COMMIT) prevents any further matches  being  tried,  so  the  overall
3238           result  is  "no  match". If the pattern is studied, more start-up opti-
3239           mizations may be used. For example, a minimum length  for  the  subject
3240           may be recorded. Consider the pattern
3241    
3242             (*MARK:A)(X|Y)
3243    
3244           The  minimum  length  for  a  match is one character. If the subject is
3245           "ABC", there will be attempts to  match  "ABC",  "BC",  "C",  and  then
3246           finally  an empty string.  If the pattern is studied, the final attempt
3247           does not take place, because PCRE knows that the subject is too  short,
3248           and  so  the  (*MARK) is never encountered.  In this case, studying the
3249           pattern does not affect the overall match result, which  is  still  "no
3250           match", but it does affect the auxiliary information that is returned.
3251    
3252           PCRE_NO_UTF8_CHECK           PCRE_NO_UTF8_CHECK
3253    
3254         When PCRE_UTF8 is set at compile time, the validity of the subject as a         When PCRE_UTF8 is set at compile time, the validity of the subject as a
3255         UTF-8  string is automatically checked when pcre_exec() is subsequently         UTF-8 string is automatically checked when pcre_exec() is  subsequently
3256         called.  The value of startoffset is also checked  to  ensure  that  it         called.  The entire string is checked before any other processing takes
3257         points  to the start of a UTF-8 character. If an invalid UTF-8 sequence         place. The value of startoffset is  also  checked  to  ensure  that  it
3258         of bytes is found, pcre_exec() returns the error PCRE_ERROR_BADUTF8. If         points  to  the start of a UTF-8 character. There is a discussion about
3259         startoffset  contains  an  invalid  value, PCRE_ERROR_BADUTF8_OFFSET is         the validity of UTF-8 strings in the pcreunicode page.  If  an  invalid
3260         returned.         sequence   of   bytes   is   found,   pcre_exec()   returns  the  error
3261           PCRE_ERROR_BADUTF8 or, if PCRE_PARTIAL_HARD is set and the problem is a
3262           truncated character at the end of the subject, PCRE_ERROR_SHORTUTF8. In
3263           both cases, information about the precise nature of the error may  also
3264           be  returned (see the descriptions of these errors in the section enti-
3265           tled Error return values from pcre_exec() below).  If startoffset  con-
3266           tains a value that does not point to the start of a UTF-8 character (or
3267           to the end of the subject), PCRE_ERROR_BADUTF8_OFFSET is returned.
3268    
3269         If you already know that your subject is valid, and you  want  to  skip         If you already know that your subject is valid, and you  want  to  skip
3270         these    checks    for   performance   reasons,   you   can   set   the         these    checks    for   performance   reasons,   you   can   set   the
# Line 1840  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3272  MATCHING A PATTERN: THE TRADITIONAL FUNC
3272         do  this  for the second and subsequent calls to pcre_exec() if you are         do  this  for the second and subsequent calls to pcre_exec() if you are
3273         making repeated calls to find all  the  matches  in  a  single  subject         making repeated calls to find all  the  matches  in  a  single  subject
3274         string.  However,  you  should  be  sure  that the value of startoffset         string.  However,  you  should  be  sure  that the value of startoffset
3275         points to the start of a UTF-8 character.  When  PCRE_NO_UTF8_CHECK  is         points to the start of a character (or the end of  the  subject).  When
3276         set,  the  effect of passing an invalid UTF-8 string as a subject, or a         PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a
3277         value of startoffset that does not point to the start of a UTF-8  char-         subject or an invalid value of startoffset is undefined.  Your  program
3278         acter, is undefined. Your program may crash.         may crash.
3279    
3280           PCRE_PARTIAL           PCRE_PARTIAL_HARD
3281             PCRE_PARTIAL_SOFT
3282         This  option  turns  on  the  partial  matching feature. If the subject  
3283         string fails to match the pattern, but at some point during the  match-         These  options turn on the partial matching feature. For backwards com-
3284         ing  process  the  end of the subject was reached (that is, the subject         patibility, PCRE_PARTIAL is a synonym for PCRE_PARTIAL_SOFT. A  partial
3285         partially matches the pattern and the failure to  match  occurred  only         match  occurs if the end of the subject string is reached successfully,
3286         because  there were not enough subject characters), pcre_exec() returns         but there are not enough subject characters to complete the  match.  If
3287         PCRE_ERROR_PARTIAL instead of PCRE_ERROR_NOMATCH. When PCRE_PARTIAL  is         this happens when PCRE_PARTIAL_SOFT (but not PCRE_PARTIAL_HARD) is set,
3288         used,  there  are restrictions on what may appear in the pattern. These         matching continues by testing any remaining alternatives.  Only  if  no
3289         are discussed in the pcrepartial documentation.         complete  match  can be found is PCRE_ERROR_PARTIAL returned instead of
3290           PCRE_ERROR_NOMATCH. In other words,  PCRE_PARTIAL_SOFT  says  that  the
3291           caller  is  prepared to handle a partial match, but only if no complete
3292           match can be found.
3293    
3294           If PCRE_PARTIAL_HARD is set, it overrides  PCRE_PARTIAL_SOFT.  In  this
3295           case,  if  a  partial  match  is found, pcre_exec() immediately returns
3296           PCRE_ERROR_PARTIAL, without  considering  any  other  alternatives.  In
3297           other  words, when PCRE_PARTIAL_HARD is set, a partial match is consid-
3298           ered to be more important that an alternative complete match.
3299    
3300           In both cases, the portion of the string that was  inspected  when  the
3301           partial match was found is set as the first matching string. There is a
3302           more detailed discussion of partial and  multi-segment  matching,  with
3303           examples, in the pcrepartial documentation.
3304    
3305     The string to be matched by pcre_exec()     The string to be matched by pcre_exec()
3306    
3307         The subject string is passed to pcre_exec() as a pointer in subject,  a         The  subject string is passed to pcre_exec() as a pointer in subject, a
3308         length  in  length, and a starting byte offset in startoffset. In UTF-8         length in bytes in length, and a starting byte offset  in  startoffset.
3309         mode, the byte offset must point to the start  of  a  UTF-8  character.         If  this  is  negative  or  greater  than  the  length  of the subject,
3310         Unlike  the  pattern string, the subject may contain binary zero bytes.         pcre_exec() returns PCRE_ERROR_BADOFFSET. When the starting  offset  is
3311         When the starting offset is zero, the search for a match starts at  the         zero,  the  search  for a match starts at the beginning of the subject,
3312         beginning of the subject, and this is by far the most common case.         and this is by far the most common case. In UTF-8 mode, the byte offset
3313           must  point  to  the start of a UTF-8 character (or the end of the sub-
3314           ject). Unlike the pattern string, the subject may contain  binary  zero
3315           bytes.
3316    
3317         A  non-zero  starting offset is useful when searching for another match         A  non-zero  starting offset is useful when searching for another match
3318         in the same subject by calling pcre_exec() again after a previous  suc-         in the same subject by calling pcre_exec() again after a previous  suc-
# Line 1884  MATCHING A PATTERN: THE TRADITIONAL FUNC Line 3333  MATCHING A PATTERN: THE TRADITIONAL FUNC
3333         rence  of "iss" because it is able to look behind the starting point to         rence  of "iss" because it is able to look behind the starting point to
3334         discover that it is preceded by a letter.         discover that it is preceded by a letter.
3335    
3336         If a non-zero starting offset is passed when the pattern  is  anchored,         Finding all the matches in a subject is tricky  when  the  pattern  can
3337           match an empty string. It is possible to emulate Perl's /g behaviour by
3338           first  trying  the  match  again  at  the   same   offset,   with   the
3339           PCRE_NOTEMPTY_ATSTART  and  PCRE_ANCHORED  options,  and  then  if that
3340           fails, advancing the starting  offset  and  trying  an  ordinary  match
3341           again. There is some code that demonstrates how to do this in the pcre-
3342           demo sample program. In the most general case, you have to check to see
3343           if  the newline convention recognizes CRLF as a newline, and if so, and
3344           the current character is CR followed by LF, advance the starting offset
3345           by two characters instead of one.
3346    
3347           If  a  non-zero starting offset is passed when the pattern is anchored,
3348         one attempt to match at the given offset is made. This can only succeed         one attempt to match at the given offset is made. This can only succeed
3349         if the pattern does not require the match to be at  the  start  of  the         if  the  pattern  does  not require the match to be at the start of the
3350         subject.         subject.
3351    
3352     How pcre_exec() returns captured substrings     How pcre_exec() returns captured substrings
3353    
3354         In  general, a pattern matches a certain portion of the subject, and in         In general, a pattern matches a certain portion of the subject, and  in
3355         addition, further substrings from the subject  may  be  picked  out  by         addition,  further  substrings  from  the  subject may be picked out by
3356         parts  of  the  pattern.  Following the usage in Jeffrey Friedl's book,         parts of the pattern. Following the usage  in  Jeffrey  Friedl's  book,
3357         this is called "capturing" in what follows, and the  phrase  "capturing         this  is  called "capturing" in what follows, and the phrase "capturing
3358         subpattern"  is  used for a fragment of a pattern that picks out a sub-         subpattern" is used for a fragment of a pattern that picks out  a  sub-
3359         string. PCRE supports several other kinds of  parenthesized  subpattern         string.  PCRE  supports several other kinds of parenthesized subpattern
3360         that do not cause substrings to be captured.         that do not cause substrings to be captured.
3361    
3362         Captured  substrings are returned to the caller via a vector of integer         Captured substrings are returned to the caller via a vector of integers
3363         offsets whose address is passed in ovector. The number of  elements  in         whose  address is passed in ovector. The number of elements in the vec-
3364         the  vector is passed in ovecsize, which must be a non-negative number.         tor is passed in ovecsize, which must be a non-negative  number.  Note:
3365         Note: this argument is NOT the size of ovector in bytes.         this argument is NOT the size of ovector in bytes.
3366    
3367         The first two-thirds of the vector is used to pass back  captured  sub-         The  first  two-thirds of the vector is used to pass back captured sub-
3368         strings,  each  substring using a pair of integers. The remaining third         strings, each substring using a pair of integers. The  remaining  third
3369         of the vector is used as workspace by pcre_exec() while  matching  cap-         of  the  vector is used as workspace by pcre_exec() while matching cap-
3370         turing  subpatterns, and is not available for passing back information.         turing subpatterns, and is not available for passing back  information.
3371         The length passed in ovecsize should always be a multiple of three.  If         The  number passed in ovecsize should always be a multiple of three. If